linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/3] 2.6.17 radix-tree: updates and lockless
@ 2006-06-20 14:48 Nick Piggin
  2006-06-20 14:48 ` [patch 1/3] radix-tree: direct data Nick Piggin
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Nick Piggin @ 2006-06-20 14:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Andrew Morton, Paul McKenney
  Cc: Linux Kernel, Nick Piggin, Linux Memory Management

I've finally ported the RCU radix tree over my radix tree direct-data patch
(the latter patch has been in -mm for a while now).

I've also done the last step required for submission, which was to make a
small userspace RCU test harness, and wire up the rtth so that it can handle
multiple threads to test the lockless capability. The RCU test harness uses
an implementation somewhat like Paul's paper's quiescent state bitmask
approach; with infrequent quiescent state updates, performance isn't bad.

This quickly flushed out several obscure bugs just when running on my dual
G5. After fixing those, I racked up about 100 CPU hours of testing on
SUSE's 64-way Altix without problem. Also passes the normal battery of
single threaded rtth tests.

I'd like to hear views regarding merging these patches for 2.6.18. Initially
the lockless code would not come into effect (good - one thing at a time)
until tree_lock can start getting lifted in -mm and 2.6.19.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 1/3] radix-tree: direct data
  2006-06-20 14:48 [patch 0/3] 2.6.17 radix-tree: updates and lockless Nick Piggin
@ 2006-06-20 14:48 ` Nick Piggin
  2006-06-20 14:48 ` [patch 2/3] radix-tree: small Nick Piggin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2006-06-20 14:48 UTC (permalink / raw)
  To: Andrew Morton, Benjamin Herrenschmidt, Paul McKenney
  Cc: Linux Kernel, Nick Piggin, Linux Memory Management

From: Nick Piggin <npiggin@suse.de>

The ability to have height 0 radix trees (a direct pointer to the data item
rather than going through a full node->slot) quietly disappeared with
old-2.6-bkcvs commit ffee171812d51652f9ba284302d9e5c5cc14bdfd.  On 64-bit
machines this causes nearly 600 bytes to be used for every <= 4K file in
pagecache.

Re-introduce this feature, root tags stored in spare ->gfp_mask bits.

Simplify radix_tree_delete's complex tag clearing arrangement (which would
become even more complex) by just falling back to tag clearing functions
(the pagecache radix-tree never uses this path anyway, so the icache
savings will mean it's actually a speedup).

On my 4GB G5, this saves 8MB RAM per kernel kernel source+object tree in
pagecache.

Pagecache lookup, insertion, and removal speed for small files will also be
improved.

This makes RCU radix tree harder, but it's worth it.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -23,6 +23,9 @@
 #include <linux/preempt.h>
 #include <linux/types.h>
 
+#define RADIX_TREE_MAX_TAGS 2
+
+/* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
 	unsigned int		height;
 	gfp_t			gfp_mask;
@@ -45,8 +48,6 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
-#define RADIX_TREE_MAX_TAGS 2
-
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -74,6 +74,11 @@ struct radix_tree_preload {
 };
 DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
+static inline gfp_t root_gfp_mask(struct radix_tree_root *root)
+{
+	return root->gfp_mask & __GFP_BITS_MASK;
+}
+
 /*
  * This assumes that the caller has performed appropriate preallocation, and
  * that the caller has pinned this thread of control to the current CPU.
@@ -82,9 +87,10 @@ static struct radix_tree_node *
 radix_tree_node_alloc(struct radix_tree_root *root)
 {
 	struct radix_tree_node *ret;
+	gfp_t gfp_mask = root_gfp_mask(root);
 
-	ret = kmem_cache_alloc(radix_tree_node_cachep, root->gfp_mask);
-	if (ret == NULL && !(root->gfp_mask & __GFP_WAIT)) {
+	ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+	if (ret == NULL && !(gfp_mask & __GFP_WAIT)) {
 		struct radix_tree_preload *rtp;
 
 		rtp = &__get_cpu_var(radix_tree_preloads);
@@ -152,6 +158,27 @@ static inline int tag_get(struct radix_t
 	return test_bit(offset, node->tags[tag]);
 }
 
+static inline void root_tag_set(struct radix_tree_root *root, unsigned int tag)
+{
+	root->gfp_mask |= (1 << (tag + __GFP_BITS_SHIFT));
+}
+
+
+static inline void root_tag_clear(struct radix_tree_root *root, unsigned int tag)
+{
+	root->gfp_mask &= ~(1 << (tag + __GFP_BITS_SHIFT));
+}
+
+static inline void root_tag_clear_all(struct radix_tree_root *root)
+{
+	root->gfp_mask &= __GFP_BITS_MASK;
+}
+
+static inline int root_tag_get(struct radix_tree_root *root, unsigned int tag)
+{
+	return root->gfp_mask & (1 << (tag + __GFP_BITS_SHIFT));
+}
+
 /*
  * Returns 1 if any slot in the node has this tag set.
  * Otherwise returns 0.
@@ -182,7 +209,6 @@ static int radix_tree_extend(struct radi
 {
 	struct radix_tree_node *node;
 	unsigned int height;
-	char tags[RADIX_TREE_MAX_TAGS];
 	int tag;
 
 	/* Figure out what the height should be.  */
@@ -195,16 +221,6 @@ static int radix_tree_extend(struct radi
 		goto out;
 	}
 
-	/*
-	 * Prepare the tag status of the top-level node for propagation
-	 * into the newly-pushed top-level node(s)
-	 */
-	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
-		tags[tag] = 0;
-		if (any_tag_set(root->rnode, tag))
-			tags[tag] = 1;
-	}
-
 	do {
 		if (!(node = radix_tree_node_alloc(root)))
 			return -ENOMEM;
@@ -214,7 +230,7 @@ static int radix_tree_extend(struct radi
 
 		/* Propagate the aggregated tag info into the new root */
 		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
-			if (tags[tag])
+			if (root_tag_get(root, tag))
 				tag_set(node, tag, 0);
 		}
 
@@ -243,8 +259,7 @@ int radix_tree_insert(struct radix_tree_
 	int error;
 
 	/* Make sure the tree is high enough.  */
-	if ((!index && !root->rnode) ||
-			index > radix_tree_maxindex(root->height)) {
+	if (index > radix_tree_maxindex(root->height)) {
 		error = radix_tree_extend(root, index);
 		if (error)
 			return error;
@@ -255,7 +270,7 @@ int radix_tree_insert(struct radix_tree_
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
 	offset = 0;			/* uninitialised var warning */
-	do {
+	while (height > 0) {
 		if (slot == NULL) {
 			/* Have to add a child node.  */
 			if (!(slot = radix_tree_node_alloc(root)))
@@ -273,16 +288,21 @@ int radix_tree_insert(struct radix_tree_
 		slot = node->slots[offset];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
-	} while (height > 0);
+	}
 
 	if (slot != NULL)
 		return -EEXIST;
 
-	BUG_ON(!node);
-	node->count++;
-	node->slots[offset] = item;
-	BUG_ON(tag_get(node, 0, offset));
-	BUG_ON(tag_get(node, 1, offset));
+	if (node) {
+		node->count++;
+		node->slots[offset] = item;
+		BUG_ON(tag_get(node, 0, offset));
+		BUG_ON(tag_get(node, 1, offset));
+	} else {
+		root->rnode = item;
+		BUG_ON(root_tag_get(root, 0));
+		BUG_ON(root_tag_get(root, 1));
+	}
 
 	return 0;
 }
@@ -295,9 +315,13 @@ static inline void **__lookup_slot(struc
 	struct radix_tree_node **slot;
 
 	height = root->height;
+
 	if (index > radix_tree_maxindex(height))
 		return NULL;
 
+	if (height == 0 && root->rnode)
+		return (void **)&root->rnode;
+
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 	slot = &root->rnode;
 
@@ -368,8 +392,8 @@ void *radix_tree_tag_set(struct radix_tr
 	if (index > radix_tree_maxindex(height))
 		return NULL;
 
-	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	slot = root->rnode;
+	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 
 	while (height > 0) {
 		int offset;
@@ -383,6 +407,10 @@ void *radix_tree_tag_set(struct radix_tr
 		height--;
 	}
 
+	/* set the root's tag bit */
+	if (slot && !root_tag_get(root, tag))
+		root_tag_set(root, tag);
+
 	return slot;
 }
 EXPORT_SYMBOL(radix_tree_tag_set);
@@ -405,9 +433,8 @@ void *radix_tree_tag_clear(struct radix_
 			unsigned long index, unsigned int tag)
 {
 	struct radix_tree_path path[RADIX_TREE_MAX_PATH], *pathp = path;
-	struct radix_tree_node *slot;
+	struct radix_tree_node *slot = NULL;
 	unsigned int height, shift;
-	void *ret = NULL;
 
 	height = root->height;
 	if (index > radix_tree_maxindex(height))
@@ -432,20 +459,24 @@ void *radix_tree_tag_clear(struct radix_
 		height--;
 	}
 
-	ret = slot;
-	if (ret == NULL)
+	if (slot == NULL)
 		goto out;
 
-	do {
+	while (pathp->node) {
 		if (!tag_get(pathp->node, tag, pathp->offset))
 			goto out;
 		tag_clear(pathp->node, tag, pathp->offset);
 		if (any_tag_set(pathp->node, tag))
 			goto out;
 		pathp--;
-	} while (pathp->node);
+	}
+
+	/* clear the root's tag bit */
+	if (root_tag_get(root, tag))
+		root_tag_clear(root, tag);
+
 out:
-	return ret;
+	return slot;
 }
 EXPORT_SYMBOL(radix_tree_tag_clear);
 
@@ -458,9 +489,8 @@ EXPORT_SYMBOL(radix_tree_tag_clear);
  *
  * Return values:
  *
- *  0: tag not present
- *  1: tag present, set
- * -1: tag present, unset
+ *  0: tag not present or not set
+ *  1: tag set
  */
 int radix_tree_tag_get(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag)
@@ -473,6 +503,13 @@ int radix_tree_tag_get(struct radix_tree
 	if (index > radix_tree_maxindex(height))
 		return 0;
 
+	/* check the root's tag bit */
+	if (!root_tag_get(root, tag))
+		return 0;
+
+	if (height == 0)
+		return 1;
+
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	slot = root->rnode;
 
@@ -494,7 +531,7 @@ int radix_tree_tag_get(struct radix_tree
 			int ret = tag_get(slot, tag, offset);
 
 			BUG_ON(ret && saw_unset_tag);
-			return ret ? 1 : -1;
+			return ret;
 		}
 		slot = slot->slots[offset];
 		shift -= RADIX_TREE_MAP_SHIFT;
@@ -514,8 +551,11 @@ __lookup(struct radix_tree_root *root, v
 	unsigned long i;
 
 	height = root->height;
-	if (height == 0)
+	if (height == 0) {
+		if (root->rnode && index == 0)
+			results[nr_found++] = root->rnode;
 		goto out;
+	}
 
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 	slot = root->rnode;
@@ -603,10 +643,16 @@ __lookup_tag(struct radix_tree_root *roo
 	unsigned int height = root->height;
 	struct radix_tree_node *slot;
 
+	if (height == 0) {
+		if (root->rnode && index == 0)
+			results[nr_found++] = root->rnode;
+		goto out;
+	}
+
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	slot = root->rnode;
 
-	while (height > 0) {
+	do {
 		unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK;
 
 		for ( ; i < RADIX_TREE_MAP_SIZE; i++) {
@@ -637,7 +683,7 @@ __lookup_tag(struct radix_tree_root *roo
 		}
 		shift -= RADIX_TREE_MAP_SHIFT;
 		slot = slot->slots[i];
-	}
+	} while (height > 0);
 out:
 	*next_index = index;
 	return nr_found;
@@ -665,6 +711,10 @@ radix_tree_gang_lookup_tag(struct radix_
 	unsigned long cur_index = first_index;
 	unsigned int ret = 0;
 
+	/* check the root's tag bit */
+	if (!root_tag_get(root, tag))
+		return 0;
+
 	while (ret < max_items) {
 		unsigned int nr_found;
 		unsigned long next_index;	/* Index of next search */
@@ -689,7 +739,7 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_tag
 static inline void radix_tree_shrink(struct radix_tree_root *root)
 {
 	/* try to shrink tree height */
-	while (root->height > 1 &&
+	while (root->height > 0 &&
 			root->rnode->count == 1 &&
 			root->rnode->slots[0]) {
 		struct radix_tree_node *to_free = root->rnode;
@@ -717,12 +767,8 @@ static inline void radix_tree_shrink(str
 void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 {
 	struct radix_tree_path path[RADIX_TREE_MAX_PATH], *pathp = path;
-	struct radix_tree_path *orig_pathp;
-	struct radix_tree_node *slot;
+	struct radix_tree_node *slot = NULL;
 	unsigned int height, shift;
-	void *ret = NULL;
-	char tags[RADIX_TREE_MAX_TAGS];
-	int nr_cleared_tags;
 	int tag;
 	int offset;
 
@@ -730,11 +776,17 @@ void *radix_tree_delete(struct radix_tre
 	if (index > radix_tree_maxindex(height))
 		goto out;
 
+	slot = root->rnode;
+	if (height == 0 && root->rnode) {
+		root_tag_clear_all(root);
+		root->rnode = NULL;
+		goto out;
+	}
+
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	pathp->node = NULL;
-	slot = root->rnode;
 
-	for ( ; height > 0; height--) {
+	do {
 		if (slot == NULL)
 			goto out;
 
@@ -744,44 +796,22 @@ void *radix_tree_delete(struct radix_tre
 		pathp->node = slot;
 		slot = slot->slots[offset];
 		shift -= RADIX_TREE_MAP_SHIFT;
-	}
+		height--;
+	} while (height > 0);
 
-	ret = slot;
-	if (ret == NULL)
+	if (slot == NULL)
 		goto out;
 
-	orig_pathp = pathp;
-
 	/*
 	 * Clear all tags associated with the just-deleted item
 	 */
-	nr_cleared_tags = 0;
 	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
-		tags[tag] = 1;
-		if (tag_get(pathp->node, tag, pathp->offset)) {
-			tag_clear(pathp->node, tag, pathp->offset);
-			if (!any_tag_set(pathp->node, tag)) {
-				tags[tag] = 0;
-				nr_cleared_tags++;
-			}
-		}
-	}
-
-	for (pathp--; nr_cleared_tags && pathp->node; pathp--) {
-		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
-			if (tags[tag])
-				continue;
-
-			tag_clear(pathp->node, tag, pathp->offset);
-			if (any_tag_set(pathp->node, tag)) {
-				tags[tag] = 1;
-				nr_cleared_tags--;
-			}
-		}
+		if (tag_get(pathp->node, tag, pathp->offset))
+			radix_tree_tag_clear(root, index, tag);
 	}
 
 	/* Now free the nodes we do not need anymore */
-	for (pathp = orig_pathp; pathp->node; pathp--) {
+	while (pathp->node) {
 		pathp->node->slots[pathp->offset] = NULL;
 		pathp->node->count--;
 
@@ -793,11 +823,15 @@ void *radix_tree_delete(struct radix_tre
 
 		/* Node with zero slots in use so free it */
 		radix_tree_node_free(pathp->node);
+
+		pathp--;
 	}
-	root->rnode = NULL;
+	root_tag_clear_all(root);
 	root->height = 0;
+	root->rnode = NULL;
+
 out:
-	return ret;
+	return slot;
 }
 EXPORT_SYMBOL(radix_tree_delete);
 
@@ -808,11 +842,7 @@ EXPORT_SYMBOL(radix_tree_delete);
  */
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag)
 {
-  	struct radix_tree_node *rnode;
-  	rnode = root->rnode;
-  	if (!rnode)
-  		return 0;
-	return any_tag_set(rnode, tag);
+	return root_tag_get(root, tag);
 }
 EXPORT_SYMBOL(radix_tree_tagged);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 2/3] radix-tree: small
  2006-06-20 14:48 [patch 0/3] 2.6.17 radix-tree: updates and lockless Nick Piggin
  2006-06-20 14:48 ` [patch 1/3] radix-tree: direct data Nick Piggin
@ 2006-06-20 14:48 ` Nick Piggin
  2006-06-20 14:48 ` [patch 3/3] radix-tree: RCU lockless readside Nick Piggin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2006-06-20 14:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Andrew Morton, Paul McKenney
  Cc: Linux Kernel, Nick Piggin, Linux Memory Management

Reduce radix tree node memory usage by about a factor of 4 for small
files (< 64K). There are pointer traversal and memory usage costs for
large files with dense pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -33,7 +33,7 @@
 
 
 #ifdef __KERNEL__
-#define RADIX_TREE_MAP_SHIFT	6
+#define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 3/3] radix-tree: RCU lockless readside
  2006-06-20 14:48 [patch 0/3] 2.6.17 radix-tree: updates and lockless Nick Piggin
  2006-06-20 14:48 ` [patch 1/3] radix-tree: direct data Nick Piggin
  2006-06-20 14:48 ` [patch 2/3] radix-tree: small Nick Piggin
@ 2006-06-20 14:48 ` Nick Piggin
  2006-06-22  1:49   ` Paul E. McKenney
  2006-06-20 22:08 ` [patch 0/3] 2.6.17 radix-tree: updates and lockless Benjamin Herrenschmidt
  2006-06-20 22:35 ` Andrew Morton
  4 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2006-06-20 14:48 UTC (permalink / raw)
  To: Andrew Morton, Benjamin Herrenschmidt, Paul McKenney
  Cc: Linux Kernel, Nick Piggin, Linux Memory Management

Make radix tree lookups safe to be performed without locks. Readers
are protected against nodes being deleted by using RCU based freeing.
Readers are protected against new node insertion by using memory
barriers to ensure the node itself will be properly written before it
is visible in the radix tree.

Each radix tree node keeps a record of their height (above leaf
nodes). This height does not change after insertion -- when the radix
tree is extended, higher nodes are only inserted in the top. So a
lookup can take the pointer to what is *now* the root node, and
traverse down it even if the tree is concurrently extended and this
node becomes a subtree of a new root.

"Direct" pointers (tree height of 0, where root->rnode points directly
to the data item) are handled by using the low bit of the pointer to
signal whether rnode is a direct pointer or a pointer to a radix tree
node.

When a reader wants to traverse the next branch, they will take a
copy of the pointer. This pointer will be either NULL (and the branch
is empty) or non-NULL (and will point to a valid node).

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -30,6 +30,7 @@
 #include <linux/gfp.h>
 #include <linux/string.h>
 #include <linux/bitops.h>
+#include <linux/rcupdate.h>
 
 
 #ifdef __KERNEL__
@@ -45,7 +46,9 @@
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
 struct radix_tree_node {
+	unsigned int	height;		/* Height from the bottom */
 	unsigned int	count;
+	struct rcu_head	rcu_head;
 	void		*slots[RADIX_TREE_MAP_SIZE];
 	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
 };
@@ -100,13 +103,21 @@ radix_tree_node_alloc(struct radix_tree_
 			rtp->nr--;
 		}
 	}
+	BUG_ON(radix_tree_is_direct_ptr(ret));
 	return ret;
 }
 
+static void radix_tree_node_rcu_free(struct rcu_head *head)
+{
+	struct radix_tree_node *node =
+			container_of(head, struct radix_tree_node, rcu_head);
+	kmem_cache_free(radix_tree_node_cachep, node);
+}
+
 static inline void
 radix_tree_node_free(struct radix_tree_node *node)
 {
-	kmem_cache_free(radix_tree_node_cachep, node);
+	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
 /*
@@ -222,11 +233,12 @@ static int radix_tree_extend(struct radi
 	}
 
 	do {
+		unsigned int newheight;
 		if (!(node = radix_tree_node_alloc(root)))
 			return -ENOMEM;
 
 		/* Increase the height.  */
-		node->slots[0] = root->rnode;
+		node->slots[0] = radix_tree_direct_to_ptr(root->rnode);
 
 		/* Propagate the aggregated tag info into the new root */
 		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
@@ -234,9 +246,11 @@ static int radix_tree_extend(struct radi
 				tag_set(node, tag, 0);
 		}
 
+		newheight = root->height+1;
+		node->height = newheight;
 		node->count = 1;
-		root->rnode = node;
-		root->height++;
+		rcu_assign_pointer(root->rnode, node);
+		root->height = newheight;
 	} while (height > root->height);
 out:
 	return 0;
@@ -258,6 +272,8 @@ int radix_tree_insert(struct radix_tree_
 	int offset;
 	int error;
 
+	BUG_ON(radix_tree_is_direct_ptr(item));
+
 	/* Make sure the tree is high enough.  */
 	if (index > radix_tree_maxindex(root->height)) {
 		error = radix_tree_extend(root, index);
@@ -275,11 +291,12 @@ int radix_tree_insert(struct radix_tree_
 			/* Have to add a child node.  */
 			if (!(slot = radix_tree_node_alloc(root)))
 				return -ENOMEM;
+			slot->height = height;
 			if (node) {
-				node->slots[offset] = slot;
+				rcu_assign_pointer(node->slots[offset], slot);
 				node->count++;
 			} else
-				root->rnode = slot;
+				rcu_assign_pointer(root->rnode, slot);
 		}
 
 		/* Go a level down */
@@ -295,11 +312,11 @@ int radix_tree_insert(struct radix_tree_
 
 	if (node) {
 		node->count++;
-		node->slots[offset] = item;
+		rcu_assign_pointer(node->slots[offset], item);
 		BUG_ON(tag_get(node, 0, offset));
 		BUG_ON(tag_get(node, 1, offset));
 	} else {
-		root->rnode = item;
+		rcu_assign_pointer(root->rnode, radix_tree_ptr_to_direct(item));
 		BUG_ON(root_tag_get(root, 0));
 		BUG_ON(root_tag_get(root, 1));
 	}
@@ -308,49 +325,51 @@ int radix_tree_insert(struct radix_tree_
 }
 EXPORT_SYMBOL(radix_tree_insert);
 
-static inline void **__lookup_slot(struct radix_tree_root *root,
-				   unsigned long index)
+/**
+ *	radix_tree_lookup_slot    -    lookup a slot in a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Lookup the slot corresponding to the position @index in the radix tree
+ *	@root. This is useful for update-if-exists operations.
+ *
+ *	This function cannot be called under rcu_read_lock, it must be
+ *	excluded from writers, as must the returned slot.
+ */
+void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
 {
 	unsigned int height, shift;
-	struct radix_tree_node **slot;
-
-	height = root->height;
+	struct radix_tree_node *node, **slot;
 
-	if (index > radix_tree_maxindex(height))
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
 		return NULL;
 
-	if (height == 0 && root->rnode)
+	if (radix_tree_is_direct_ptr(node)) {
+		if (index > 0)
+			return NULL;
 		return (void **)&root->rnode;
+	}
+
+	height = node->height;
+	if (index > radix_tree_maxindex(height))
+		return NULL;
 
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
-	slot = &root->rnode;
 
-	while (height > 0) {
-		if (*slot == NULL)
+	do {
+		slot = (struct radix_tree_node **)
+			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
+		node = rcu_dereference(*slot);
+		if (node == NULL)
 			return NULL;
 
-		slot = (struct radix_tree_node **)
-			((*slot)->slots +
-				((index >> shift) & RADIX_TREE_MAP_MASK));
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
-	}
+	} while (height > 0);
 
 	return (void **)slot;
 }
-
-/**
- *	radix_tree_lookup_slot    -    lookup a slot in a radix tree
- *	@root:		radix tree root
- *	@index:		index key
- *
- *	Lookup the slot corresponding to the position @index in the radix tree
- *	@root. This is useful for update-if-exists operations.
- */
-void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
-{
-	return __lookup_slot(root, index);
-}
 EXPORT_SYMBOL(radix_tree_lookup_slot);
 
 /**
@@ -359,13 +378,45 @@ EXPORT_SYMBOL(radix_tree_lookup_slot);
  *	@index:		index key
  *
  *	Lookup the item at the position @index in the radix tree @root.
+ *
+ *	This function can be called under rcu_read_lock, however the caller
+ *	must manage lifetimes of leaf nodes (eg. RCU may also be used to free
+ *	them safely). No RCU barriers are required to access or modify the
+ *	returned item, however.
  */
 void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
 {
-	void **slot;
+	unsigned int height, shift;
+	struct radix_tree_node *node, **slot;
+
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
+		return NULL;
+
+	if (radix_tree_is_direct_ptr(node)) {
+		if (index > 0)
+			return NULL;
+		return radix_tree_direct_to_ptr(node);
+	}
+
+	height = node->height;
+	if (index > radix_tree_maxindex(height))
+		return NULL;
 
-	slot = __lookup_slot(root, index);
-	return slot != NULL ? *slot : NULL;
+	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
+
+	do {
+		slot = (struct radix_tree_node **)
+			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
+		node = rcu_dereference(*slot);
+		if (node == NULL)
+			return NULL;
+
+		shift -= RADIX_TREE_MAP_SHIFT;
+		height--;
+	} while (height > 0);
+
+	return node;
 }
 EXPORT_SYMBOL(radix_tree_lookup);
 
@@ -542,29 +593,25 @@ EXPORT_SYMBOL(radix_tree_tag_get);
 #endif
 
 static unsigned int
-__lookup(struct radix_tree_root *root, void **results, unsigned long index,
+__lookup(struct radix_tree_node *slot, void **results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index)
 {
 	unsigned int nr_found = 0;
 	unsigned int shift, height;
-	struct radix_tree_node *slot;
 	unsigned long i;
 
-	height = root->height;
-	if (height == 0) {
-		if (root->rnode && index == 0)
-			results[nr_found++] = root->rnode;
+	height = slot->height;
+	if (height == 0)
 		goto out;
-	}
-
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
-	slot = root->rnode;
 
 	for ( ; height > 1; height--) {
+		struct radix_tree_node *__s;
 
 		for (i = (index >> shift) & RADIX_TREE_MAP_MASK ;
 				i < RADIX_TREE_MAP_SIZE; i++) {
-			if (slot->slots[i] != NULL)
+			__s = rcu_dereference(slot->slots[i]);
+			if (__s != NULL)
 				break;
 			index &= ~((1UL << shift) - 1);
 			index += 1UL << shift;
@@ -575,14 +622,16 @@ __lookup(struct radix_tree_root *root, v
 			goto out;
 
 		shift -= RADIX_TREE_MAP_SHIFT;
-		slot = slot->slots[i];
+		slot = __s;
 	}
 
 	/* Bottom level: grab some items */
 	for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
+		struct radix_tree_node *node;
 		index++;
-		if (slot->slots[i]) {
-			results[nr_found++] = slot->slots[i];
+		node = slot->slots[i];
+		if (node) {
+			results[nr_found++] = node;
 			if (nr_found == max_items)
 				goto out;
 		}
@@ -604,28 +653,54 @@ out:
  *	*@results.
  *
  *	The implementation is naive.
+ *
+ *	Like radix_tree_lookup, radix_tree_gang_lookup may be called under
+ *	rcu_read_lock. In this case, rather than the returned results being
+ *	an atomic snapshot of the tree at a single point in time, the semantics
+ *	of an RCU protected gang lookup are as though multiple radix_tree_lookups
+ *	have been issued in individual locks, and results stored in 'results'.
  */
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items)
 {
-	const unsigned long max_index = radix_tree_maxindex(root->height);
+	unsigned long max_index;
+	struct radix_tree_node *node;
 	unsigned long cur_index = first_index;
-	unsigned int ret = 0;
+	unsigned int ret;
+
+	node = rcu_dereference(root->rnode);
+	if (!node)
+		return 0;
+
+	if (radix_tree_is_direct_ptr(node)) {
+		if (first_index > 0)
+			return 0;
+		results[0] = radix_tree_direct_to_ptr(node);
+		ret = 1;
+		goto out;
+	}
 
+	max_index = radix_tree_maxindex(node->height);
+
+	ret = 0;
 	while (ret < max_items) {
 		unsigned int nr_found;
 		unsigned long next_index;	/* Index of next search */
 
 		if (cur_index > max_index)
 			break;
-		nr_found = __lookup(root, results + ret, cur_index,
+		nr_found = __lookup(node, results + ret, cur_index,
 					max_items - ret, &next_index);
 		ret += nr_found;
 		if (next_index == 0)
 			break;
 		cur_index = next_index;
 	}
+
+out:
+	(void)rcu_dereference(results); /* single barrier for all results */
+
 	return ret;
 }
 EXPORT_SYMBOL(radix_tree_gang_lookup);
@@ -635,24 +710,16 @@ EXPORT_SYMBOL(radix_tree_gang_lookup);
  * open-coding the search.
  */
 static unsigned int
-__lookup_tag(struct radix_tree_root *root, void **results, unsigned long index,
+__lookup_tag(struct radix_tree_node *slot, void **results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index, unsigned int tag)
 {
 	unsigned int nr_found = 0;
 	unsigned int shift;
-	unsigned int height = root->height;
-	struct radix_tree_node *slot;
-
-	if (height == 0) {
-		if (root->rnode && index == 0)
-			results[nr_found++] = root->rnode;
-		goto out;
-	}
+	unsigned int height = slot->height;
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
-	slot = root->rnode;
 
-	do {
+	while (height > 0) {
 		unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK;
 
 		for ( ; i < RADIX_TREE_MAP_SIZE; i++) {
@@ -683,7 +750,7 @@ __lookup_tag(struct radix_tree_root *roo
 		}
 		shift -= RADIX_TREE_MAP_SHIFT;
 		slot = slot->slots[i];
-	} while (height > 0);
+	}
 out:
 	*next_index = index;
 	return nr_found;
@@ -707,7 +774,8 @@ radix_tree_gang_lookup_tag(struct radix_
 		unsigned long first_index, unsigned int max_items,
 		unsigned int tag)
 {
-	const unsigned long max_index = radix_tree_maxindex(root->height);
+	struct radix_tree_node *node;
+	unsigned long max_index;
 	unsigned long cur_index = first_index;
 	unsigned int ret = 0;
 
@@ -715,13 +783,27 @@ radix_tree_gang_lookup_tag(struct radix_
 	if (!root_tag_get(root, tag))
 		return 0;
 
+	node = rcu_dereference(root->rnode);
+	if (!node)
+		return ret;
+
+	if (radix_tree_is_direct_ptr(node)) {
+		if (first_index > 0)
+			return 0;
+		node = radix_tree_direct_to_ptr(node);
+		results[0] = node;
+		return 1;
+	}
+
+	max_index = radix_tree_maxindex(node->height);
+
 	while (ret < max_items) {
 		unsigned int nr_found;
 		unsigned long next_index;	/* Index of next search */
 
 		if (cur_index > max_index)
 			break;
-		nr_found = __lookup_tag(root, results + ret, cur_index,
+		nr_found = __lookup_tag(node, results + ret, cur_index,
 					max_items - ret, &next_index, tag);
 		ret += nr_found;
 		if (next_index == 0)
@@ -743,8 +825,17 @@ static inline void radix_tree_shrink(str
 			root->rnode->count == 1 &&
 			root->rnode->slots[0]) {
 		struct radix_tree_node *to_free = root->rnode;
+		void *newptr;
 
-		root->rnode = to_free->slots[0];
+		/*
+		 * this doesn't need an rcu_assign_pointer, because
+		 * we aren't touching the object that to_free->slots[0]
+		 * points to.
+		 */
+		newptr = to_free->slots[0];
+		if (root->height == 1)
+			newptr = radix_tree_ptr_to_direct(newptr);
+		root->rnode = newptr;
 		root->height--;
 		/* must only free zeroed nodes into the slab */
 		tag_clear(to_free, 0, 0);
@@ -768,6 +859,7 @@ void *radix_tree_delete(struct radix_tre
 {
 	struct radix_tree_path path[RADIX_TREE_MAX_PATH], *pathp = path;
 	struct radix_tree_node *slot = NULL;
+	struct radix_tree_node *to_free;
 	unsigned int height, shift;
 	int tag;
 	int offset;
@@ -778,6 +870,7 @@ void *radix_tree_delete(struct radix_tre
 
 	slot = root->rnode;
 	if (height == 0 && root->rnode) {
+		slot = radix_tree_direct_to_ptr(slot);
 		root_tag_clear_all(root);
 		root->rnode = NULL;
 		goto out;
@@ -810,8 +903,11 @@ void *radix_tree_delete(struct radix_tre
 			radix_tree_tag_clear(root, index, tag);
 	}
 
+	to_free = NULL;
 	/* Now free the nodes we do not need anymore */
 	while (pathp->node) {
+		if (to_free)
+			radix_tree_node_free(to_free);
 		pathp->node->slots[pathp->offset] = NULL;
 		pathp->node->count--;
 
@@ -822,13 +918,15 @@ void *radix_tree_delete(struct radix_tre
 		}
 
 		/* Node with zero slots in use so free it */
-		radix_tree_node_free(pathp->node);
-
+		to_free = pathp->node;
 		pathp--;
+
 	}
 	root_tag_clear_all(root);
 	root->height = 0;
 	root->rnode = NULL;
+	if (to_free)
+		radix_tree_node_free(to_free);
 
 out:
 	return slot;
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -22,6 +22,8 @@
 #include <linux/sched.h>
 #include <linux/preempt.h>
 #include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/rcupdate.h>
 
 #define RADIX_TREE_MAX_TAGS 2
 
@@ -48,6 +50,37 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#define RADIX_TREE_DIRECT_PTR	1
+
+static inline void *radix_tree_ptr_to_direct(void *ptr)
+{
+	return (void *)((unsigned long)ptr | RADIX_TREE_DIRECT_PTR);
+}
+
+static inline void *radix_tree_direct_to_ptr(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~RADIX_TREE_DIRECT_PTR);
+}
+
+static inline int radix_tree_is_direct_ptr(void *ptr)
+{
+	return (int)((unsigned long)ptr & RADIX_TREE_DIRECT_PTR);
+}
+
+static inline void *__radix_tree_deref_slot(void **slot)
+{
+	return rcu_dereference(radix_tree_direct_to_ptr(*slot));
+}
+
+#define radix_tree_deref_slot(slot) __radix_tree_deref_slot((void **)slot)
+
+static inline void radix_tree_replace_slot(void **slot, void *item)
+{
+	BUG_ON(radix_tree_is_direct_ptr(item));
+	rcu_assign_pointer(*slot,
+		(void *)((unsigned long)item | ((unsigned long)*slot & RADIX_TREE_DIRECT_PTR)));
+}
+
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -228,7 +228,7 @@ int migrate_page_remove_references(struc
 						page_index(page));
 
 	if (!page_mapping(page) || page_count(page) != nr_refs ||
-			*radix_pointer != page) {
+			radix_tree_deref_slot(radix_pointer) != page) {
 		write_unlock_irq(&mapping->tree_lock);
 		return -EAGAIN;
 	}
@@ -249,7 +249,7 @@ int migrate_page_remove_references(struc
 		set_page_private(newpage, page_private(page));
 	}
 
-	*radix_pointer = newpage;
+	radix_tree_replace_slot(radix_pointer, newpage);
 	__put_page(page);
 	write_unlock_irq(&mapping->tree_lock);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/3] 2.6.17 radix-tree: updates and lockless
  2006-06-20 14:48 [patch 0/3] 2.6.17 radix-tree: updates and lockless Nick Piggin
                   ` (2 preceding siblings ...)
  2006-06-20 14:48 ` [patch 3/3] radix-tree: RCU lockless readside Nick Piggin
@ 2006-06-20 22:08 ` Benjamin Herrenschmidt
  2006-06-20 22:35 ` Andrew Morton
  4 siblings, 0 replies; 23+ messages in thread
From: Benjamin Herrenschmidt @ 2006-06-20 22:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Paul McKenney, Linux Kernel, Linux Memory Management

On Tue, 2006-06-20 at 16:48 +0200, Nick Piggin wrote:
> I've finally ported the RCU radix tree over my radix tree direct-data patch
> (the latter patch has been in -mm for a while now).
> 
> I've also done the last step required for submission, which was to make a
> small userspace RCU test harness, and wire up the rtth so that it can handle
> multiple threads to test the lockless capability. The RCU test harness uses
> an implementation somewhat like Paul's paper's quiescent state bitmask
> approach; with infrequent quiescent state updates, performance isn't bad.
> 
> This quickly flushed out several obscure bugs just when running on my dual
> G5. After fixing those, I racked up about 100 CPU hours of testing on
> SUSE's 64-way Altix without problem. Also passes the normal battery of
> single threaded rtth tests.
> 
> I'd like to hear views regarding merging these patches for 2.6.18. Initially
> the lockless code would not come into effect (good - one thing at a time)
> until tree_lock can start getting lifted in -mm and 2.6.19.

I'm all about it. As I said earlier, I discovered that pp64 has been
abusing radix tree expecting them to work lockless in it's interrupt
management for ages (at least insert vs. search). Fortunatley, the race
is rare enough that it might never have been happening in practice but
still... It would kill me to have to add a big global spinlock on
interrupt handling to fix that so yeah, go for it !

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/3] 2.6.17 radix-tree: updates and lockless
  2006-06-20 14:48 [patch 0/3] 2.6.17 radix-tree: updates and lockless Nick Piggin
                   ` (3 preceding siblings ...)
  2006-06-20 22:08 ` [patch 0/3] 2.6.17 radix-tree: updates and lockless Benjamin Herrenschmidt
@ 2006-06-20 22:35 ` Andrew Morton
  2006-06-20 23:09   ` Benjamin Herrenschmidt
  4 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2006-06-20 22:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: benh, Paul.McKenney, linux-kernel, linux-mm

Nick Piggin <npiggin@suse.de> wrote:
>
> I've finally ported the RCU radix tree over my radix tree direct-data patch
> (the latter patch has been in -mm for a while now).

Yes, radix-tree-direct-data.patch and radix-tree-small.patch are for-2.6.18.

> I've also done the last step required for submission, which was to make a
> small userspace RCU test harness, and wire up the rtth so that it can handle
> multiple threads to test the lockless capability. The RCU test harness uses
> an implementation somewhat like Paul's paper's quiescent state bitmask
> approach; with infrequent quiescent state updates, performance isn't bad.
> 
> This quickly flushed out several obscure bugs just when running on my dual
> G5. After fixing those, I racked up about 100 CPU hours of testing on
> SUSE's 64-way Altix without problem. Also passes the normal battery of
> single threaded rtth tests.
> 
> I'd like to hear views regarding merging these patches for 2.6.18. Initially
> the lockless code would not come into effect (good - one thing at a time)
> until tree_lock can start getting lifted in -mm and 2.6.19.

For 2.6.18 we obviously need to fix the tree_lock box-killer as #1
priority.  And whatever we do there needs to be backportable to 2.6.17. 
Depending upon Dave's testing results that'll be either covert-to-spinlock
or disable-rwlock-debugging-if-CONFIG_DEBUG_SPINLOCK.  Or something else. 
We'll see.

So given those complexities, and the lack of a _user_ of
radix-tree-rcu-lockless-readside.patch, it doesn't look like 2.6.18 stuff
at this time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/3] 2.6.17 radix-tree: updates and lockless
  2006-06-20 22:35 ` Andrew Morton
@ 2006-06-20 23:09   ` Benjamin Herrenschmidt
  2006-06-20 23:30     ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Benjamin Herrenschmidt @ 2006-06-20 23:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, Paul.McKenney, linux-kernel, linux-mm

On Tue, 2006-06-20 at 15:35 -0700, Andrew Morton wrote:

> So given those complexities, and the lack of a _user_ of
> radix-tree-rcu-lockless-readside.patch, it doesn't look like 2.6.18 stuff
> at this time.

So what should I do ? leave the bug in ppc64 or kill it's scalability
when taking interrupts ? You have one user already, me. From what Nick
says, the patch has been beaten up pretty heavily and seems stable....

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/3] 2.6.17 radix-tree: updates and lockless
  2006-06-20 23:09   ` Benjamin Herrenschmidt
@ 2006-06-20 23:30     ` Andrew Morton
  2006-06-20 23:50       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2006-06-20 23:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: npiggin, Paul.McKenney, linux-kernel, linux-mm

Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>
> On Tue, 2006-06-20 at 15:35 -0700, Andrew Morton wrote:
> 
> > So given those complexities, and the lack of a _user_ of
> > radix-tree-rcu-lockless-readside.patch, it doesn't look like 2.6.18 stuff
> > at this time.
> 
> So what should I do ?

panic!

> leave the bug in ppc64 or kill it's scalability
> when taking interrupts ? You have one user already, me.

I didn't know that 30 minutes ago ;)

> From what Nick
> says, the patch has been beaten up pretty heavily and seems stable....

Well as I say, the tree_lock crash is way more important.  We need to work
out what we're going to do then get that fixed, backport the fix to -stable
then rebase the radix-tree patches on top and get
radix-tree-rcu-lockless-readside.patch tested and reviewed.

I guess we can do all that in time for -rc1, but not knowing _how_ we'll be
fixing the tree_lock crash is holding things up.

Paul, if you could take a close look at the RCU aspects of this work it'd
help, thanks.


btw guys, theory has it that code which was submitted post-2.6.n is too
late for 2.6.n+1..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/3] 2.6.17 radix-tree: updates and lockless
  2006-06-20 23:30     ` Andrew Morton
@ 2006-06-20 23:50       ` Benjamin Herrenschmidt
  2006-06-21  0:34         ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Benjamin Herrenschmidt @ 2006-06-20 23:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: npiggin, Paul.McKenney, linux-kernel, linux-mm

On Tue, 2006-06-20 at 16:30 -0700, Andrew Morton wrote:

> > leave the bug in ppc64 or kill it's scalability
> > when taking interrupts ? You have one user already, me.
> 
> I didn't know that 30 minutes ago ;)

Heh, I though I wrote that when I originally asked Nick to bring back
his patch up to date :) Bah, anyway, you know now. 

> > From what Nick
> > says, the patch has been beaten up pretty heavily and seems stable....
> 
> Well as I say, the tree_lock crash is way more important.  We need to work
> out what we're going to do then get that fixed, backport the fix to -stable
> then rebase the radix-tree patches on top and get
> radix-tree-rcu-lockless-readside.patch tested and reviewed.
> 
> I guess we can do all that in time for -rc1, but not knowing _how_ we'll be
> fixing the tree_lock crash is holding things up.

Ok.

> Paul, if you could take a close look at the RCU aspects of this work it'd
> help, thanks.
> 
> btw guys, theory has it that code which was submitted post-2.6.n is too
> late for 2.6.n+1..

Yes but the lockless radix tree patch was floating around a long time
ago :)

Anyway, I can drop a spinlock in (in fact I have) the ppc64 irq code for
now but that sucks, thus we should really seriously consider having the
lockless tree in 2.6.18 or I might have to look into doing an alternate
implementation specifically in arch code... or find some other way of
doing the inverse mapping there...

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/3] 2.6.17 radix-tree: updates and lockless
  2006-06-20 23:50       ` Benjamin Herrenschmidt
@ 2006-06-21  0:34         ` Christoph Lameter
  2006-06-21  0:47           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2006-06-21  0:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, npiggin, Paul.McKenney, linux-kernel, linux-mm

On Wed, 21 Jun 2006, Benjamin Herrenschmidt wrote:

> Anyway, I can drop a spinlock in (in fact I have) the ppc64 irq code for
> now but that sucks, thus we should really seriously consider having the
> lockless tree in 2.6.18 or I might have to look into doing an alternate
> implementation specifically in arch code... or find some other way of
> doing the inverse mapping there...

How many interrupts do you have to ? I would expect a simple table 
lookup would be fine to get from the virtual to the real interrupt.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/3] 2.6.17 radix-tree: updates and lockless
  2006-06-21  0:34         ` Christoph Lameter
@ 2006-06-21  0:47           ` Benjamin Herrenschmidt
  2006-06-21  1:07             ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Benjamin Herrenschmidt @ 2006-06-21  0:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, npiggin, Paul.McKenney, linux-kernel, linux-mm

On Tue, 2006-06-20 at 17:34 -0700, Christoph Lameter wrote:
> On Wed, 21 Jun 2006, Benjamin Herrenschmidt wrote:
> 
> > Anyway, I can drop a spinlock in (in fact I have) the ppc64 irq code for
> > now but that sucks, thus we should really seriously consider having the
> > lockless tree in 2.6.18 or I might have to look into doing an alternate
> > implementation specifically in arch code... or find some other way of
> > doing the inverse mapping there...
> 
> How many interrupts do you have to ? I would expect a simple table 
> lookup would be fine to get from the virtual to the real interrupt.

No, our hardware interrupt numbers are an encoded form containing the
geographical location of the device :) so they are 24 bits at least (and
we have a platform coming where they can be 64 bits).

The current code has a radix tree locally into the xics interrupt
controller used on IBM pSeries machines.

My rewritten powerpc irq core & remapper allows for different
reverse-map implementation based on the controller, you can ask for no
inverse map (in case you can fit your virtual irq directly in your hw
controller, some can, or give it to your hypervisor and get it back when
an interrupt happens), legacy (0..15 only, they are reserved unless you
are a i8259 to avoid problems with stupid drivers), linear map (a table)
or tree map (a radix tree).

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/3] 2.6.17 radix-tree: updates and lockless
  2006-06-21  0:47           ` Benjamin Herrenschmidt
@ 2006-06-21  1:07             ` Christoph Lameter
  2006-06-21  1:33               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2006-06-21  1:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, npiggin, Paul.McKenney, linux-kernel, linux-mm

On Wed, 21 Jun 2006, Benjamin Herrenschmidt wrote:

> No, our hardware interrupt numbers are an encoded form containing the
> geographical location of the device :) so they are 24 bits at least (and
> we have a platform coming where they can be 64 bits).

PICs with build in GPSses? And I thought we had weird hardware....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/3] 2.6.17 radix-tree: updates and lockless
  2006-06-21  1:07             ` Christoph Lameter
@ 2006-06-21  1:33               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 23+ messages in thread
From: Benjamin Herrenschmidt @ 2006-06-21  1:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, npiggin, Paul.McKenney, linux-kernel, linux-mm

On Tue, 2006-06-20 at 18:07 -0700, Christoph Lameter wrote:
> On Wed, 21 Jun 2006, Benjamin Herrenschmidt wrote:
> 
> > No, our hardware interrupt numbers are an encoded form containing the
> > geographical location of the device :) so they are 24 bits at least (and
> > we have a platform coming where they can be 64 bits).
> 
> PICs with build in GPSses? And I thought we had weird hardware....

hehehe :) Well, domain/bus/slot number if you prefer but yeah, a GPS
would be much more cool !

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] radix-tree: RCU lockless readside
  2006-06-20 14:48 ` [patch 3/3] radix-tree: RCU lockless readside Nick Piggin
@ 2006-06-22  1:49   ` Paul E. McKenney
  2006-06-22 15:45     ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2006-06-22  1:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Benjamin Herrenschmidt, Paul McKenney,
	Linux Kernel, Linux Memory Management

On Tue, Jun 20, 2006 at 04:48:48PM +0200, Nick Piggin wrote:

Pretty close, but some questions and comments interspersed.
Search for empty lines to find them.

Rough notes from review process at the end, in case anyone has
insomnia or something.

						Thanx, Paul

> Make radix tree lookups safe to be performed without locks. Readers
> are protected against nodes being deleted by using RCU based freeing.
> Readers are protected against new node insertion by using memory
> barriers to ensure the node itself will be properly written before it
> is visible in the radix tree.
> 
> Each radix tree node keeps a record of their height (above leaf
> nodes). This height does not change after insertion -- when the radix
> tree is extended, higher nodes are only inserted in the top. So a
> lookup can take the pointer to what is *now* the root node, and
> traverse down it even if the tree is concurrently extended and this
> node becomes a subtree of a new root.
> 
> "Direct" pointers (tree height of 0, where root->rnode points directly
> to the data item) are handled by using the low bit of the pointer to
> signal whether rnode is a direct pointer or a pointer to a radix tree
> node.
> 
> When a reader wants to traverse the next branch, they will take a
> copy of the pointer. This pointer will be either NULL (and the branch
> is empty) or non-NULL (and will point to a valid node).
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> Index: linux-2.6/lib/radix-tree.c
> ===================================================================
> --- linux-2.6.orig/lib/radix-tree.c
> +++ linux-2.6/lib/radix-tree.c
> @@ -30,6 +30,7 @@
>  #include <linux/gfp.h>
>  #include <linux/string.h>
>  #include <linux/bitops.h>
> +#include <linux/rcupdate.h>
>  
>  
>  #ifdef __KERNEL__
> @@ -45,7 +46,9 @@
>  	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
>  
>  struct radix_tree_node {
> +	unsigned int	height;		/* Height from the bottom */
>  	unsigned int	count;
> +	struct rcu_head	rcu_head;
>  	void		*slots[RADIX_TREE_MAP_SIZE];
>  	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
>  };
> @@ -100,13 +103,21 @@ radix_tree_node_alloc(struct radix_tree_
>  			rtp->nr--;
>  		}
>  	}
> +	BUG_ON(radix_tree_is_direct_ptr(ret));
>  	return ret;
>  }
>  
> +static void radix_tree_node_rcu_free(struct rcu_head *head)
> +{
> +	struct radix_tree_node *node =
> +			container_of(head, struct radix_tree_node, rcu_head);
> +	kmem_cache_free(radix_tree_node_cachep, node);
> +}
> +
>  static inline void
>  radix_tree_node_free(struct radix_tree_node *node)
>  {
> -	kmem_cache_free(radix_tree_node_cachep, node);
> +	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
>  }
>  
>  /*
> @@ -222,11 +233,12 @@ static int radix_tree_extend(struct radi
>  	}
>  
>  	do {
> +		unsigned int newheight;
>  		if (!(node = radix_tree_node_alloc(root)))
>  			return -ENOMEM;
>  
>  		/* Increase the height.  */
> -		node->slots[0] = root->rnode;
> +		node->slots[0] = radix_tree_direct_to_ptr(root->rnode);
>  
>  		/* Propagate the aggregated tag info into the new root */
>  		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> @@ -234,9 +246,11 @@ static int radix_tree_extend(struct radi
>  				tag_set(node, tag, 0);
>  		}
>  
> +		newheight = root->height+1;
> +		node->height = newheight;
>  		node->count = 1;
> -		root->rnode = node;
> -		root->height++;
> +		rcu_assign_pointer(root->rnode, node);
> +		root->height = newheight;
>  	} while (height > root->height);
>  out:
>  	return 0;
> @@ -258,6 +272,8 @@ int radix_tree_insert(struct radix_tree_
>  	int offset;
>  	int error;
>  
> +	BUG_ON(radix_tree_is_direct_ptr(item));
> +
>  	/* Make sure the tree is high enough.  */
>  	if (index > radix_tree_maxindex(root->height)) {
>  		error = radix_tree_extend(root, index);
> @@ -275,11 +291,12 @@ int radix_tree_insert(struct radix_tree_
>  			/* Have to add a child node.  */
>  			if (!(slot = radix_tree_node_alloc(root)))
>  				return -ENOMEM;
> +			slot->height = height;
>  			if (node) {
> -				node->slots[offset] = slot;
> +				rcu_assign_pointer(node->slots[offset], slot);
>  				node->count++;
>  			} else
> -				root->rnode = slot;
> +				rcu_assign_pointer(root->rnode, slot);
>  		}
>  
>  		/* Go a level down */
> @@ -295,11 +312,11 @@ int radix_tree_insert(struct radix_tree_
>  
>  	if (node) {
>  		node->count++;
> -		node->slots[offset] = item;
> +		rcu_assign_pointer(node->slots[offset], item);
>  		BUG_ON(tag_get(node, 0, offset));
>  		BUG_ON(tag_get(node, 1, offset));
>  	} else {
> -		root->rnode = item;
> +		rcu_assign_pointer(root->rnode, radix_tree_ptr_to_direct(item));
>  		BUG_ON(root_tag_get(root, 0));
>  		BUG_ON(root_tag_get(root, 1));
>  	}
> @@ -308,49 +325,51 @@ int radix_tree_insert(struct radix_tree_
>  }
>  EXPORT_SYMBOL(radix_tree_insert);
>  
> -static inline void **__lookup_slot(struct radix_tree_root *root,
> -				   unsigned long index)
> +/**
> + *	radix_tree_lookup_slot    -    lookup a slot in a radix tree
> + *	@root:		radix tree root
> + *	@index:		index key
> + *
> + *	Lookup the slot corresponding to the position @index in the radix tree
> + *	@root. This is useful for update-if-exists operations.
> + *
> + *	This function cannot be called under rcu_read_lock, it must be
> + *	excluded from writers, as must the returned slot.
> + */
> +void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
>  {
>  	unsigned int height, shift;
> -	struct radix_tree_node **slot;
> -
> -	height = root->height;
> +	struct radix_tree_node *node, **slot;
>  
> -	if (index > radix_tree_maxindex(height))
> +	node = rcu_dereference(root->rnode);

If writers are excluded, why is the rcu_dereference() required?

> +	if (node == NULL)
>  		return NULL;
>  
> -	if (height == 0 && root->rnode)
> +	if (radix_tree_is_direct_ptr(node)) {
> +		if (index > 0)
> +			return NULL;
>  		return (void **)&root->rnode;
> +	}
> +
> +	height = node->height;
> +	if (index > radix_tree_maxindex(height))
> +		return NULL;
>  
>  	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
> -	slot = &root->rnode;
>  
> -	while (height > 0) {
> -		if (*slot == NULL)
> +	do {
> +		slot = (struct radix_tree_node **)
> +			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
> +		node = rcu_dereference(*slot);

If writers are excluded, why is the rcu_dereference() required?

> +		if (node == NULL)
>  			return NULL;
>  
> -		slot = (struct radix_tree_node **)
> -			((*slot)->slots +
> -				((index >> shift) & RADIX_TREE_MAP_MASK));
>  		shift -= RADIX_TREE_MAP_SHIFT;
>  		height--;
> -	}
> +	} while (height > 0);
>  
>  	return (void **)slot;
>  }
> -
> -/**
> - *	radix_tree_lookup_slot    -    lookup a slot in a radix tree
> - *	@root:		radix tree root
> - *	@index:		index key
> - *
> - *	Lookup the slot corresponding to the position @index in the radix tree
> - *	@root. This is useful for update-if-exists operations.
> - */
> -void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
> -{
> -	return __lookup_slot(root, index);
> -}
>  EXPORT_SYMBOL(radix_tree_lookup_slot);
>  
>  /**
> @@ -359,13 +378,45 @@ EXPORT_SYMBOL(radix_tree_lookup_slot);
>   *	@index:		index key
>   *
>   *	Lookup the item at the position @index in the radix tree @root.
> + *
> + *	This function can be called under rcu_read_lock, however the caller
> + *	must manage lifetimes of leaf nodes (eg. RCU may also be used to free
> + *	them safely). No RCU barriers are required to access or modify the
> + *	returned item, however.
>   */
>  void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
>  {
> -	void **slot;
> +	unsigned int height, shift;
> +	struct radix_tree_node *node, **slot;
> +
> +	node = rcu_dereference(root->rnode);
> +	if (node == NULL)
> +		return NULL;
> +
> +	if (radix_tree_is_direct_ptr(node)) {
> +		if (index > 0)
> +			return NULL;
> +		return radix_tree_direct_to_ptr(node);
> +	}
> +
> +	height = node->height;
> +	if (index > radix_tree_maxindex(height))
> +		return NULL;
>  
> -	slot = __lookup_slot(root, index);
> -	return slot != NULL ? *slot : NULL;
> +	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
> +
> +	do {
> +		slot = (struct radix_tree_node **)
> +			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
> +		node = rcu_dereference(*slot);
> +		if (node == NULL)
> +			return NULL;
> +
> +		shift -= RADIX_TREE_MAP_SHIFT;
> +		height--;
> +	} while (height > 0);
> +
> +	return node;
>  }
>  EXPORT_SYMBOL(radix_tree_lookup);
>  

radix_tree_tag_set() needs a comment saying that updates must be
excluded.  Might be obvious to some, but...

Ditto for radix_tree_tag_clear().

radix_tree_tag_get() either needs a comment saying that updates must
be excluded, or needs rcu_dereference() around the access of root->rnode
near line 565.  Yes, this is test scaffolding, but...

> @@ -542,29 +593,25 @@ EXPORT_SYMBOL(radix_tree_tag_get);
>  #endif
>  
>  static unsigned int
> -__lookup(struct radix_tree_root *root, void **results, unsigned long index,
> +__lookup(struct radix_tree_node *slot, void **results, unsigned long index,
>  	unsigned int max_items, unsigned long *next_index)
>  {

Not clear on why common code isn't shared more widely among the various
lookups...  Not -exactly- the same, I know, but...

>  	unsigned int nr_found = 0;
>  	unsigned int shift, height;
> -	struct radix_tree_node *slot;
>  	unsigned long i;
>  
> -	height = root->height;
> -	if (height == 0) {
> -		if (root->rnode && index == 0)
> -			results[nr_found++] = root->rnode;
> +	height = slot->height;
> +	if (height == 0)
>  		goto out;
> -	}
> -
>  	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
> -	slot = root->rnode;
>  
>  	for ( ; height > 1; height--) {
> +		struct radix_tree_node *__s;
>  
>  		for (i = (index >> shift) & RADIX_TREE_MAP_MASK ;
>  				i < RADIX_TREE_MAP_SIZE; i++) {
> -			if (slot->slots[i] != NULL)
> +			__s = rcu_dereference(slot->slots[i]);
> +			if (__s != NULL)
>  				break;
>  			index &= ~((1UL << shift) - 1);
>  			index += 1UL << shift;
> @@ -575,14 +622,16 @@ __lookup(struct radix_tree_root *root, v
>  			goto out;
>  
>  		shift -= RADIX_TREE_MAP_SHIFT;
> -		slot = slot->slots[i];
> +		slot = __s;
>  	}
>  
>  	/* Bottom level: grab some items */
>  	for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
> +		struct radix_tree_node *node;
>  		index++;
> -		if (slot->slots[i]) {
> -			results[nr_found++] = slot->slots[i];
> +		node = slot->slots[i];
> +		if (node) {
> +			results[nr_found++] = node;
>  			if (nr_found == max_items)
>  				goto out;
>  		}
> @@ -604,28 +653,54 @@ out:
>   *	*@results.
>   *
>   *	The implementation is naive.
> + *
> + *	Like radix_tree_lookup, radix_tree_gang_lookup may be called under
> + *	rcu_read_lock. In this case, rather than the returned results being
> + *	an atomic snapshot of the tree at a single point in time, the semantics
> + *	of an RCU protected gang lookup are as though multiple radix_tree_lookups
> + *	have been issued in individual locks, and results stored in 'results'.
>   */
>  unsigned int
>  radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
>  			unsigned long first_index, unsigned int max_items)
>  {
> -	const unsigned long max_index = radix_tree_maxindex(root->height);
> +	unsigned long max_index;
> +	struct radix_tree_node *node;
>  	unsigned long cur_index = first_index;
> -	unsigned int ret = 0;
> +	unsigned int ret;
> +
> +	node = rcu_dereference(root->rnode);
> +	if (!node)
> +		return 0;
> +
> +	if (radix_tree_is_direct_ptr(node)) {
> +		if (first_index > 0)
> +			return 0;
> +		results[0] = radix_tree_direct_to_ptr(node);
> +		ret = 1;
> +		goto out;
> +	}
>  
> +	max_index = radix_tree_maxindex(node->height);
> +
> +	ret = 0;
>  	while (ret < max_items) {
>  		unsigned int nr_found;
>  		unsigned long next_index;	/* Index of next search */
>  
>  		if (cur_index > max_index)
>  			break;
> -		nr_found = __lookup(root, results + ret, cur_index,
> +		nr_found = __lookup(node, results + ret, cur_index,
>  					max_items - ret, &next_index);
>  		ret += nr_found;
>  		if (next_index == 0)
>  			break;
>  		cur_index = next_index;
>  	}
> +
> +out:
> +	(void)rcu_dereference(results); /* single barrier for all results */
> +
>  	return ret;
>  }
>  EXPORT_SYMBOL(radix_tree_gang_lookup);
> @@ -635,24 +710,16 @@ EXPORT_SYMBOL(radix_tree_gang_lookup);
>   * open-coding the search.
>   */
>  static unsigned int
> -__lookup_tag(struct radix_tree_root *root, void **results, unsigned long index,
> +__lookup_tag(struct radix_tree_node *slot, void **results, unsigned long index,
>  	unsigned int max_items, unsigned long *next_index, unsigned int tag)
>  {
>  	unsigned int nr_found = 0;
>  	unsigned int shift;
> -	unsigned int height = root->height;
> -	struct radix_tree_node *slot;
> -
> -	if (height == 0) {
> -		if (root->rnode && index == 0)
> -			results[nr_found++] = root->rnode;
> -		goto out;
> -	}
> +	unsigned int height = slot->height;
>  
>  	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
> -	slot = root->rnode;
>  
> -	do {
> +	while (height > 0) {
>  		unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK;
>  
>  		for ( ; i < RADIX_TREE_MAP_SIZE; i++) {
> @@ -683,7 +750,7 @@ __lookup_tag(struct radix_tree_root *roo
>  		}
>  		shift -= RADIX_TREE_MAP_SHIFT;
>  		slot = slot->slots[i];
> -	} while (height > 0);
> +	}
>  out:
>  	*next_index = index;
>  	return nr_found;
> @@ -707,7 +774,8 @@ radix_tree_gang_lookup_tag(struct radix_
>  		unsigned long first_index, unsigned int max_items,
>  		unsigned int tag)

Need header comment saying that pointers returned in results array
must be rcu_dereferenced() by RCU callers before use.

But would be better to just rcu_dereference() them in __lookup_tag():745.
This is zero cost on all but Alpha, and keeps the RCU effects more
localized.

Or must radix_tree_gang_lookup_tag() exclude updates?  If so, need to
say so in header comment.  Also, in this case, the rcu_dereference()
calls are unneeded.

>  {
> -	const unsigned long max_index = radix_tree_maxindex(root->height);
> +	struct radix_tree_node *node;
> +	unsigned long max_index;
>  	unsigned long cur_index = first_index;
>  	unsigned int ret = 0;
>  
> @@ -715,13 +783,27 @@ radix_tree_gang_lookup_tag(struct radix_
>  	if (!root_tag_get(root, tag))
>  		return 0;
>  
> +	node = rcu_dereference(root->rnode);
> +	if (!node)
> +		return ret;
> +
> +	if (radix_tree_is_direct_ptr(node)) {
> +		if (first_index > 0)
> +			return 0;
> +		node = radix_tree_direct_to_ptr(node);
> +		results[0] = node;
> +		return 1;
> +	}
> +
> +	max_index = radix_tree_maxindex(node->height);
> +
>  	while (ret < max_items) {
>  		unsigned int nr_found;
>  		unsigned long next_index;	/* Index of next search */
>  
>  		if (cur_index > max_index)
>  			break;
> -		nr_found = __lookup_tag(root, results + ret, cur_index,
> +		nr_found = __lookup_tag(node, results + ret, cur_index,
>  					max_items - ret, &next_index, tag);
>  		ret += nr_found;
>  		if (next_index == 0)
> @@ -743,8 +825,17 @@ static inline void radix_tree_shrink(str
>  			root->rnode->count == 1 &&
>  			root->rnode->slots[0]) {
>  		struct radix_tree_node *to_free = root->rnode;
> +		void *newptr;
>  
> -		root->rnode = to_free->slots[0];
> +		/*
> +		 * this doesn't need an rcu_assign_pointer, because
> +		 * we aren't touching the object that to_free->slots[0]
> +		 * points to.
> +		 */

I found this comment to be confusing.  Suggest something like the following:

		/*
		 * We don't need rcu_assign_pointer(), since we are
		 * simply moving the node from one part of the tree
		 * to another.  If it was safe to dereference a pointer
		 * to it before, it is still safe to dereference a
		 * pointer to it afterwards.
		 */

A few more lines, but well worth it IMHO.  Someone with a clearer
understanding of the code might be able to create a more concise
comment that still gets the idea across clearly.  ;-)

> +		newptr = to_free->slots[0];
> +		if (root->height == 1)
> +			newptr = radix_tree_ptr_to_direct(newptr);
> +		root->rnode = newptr;
>  		root->height--;
>  		/* must only free zeroed nodes into the slab */
>  		tag_clear(to_free, 0, 0);
> @@ -768,6 +859,7 @@ void *radix_tree_delete(struct radix_tre

Need header comment in radix_tree_delete() saying that if the caller
also does lock-free searches, that it is the caller's responsibility
to wait a grace period (e.g., call_rcu() or synchronize_rcu()) before
freeing or re-using the removed item.  Otherwise, people will naturally
assume that radix_tree_delete() took care of this detail for them.

For that matter, should radix_tree_delete() do a synchronize_rcu()?
My belief is "no", since radix_tree_delete() has no way of knowing whether
this particular tree is searched lock-free.  One option would be to have
some sort of option bit in the tree telling radix_tree_delete() to do
a synchronize_rcu().  It does not make sense for radix_tree_delete()
to do call_rcu(), since it has no idea what needs to be done to free
the item removed -- it could well have arbitrary frilly data structures
linked to it that also need to be freed.

Also should add comment saying that caller is responsible for excluding
other updates to this tree.  Obvious perhaps, but better safe than sorry.

>  {
>  	struct radix_tree_path path[RADIX_TREE_MAX_PATH], *pathp = path;
>  	struct radix_tree_node *slot = NULL;
> +	struct radix_tree_node *to_free;
>  	unsigned int height, shift;
>  	int tag;
>  	int offset;
> @@ -778,6 +870,7 @@ void *radix_tree_delete(struct radix_tre
>  
>  	slot = root->rnode;
>  	if (height == 0 && root->rnode) {
> +		slot = radix_tree_direct_to_ptr(slot);
>  		root_tag_clear_all(root);
>  		root->rnode = NULL;
>  		goto out;
> @@ -810,8 +903,11 @@ void *radix_tree_delete(struct radix_tre
>  			radix_tree_tag_clear(root, index, tag);
>  	}
>  
> +	to_free = NULL;
>  	/* Now free the nodes we do not need anymore */
>  	while (pathp->node) {
> +		if (to_free)
> +			radix_tree_node_free(to_free);
>  		pathp->node->slots[pathp->offset] = NULL;
>  		pathp->node->count--;
>  
> @@ -822,13 +918,15 @@ void *radix_tree_delete(struct radix_tre
>  		}
>  
>  		/* Node with zero slots in use so free it */
> -		radix_tree_node_free(pathp->node);
> -
> +		to_free = pathp->node;
>  		pathp--;
> +
>  	}
>  	root_tag_clear_all(root);
>  	root->height = 0;
>  	root->rnode = NULL;
> +	if (to_free)
> +		radix_tree_node_free(to_free);
>  
>  out:
>  	return slot;
> Index: linux-2.6/include/linux/radix-tree.h
> ===================================================================
> --- linux-2.6.orig/include/linux/radix-tree.h
> +++ linux-2.6/include/linux/radix-tree.h
> @@ -22,6 +22,8 @@
>  #include <linux/sched.h>
>  #include <linux/preempt.h>
>  #include <linux/types.h>
> +#include <linux/kernel.h>
> +#include <linux/rcupdate.h>
>  
>  #define RADIX_TREE_MAX_TAGS 2
>  
> @@ -48,6 +50,37 @@ do {									\
>  	(root)->rnode = NULL;						\
>  } while (0)
>  
> +#define RADIX_TREE_DIRECT_PTR	1
> +
> +static inline void *radix_tree_ptr_to_direct(void *ptr)
> +{
> +	return (void *)((unsigned long)ptr | RADIX_TREE_DIRECT_PTR);
> +}
> +
> +static inline void *radix_tree_direct_to_ptr(void *ptr)
> +{
> +	return (void *)((unsigned long)ptr & ~RADIX_TREE_DIRECT_PTR);
> +}
> +
> +static inline int radix_tree_is_direct_ptr(void *ptr)
> +{
> +	return (int)((unsigned long)ptr & RADIX_TREE_DIRECT_PTR);
> +}
> +
> +static inline void *__radix_tree_deref_slot(void **slot)
> +{
> +	return rcu_dereference(radix_tree_direct_to_ptr(*slot));
> +}
> +
> +#define radix_tree_deref_slot(slot) __radix_tree_deref_slot((void **)slot)
> +
> +static inline void radix_tree_replace_slot(void **slot, void *item)
> +{
> +	BUG_ON(radix_tree_is_direct_ptr(item));
> +	rcu_assign_pointer(*slot,
> +		(void *)((unsigned long)item | ((unsigned long)*slot & RADIX_TREE_DIRECT_PTR)));
> +}
> +
>  int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
>  void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
>  void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
> Index: linux-2.6/mm/migrate.c
> ===================================================================
> --- linux-2.6.orig/mm/migrate.c
> +++ linux-2.6/mm/migrate.c
> @@ -228,7 +228,7 @@ int migrate_page_remove_references(struc
>  						page_index(page));
>  
>  	if (!page_mapping(page) || page_count(page) != nr_refs ||
> -			*radix_pointer != page) {
> +			radix_tree_deref_slot(radix_pointer) != page) {
>  		write_unlock_irq(&mapping->tree_lock);
>  		return -EAGAIN;
>  	}
> @@ -249,7 +249,7 @@ int migrate_page_remove_references(struc
>  		set_page_private(newpage, page_private(page));
>  	}
>  
> -	*radix_pointer = newpage;
> +	radix_tree_replace_slot(radix_pointer, newpage);
>  	__put_page(page);
>  	write_unlock_irq(&mapping->tree_lock);
>  

Rough notes, FYA:

o	Don't the items placed into the radix tree need to be protected
	by RCU?  If not, how does the radix tree avoid handing a pointer
	to something that has recently been removed, that the caller to
	radix_tree_delete() might have already freed?

	If so, what about nfs_inode_add_request(), which adds struct
	nfs_page to a radix tree?  This structure does not appear to
	have any sort of RCU protection:

	o	nfs_inode_remove_request() invokes radix_tree_delete(),
		drops the spinlock (optionally invoking nfs_end_data_update()),
		calls nfs_clear_request() and then nfs_release_request().

	o	nfs_clear_request() invokes page_cache_release(), which
		calls put_page, which usually calls __page_cache_release(),
		which does stuff under a spinlock that cannot include
		waiting for a grace period to elapse.  But I suppose it
		could set a flag or something...

		o	__ClearPageLRU(): just clears LRU bit.

		o	del_page_from_lru(): just deletes from lru
			list and clears page-active bits if they were
			set.

	But it seems that all radix-tree operations are under a
	spinlock, so this works after all.

	Other calls to radix_tree_delete():

	o	__remove_from_page_cache() is called under
		->tree_lock in all cases, as are other non-initialization
		uses of ->page_tree.

	And that appears to be it.  So where is Ben Herrenschmidt's
	lock-free use of radix-tree lookup?  No matter -- Ben just needs
	to make sure to RCU-protect whatever the heck he is putting into
	the radix tree if he is calling the lookup functions without
	locking.  ;-)

	Probably worth a comment to radix_tree_delete() saying that the
	caller must wait for a grace period before freeing/reusing
	the item removed from the tree!  (But only if searching done
	without locking.)

o	radix_tree_shrink() -- what is to_free?  freelist?  If so, protected
	by what lock?  OK -- the reason that rcu_assign_pointer() can be
	omitted is that we are simply moving the element unchanged from
	one location in the tree to another -- if it was safe to access
	before, it is safe to access now, no additional memory barriers
	required.

	And to_free is just a local variable -- must be blind today.  :-/

	Need to make the comment more clear.

o	RCU-protected fields:

	o	radix_tree_root->rnode
	
		o	rcu_assign_pointer() in radix_tree_extend():252.

		o	rcu_assign_pointer() in radix_tree_insert():299.

		o	rcu_assign_pointer() in radix_tree_insert():319.

		o	assignment in radix_tree_extend():245 OK
			(newly allocated).

		o	access in radix_tree_insert():284 OK, as
			presumably other writes are excluded at this
			point -- we are inserting, after all...

		o	rcu_dereference() in radix_tree_lookup_slot():344.
			But comment says that writers must be excluded,
			so not needed. @@@

		o	access in radix_tree_lookup_slot():351
			(returning a pointer to the value -- but
			the caller needs to use rcu_deference() on
			returned pointer unless the caller holds a
			lock over all searches -- currently, this
			is the case, but need a comment...  Which is
			already present: "it must be excluded from
			writers, as must the returned slot".)  OK.

		o	rcu_dereference() in radix_tree_lookup():392

		o	Address arithmetic in radix_tree_lookup():410.  OK.

		o	rcu_dereference() in radix_tree_lookup():411.

		o	access in radix_tree_tag_set():446.  @@@ Need
			comment saying that other updates must be excluded.
			Given that other updates are excluded, no need
			for rcu_dereference().

		o	access in radix_tree_tag_clear():496.  @@@ Need
			comment saying that other updates must be excluded.
			Given that other updates are excluded, no need
			for rcu_dereference().

		o	access in radix_tree_tag_get():565.  @@@ Need
			comment saying that other updates must be excluded.
			Given that other updates are excluded, no need
			for rcu_dereference().  OTOH, this is only for
			the test harness -- but -someone- will figure out
			a need for it, so might as well get it right.

			Alternatively, add an rcu_dereference().

		o	rcu_dereference() in radix_tree_gang_lookup():672.

		o	rcu_dereference() in radix_tree_gang_lookup_tag():786.

		o	access in radix_tree_shrink():825.  Presumably
			updates are excluded.  Static function, so OK.

			Ditto on line 827.  Ditto for line 838 (and
			rcu_assign_pointer() not needed because element
			is being moved unchanged rather than inserted).

		o	assignment in radix_tree_delete():871.  OK since
			updates presumably excluded, but should so indicate
			in header comment.

			Ditto on line 872, 875, 915, and 927.

		o	access in radix_tree_extend():230.  OK since
			updates presumably excluded, static function,
			so no need for header comment.

	o	radix_tree_node->slots[i]

		o	rcu_assign_pointer() in radix_tree_insert():296.

		o	assignment in radix_tree_insert():305.  OK, as
			we are presumably excluding updates.

		o	rcu_assign_pointer() in radix_tree_insert():315.

		o	access in radix_tree_shrink():826.  Presumably
			updates are excluded.  Static function, so OK.

		o	assignment in radix_tree_extend():241.  OK, since
			structure newly allocated and not yet accessible
			to other CPUs.

		o	address calculation in radix_tree_lookup_slot():362 OK.

		o	rcu_dereference() in radix_tree_lookup_slot():363.
			Writers excluded, so why needed?  @@@

		o	access in radix_tree_tag_set():455.  Already
			covered in comments for ->rnode.

		o	access in radix_tree_tag_clear():507. Already covered
			in comments for ->rnode.

		o	access in radix_tree_tag_get():587. Already covered
			in comments for ->rnode.

		o	rcu_dereference() in __lookup():613.

		o	access in __lookup():632.  OK, due to earlier
			rcu_dereference() -but- caller needs to
			rcu_dereference() the returned pointers if relying
			on RCU to protect elements.  No, the needed
			rcu_dereference() is done at the end of
			radix_tree_gang_lookup(), so OK.

			@@@ Not clear on why __lookup() isn't common code
			for radix_tree_lookup() or radix_tree_lookup_slot()...

		o	access in __lookup_tag():727.  OK, pointer comparison.

			Ditto on line 744.

		o	access in __lookup_tag():745.  OK, but caller needs
			to do rcu_dereference() before using pointer...

			@@@ radix_tree_gang_lookup_tag() needs header comment
			calling for rcu_dereference() of elements of results
			array, or need rcu_dereference() on this access.

			Or must radix_tree_gang_lookup_tag() be protected
			by lock?  If so, need to say so!

		o	access in __lookup_tag():752.  Need rcu_dereference()
			unless radix_tree_gang_lookup() is to exclude
			updates (in this latter case, the header comment
			needs to say this!).  @@@

		o	access in radix_tree_shrink():835.  OK, as we
			presumably are excluding updates.

		o	assignment in radix_tree_shrink():843: Ditto.

		o	access in radix_tree_delete():890. OK, as we are
			presumably excluding updates.

		o	assignment in radix_tree_delete():911.  Ditto.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] radix-tree: RCU lockless readside
  2006-06-22  1:49   ` Paul E. McKenney
@ 2006-06-22 15:45     ` Nick Piggin
  2006-06-22 16:30       ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2006-06-22 15:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Benjamin Herrenschmidt, Paul McKenney,
	Linux Kernel, Linux Memory Management

On Wed, Jun 21, 2006 at 06:49:49PM -0700, Paul E. McKenney wrote:
> On Tue, Jun 20, 2006 at 04:48:48PM +0200, Nick Piggin wrote:
> 
> Pretty close, but some questions and comments interspersed.
> Search for empty lines to find them.
> 
> Rough notes from review process at the end, in case anyone has
> insomnia or something.

Wow ;)
Thanks Paul.

> > +void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
> >  {
> >  	unsigned int height, shift;
> > -	struct radix_tree_node **slot;
> > -
> > -	height = root->height;
> > +	struct radix_tree_node *node, **slot;
> >  
> > -	if (index > radix_tree_maxindex(height))
> > +	node = rcu_dereference(root->rnode);
> 
> If writers are excluded, why is the rcu_dereference() required?

Good point, in earlier versions, radix_tree_lookup_slot was RCU-able,
but the direct-data patch ended that idea. I'll go through and clean
these out.

> radix_tree_tag_set() needs a comment saying that updates must be
> excluded.  Might be obvious to some, but...
> 
> Ditto for radix_tree_tag_clear().
> 
> radix_tree_tag_get() either needs a comment saying that updates must
> be excluded, or needs rcu_dereference() around the access of root->rnode
> near line 565.  Yes, this is test scaffolding, but...

Indeed. I mainly intended to document APIs that _are_ RCU-able,
but documented the negative for lookup_slot because it is similar
to lookup... but that's messy too.

I'll probably put a little table in radix-tree.h to summarise the
API synchronisation requirements, OK?

> 
> > @@ -542,29 +593,25 @@ EXPORT_SYMBOL(radix_tree_tag_get);
> >  #endif
> >  
> >  static unsigned int
> > -__lookup(struct radix_tree_root *root, void **results, unsigned long index,
> > +__lookup(struct radix_tree_node *slot, void **results, unsigned long index,
> >  	unsigned int max_items, unsigned long *next_index)
> >  {
> 
> Not clear on why common code isn't shared more widely among the various
> lookups...  Not -exactly- the same, I know, but...

Yeah it can get a bit tricky. It's gets a small cleanup now and again,
which is probably not a bad pace for our pagecache data structure ;)

> >  radix_tree_gang_lookup(struct radix_tree_root *root, void **results,

...

> > +out:
> > +	(void)rcu_dereference(results); /* single barrier for all results */

...

> > @@ -707,7 +774,8 @@ radix_tree_gang_lookup_tag(struct radix_
> >  		unsigned long first_index, unsigned int max_items,
> >  		unsigned int tag)
> 
> Need header comment saying that pointers returned in results array
> must be rcu_dereferenced() by RCU callers before use.
> 
> But would be better to just rcu_dereference() them in __lookup_tag():745.
> This is zero cost on all but Alpha, and keeps the RCU effects more
> localized.

Does the single rcu_dereference in radix_tree_gang_lookup look OK?

> 
> Or must radix_tree_gang_lookup_tag() exclude updates?  If so, need to
> say so in header comment.  Also, in this case, the rcu_dereference()
> calls are unneeded.

Ah indeed, that's confusing. Yes, the lookup_tag must exclude updates.
I guess I got too mechanical in my conversion... however, tag lookups
can be RCUified without a great deal of trouble, so I might take this
opportunity.

> > @@ -743,8 +825,17 @@ static inline void radix_tree_shrink(str
> >  			root->rnode->count == 1 &&
> >  			root->rnode->slots[0]) {
> >  		struct radix_tree_node *to_free = root->rnode;
> > +		void *newptr;
> >  
> > -		root->rnode = to_free->slots[0];
> > +		/*
> > +		 * this doesn't need an rcu_assign_pointer, because
> > +		 * we aren't touching the object that to_free->slots[0]
> > +		 * points to.
> > +		 */
> 
> I found this comment to be confusing.  Suggest something like the following:
> 
> 		/*
> 		 * We don't need rcu_assign_pointer(), since we are
> 		 * simply moving the node from one part of the tree
> 		 * to another.  If it was safe to dereference a pointer
> 		 * to it before, it is still safe to dereference a
> 		 * pointer to it afterwards.
> 		 */
> 
> A few more lines, but well worth it IMHO.  Someone with a clearer
> understanding of the code might be able to create a more concise
> comment that still gets the idea across clearly.  ;-)

That's pretty good, I just made a couple of tiny changes, what do you think?
                /*
                 * We don't need rcu_assign_pointer(), since we are simply
                 * moving the node from one part of the tree to another. If
                 * it was safe to dereference the old pointer to it
                 * (to_free->slots[0]), it will be safe to dereference the new
                 * one (root->rnode).
                 */

> 
> > +		newptr = to_free->slots[0];
> > +		if (root->height == 1)
> > +			newptr = radix_tree_ptr_to_direct(newptr);
> > +		root->rnode = newptr;
> >  		root->height--;
> >  		/* must only free zeroed nodes into the slab */
> >  		tag_clear(to_free, 0, 0);
> > @@ -768,6 +859,7 @@ void *radix_tree_delete(struct radix_tre
> 
> Need header comment in radix_tree_delete() saying that if the caller
> also does lock-free searches, that it is the caller's responsibility
> to wait a grace period (e.g., call_rcu() or synchronize_rcu()) before
> freeing or re-using the removed item.  Otherwise, people will naturally
> assume that radix_tree_delete() took care of this detail for them.
> 

I've tried to get that message across in the radix_tree_lookup_slot
comment, if they're using RCU lookups. Enough? I guess I'll add it
to the locking summary too.

> For that matter, should radix_tree_delete() do a synchronize_rcu()?
> My belief is "no", since radix_tree_delete() has no way of knowing whether
> this particular tree is searched lock-free.  One option would be to have
> some sort of option bit in the tree telling radix_tree_delete() to do
> a synchronize_rcu().  It does not make sense for radix_tree_delete()
> to do call_rcu(), since it has no idea what needs to be done to free
> the item removed -- it could well have arbitrary frilly data structures
> linked to it that also need to be freed.

Agreed, I think "no". In the current (locked) radix-tree, the caller is
still responsible to manage lifetimes of the data (leaf) nodes.

> 
> Also should add comment saying that caller is responsible for excluding
> other updates to this tree.  Obvious perhaps, but better safe than sorry.
> 

Yeah, I'll try to make all this a bit clearer in the locking summary.

> Rough notes, FYA:
> 
> o	Don't the items placed into the radix tree need to be protected
> 	by RCU?  If not, how does the radix tree avoid handing a pointer
> 	to something that has recently been removed, that the caller to
> 	radix_tree_delete() might have already freed?

Yes, they'll need to be protected by something. In the lockless pagecache,
they are never freed, so that isn't an issue. But other users (Ben's
irq patch perhaps, unless it uses the slot pointers directly) will have to
be careful.

> 
> 	If so, what about nfs_inode_add_request(), which adds struct
> 	nfs_page to a radix tree?  This structure does not appear to
> 	have any sort of RCU protection:

Well if they retain their existing locking, there shouldn't be a problem.
(as you say below...)

> 	But it seems that all radix-tree operations are under a
> 	spinlock, so this works after all.
> 
> 	Other calls to radix_tree_delete():
> 
> 	o	__remove_from_page_cache() is called under
> 		->tree_lock in all cases, as are other non-initialization
> 		uses of ->page_tree.
> 
> 	And that appears to be it.  So where is Ben Herrenschmidt's
> 	lock-free use of radix-tree lookup?  No matter -- Ben just needs
> 	to make sure to RCU-protect whatever the heck he is putting into
> 	the radix tree if he is calling the lookup functions without
> 	locking.  ;-)
> 
> 	Probably worth a comment to radix_tree_delete() saying that the
> 	caller must wait for a grace period before freeing/reusing
> 	the item removed from the tree!  (But only if searching done
> 	without locking.)
> 
> o	radix_tree_shrink() -- what is to_free?  freelist?  If so, protected
> 	by what lock?  OK -- the reason that rcu_assign_pointer() can be
> 	omitted is that we are simply moving the element unchanged from
> 	one location in the tree to another -- if it was safe to access
> 	before, it is safe to access now, no additional memory barriers
> 	required.
> 
> 	And to_free is just a local variable -- must be blind today.  :-/
> 
> 	Need to make the comment more clear.

Yeah, to_free is just a redundant node that is being chopped off the top
of the tree. It does go through a grace period of course, because lockless
lookups can still be referencing it.

> 
> o	RCU-protected fields:

...

Thanks very much Paul. Very thorough review.

I'll send out an incremental diff with changes.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] radix-tree: RCU lockless readside
  2006-06-22 15:45     ` Nick Piggin
@ 2006-06-22 16:30       ` Paul E. McKenney
  2006-06-22 16:55         ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2006-06-22 16:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Benjamin Herrenschmidt, Paul McKenney,
	Linux Kernel, Linux Memory Management

On Thu, Jun 22, 2006 at 05:45:18PM +0200, Nick Piggin wrote:
> On Wed, Jun 21, 2006 at 06:49:49PM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 20, 2006 at 04:48:48PM +0200, Nick Piggin wrote:
> > 
> > Pretty close, but some questions and comments interspersed.
> > Search for empty lines to find them.
> > 
> > Rough notes from review process at the end, in case anyone has
> > insomnia or something.
> 
> Wow ;)
> Thanks Paul.

This process -really- needs to be mechanized.  ;-)  Some people are
working on it...

> > > +void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
> > >  {
> > >  	unsigned int height, shift;
> > > -	struct radix_tree_node **slot;
> > > -
> > > -	height = root->height;
> > > +	struct radix_tree_node *node, **slot;
> > >  
> > > -	if (index > radix_tree_maxindex(height))
> > > +	node = rcu_dereference(root->rnode);
> > 
> > If writers are excluded, why is the rcu_dereference() required?
> 
> Good point, in earlier versions, radix_tree_lookup_slot was RCU-able,
> but the direct-data patch ended that idea. I'll go through and clean
> these out.

Fair enough!

> > radix_tree_tag_set() needs a comment saying that updates must be
> > excluded.  Might be obvious to some, but...
> > 
> > Ditto for radix_tree_tag_clear().
> > 
> > radix_tree_tag_get() either needs a comment saying that updates must
> > be excluded, or needs rcu_dereference() around the access of root->rnode
> > near line 565.  Yes, this is test scaffolding, but...
> 
> Indeed. I mainly intended to document APIs that _are_ RCU-able,
> but documented the negative for lookup_slot because it is similar
> to lookup... but that's messy too.
> 
> I'll probably put a little table in radix-tree.h to summarise the
> API synchronisation requirements, OK?

Makes sense to me -- except will that feed into the docbook stuff?
It seems to me to be really important to get these sorts of requirements
included in the docbook stuff.  I have had too many people show me
code that assumed that RCU somehow synchronizes updates, so it is
good to call out these requirements early and often.

> > > @@ -542,29 +593,25 @@ EXPORT_SYMBOL(radix_tree_tag_get);
> > >  #endif
> > >  
> > >  static unsigned int
> > > -__lookup(struct radix_tree_root *root, void **results, unsigned long index,
> > > +__lookup(struct radix_tree_node *slot, void **results, unsigned long index,
> > >  	unsigned int max_items, unsigned long *next_index)
> > >  {
> > 
> > Not clear on why common code isn't shared more widely among the various
> > lookups...  Not -exactly- the same, I know, but...
> 
> Yeah it can get a bit tricky. It's gets a small cleanup now and again,
> which is probably not a bad pace for our pagecache data structure ;)

;-)

> > >  radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
> 
> ...
> 
> > > +out:
> > > +	(void)rcu_dereference(results); /* single barrier for all results */
> 
> ...
> 
> > > @@ -707,7 +774,8 @@ radix_tree_gang_lookup_tag(struct radix_
> > >  		unsigned long first_index, unsigned int max_items,
> > >  		unsigned int tag)
> > 
> > Need header comment saying that pointers returned in results array
> > must be rcu_dereferenced() by RCU callers before use.
> > 
> > But would be better to just rcu_dereference() them in __lookup_tag():745.
> > This is zero cost on all but Alpha, and keeps the RCU effects more
> > localized.
> 
> Does the single rcu_dereference in radix_tree_gang_lookup look OK?

Well, it does put a memory barrier in the right place on Alpha, but the
intent would be more clear to me if the rcu_dereference() were on the
assignments to each element of the results array.  And there would be
no additional overhead on most architectures.

So I would much prefer the rcu_dereference() be on the assignment to
the results array.

> > Or must radix_tree_gang_lookup_tag() exclude updates?  If so, need to
> > say so in header comment.  Also, in this case, the rcu_dereference()
> > calls are unneeded.
> 
> Ah indeed, that's confusing. Yes, the lookup_tag must exclude updates.
> I guess I got too mechanical in my conversion... however, tag lookups
> can be RCUified without a great deal of trouble, so I might take this
> opportunity.

The tag lookups would then find anything that (1) had been tagged in a
prior operation and (2) had not been deleted in the meantime, right?
And the caller could hold a lock across both the tagging and tag
lookup if greater certainty was desired.  I could imagine this sort
of semantic being useful for deferred operations on ranges of memory,
where new additions would have the operation implicit in creation and any
deletions would no longer need the operation to be performed (or might
be performed as part of deletion operation), but have not actually used
this sort of thing myself.

So I must defer to people who have used tagging and tagged lookups
in anger.

> > > @@ -743,8 +825,17 @@ static inline void radix_tree_shrink(str
> > >  			root->rnode->count == 1 &&
> > >  			root->rnode->slots[0]) {
> > >  		struct radix_tree_node *to_free = root->rnode;
> > > +		void *newptr;
> > >  
> > > -		root->rnode = to_free->slots[0];
> > > +		/*
> > > +		 * this doesn't need an rcu_assign_pointer, because
> > > +		 * we aren't touching the object that to_free->slots[0]
> > > +		 * points to.
> > > +		 */
> > 
> > I found this comment to be confusing.  Suggest something like the following:
> > 
> > 		/*
> > 		 * We don't need rcu_assign_pointer(), since we are
> > 		 * simply moving the node from one part of the tree
> > 		 * to another.  If it was safe to dereference a pointer
> > 		 * to it before, it is still safe to dereference a
> > 		 * pointer to it afterwards.
> > 		 */
> > 
> > A few more lines, but well worth it IMHO.  Someone with a clearer
> > understanding of the code might be able to create a more concise
> > comment that still gets the idea across clearly.  ;-)
> 
> That's pretty good, I just made a couple of tiny changes, what do you think?
>                 /*
>                  * We don't need rcu_assign_pointer(), since we are simply
>                  * moving the node from one part of the tree to another. If
>                  * it was safe to dereference the old pointer to it
>                  * (to_free->slots[0]), it will be safe to dereference the new
>                  * one (root->rnode).
>                  */

Looks good to me!

> > > +		newptr = to_free->slots[0];
> > > +		if (root->height == 1)
> > > +			newptr = radix_tree_ptr_to_direct(newptr);
> > > +		root->rnode = newptr;
> > >  		root->height--;
> > >  		/* must only free zeroed nodes into the slab */
> > >  		tag_clear(to_free, 0, 0);
> > > @@ -768,6 +859,7 @@ void *radix_tree_delete(struct radix_tre
> > 
> > Need header comment in radix_tree_delete() saying that if the caller
> > also does lock-free searches, that it is the caller's responsibility
> > to wait a grace period (e.g., call_rcu() or synchronize_rcu()) before
> > freeing or re-using the removed item.  Otherwise, people will naturally
> > assume that radix_tree_delete() took care of this detail for them.
> 
> I've tried to get that message across in the radix_tree_lookup_slot
> comment, if they're using RCU lookups. Enough? I guess I'll add it
> to the locking summary too.

This is what I saw in the radix_tree_lookup_slot() comment:

+ *   radix_tree_lookup_slot    -    lookup a slot in a radix tree
+ *   @root:          radix tree root
+ *   @index:         index key
+ *
+ *   Lookup the slot corresponding to the position @index in the radix tree
+ *   @root. This is useful for update-if-exists operations.
+ *
+ *   This function cannot be called under rcu_read_lock, it must be
+ *   excluded from writers, as must the returned slot.

This comment does not make the RCU-protection point clear to me.
The constraint is that if you use RCU radix-tree lookups, then you must
use synchronize_rcu() or call_rcu() when freeing any elements removed
from the radix tree via radix_tree_delete().

Or am I missing something here?

> > For that matter, should radix_tree_delete() do a synchronize_rcu()?
> > My belief is "no", since radix_tree_delete() has no way of knowing whether
> > this particular tree is searched lock-free.  One option would be to have
> > some sort of option bit in the tree telling radix_tree_delete() to do
> > a synchronize_rcu().  It does not make sense for radix_tree_delete()
> > to do call_rcu(), since it has no idea what needs to be done to free
> > the item removed -- it could well have arbitrary frilly data structures
> > linked to it that also need to be freed.
> 
> Agreed, I think "no". In the current (locked) radix-tree, the caller is
> still responsible to manage lifetimes of the data (leaf) nodes.

Sounds good -- if people need additional functionality, it can always be
added later.

> > Also should add comment saying that caller is responsible for excluding
> > other updates to this tree.  Obvious perhaps, but better safe than sorry.
> 
> Yeah, I'll try to make all this a bit clearer in the locking summary.

Sounds good!

> > Rough notes, FYA:

You went through all these as well?  Hope you have recovered from the
bout of insomnia!  ;-)

> > o	Don't the items placed into the radix tree need to be protected
> > 	by RCU?  If not, how does the radix tree avoid handing a pointer
> > 	to something that has recently been removed, that the caller to
> > 	radix_tree_delete() might have already freed?
> 
> Yes, they'll need to be protected by something. In the lockless pagecache,
> they are never freed, so that isn't an issue. But other users (Ben's
> irq patch perhaps, unless it uses the slot pointers directly) will have to
> be careful.

Ah!  If they are never freed, there is still a need to take care when
reusing them.  One approach is to prevent them from being reused until
after a grace period has elapsed, another is to use revalidation checks
after lookup.  Either way, one needs to allow for the fact that a
lookup might hand you something that has just been deleted.

> > 	If so, what about nfs_inode_add_request(), which adds struct
> > 	nfs_page to a radix tree?  This structure does not appear to
> > 	have any sort of RCU protection:
> 
> Well if they retain their existing locking, there shouldn't be a problem.
> (as you say below...)

Agreed!

> > 	But it seems that all radix-tree operations are under a
> > 	spinlock, so this works after all.
> > 
> > 	Other calls to radix_tree_delete():
> > 
> > 	o	__remove_from_page_cache() is called under
> > 		->tree_lock in all cases, as are other non-initialization
> > 		uses of ->page_tree.
> > 
> > 	And that appears to be it.  So where is Ben Herrenschmidt's
> > 	lock-free use of radix-tree lookup?  No matter -- Ben just needs
> > 	to make sure to RCU-protect whatever the heck he is putting into
> > 	the radix tree if he is calling the lookup functions without
> > 	locking.  ;-)
> > 
> > 	Probably worth a comment to radix_tree_delete() saying that the
> > 	caller must wait for a grace period before freeing/reusing
> > 	the item removed from the tree!  (But only if searching done
> > 	without locking.)
> > 
> > o	radix_tree_shrink() -- what is to_free?  freelist?  If so, protected
> > 	by what lock?  OK -- the reason that rcu_assign_pointer() can be
> > 	omitted is that we are simply moving the element unchanged from
> > 	one location in the tree to another -- if it was safe to access
> > 	before, it is safe to access now, no additional memory barriers
> > 	required.
> > 
> > 	And to_free is just a local variable -- must be blind today.  :-/
> > 
> > 	Need to make the comment more clear.
> 
> Yeah, to_free is just a redundant node that is being chopped off the top
> of the tree. It does go through a grace period of course, because lockless
> lookups can still be referencing it.

Yep -- chopping out a no-longer-needed level of the tree.  Made sense
after a couple of scans over the code.

> > o	RCU-protected fields:
> 
> ...
> 
> Thanks very much Paul. Very thorough review.
> 
> I'll send out an incremental diff with changes.

Looking forward to it!  Maybe not as anxiously as Ben Herrenschmidt, but
so it goes.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] radix-tree: RCU lockless readside
  2006-06-22 16:30       ` Paul E. McKenney
@ 2006-06-22 16:55         ` Nick Piggin
  2006-06-22 17:40           ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2006-06-22 16:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Benjamin Herrenschmidt, Paul McKenney,
	Linux Kernel, Linux Memory Management

On Thu, Jun 22, 2006 at 09:30:32AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 22, 2006 at 05:45:18PM +0200, Nick Piggin wrote:
> > 
> > I'll probably put a little table in radix-tree.h to summarise the
> > API synchronisation requirements, OK?
> 
> Makes sense to me -- except will that feed into the docbook stuff?
> It seems to me to be really important to get these sorts of requirements
> included in the docbook stuff.  I have had too many people show me
> code that assumed that RCU somehow synchronizes updates, so it is
> good to call out these requirements early and often.

I'm not a docbook expert, but that's a good point. Will RFComments
on the comments when I'm done ;)

> > 
> > Does the single rcu_dereference in radix_tree_gang_lookup look OK?
> 
> Well, it does put a memory barrier in the right place on Alpha, but the
> intent would be more clear to me if the rcu_dereference() were on the
> assignments to each element of the results array.  And there would be
> no additional overhead on most architectures.
> 
> So I would much prefer the rcu_dereference() be on the assignment to
> the results array.

No problem, will change.

> > Ah indeed, that's confusing. Yes, the lookup_tag must exclude updates.
> > I guess I got too mechanical in my conversion... however, tag lookups
> > can be RCUified without a great deal of trouble, so I might take this
> > opportunity.
> 
> The tag lookups would then find anything that (1) had been tagged in a
> prior operation and (2) had not been deleted in the meantime, right?

Yes. Where "prior" is only really prior as guaranteed by some
synchronising (or otherwise dependent) operation. But I don't
need to tell you that ;)

> And the caller could hold a lock across both the tagging and tag
> lookup if greater certainty was desired.  I could imagine this sort
> of semantic being useful for deferred operations on ranges of memory,
> where new additions would have the operation implicit in creation and any
> deletions would no longer need the operation to be performed (or might
> be performed as part of deletion operation), but have not actually used
> this sort of thing myself.
> 
> So I must defer to people who have used tagging and tagged lookups
> in anger.

We use tagged lookups for writeout and synching -- slow IO related
which is why I had not converted it over to lockless. But it could
use lockless tagged lookups: eg. so long as sync catches all the
pages that we *know* to be dirty at the time of the sync, that's
fine.

> > I've tried to get that message across in the radix_tree_lookup_slot
> > comment, if they're using RCU lookups. Enough? I guess I'll add it
> > to the locking summary too.
> 
> This is what I saw in the radix_tree_lookup_slot() comment:
> 
> + *   radix_tree_lookup_slot    -    lookup a slot in a radix tree
> + *   @root:          radix tree root
> + *   @index:         index key
> + *
> + *   Lookup the slot corresponding to the position @index in the radix tree
> + *   @root. This is useful for update-if-exists operations.
> + *
> + *   This function cannot be called under rcu_read_lock, it must be
> + *   excluded from writers, as must the returned slot.
> 
> This comment does not make the RCU-protection point clear to me.
> The constraint is that if you use RCU radix-tree lookups, then you must
> use synchronize_rcu() or call_rcu() when freeing any elements removed
> from the radix tree via radix_tree_delete().
> 
> Or am I missing something here?

Well the pagecache uses pointers to struct page. struct page is never
freed, so we can forget the whole thing ;)
(mem unplug may want to free them, so in that case they would have to
synchronize_rcu).

But I guess for less specialised users, RCU would be the usual way to
go... ah I see, I must have got out of synch somewhere. I have comments
along these lines for radix_tree_lookup_slot:

+ *     This function can be called under rcu_read_lock, however it is the
+ *     duty of the caller to manage the lifetimes of the leaf nodes (ie.
+ *     they would usually be RCU protected as well). Also, dereferencing
+ *     the slot pointer would require rcu_dereference, and modifying it
+ *     would require rcu_assign_pointer.



> > > Rough notes, FYA:
> 
> You went through all these as well?  Hope you have recovered from the
> bout of insomnia!  ;-)

Pretty well gone through them.

> 
> > > o	Don't the items placed into the radix tree need to be protected
> > > 	by RCU?  If not, how does the radix tree avoid handing a pointer
> > > 	to something that has recently been removed, that the caller to
> > > 	radix_tree_delete() might have already freed?
> > 
> > Yes, they'll need to be protected by something. In the lockless pagecache,
> > they are never freed, so that isn't an issue. But other users (Ben's
> > irq patch perhaps, unless it uses the slot pointers directly) will have to
> > be careful.
> 
> Ah!  If they are never freed, there is still a need to take care when
> reusing them.  One approach is to prevent them from being reused until
> after a grace period has elapsed, another is to use revalidation checks
> after lookup.  Either way, one needs to allow for the fact that a
> lookup might hand you something that has just been deleted.

Yep, validation checks are done after lookup. The core of it, the
``page_cache_get_speculative'' function isn't that big.

> > I'll send out an incremental diff with changes.
> 
> Looking forward to it!  Maybe not as anxiously as Ben Herrenschmidt, but
> so it goes.  ;-)

Well I couldn't ask you to spend any more time on it, but if you
get interested and take a peek, that'll be a bonus for me ;)

Thanks again

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] radix-tree: RCU lockless readside
  2006-06-22 16:55         ` Nick Piggin
@ 2006-06-22 17:40           ` Paul E. McKenney
  2006-06-22 18:11             ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2006-06-22 17:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Benjamin Herrenschmidt, Paul McKenney,
	Linux Kernel, Linux Memory Management

On Thu, Jun 22, 2006 at 06:55:51PM +0200, Nick Piggin wrote:
> On Thu, Jun 22, 2006 at 09:30:32AM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 22, 2006 at 05:45:18PM +0200, Nick Piggin wrote:
> > > 
> > > I'll probably put a little table in radix-tree.h to summarise the
> > > API synchronisation requirements, OK?
> > 
> > Makes sense to me -- except will that feed into the docbook stuff?
> > It seems to me to be really important to get these sorts of requirements
> > included in the docbook stuff.  I have had too many people show me
> > code that assumed that RCU somehow synchronizes updates, so it is
> > good to call out these requirements early and often.
> 
> I'm not a docbook expert, but that's a good point. Will RFComments
> on the comments when I'm done ;)

Nor am I, but the trick seems to be to put the right verbiage and
format into the function comment header.

> > > Does the single rcu_dereference in radix_tree_gang_lookup look OK?
> > 
> > Well, it does put a memory barrier in the right place on Alpha, but the
> > intent would be more clear to me if the rcu_dereference() were on the
> > assignments to each element of the results array.  And there would be
> > no additional overhead on most architectures.
> > 
> > So I would much prefer the rcu_dereference() be on the assignment to
> > the results array.
> 
> No problem, will change.

Thank you!

> > > Ah indeed, that's confusing. Yes, the lookup_tag must exclude updates.
> > > I guess I got too mechanical in my conversion... however, tag lookups
> > > can be RCUified without a great deal of trouble, so I might take this
> > > opportunity.
> > 
> > The tag lookups would then find anything that (1) had been tagged in a
> > prior operation and (2) had not been deleted in the meantime, right?
> 
> Yes. Where "prior" is only really prior as guaranteed by some
> synchronising (or otherwise dependent) operation. But I don't
> need to tell you that ;)

We can all use reminding from time to time...  ;-)

> > And the caller could hold a lock across both the tagging and tag
> > lookup if greater certainty was desired.  I could imagine this sort
> > of semantic being useful for deferred operations on ranges of memory,
> > where new additions would have the operation implicit in creation and any
> > deletions would no longer need the operation to be performed (or might
> > be performed as part of deletion operation), but have not actually used
> > this sort of thing myself.
> > 
> > So I must defer to people who have used tagging and tagged lookups
> > in anger.
> 
> We use tagged lookups for writeout and synching -- slow IO related
> which is why I had not converted it over to lockless. But it could
> use lockless tagged lookups: eg. so long as sync catches all the
> pages that we *know* to be dirty at the time of the sync, that's
> fine.

Fair enough!

> > > I've tried to get that message across in the radix_tree_lookup_slot
> > > comment, if they're using RCU lookups. Enough? I guess I'll add it
> > > to the locking summary too.
> > 
> > This is what I saw in the radix_tree_lookup_slot() comment:
> > 
> > + *   radix_tree_lookup_slot    -    lookup a slot in a radix tree
> > + *   @root:          radix tree root
> > + *   @index:         index key
> > + *
> > + *   Lookup the slot corresponding to the position @index in the radix tree
> > + *   @root. This is useful for update-if-exists operations.
> > + *
> > + *   This function cannot be called under rcu_read_lock, it must be
> > + *   excluded from writers, as must the returned slot.
> > 
> > This comment does not make the RCU-protection point clear to me.
> > The constraint is that if you use RCU radix-tree lookups, then you must
> > use synchronize_rcu() or call_rcu() when freeing any elements removed
> > from the radix tree via radix_tree_delete().
> > 
> > Or am I missing something here?
> 
> Well the pagecache uses pointers to struct page. struct page is never
> freed, so we can forget the whole thing ;)
> (mem unplug may want to free them, so in that case they would have to
> synchronize_rcu).

OK, but they still need to be careful -- otherwise, they can end up
re-using a page quickly, and then someone can search with the old
identity, but end up with a reference to the page, but with its new
identity.  Sort of a reverse form of identity theft, I guess.

The usual approach to handle this is to either:

1.	wait a grace period before reusing a page after it has been
	deleted.

2.	recheck the page's identity after each lookup.

>From your comments below, looks like you do #2.

> But I guess for less specialised users, RCU would be the usual way to
> go... ah I see, I must have got out of synch somewhere. I have comments
> along these lines for radix_tree_lookup_slot:
> 
> + *     This function can be called under rcu_read_lock, however it is the
> + *     duty of the caller to manage the lifetimes of the leaf nodes (ie.
> + *     they would usually be RCU protected as well). Also, dereferencing
> + *     the slot pointer would require rcu_dereference, and modifying it
> + *     would require rcu_assign_pointer.
> 
> > > > Rough notes, FYA:
> > 
> > You went through all these as well?  Hope you have recovered from the
> > bout of insomnia!  ;-)
> 
> Pretty well gone through them.

;-)

> > > > o	Don't the items placed into the radix tree need to be protected
> > > > 	by RCU?  If not, how does the radix tree avoid handing a pointer
> > > > 	to something that has recently been removed, that the caller to
> > > > 	radix_tree_delete() might have already freed?
> > > 
> > > Yes, they'll need to be protected by something. In the lockless pagecache,
> > > they are never freed, so that isn't an issue. But other users (Ben's
> > > irq patch perhaps, unless it uses the slot pointers directly) will have to
> > > be careful.
> > 
> > Ah!  If they are never freed, there is still a need to take care when
> > reusing them.  One approach is to prevent them from being reused until
> > after a grace period has elapsed, another is to use revalidation checks
> > after lookup.  Either way, one needs to allow for the fact that a
> > lookup might hand you something that has just been deleted.
> 
> Yep, validation checks are done after lookup. The core of it, the
> ``page_cache_get_speculative'' function isn't that big.

Good enough!

							Thanx, Paul

> > > I'll send out an incremental diff with changes.
> > 
> > Looking forward to it!  Maybe not as anxiously as Ben Herrenschmidt, but
> > so it goes.  ;-)
> 
> Well I couldn't ask you to spend any more time on it, but if you
> get interested and take a peek, that'll be a bonus for me ;)
> 
> Thanks again
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] radix-tree: RCU lockless readside
  2006-06-22 17:40           ` Paul E. McKenney
@ 2006-06-22 18:11             ` Nick Piggin
  2006-06-23  7:09               ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2006-06-22 18:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Benjamin Herrenschmidt, Paul McKenney,
	Linux Kernel, Linux Memory Management

On Thu, Jun 22, 2006 at 10:40:57AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 22, 2006 at 06:55:51PM +0200, Nick Piggin wrote:
> > 
> > No problem, will change.
> 
> Thank you!

OK, and with that I believe I've covered all your concerns.

Attached is the incremental patch (plus a little bit of fuzz
that's gone to Andrew). The big items are:

- documentation, clarification of comments
- tag lookups are now RCU safe (tested with harness)
- cleanups of various misuses of rcu_ API that Paul spotted
- thought I might put in a copyright -- is this OK?

Andrew, please apply.

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -1,6 +1,7 @@
 /*
  * Copyright (C) 2001 Momchil Velikov
  * Portions Copyright (C) 2001 Christoph Hellwig
+ * Copyright (C) 2006 Nick Piggin
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License as
@@ -25,6 +26,33 @@
 #include <linux/kernel.h>
 #include <linux/rcupdate.h>
 
+/*
+ * A direct pointer (root->rnode pointing directly to a data item,
+ * rather than another radix_tree_node) is signalled by the low bit
+ * set in the root->rnode pointer.
+ *
+ * In this case root->height is also NULL, but the direct pointer tests are
+ * needed for RCU lookups when root->height is unreliable.
+ */
+#define RADIX_TREE_DIRECT_PTR	1
+
+static inline void *radix_tree_ptr_to_direct(void *ptr)
+{
+	return (void *)((unsigned long)ptr | RADIX_TREE_DIRECT_PTR);
+}
+
+static inline void *radix_tree_direct_to_ptr(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~RADIX_TREE_DIRECT_PTR);
+}
+
+static inline int radix_tree_is_direct_ptr(void *ptr)
+{
+	return (int)((unsigned long)ptr & RADIX_TREE_DIRECT_PTR);
+}
+
+/*** radix-tree API starts here ***/
+
 #define RADIX_TREE_MAX_TAGS 2
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
@@ -50,29 +78,62 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
-#define RADIX_TREE_DIRECT_PTR	1
-
-static inline void *radix_tree_ptr_to_direct(void *ptr)
-{
-	return (void *)((unsigned long)ptr | RADIX_TREE_DIRECT_PTR);
-}
-
-static inline void *radix_tree_direct_to_ptr(void *ptr)
-{
-	return (void *)((unsigned long)ptr & ~RADIX_TREE_DIRECT_PTR);
-}
-
-static inline int radix_tree_is_direct_ptr(void *ptr)
-{
-	return (int)((unsigned long)ptr & RADIX_TREE_DIRECT_PTR);
-}
+/**
+ * Radix-tree synchronization
+ *
+ * The radix-tree API requires that users provide all synchronisation (with
+ * specific exceptions, noted below).
+ *
+ * Synchronization of access to the data items being stored in the tree, and
+ * management of their lifetimes must be completely managed by API users.
+ *
+ * For API usage, in general,
+ * - any function _modifying_ the the tree or tags (inserting or deleting
+ *   items, setting or clearing tags must exclude other modifications, and
+ *   exclude any functions reading the tree.
+ * - any function _reading_ the the tree or tags (looking up items or tags,
+ *   gang lookups) must exclude modifications to the tree, but may occur
+ *   concurrently with other readers.
+ *
+ * The notable exceptions to this rule are the following functions:
+ * radix_tree_lookup
+ * radix_tree_tag_get
+ * radix_tree_gang_lookup
+ * radix_tree_gang_lookup_tag
+ * radix_tree_tagged
+ *
+ * The first 4 functions are able to be called locklessly, using RCU. The
+ * caller must ensure calls to these functions are made within rcu_read_lock()
+ * regions. Other readers (lock-free or otherwise) and modifications may be
+ * running concurrently.
+ *
+ * It is still required that the caller manage the synchronization and lifetimes
+ * of the items. So if RCU lock-free lookups are used, typically this would mean
+ * that the items have their own locks, or are amenable to lock-free access; and
+ * that the items are freed by RCU (or only freed after having been deleted from
+ * the radix tree *and* a synchronize_rcu() grace period).
+ *
+ * (Note, rcu_assign_pointer and rcu_dereference are not needed to control
+ * access to data items when inserting into or looking up from the radix tree)
+ *
+ * radix_tree_tagged is able to be called without locking or RCU.
+ */
 
+/**
+ * radix_tree_deref_slot	- dereference a slot
+ * @pslot:	pointer to slot, returned by radix_tree_lookup_slot
+ * @returns:	item that was stored in that slot.
+ */
 static inline void *radix_tree_deref_slot(void *pslot)
 {
 	void *slot = *(void **)pslot;
 	return rcu_dereference(radix_tree_direct_to_ptr(slot));
 }
-
+/**
+ * radix_tree_replace_slot	- replace item in a slot
+ * @pslot:	pointer to slot, returned by radix_tree_lookup_slot
+ * @item:	new item to store in the slot.
+ */
 static inline void radix_tree_replace_slot(void *pslot, void *item)
 {
 	void *slot = *(void **)pslot;
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -2,6 +2,7 @@
  * Copyright (C) 2001 Momchil Velikov
  * Portions Copyright (C) 2001 Christoph Hellwig
  * Copyright (C) 2005 SGI, Christoph Lameter <clameter@sgi.com>
+ * Copyright (C) 2006 Nick Piggin
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License as
@@ -341,7 +342,7 @@ void **radix_tree_lookup_slot(struct rad
 	unsigned int height, shift;
 	struct radix_tree_node *node, **slot;
 
-	node = rcu_dereference(root->rnode);
+	node = root->rnode;
 	if (node == NULL)
 		return NULL;
 
@@ -360,7 +361,7 @@ void **radix_tree_lookup_slot(struct rad
 	do {
 		slot = (struct radix_tree_node **)
 			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
-		node = rcu_dereference(*slot);
+		node = *slot;
 		if (node == NULL)
 			return NULL;
 
@@ -547,27 +548,30 @@ int radix_tree_tag_get(struct radix_tree
 			unsigned long index, unsigned int tag)
 {
 	unsigned int height, shift;
-	struct radix_tree_node *slot;
+	struct radix_tree_node *node;
 	int saw_unset_tag = 0;
 
-	height = root->height;
-	if (index > radix_tree_maxindex(height))
-		return 0;
-
 	/* check the root's tag bit */
 	if (!root_tag_get(root, tag))
 		return 0;
 
-	if (height == 0)
-		return 1;
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
+		return 0;
+
+	if (radix_tree_is_direct_ptr(node))
+		return (index == 0);
+
+	height = node->height;
+	if (index > radix_tree_maxindex(height))
+		return 0;
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
-	slot = root->rnode;
 
 	for ( ; ; ) {
 		int offset;
 
-		if (slot == NULL)
+		if (node == NULL)
 			return 0;
 
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
@@ -576,15 +580,15 @@ int radix_tree_tag_get(struct radix_tree
 		 * This is just a debug check.  Later, we can bale as soon as
 		 * we see an unset tag.
 		 */
-		if (!tag_get(slot, tag, offset))
+		if (!tag_get(node, tag, offset))
 			saw_unset_tag = 1;
 		if (height == 1) {
-			int ret = tag_get(slot, tag, offset);
+			int ret = tag_get(node, tag, offset);
 
 			BUG_ON(ret && saw_unset_tag);
 			return ret;
 		}
-		slot = slot->slots[offset];
+		node = rcu_dereference(node->slots[offset]);
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
 	}
@@ -606,12 +610,9 @@ __lookup(struct radix_tree_node *slot, v
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
 	for ( ; height > 1; height--) {
-		struct radix_tree_node *__s;
-
 		i = (index >> shift) & RADIX_TREE_MAP_MASK ;
 		for (;;) {
-			__s = rcu_dereference(slot->slots[i]);
-			if (__s != NULL)
+			if (slot->slots[i] != NULL)
 				break;
 			index &= ~((1UL << shift) - 1);
 			index += 1UL << shift;
@@ -623,7 +624,7 @@ __lookup(struct radix_tree_node *slot, v
 		}
 
 		shift -= RADIX_TREE_MAP_SHIFT;
-		slot = __s;
+		slot = rcu_dereference(slot->slots[i]);
 	}
 
 	/* Bottom level: grab some items */
@@ -632,7 +633,7 @@ __lookup(struct radix_tree_node *slot, v
 		index++;
 		node = slot->slots[i];
 		if (node) {
-			results[nr_found++] = node;
+			results[nr_found++] = rcu_dereference(node);
 			if (nr_found == max_items)
 				goto out;
 		}
@@ -677,9 +678,9 @@ radix_tree_gang_lookup(struct radix_tree
 	if (radix_tree_is_direct_ptr(node)) {
 		if (first_index > 0)
 			return 0;
-		results[0] = radix_tree_direct_to_ptr(node);
-		ret = 1;
-		goto out;
+		node = radix_tree_direct_to_ptr(node);
+		results[0] = rcu_dereference(node);
+		return 1;
 	}
 
 	max_index = radix_tree_maxindex(node->height);
@@ -699,9 +700,6 @@ radix_tree_gang_lookup(struct radix_tree
 		cur_index = next_index;
 	}
 
-out:
-	(void)rcu_dereference(results); /* single barrier for all results */
-
 	return ret;
 }
 EXPORT_SYMBOL(radix_tree_gang_lookup);
@@ -715,26 +713,27 @@ __lookup_tag(struct radix_tree_node *slo
 	unsigned int max_items, unsigned long *next_index, unsigned int tag)
 {
 	unsigned int nr_found = 0;
-	unsigned int shift;
-	unsigned int height = slot->height;
+	unsigned int shift, height;
 
-	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+	height = slot->height;
+	if (height == 0)
+		goto out;
+	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
 	while (height > 0) {
-		unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK;
+		unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK ;
 
-		for ( ; i < RADIX_TREE_MAP_SIZE; i++) {
-			if (tag_get(slot, tag, i)) {
-				BUG_ON(slot->slots[i] == NULL);
+		for (;;) {
+			if (tag_get(slot, tag, i))
 				break;
-			}
 			index &= ~((1UL << shift) - 1);
 			index += 1UL << shift;
 			if (index == 0)
 				goto out;	/* 32-bit wraparound */
+			i++;
+			if (i == RADIX_TREE_MAP_SIZE)
+				goto out;
 		}
-		if (i == RADIX_TREE_MAP_SIZE)
-			goto out;
 		height--;
 		if (height == 0) {	/* Bottom level: grab some items */
 			unsigned long j = index & RADIX_TREE_MAP_MASK;
@@ -742,15 +741,19 @@ __lookup_tag(struct radix_tree_node *slo
 			for ( ; j < RADIX_TREE_MAP_SIZE; j++) {
 				index++;
 				if (tag_get(slot, tag, j)) {
-					BUG_ON(slot->slots[j] == NULL);
-					results[nr_found++] = slot->slots[j];
-					if (nr_found == max_items)
-						goto out;
+					struct radix_tree_node *node = slot->slots[j];
+					if (node) {
+						results[nr_found++] = rcu_dereference(node);
+						if (nr_found == max_items)
+							goto out;
+					}
 				}
 			}
 		}
 		shift -= RADIX_TREE_MAP_SHIFT;
-		slot = slot->slots[i];
+		slot = rcu_dereference(slot->slots[i]);
+		if (slot == NULL);
+			break;
 	}
 out:
 	*next_index = index;
@@ -778,7 +781,7 @@ radix_tree_gang_lookup_tag(struct radix_
 	struct radix_tree_node *node;
 	unsigned long max_index;
 	unsigned long cur_index = first_index;
-	unsigned int ret = 0;
+	unsigned int ret;
 
 	/* check the root's tag bit */
 	if (!root_tag_get(root, tag))
@@ -786,18 +789,19 @@ radix_tree_gang_lookup_tag(struct radix_
 
 	node = rcu_dereference(root->rnode);
 	if (!node)
-		return ret;
+		return 0;
 
 	if (radix_tree_is_direct_ptr(node)) {
 		if (first_index > 0)
 			return 0;
 		node = radix_tree_direct_to_ptr(node);
-		results[0] = node;
+		results[0] = rcu_dereference(node);
 		return 1;
 	}
 
 	max_index = radix_tree_maxindex(node->height);
 
+	ret = 0;
 	while (ret < max_items) {
 		unsigned int nr_found;
 		unsigned long next_index;	/* Index of next search */
@@ -811,6 +815,7 @@ radix_tree_gang_lookup_tag(struct radix_
 			break;
 		cur_index = next_index;
 	}
+
 	return ret;
 }
 EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
@@ -829,9 +834,11 @@ static inline void radix_tree_shrink(str
 		void *newptr;
 
 		/*
-		 * this doesn't need an rcu_assign_pointer, because
-		 * we aren't touching the object that to_free->slots[0]
-		 * points to.
+		 * We don't need rcu_assign_pointer(), since we are simply
+		 * moving the node from one part of the tree to another. If
+		 * it was safe to dereference the old pointer to it
+		 * (to_free->slots[0]), it will be safe to dereference the new
+		 * one (root->rnode).
 		 */
 		newptr = to_free->slots[0];
 		if (root->height == 1)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] radix-tree: RCU lockless readside
  2006-06-22 18:11             ` Nick Piggin
@ 2006-06-23  7:09               ` Andrew Morton
  2006-06-23  8:03                 ` Nick Piggin
  2006-06-23  8:39                 ` Nick Piggin
  0 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2006-06-23  7:09 UTC (permalink / raw)
  To: Nick Piggin; +Cc: paulmck, benh, Paul.McKenney, linux-kernel, linux-mm

On Thu, 22 Jun 2006 20:11:12 +0200
Nick Piggin <npiggin@suse.de> wrote:

> On Thu, Jun 22, 2006 at 10:40:57AM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 22, 2006 at 06:55:51PM +0200, Nick Piggin wrote:
> > > 
> > > No problem, will change.
> > 
> > Thank you!
> 
> OK, and with that I believe I've covered all your concerns.
> 
> Attached is the incremental patch (plus a little bit of fuzz
> that's gone to Andrew). The big items are:
> 
> - documentation, clarification of comments
> - tag lookups are now RCU safe (tested with harness)
> - cleanups of various misuses of rcu_ API that Paul spotted
> - thought I might put in a copyright -- is this OK?
> 
> Andrew, please apply.

Freeing unused kernel memory: 316k freed
Write protecting the kernel read-only data: 384k
No module found in object
No module found in object
No module found in object
No module found in object
input: AT Translated Set 2 keyboard as /class/input/input0
No module found in object
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
BUG: NMI Watchdog detected LOCKUP on CPU0, eip c0264345, registers:
CPU:    0
EIP is at radix_tree_gang_lookup_tag+0x105/0x1a0
eax: ffffffff   ebx: 00000040   ecx: ffffffc0   edx: 00000007
esi: e701e9d8   edi: 000001c0   ebp: e6fbddd8   esp: e6fbdda8
ds: 007b   es: 007b   ss: 0068
Process fsck.ext3 (pid: 1565, ti=e6fbc000 task=c1fbcb90 task.ti=e6fbc000)
Stack: e77f2dc4 e701e9d8 e701e9d8 00000002 00000fff 00000000 e701e8c8 e6fbde60 
       0000000e c1c6c52c e6fbde60 c1c6c538 e6fbde00 c014b68f c1c6c52c e6fbde60 
       00000000 0000000e 00000000 e6fbde58 00000000 00000001 e6fbde20 c0155631 
Call Trace:
Code: 89 fa 8d 4c 09 fa d3 e3 d3 ea 89 d9 83 e2 3f f7 d9 eb 13 8d 76 00 89 f8 89 df 21 c8 01 c7 74 26 42 83 fa 40 74 95 0f a3 16 19 c0 <85> c0 74 e7 83 7d dc 01 74 3a 31 f6 89 75 f0 e9 6e ff ff ff c7 
console shuts up ...


Not sure why, either.  It all looks like an equivalent transformation to
me.

fwiw, here's what I tested:


From: Nick Piggin <npiggin@suse.de>

- documentation, clarification of comments
- tag lookups are now RCU safe (tested with harness)
- cleanups of various misuses of rcu_ API that Paul spotted
- thought I might put in a copyright -- is this OK?

Cc: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 include/linux/radix-tree.h |   95 +++++++++++++++++++++++++++-----
 lib/radix-tree.c           |  101 ++++++++++++++++++-----------------
 2 files changed, 132 insertions(+), 64 deletions(-)

diff -puN include/linux/radix-tree.h~adix-tree-rcu-lockless-readside-update include/linux/radix-tree.h
--- a/include/linux/radix-tree.h~adix-tree-rcu-lockless-readside-update
+++ a/include/linux/radix-tree.h
@@ -1,6 +1,7 @@
 /*
  * Copyright (C) 2001 Momchil Velikov
  * Portions Copyright (C) 2001 Christoph Hellwig
+ * Copyright (C) 2006 Nick Piggin
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License as
@@ -25,6 +26,33 @@
 #include <linux/kernel.h>
 #include <linux/rcupdate.h>
 
+/*
+ * A direct pointer (root->rnode pointing directly to a data item,
+ * rather than another radix_tree_node) is signalled by the low bit
+ * set in the root->rnode pointer.
+ *
+ * In this case root->height is also NULL, but the direct pointer tests are
+ * needed for RCU lookups when root->height is unreliable.
+ */
+#define RADIX_TREE_DIRECT_PTR	1
+
+static inline void *radix_tree_ptr_to_direct(void *ptr)
+{
+	return (void *)((unsigned long)ptr | RADIX_TREE_DIRECT_PTR);
+}
+
+static inline void *radix_tree_direct_to_ptr(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~RADIX_TREE_DIRECT_PTR);
+}
+
+static inline int radix_tree_is_direct_ptr(void *ptr)
+{
+	return (int)((unsigned long)ptr & RADIX_TREE_DIRECT_PTR);
+}
+
+/*** radix-tree API starts here ***/
+
 #define RADIX_TREE_MAX_TAGS 2
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
@@ -50,29 +78,62 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
-#define RADIX_TREE_DIRECT_PTR	1
-
-static inline void *radix_tree_ptr_to_direct(void *ptr)
-{
-	return (void *)((unsigned long)ptr | RADIX_TREE_DIRECT_PTR);
-}
-
-static inline void *radix_tree_direct_to_ptr(void *ptr)
-{
-	return (void *)((unsigned long)ptr & ~RADIX_TREE_DIRECT_PTR);
-}
-
-static inline int radix_tree_is_direct_ptr(void *ptr)
-{
-	return (int)((unsigned long)ptr & RADIX_TREE_DIRECT_PTR);
-}
+/**
+ * Radix-tree synchronization
+ *
+ * The radix-tree API requires that users provide all synchronisation (with
+ * specific exceptions, noted below).
+ *
+ * Synchronization of access to the data items being stored in the tree, and
+ * management of their lifetimes must be completely managed by API users.
+ *
+ * For API usage, in general,
+ * - any function _modifying_ the the tree or tags (inserting or deleting
+ *   items, setting or clearing tags must exclude other modifications, and
+ *   exclude any functions reading the tree.
+ * - any function _reading_ the the tree or tags (looking up items or tags,
+ *   gang lookups) must exclude modifications to the tree, but may occur
+ *   concurrently with other readers.
+ *
+ * The notable exceptions to this rule are the following functions:
+ * radix_tree_lookup
+ * radix_tree_tag_get
+ * radix_tree_gang_lookup
+ * radix_tree_gang_lookup_tag
+ * radix_tree_tagged
+ *
+ * The first 4 functions are able to be called locklessly, using RCU. The
+ * caller must ensure calls to these functions are made within rcu_read_lock()
+ * regions. Other readers (lock-free or otherwise) and modifications may be
+ * running concurrently.
+ *
+ * It is still required that the caller manage the synchronization and lifetimes
+ * of the items. So if RCU lock-free lookups are used, typically this would mean
+ * that the items have their own locks, or are amenable to lock-free access; and
+ * that the items are freed by RCU (or only freed after having been deleted from
+ * the radix tree *and* a synchronize_rcu() grace period).
+ *
+ * (Note, rcu_assign_pointer and rcu_dereference are not needed to control
+ * access to data items when inserting into or looking up from the radix tree)
+ *
+ * radix_tree_tagged is able to be called without locking or RCU.
+ */
 
+/**
+ * radix_tree_deref_slot	- dereference a slot
+ * @pslot:	pointer to slot, returned by radix_tree_lookup_slot
+ * @returns:	item that was stored in that slot.
+ */
 static inline void *radix_tree_deref_slot(void *pslot)
 {
 	void *slot = *(void **)pslot;
 	return rcu_dereference(radix_tree_direct_to_ptr(slot));
 }
-
+/**
+ * radix_tree_replace_slot	- replace item in a slot
+ * @pslot:	pointer to slot, returned by radix_tree_lookup_slot
+ * @item:	new item to store in the slot.
+ */
 static inline void radix_tree_replace_slot(void *pslot, void *item)
 {
 	void *slot = *(void **)pslot;
diff -puN lib/radix-tree.c~adix-tree-rcu-lockless-readside-update lib/radix-tree.c
--- a/lib/radix-tree.c~adix-tree-rcu-lockless-readside-update
+++ a/lib/radix-tree.c
@@ -2,6 +2,7 @@
  * Copyright (C) 2001 Momchil Velikov
  * Portions Copyright (C) 2001 Christoph Hellwig
  * Copyright (C) 2005 SGI, Christoph Lameter <clameter@sgi.com>
+ * Copyright (C) 2006 Nick Piggin
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License as
@@ -341,7 +342,7 @@ void **radix_tree_lookup_slot(struct rad
 	unsigned int height, shift;
 	struct radix_tree_node *node, **slot;
 
-	node = rcu_dereference(root->rnode);
+	node = root->rnode;
 	if (node == NULL)
 		return NULL;
 
@@ -360,7 +361,7 @@ void **radix_tree_lookup_slot(struct rad
 	do {
 		slot = (struct radix_tree_node **)
 			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
-		node = rcu_dereference(*slot);
+		node = *slot;
 		if (node == NULL)
 			return NULL;
 
@@ -546,27 +547,30 @@ int radix_tree_tag_get(struct radix_tree
 			unsigned long index, unsigned int tag)
 {
 	unsigned int height, shift;
-	struct radix_tree_node *slot;
+	struct radix_tree_node *node;
 	int saw_unset_tag = 0;
 
-	height = root->height;
-	if (index > radix_tree_maxindex(height))
-		return 0;
-
 	/* check the root's tag bit */
 	if (!root_tag_get(root, tag))
 		return 0;
 
-	if (height == 0)
-		return 1;
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
+		return 0;
+
+	if (radix_tree_is_direct_ptr(node))
+		return (index == 0);
+
+	height = node->height;
+	if (index > radix_tree_maxindex(height))
+		return 0;
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
-	slot = root->rnode;
 
 	for ( ; ; ) {
 		int offset;
 
-		if (slot == NULL)
+		if (node == NULL)
 			return 0;
 
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
@@ -575,15 +579,15 @@ int radix_tree_tag_get(struct radix_tree
 		 * This is just a debug check.  Later, we can bale as soon as
 		 * we see an unset tag.
 		 */
-		if (!tag_get(slot, tag, offset))
+		if (!tag_get(node, tag, offset))
 			saw_unset_tag = 1;
 		if (height == 1) {
-			int ret = tag_get(slot, tag, offset);
+			int ret = tag_get(node, tag, offset);
 
 			BUG_ON(ret && saw_unset_tag);
 			return ret;
 		}
-		slot = slot->slots[offset];
+		node = rcu_dereference(node->slots[offset]);
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
 	}
@@ -605,12 +609,9 @@ __lookup(struct radix_tree_node *slot, v
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
 	for ( ; height > 1; height--) {
-		struct radix_tree_node *__s;
-
 		i = (index >> shift) & RADIX_TREE_MAP_MASK ;
 		for (;;) {
-			__s = rcu_dereference(slot->slots[i]);
-			if (__s != NULL)
+			if (slot->slots[i] != NULL)
 				break;
 			index &= ~((1UL << shift) - 1);
 			index += 1UL << shift;
@@ -622,7 +623,7 @@ __lookup(struct radix_tree_node *slot, v
 		}
 
 		shift -= RADIX_TREE_MAP_SHIFT;
-		slot = __s;
+		slot = rcu_dereference(slot->slots[i]);
 	}
 
 	/* Bottom level: grab some items */
@@ -631,7 +632,7 @@ __lookup(struct radix_tree_node *slot, v
 		index++;
 		node = slot->slots[i];
 		if (node) {
-			results[nr_found++] = node;
+			results[nr_found++] = rcu_dereference(node);
 			if (nr_found == max_items)
 				goto out;
 		}
@@ -676,9 +677,9 @@ radix_tree_gang_lookup(struct radix_tree
 	if (radix_tree_is_direct_ptr(node)) {
 		if (first_index > 0)
 			return 0;
-		results[0] = radix_tree_direct_to_ptr(node);
-		ret = 1;
-		goto out;
+		node = radix_tree_direct_to_ptr(node);
+		results[0] = rcu_dereference(node);
+		return 1;
 	}
 
 	max_index = radix_tree_maxindex(node->height);
@@ -698,9 +699,6 @@ radix_tree_gang_lookup(struct radix_tree
 		cur_index = next_index;
 	}
 
-out:
-	(void)rcu_dereference(results); /* single barrier for all results */
-
 	return ret;
 }
 EXPORT_SYMBOL(radix_tree_gang_lookup);
@@ -714,26 +712,27 @@ __lookup_tag(struct radix_tree_node *slo
 	unsigned int max_items, unsigned long *next_index, unsigned int tag)
 {
 	unsigned int nr_found = 0;
-	unsigned int shift;
-	unsigned int height = slot->height;
+	unsigned int shift, height;
 
-	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+	height = slot->height;
+	if (height == 0)
+		goto out;
+	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
 	while (height > 0) {
-		unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK;
+		unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK ;
 
-		for ( ; i < RADIX_TREE_MAP_SIZE; i++) {
-			if (tag_get(slot, tag, i)) {
-				BUG_ON(slot->slots[i] == NULL);
+		for (;;) {
+			if (tag_get(slot, tag, i))
 				break;
-			}
 			index &= ~((1UL << shift) - 1);
 			index += 1UL << shift;
 			if (index == 0)
 				goto out;	/* 32-bit wraparound */
+			i++;
+			if (i == RADIX_TREE_MAP_SIZE)
+				goto out;
 		}
-		if (i == RADIX_TREE_MAP_SIZE)
-			goto out;
 		height--;
 		if (height == 0) {	/* Bottom level: grab some items */
 			unsigned long j = index & RADIX_TREE_MAP_MASK;
@@ -741,15 +740,19 @@ __lookup_tag(struct radix_tree_node *slo
 			for ( ; j < RADIX_TREE_MAP_SIZE; j++) {
 				index++;
 				if (tag_get(slot, tag, j)) {
-					BUG_ON(slot->slots[j] == NULL);
-					results[nr_found++] = slot->slots[j];
-					if (nr_found == max_items)
-						goto out;
+					struct radix_tree_node *node = slot->slots[j];
+					if (node) {
+						results[nr_found++] = rcu_dereference(node);
+						if (nr_found == max_items)
+							goto out;
+					}
 				}
 			}
 		}
 		shift -= RADIX_TREE_MAP_SHIFT;
-		slot = slot->slots[i];
+		slot = rcu_dereference(slot->slots[i]);
+		if (slot == NULL);
+			break;
 	}
 out:
 	*next_index = index;
@@ -777,7 +780,7 @@ radix_tree_gang_lookup_tag(struct radix_
 	struct radix_tree_node *node;
 	unsigned long max_index;
 	unsigned long cur_index = first_index;
-	unsigned int ret = 0;
+	unsigned int ret;
 
 	/* check the root's tag bit */
 	if (!root_tag_get(root, tag))
@@ -785,18 +788,19 @@ radix_tree_gang_lookup_tag(struct radix_
 
 	node = rcu_dereference(root->rnode);
 	if (!node)
-		return ret;
+		return 0;
 
 	if (radix_tree_is_direct_ptr(node)) {
 		if (first_index > 0)
 			return 0;
 		node = radix_tree_direct_to_ptr(node);
-		results[0] = node;
+		results[0] = rcu_dereference(node);
 		return 1;
 	}
 
 	max_index = radix_tree_maxindex(node->height);
 
+	ret = 0;
 	while (ret < max_items) {
 		unsigned int nr_found;
 		unsigned long next_index;	/* Index of next search */
@@ -810,6 +814,7 @@ radix_tree_gang_lookup_tag(struct radix_
 			break;
 		cur_index = next_index;
 	}
+
 	return ret;
 }
 EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
@@ -828,9 +833,11 @@ static inline void radix_tree_shrink(str
 		void *newptr;
 
 		/*
-		 * this doesn't need an rcu_assign_pointer, because
-		 * we aren't touching the object that to_free->slots[0]
-		 * points to.
+		 * We don't need rcu_assign_pointer(), since we are simply
+		 * moving the node from one part of the tree to another. If
+		 * it was safe to dereference the old pointer to it
+		 * (to_free->slots[0]), it will be safe to dereference the new
+		 * one (root->rnode).
 		 */
 		newptr = to_free->slots[0];
 		if (root->height == 1)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] radix-tree: RCU lockless readside
  2006-06-23  7:09               ` Andrew Morton
@ 2006-06-23  8:03                 ` Nick Piggin
  2006-06-23  8:39                 ` Nick Piggin
  1 sibling, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2006-06-23  8:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, paulmck, benh, Paul.McKenney, linux-kernel, linux-mm

Andrew Morton wrote:
> On Thu, 22 Jun 2006 20:11:12 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> 
>>On Thu, Jun 22, 2006 at 10:40:57AM -0700, Paul E. McKenney wrote:
>>
>>>On Thu, Jun 22, 2006 at 06:55:51PM +0200, Nick Piggin wrote:
>>>
>>>>No problem, will change.
>>>
>>>Thank you!
>>
>>OK, and with that I believe I've covered all your concerns.
>>
>>Attached is the incremental patch (plus a little bit of fuzz
>>that's gone to Andrew). The big items are:
>>
>>- documentation, clarification of comments
>>- tag lookups are now RCU safe (tested with harness)
>>- cleanups of various misuses of rcu_ API that Paul spotted
>>- thought I might put in a copyright -- is this OK?
>>
>>Andrew, please apply.
> 
> 
> Freeing unused kernel memory: 316k freed
> Write protecting the kernel read-only data: 384k
> No module found in object
> No module found in object
> No module found in object
> No module found in object
> input: AT Translated Set 2 keyboard as /class/input/input0
> No module found in object
> EXT3-fs: INFO: recovery required on readonly filesystem.
> EXT3-fs: write access will be enabled during recovery.
> kjournald starting.  Commit interval 5 seconds
> EXT3-fs: recovery complete.
> EXT3-fs: mounted filesystem with ordered data mode.
> BUG: NMI Watchdog detected LOCKUP on CPU0, eip c0264345, registers:
> CPU:    0
> EIP is at radix_tree_gang_lookup_tag+0x105/0x1a0
> eax: ffffffff   ebx: 00000040   ecx: ffffffc0   edx: 00000007
> esi: e701e9d8   edi: 000001c0   ebp: e6fbddd8   esp: e6fbdda8
> ds: 007b   es: 007b   ss: 0068
> Process fsck.ext3 (pid: 1565, ti=e6fbc000 task=c1fbcb90 task.ti=e6fbc000)
> Stack: e77f2dc4 e701e9d8 e701e9d8 00000002 00000fff 00000000 e701e8c8 e6fbde60 
>        0000000e c1c6c52c e6fbde60 c1c6c538 e6fbde00 c014b68f c1c6c52c e6fbde60 
>        00000000 0000000e 00000000 e6fbde58 00000000 00000001 e6fbde20 c0155631 
> Call Trace:
> Code: 89 fa 8d 4c 09 fa d3 e3 d3 ea 89 d9 83 e2 3f f7 d9 eb 13 8d 76 00 89 f8 89 df 21 c8 01 c7 74 26 42 83 fa 40 74 95 0f a3 16 19 c0 <85> c0 74 e7 83 7d dc 01 74 3a 31 f6 89 75 f0 e9 6e ff ff ff c7 
> console shuts up ...
> 
> 
> Not sure why, either.  It all looks like an equivalent transformation to
> me.

Ahh crap, sorry.

I'll see if I can work it out. Will make another good addition to
rtth (which I'm going to have to sort out and get synched up with
you soon).

> 
> fwiw, here's what I tested:

Thanks.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] radix-tree: RCU lockless readside
  2006-06-23  7:09               ` Andrew Morton
  2006-06-23  8:03                 ` Nick Piggin
@ 2006-06-23  8:39                 ` Nick Piggin
  2006-06-23  8:41                   ` Nick Piggin
  1 sibling, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2006-06-23  8:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, paulmck, benh, Paul.McKenney, linux-kernel, linux-mm

Andrew Morton wrote:
> On Thu, 22 Jun 2006 20:11:12 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> 
>>On Thu, Jun 22, 2006 at 10:40:57AM -0700, Paul E. McKenney wrote:
>>
>>>On Thu, Jun 22, 2006 at 06:55:51PM +0200, Nick Piggin wrote:
>>>
>>>>No problem, will change.
>>>
>>>Thank you!
>>
>>OK, and with that I believe I've covered all your concerns.
>>
>>Attached is the incremental patch (plus a little bit of fuzz
>>that's gone to Andrew). The big items are:
>>
>>- documentation, clarification of comments
>>- tag lookups are now RCU safe (tested with harness)
>>- cleanups of various misuses of rcu_ API that Paul spotted
>>- thought I might put in a copyright -- is this OK?
>>
>>Andrew, please apply.
> 
> 
> Freeing unused kernel memory: 316k freed
> Write protecting the kernel read-only data: 384k
> No module found in object
> No module found in object
> No module found in object
> No module found in object
> input: AT Translated Set 2 keyboard as /class/input/input0
> No module found in object
> EXT3-fs: INFO: recovery required on readonly filesystem.
> EXT3-fs: write access will be enabled during recovery.
> kjournald starting.  Commit interval 5 seconds
> EXT3-fs: recovery complete.
> EXT3-fs: mounted filesystem with ordered data mode.
> BUG: NMI Watchdog detected LOCKUP on CPU0, eip c0264345, registers:
> CPU:    0
> EIP is at radix_tree_gang_lookup_tag+0x105/0x1a0
> eax: ffffffff   ebx: 00000040   ecx: ffffffc0   edx: 00000007
> esi: e701e9d8   edi: 000001c0   ebp: e6fbddd8   esp: e6fbdda8
> ds: 007b   es: 007b   ss: 0068
> Process fsck.ext3 (pid: 1565, ti=e6fbc000 task=c1fbcb90 task.ti=e6fbc000)
> Stack: e77f2dc4 e701e9d8 e701e9d8 00000002 00000fff 00000000 e701e8c8 e6fbde60 
>        0000000e c1c6c52c e6fbde60 c1c6c538 e6fbde00 c014b68f c1c6c52c e6fbde60 
>        00000000 0000000e 00000000 e6fbde58 00000000 00000001 e6fbde20 c0155631 
> Call Trace:
> Code: 89 fa 8d 4c 09 fa d3 e3 d3 ea 89 d9 83 e2 3f f7 d9 eb 13 8d 76 00 89 f8 89 df 21 c8 01 c7 74 26 42 83 fa 40 74 95 0f a3 16 19 c0 <85> c0 74 e7 83 7d dc 01 74 3a 31 f6 89 75 f0 e9 6e ff ff ff c7 
> console shuts up ...
> 
> 
> Not sure why, either.  It all looks like an equivalent transformation to
> me.
> 
> fwiw, here's what I tested:


Arggh, line 755 has an extra semicolon. I caught and fixed this in the rtth
tree, but obviously forgot to transfer it over.

> @@ -741,15 +740,19 @@ __lookup_tag(struct radix_tree_node *slo
>  			for ( ; j < RADIX_TREE_MAP_SIZE; j++) {
>  				index++;
>  				if (tag_get(slot, tag, j)) {
> -					BUG_ON(slot->slots[j] == NULL);
> -					results[nr_found++] = slot->slots[j];
> -					if (nr_found == max_items)
> -						goto out;
> +					struct radix_tree_node *node = slot->slots[j];
> +					if (node) {
> +						results[nr_found++] = rcu_dereference(node);
> +						if (nr_found == max_items)
> +							goto out;
> +					}
>  				}
>  			}
>  		}
>  		shift -= RADIX_TREE_MAP_SHIFT;
> -		slot = slot->slots[i];
> +		slot = rcu_dereference(slot->slots[i]);
> +		if (slot == NULL);
> +			break;
>  	}

                          ^^^^^^^^

Up there.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 3/3] radix-tree: RCU lockless readside
  2006-06-23  8:39                 ` Nick Piggin
@ 2006-06-23  8:41                   ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2006-06-23  8:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, paulmck, benh, Paul.McKenney, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 325 bytes --]

Nick Piggin wrote:

>>          shift -= RADIX_TREE_MAP_SHIFT;
>> -        slot = slot->slots[i];
>> +        slot = rcu_dereference(slot->slots[i]);
>> +        if (slot == NULL);
>> +            break;
>>      }
> 
> 
>                          ^^^^^^^^
> 
> Up there.
> 

And here's the patch.

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: radix-tree-paul-review-fix.patch --]
[-- Type: text/plain, Size: 376 bytes --]

Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -752,7 +752,7 @@ __lookup_tag(struct radix_tree_node *slo
 		}
 		shift -= RADIX_TREE_MAP_SHIFT;
 		slot = rcu_dereference(slot->slots[i]);
-		if (slot == NULL);
+		if (slot == NULL)
 			break;
 	}
 out:

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2006-06-23  8:41 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-20 14:48 [patch 0/3] 2.6.17 radix-tree: updates and lockless Nick Piggin
2006-06-20 14:48 ` [patch 1/3] radix-tree: direct data Nick Piggin
2006-06-20 14:48 ` [patch 2/3] radix-tree: small Nick Piggin
2006-06-20 14:48 ` [patch 3/3] radix-tree: RCU lockless readside Nick Piggin
2006-06-22  1:49   ` Paul E. McKenney
2006-06-22 15:45     ` Nick Piggin
2006-06-22 16:30       ` Paul E. McKenney
2006-06-22 16:55         ` Nick Piggin
2006-06-22 17:40           ` Paul E. McKenney
2006-06-22 18:11             ` Nick Piggin
2006-06-23  7:09               ` Andrew Morton
2006-06-23  8:03                 ` Nick Piggin
2006-06-23  8:39                 ` Nick Piggin
2006-06-23  8:41                   ` Nick Piggin
2006-06-20 22:08 ` [patch 0/3] 2.6.17 radix-tree: updates and lockless Benjamin Herrenschmidt
2006-06-20 22:35 ` Andrew Morton
2006-06-20 23:09   ` Benjamin Herrenschmidt
2006-06-20 23:30     ` Andrew Morton
2006-06-20 23:50       ` Benjamin Herrenschmidt
2006-06-21  0:34         ` Christoph Lameter
2006-06-21  0:47           ` Benjamin Herrenschmidt
2006-06-21  1:07             ` Christoph Lameter
2006-06-21  1:33               ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).