All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] lockless radix tree readside
@ 2005-12-06  1:40 Nick Piggin
  2005-12-06  3:11   ` David S. Miller, Nick Piggin
  2005-12-06 15:53 ` Joe Seigh
  0 siblings, 2 replies; 8+ messages in thread
From: Nick Piggin @ 2005-12-06  1:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Linux Memory Management,
	Paul McKenney, WU Fengguang

[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

The following patch against recent -mm kernels implements lockless
radix tree lookups using RCU. No users of this new facility yet,
but it is a requirement for lockless pagecache.

I have recently added (what I think are) the missing rcu_dereference
calls needed on Alpha, and the implementation now has no known bugs.
(actually that's wrong: the new capabilities in the lookup APIs need
commenting)

I realise that radix-tree.c isn't a trivial bit of code so I don't
expect reviews to be forthcoming, but if anyone had some spare time
to glance over it that would be great.

Is my given detail of the implementation clear? Sufficient? Would
diagrams be helpful?

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: radix-tree-lockless-readside.patch --]
[-- Type: text/plain, Size: 8902 bytes --]

Make radix tree lookups safe to be performed without locks. Readers
are protected against nodes being deleted by using RCU based freeing.
Readers are protected against new node insertion by using memory
barriers to ensure the node itself will be properly written before it
is visible in the radix tree.

Each radix tree node keeps a record of their height (above leaf
nodes). This height does not change after insertion -- when the radix
tree is extended, higher nodes are only inserted in the top. So a
lookup can take the pointer to what is *now* the root node, and
traverse down it even if the tree is concurrently extended and this
node becomes a subtree of a new root.

When a reader wants to traverse the next branch, they will take a
copy of the pointer. This pointer will be either NULL (and the branch
is empty) or non-NULL (and will point to a valid node).

Also introduce a lockfree gang_lookup_slot which will be used by a
future patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -30,6 +30,7 @@
 #include <linux/gfp.h>
 #include <linux/string.h>
 #include <linux/bitops.h>
+#include <linux/rcupdate.h>
 
 
 #ifdef __KERNEL__
@@ -46,7 +47,9 @@
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
 struct radix_tree_node {
+	unsigned int	height;		/* Height from the bottom */
 	unsigned int	count;
+	struct rcu_head	rcu_head;
 	void		*slots[RADIX_TREE_MAP_SIZE];
 	unsigned long	tags[RADIX_TREE_TAGS][RADIX_TREE_TAG_LONGS];
 };
@@ -98,10 +101,17 @@ radix_tree_node_alloc(struct radix_tree_
 	return ret;
 }
 
+static void radix_tree_node_rcu_free(struct rcu_head *head)
+{
+	struct radix_tree_node *node =
+			container_of(head, struct radix_tree_node, rcu_head);
+	kmem_cache_free(radix_tree_node_cachep, node);
+}
+
 static inline void
 radix_tree_node_free(struct radix_tree_node *node)
 {
-	kmem_cache_free(radix_tree_node_cachep, node);
+	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
 /*
@@ -204,6 +214,7 @@ static int radix_tree_extend(struct radi
 	}
 
 	do {
+		unsigned int newheight;
 		if (!(node = radix_tree_node_alloc(root)))
 			return -ENOMEM;
 
@@ -216,9 +227,11 @@ static int radix_tree_extend(struct radi
 				tag_set(node, tag, 0);
 		}
 
+		newheight = root->height+1;
+		node->height = newheight;
 		node->count = 1;
-		root->rnode = node;
-		root->height++;
+		rcu_assign_pointer(root->rnode, node);
+		root->height = newheight;
 	} while (height > root->height);
 out:
 	return 0;
@@ -258,11 +271,12 @@ int radix_tree_insert(struct radix_tree_
 			/* Have to add a child node.  */
 			if (!(slot = radix_tree_node_alloc(root)))
 				return -ENOMEM;
+			slot->height = height;
 			if (node) {
-				node->slots[offset] = slot;
+				rcu_assign_pointer(node->slots[offset], slot);
 				node->count++;
 			} else
-				root->rnode = slot;
+				rcu_assign_pointer(root->rnode, slot);
 		}
 
 		/* Go a level down */
@@ -278,7 +292,7 @@ int radix_tree_insert(struct radix_tree_
 
 	BUG_ON(!node);
 	node->count++;
-	node->slots[offset] = item;
+	rcu_assign_pointer(node->slots[offset], item);
 	BUG_ON(tag_get(node, 0, offset));
 	BUG_ON(tag_get(node, 1, offset));
 
@@ -290,25 +304,29 @@ static inline void **__lookup_slot(struc
 				   unsigned long index)
 {
 	unsigned int height, shift;
-	struct radix_tree_node **slot;
+	struct radix_tree_node *node, **slot;
 
-	height = root->height;
+	/* Must take a copy now because root->rnode may change */
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
+		return NULL;
+
+	height = node->height;
 	if (index > radix_tree_maxindex(height))
 		return NULL;
 
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
-	slot = &root->rnode;
 
-	while (height > 0) {
-		if (*slot == NULL)
+	do {
+		slot = (struct radix_tree_node **)
+			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
+		node = rcu_dereference(*slot);
+		if (node == NULL)
 			return NULL;
 
-		slot = (struct radix_tree_node **)
-			((*slot)->slots +
-				((index >> shift) & RADIX_TREE_MAP_MASK));
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
-	}
+	} while (height > 0);
 
 	return (void **)slot;
 }
@@ -339,7 +357,7 @@ void *radix_tree_lookup(struct radix_tre
 	void **slot;
 
 	slot = __lookup_slot(root, index);
-	return slot != NULL ? *slot : NULL;
+	return slot != NULL ? rcu_dereference(*slot) : NULL;
 }
 EXPORT_SYMBOL(radix_tree_lookup);
 
@@ -501,26 +519,27 @@ EXPORT_SYMBOL(radix_tree_tag_get);
 #endif
 
 static unsigned int
-__lookup(struct radix_tree_root *root, void **results, unsigned long index,
+__lookup(struct radix_tree_root *root, void ***results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index)
 {
 	unsigned int nr_found = 0;
 	unsigned int shift, height;
-	struct radix_tree_node *slot;
+	struct radix_tree_node *slot, *__s;
 	unsigned long i;
 
-	height = root->height;
-	if (height == 0)
+	slot = rcu_dereference(root->rnode);
+	if (!slot || slot->height == 0)
 		goto out;
 
+	height = slot->height;
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
-	slot = root->rnode;
 
 	for ( ; height > 1; height--) {
 
 		for (i = (index >> shift) & RADIX_TREE_MAP_MASK ;
 				i < RADIX_TREE_MAP_SIZE; i++) {
-			if (slot->slots[i] != NULL)
+			__s = rcu_dereference(slot->slots[i]);
+			if (__s != NULL)
 				break;
 			index &= ~((1UL << shift) - 1);
 			index += 1UL << shift;
@@ -531,14 +550,14 @@ __lookup(struct radix_tree_root *root, v
 			goto out;
 
 		shift -= RADIX_TREE_MAP_SHIFT;
-		slot = slot->slots[i];
+		slot = __s;
 	}
 
 	/* Bottom level: grab some items */
 	for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
 		index++;
 		if (slot->slots[i]) {
-			results[nr_found++] = slot->slots[i];
+			results[nr_found++] = &slot->slots[i];
 			if (nr_found == max_items)
 				goto out;
 		}
@@ -570,6 +589,43 @@ radix_tree_gang_lookup(struct radix_tree
 	unsigned int ret = 0;
 
 	while (ret < max_items) {
+		unsigned int nr_found, i;
+		unsigned long next_index;	/* Index of next search */
+
+		if (cur_index > max_index)
+			break;
+		nr_found = __lookup(root, (void ***)results + ret, cur_index,
+					max_items - ret, &next_index);
+		for (i = 0; i < nr_found; i++)
+			results[ret + i] = *(((void ***)results)[ret + i]);
+		ret += nr_found;
+		if (next_index == 0)
+			break;
+		cur_index = next_index;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup);
+
+/**
+ *	radix_tree_gang_lookup_slot - perform multiple lookup on a radix tree
+ *	@root:		radix tree root
+ *	@results:	where the results of the lookup are placed
+ *	@first_index:	start the lookup from this key
+ *	@max_items:	place up to this many items at *results
+ *
+ *	Same as radix_tree_gang_lookup, but returns an array of pointers
+ *	(slots) to the stored items instead of the items themselves.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items)
+{
+	const unsigned long max_index = radix_tree_maxindex(root->height);
+	unsigned long cur_index = first_index;
+	unsigned int ret = 0;
+
+	while (ret < max_items) {
 		unsigned int nr_found;
 		unsigned long next_index;	/* Index of next search */
 
@@ -584,7 +640,8 @@ radix_tree_gang_lookup(struct radix_tree
 	}
 	return ret;
 }
-EXPORT_SYMBOL(radix_tree_gang_lookup);
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
+
 
 /*
  * FIXME: the two tag_get()s here should use find_next_bit() instead of
@@ -689,6 +746,11 @@ static inline void radix_tree_shrink(str
 			root->rnode->slots[0]) {
 		struct radix_tree_node *to_free = root->rnode;
 
+		/*
+		 * this doesn't need an rcu_assign_pointer, because
+		 * we aren't touching the object that to_free->slots[0]
+		 * points to.
+		 */
 		root->rnode = to_free->slots[0];
 		root->height--;
 		/* must only free zeroed nodes into the slab */
@@ -802,7 +864,7 @@ EXPORT_SYMBOL(radix_tree_delete);
 int radix_tree_tagged(struct radix_tree_root *root, int tag)
 {
   	struct radix_tree_node *rnode;
-  	rnode = root->rnode;
+  	rnode = rcu_dereference(root->rnode);
   	if (!rnode)
   		return 0;
 	return tag_get_any_node(rnode, tag);
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -51,6 +51,9 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] lockless radix tree readside
  2005-12-06  1:40 [RFC] lockless radix tree readside Nick Piggin
@ 2005-12-06  3:11   ` David S. Miller, Nick Piggin
  2005-12-06 15:53 ` Joe Seigh
  1 sibling, 0 replies; 8+ messages in thread
From: David S. Miller @ 2005-12-06  3:11 UTC (permalink / raw)
  To: nickpiggin; +Cc: Linux-Kernel, linux-mm, paul.mckenney, wfg

From: Nick Piggin <nickpiggin@yahoo.com.au>
Date: Tue, 06 Dec 2005 12:40:56 +1100

> I realise that radix-tree.c isn't a trivial bit of code so I don't
> expect reviews to be forthcoming, but if anyone had some spare time
> to glance over it that would be great.

I went over this a few times and didn't find any obvious
problems with the RCU aspect of this.

> Is my given detail of the implementation clear? Sufficient? Would
> diagrams be helpful?

If I were to suggest an ascii diagram for a comment, it would be
one which would show the height invariant this patch takes advantage
of.

Nice work.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] lockless radix tree readside
@ 2005-12-06  3:11   ` David S. Miller, Nick Piggin
  0 siblings, 0 replies; 8+ messages in thread
From: David S. Miller, Nick Piggin @ 2005-12-06  3:11 UTC (permalink / raw)
  To: nickpiggin; +Cc: Linux-Kernel, linux-mm, paul.mckenney, wfg

> I realise that radix-tree.c isn't a trivial bit of code so I don't
> expect reviews to be forthcoming, but if anyone had some spare time
> to glance over it that would be great.

I went over this a few times and didn't find any obvious
problems with the RCU aspect of this.

> Is my given detail of the implementation clear? Sufficient? Would
> diagrams be helpful?

If I were to suggest an ascii diagram for a comment, it would be
one which would show the height invariant this patch takes advantage
of.

Nice work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] lockless radix tree readside
  2005-12-06  3:11   ` David S. Miller, Nick Piggin
@ 2005-12-06  5:48     ` Nick Piggin
  -1 siblings, 0 replies; 8+ messages in thread
From: Nick Piggin @ 2005-12-06  5:48 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux-Kernel, linux-mm, paul.mckenney, wfg

David S. Miller wrote:
> From: Nick Piggin <nickpiggin@yahoo.com.au>
> Date: Tue, 06 Dec 2005 12:40:56 +1100
> 
> 
>>I realise that radix-tree.c isn't a trivial bit of code so I don't
>>expect reviews to be forthcoming, but if anyone had some spare time
>>to glance over it that would be great.
> 
> 
> I went over this a few times and didn't find any obvious
> problems with the RCU aspect of this.
> 

Thanks!

> 
>>Is my given detail of the implementation clear? Sufficient? Would
>>diagrams be helpful?
> 
> 
> If I were to suggest an ascii diagram for a comment, it would be
> one which would show the height invariant this patch takes advantage
> of.
> 

I'll see if I can make something reasonably descriptive. And possibly
another diagram to show the node insertion concurrency cases vs lookup.
These things are the main concepts to understand, so I agree diagrams
might be helpful.

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] lockless radix tree readside
@ 2005-12-06  5:48     ` Nick Piggin
  0 siblings, 0 replies; 8+ messages in thread
From: Nick Piggin @ 2005-12-06  5:48 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux-Kernel, linux-mm, paul.mckenney, wfg

David S. Miller wrote:
> From: Nick Piggin <nickpiggin@yahoo.com.au>
> Date: Tue, 06 Dec 2005 12:40:56 +1100
> 
> 
>>I realise that radix-tree.c isn't a trivial bit of code so I don't
>>expect reviews to be forthcoming, but if anyone had some spare time
>>to glance over it that would be great.
> 
> 
> I went over this a few times and didn't find any obvious
> problems with the RCU aspect of this.
> 

Thanks!

> 
>>Is my given detail of the implementation clear? Sufficient? Would
>>diagrams be helpful?
> 
> 
> If I were to suggest an ascii diagram for a comment, it would be
> one which would show the height invariant this patch takes advantage
> of.
> 

I'll see if I can make something reasonably descriptive. And possibly
another diagram to show the node insertion concurrency cases vs lookup.
These things are the main concepts to understand, so I agree diagrams
might be helpful.

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] lockless radix tree readside
  2005-12-06  1:40 [RFC] lockless radix tree readside Nick Piggin
  2005-12-06  3:11   ` David S. Miller, Nick Piggin
@ 2005-12-06 15:53 ` Joe Seigh
  2005-12-06 22:36     ` Nick Piggin
  1 sibling, 1 reply; 8+ messages in thread
From: Joe Seigh @ 2005-12-06 15:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

Nick Piggin wrote:
> The following patch against recent -mm kernels implements lockless
> radix tree lookups using RCU. No users of this new facility yet,
> but it is a requirement for lockless pagecache.
> 
> I have recently added (what I think are) the missing rcu_dereference
> calls needed on Alpha, and the implementation now has no known bugs.
> (actually that's wrong: the new capabilities in the lookup APIs need
> commenting)
> 
> I realise that radix-tree.c isn't a trivial bit of code so I don't
> expect reviews to be forthcoming, but if anyone had some spare time
> to glance over it that would be great.
> 
> Is my given detail of the implementation clear? Sufficient? Would
> diagrams be helpful?
> 

Well, I don't have a kernel development set up so I can't comment on
the specific patch but I have done some minor experimentation with reader
lock-free b-trees, specifically insert, delete, and rotate (no actual
balancing heuristics though) so I can comment on what some of the 
general issues are.

You need to have a serialization point in your tree modifications so
the change becomes atomically visible to threads reading the tree.
This is important for the semantics of your data structure.  It's not
good to have a node become temporarily invisible to readers if the
tree operation involved moving a node or subtree around with more than
a single link modification.  So you will likely find yourself needing to use
COW (copy on write) or PCOW (partial copy on write), particularly on
deletes of non leaf nodes. PCOW is naturally better, especially if you
can minimize the number of nodes that have to be copied.

So that's probably what you want to have in your documentation; what
the serialization points are, your COW or PCOW mechanism, and how
they preserve semantics.

Also I assume you're returning lookups by value and not reference
unless they're refcounted (which naturally since you're using RCU
can be incremented safely if the refcount is not zero)

--
Joe Seigh




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] lockless radix tree readside
  2005-12-06 15:53 ` Joe Seigh
@ 2005-12-06 22:36     ` Nick Piggin
  0 siblings, 0 replies; 8+ messages in thread
From: Nick Piggin @ 2005-12-06 22:36 UTC (permalink / raw)
  To: Joe Seigh; +Cc: linux-kernel, linux-mm

Joe Seigh wrote:

> Well, I don't have a kernel development set up so I can't comment on
> the specific patch but I have done some minor experimentation with reader
> lock-free b-trees, specifically insert, delete, and rotate (no actual
> balancing heuristics though) so I can comment on what some of the 
> general issues are.
> 
> You need to have a serialization point in your tree modifications so
> the change becomes atomically visible to threads reading the tree.

Yes, that is the memory barrier in rcu_assign_pointer.

> This is important for the semantics of your data structure.  It's not
> good to have a node become temporarily invisible to readers if the
> tree operation involved moving a node or subtree around with more than
> a single link modification.  So you will likely find yourself needing to 
> use
> COW (copy on write) or PCOW (partial copy on write), particularly on
> deletes of non leaf nodes. PCOW is naturally better, especially if you
> can minimize the number of nodes that have to be copied.
> 

Fortunately the radix tree never needs to do anything like this.
It doesn't move nodes or subtrees - the only modification operations
needed are to insert and delete items (ignoring the tag operations,
which are done under lock).

> So that's probably what you want to have in your documentation; what
> the serialization points are, your COW or PCOW mechanism, and how
> they preserve semantics.
> 
> Also I assume you're returning lookups by value and not reference
> unless they're refcounted (which naturally since you're using RCU
> can be incremented safely if the refcount is not zero)
> 

It can return either. It is up to the reader to do the right thing
in either case (which will need a note in the API comments).

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] lockless radix tree readside
@ 2005-12-06 22:36     ` Nick Piggin
  0 siblings, 0 replies; 8+ messages in thread
From: Nick Piggin @ 2005-12-06 22:36 UTC (permalink / raw)
  To: Joe Seigh; +Cc: linux-kernel, linux-mm

Joe Seigh wrote:

> Well, I don't have a kernel development set up so I can't comment on
> the specific patch but I have done some minor experimentation with reader
> lock-free b-trees, specifically insert, delete, and rotate (no actual
> balancing heuristics though) so I can comment on what some of the 
> general issues are.
> 
> You need to have a serialization point in your tree modifications so
> the change becomes atomically visible to threads reading the tree.

Yes, that is the memory barrier in rcu_assign_pointer.

> This is important for the semantics of your data structure.  It's not
> good to have a node become temporarily invisible to readers if the
> tree operation involved moving a node or subtree around with more than
> a single link modification.  So you will likely find yourself needing to 
> use
> COW (copy on write) or PCOW (partial copy on write), particularly on
> deletes of non leaf nodes. PCOW is naturally better, especially if you
> can minimize the number of nodes that have to be copied.
> 

Fortunately the radix tree never needs to do anything like this.
It doesn't move nodes or subtrees - the only modification operations
needed are to insert and delete items (ignoring the tag operations,
which are done under lock).

> So that's probably what you want to have in your documentation; what
> the serialization points are, your COW or PCOW mechanism, and how
> they preserve semantics.
> 
> Also I assume you're returning lookups by value and not reference
> unless they're refcounted (which naturally since you're using RCU
> can be incremented safely if the refcount is not zero)
> 

It can return either. It is up to the reader to do the right thing
in either case (which will need a note in the API comments).

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-12-06 22:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-06  1:40 [RFC] lockless radix tree readside Nick Piggin
2005-12-06  3:11 ` David S. Miller
2005-12-06  3:11   ` David S. Miller, Nick Piggin
2005-12-06  5:48   ` Nick Piggin
2005-12-06  5:48     ` Nick Piggin
2005-12-06 15:53 ` Joe Seigh
2005-12-06 22:36   ` Nick Piggin
2005-12-06 22:36     ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.