linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
@ 2016-08-27 14:14 Konstantin Khlebnikov
  2016-08-27 14:16 ` [PATCH RFC 2/4] lib/radix-tree: remove sibling entries Konstantin Khlebnikov
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Konstantin Khlebnikov @ 2016-08-27 14:14 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel, Kirill A. Shutemov, Ross Zwisler

This patch adds function for filling and truncating ranges of slots:

radix_tree_node *radix_tree_fill_range(root, start, end, item, flags)

It fills slots in range "begin".."end" with "item" and returns pointer
to the last filled node. Filling with NULL truncates range.

This is intended for managing transparent huge pages in page cache where
all entries are aligned but this function can handle arbitrary unaligned
ranges. Might be useful for PAT or VMA-like extent trees.

By default filling range constructs shallow tree: entries are assigned
directly inner slots if possible. In worst case any range requires only
2 * RADIX_TREE_MAX_PATH nodes. If length is power of two and start index
is aligned then all slots are always in single node and requires at most
RADIX_TREE_MAX_PATH nodes.

Function accepts several flags:

RADIX_TREE_FILL_LEAVES  - build deep tree, insert entry into leaves.

RADIX_TREE_FILL_OVERWRITE - overwrite instead of failing with -EEXIST.

RADIX_TREE_FILL_ATOMIC - play well with concurrent RCU-protected lookup:
fill new nodes with RADIX_TREE_RETRY before inserting them into the tree.
At following iterations these slots are filled with @item or sub-nodes.

RADIX_TREE_FILL_CLEAR_TAGS - also clears all tags.

radix_tree_fill_range() returns pointer to the node which holds the last
slot in range, NULL if this is root slot, or ERR_PTR in case of error.

Thus, radix_tree_fill_range() can handle all operations required for THP:

* Insert
Fill range with pointer to head page.

radix_tree_fill_range(root, index, index + nr_pages - 1, head_page,
		      RADIX_TREE_FILL_ATOMIC)

* Remove
Fill range with NULL or shadow entry, returned value will be used for
linking completely shadow nodes into slab shrinker.

radix_tree_fill_range(root, index, index + nr_pages - 1, NULL,
		      RADIX_TREE_FILL_OVERWRITE)

* Merge
Fill range with overwrite to replace 0-order pages with THP.

radix_tree_fill_range(root, index, index + nr_pages - 1, head_page,
		      RADIX_TREE_FILL_OVERWRITE | RADIX_TREE_FILL_ATOMIC)

* Split
Two passes: first fill leaves with head_page entry and then replace each
slot with pointer to individual tail page. This could be done in single
pass but makes radix_tree_fill_range much more complicated.

radix_tree_fill_range(root, index, index + nr_pages - 1, head_page,
		      RADIX_TREE_FILL_LEAVES | RADIX_TREE_FILL_OVERWRITE |
		      RADIX_TREE_FILL_ATOMIC);
radix_tree_for_each_slot(...)
	radix_tree_replace_slot(slot, head + iter.index - head->index);


Page lookup and iterator will return pointer to head page for any index.


Code inside iterator loop could detect huge entry, handle all sub-pages
and jump to next index using new helper function radix_tree_iter_jump():

slot = radix_tree_iter_jump(&iter, page->index + hpage_nr_pages(page));

This helper has builtin protection against overflows: jump to index = 0
stops iterator. This uses existing logic in radix_tree_next_chunk():
if iter.next_index is zero then iter.index must be zero too.


Tags should be set only for last index of THP range: this way iterator
will find them regardless of starting index.

radix_tree_preload_range() pre-allocates nodes for filling range.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
---
 include/linux/radix-tree.h |   46 ++++++++
 lib/radix-tree.c           |  245 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 291 insertions(+)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4613bf35c311..af33e8d93ec3 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -319,6 +319,35 @@ static inline void radix_tree_preload_end(void)
 	preempt_enable();
 }
 
+#define RADIX_TREE_FILL_LEAVES		1 /* build full depth tree */
+#define RADIX_TREE_FILL_OVERWRITE	2 /* overwrite non-empty slots */
+#define RADIX_TREE_FILL_CLEAR_TAGS	4 /* clear all tags */
+#define RADIX_TREE_FILL_ATOMIC		8 /* play well with rcu lookup */
+
+struct radix_tree_node *
+radix_tree_fill_range(struct radix_tree_root *root, unsigned long start,
+		      unsigned long end, void *item, unsigned int flags);
+
+int radix_tree_preload_range(gfp_t gfp_mask, unsigned long start,
+			     unsigned long end, unsigned int flags);
+
+/**
+ * radix_tree_truncate_range  - remove everything in range
+ * @root:	radix tree root
+ * @start:	first index
+ * @end:	last index
+ *
+ * This function removes all items and tags within given range.
+ */
+static inline void
+radix_tree_truncate_range(struct radix_tree_root *root,
+			  unsigned long start, unsigned long end)
+{
+	radix_tree_fill_range(root, start, end, NULL,
+			      RADIX_TREE_FILL_OVERWRITE |
+			      RADIX_TREE_FILL_CLEAR_TAGS);
+}
+
 /**
  * struct radix_tree_iter - radix tree iterator state
  *
@@ -435,6 +464,23 @@ void **radix_tree_iter_next(struct radix_tree_iter *iter)
 }
 
 /**
+ * radix_tree_iter_jump - restart iterating from given index if it non-zero
+ * @iter:	iterator state
+ * @index:	next index
+ *
+ * If index is zero when iterator will stop. This protects from endless loop
+ * when index overflows after visiting last entry.
+ */
+static inline __must_check
+void **radix_tree_iter_jump(struct radix_tree_iter *iter, unsigned long index)
+{
+	iter->index = index - 1;
+	iter->next_index = index;
+	iter->tags = 0;
+	return NULL;
+}
+
+/**
  * radix_tree_chunk_size - get current chunk size
  *
  * @iter:	pointer to radix tree iterator
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 1b7bf7314141..c46a60065a77 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -36,6 +36,7 @@
 #include <linux/bitops.h>
 #include <linux/rcupdate.h>
 #include <linux/preempt.h>		/* in_interrupt() */
+#include <linux/err.h>
 
 
 /* Number of nodes in fully populated tree of given height */
@@ -1014,6 +1015,250 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
 EXPORT_SYMBOL(radix_tree_next_chunk);
 
 /**
+ * radix_tree_preload_range  - preload nodes for filling range.
+ * @gfp_mask:
+ * @start:	first index
+ * @end:	last index
+ * @flags:	RADIX_TREE_FILL_*
+ */
+int radix_tree_preload_range(gfp_t gfp_mask, unsigned long start,
+			     unsigned long end, unsigned int flags)
+{
+	unsigned long length = end - start + 1;
+	int nr_nodes, shift;
+
+	/* Preloading doesn't help anything with this gfp mask, skip it */
+	if (!gfpflags_allow_blocking(gfp_mask)) {
+		preempt_disable();
+		return 0;
+	}
+
+	/*
+	 * For filling leaves tree must cover all indexes in range at all
+	 * levels plus RADIX_TREE_MAX_PATH required for growing tree depth
+	 * and only root node is shared for sure.
+	 *
+	 * If for aligned range we need RADIX_TREE_MAX_PATH for growing depth
+	 * and RADIX_TREE_MAX_PATH for path where all slots will be.
+	 *
+	 * For arbitrary range we need again RADIX_TREE_MAX_PATH for growing
+	 * depth and two RADIX_TREE_MAX_PATH chains for constructing arc of
+	 * slots from leaf to root and back. Only root node is shared.
+	 */
+	if (flags & RADIX_TREE_FILL_LEAVES) {
+		if (start > end)
+			return -EINVAL;
+		shift = 0;
+		nr_nodes = RADIX_TREE_MAX_PATH - 1;
+		do {
+			shift += RADIX_TREE_MAP_SHIFT;
+			nr_nodes += (end >> shift) - (start >> shift) + 1;
+		} while (shift < RADIX_TREE_INDEX_BITS);
+	} else if (is_power_of_2(length) && IS_ALIGNED(start, length))
+		nr_nodes = RADIX_TREE_MAX_PATH * 2 - 1;
+	else
+		nr_nodes = RADIX_TREE_MAX_PATH * 3 - 2;
+	return __radix_tree_preload(gfp_mask, nr_nodes);
+}
+EXPORT_SYMBOL(radix_tree_preload_range);
+
+/**
+ * radix_tree_fill_range - fill range of slots
+ * @root:	radix tree root
+ * @start:	first index
+ * @end:	last index
+ * @item:	value for filling, NULL for removing
+ * @flags:	RADIX_TREE_FILL_* flags
+ * Returns:	pointer last node or NULL, ERR_PTR for errors
+ *
+ * By default builds shallow tree: assign entry to inner slots if possible.
+ * In wost case range requires up to 2 * RADIX_TREE_MAX_PATH nodes plus
+ * RADIX_TREE_MAX_PATH for extending tree depth.
+ *
+ * If length is 2^n and start aligned to it then all slots are in one node.
+ *
+ * This function cannot fill or cut part of bugger range if this require
+ * spltting inner slots and insering new nodes: fails with -ERANGE.
+ *
+ * With flag RADIX_TREE_FILL_LEAVES builds deep tree and insert @item into
+ * leaf slots. This requires much more nodes.
+ *
+ * With flag RADIX_TREE_FILL_OVERWRITE removes everything in range and cut
+ * sub-tree if @item is NULL. Without that flag function undo all chandges
+ * and fails with code -EEXIST if finds any populated slot.
+ *
+ * With flag RADIX_TREE_FILL_ATOMIC function plays well with rcu-protected
+ * lookups: it fills new nodes with RADIX_TREE_RETRY before inserting them
+ * into the tree: lookup will see either old entry, @item or retry entry.
+ * At following iterations these slots are filled with @item or sub-nodes.
+ *
+ * With flag RADIX_TREE_FILL_CLEAR_TAGS also clears all tags.
+ *
+ * Function returns pointer to node which holds the last slot in range,
+ * NULL if that was root slot, or ERR_PTR: -ENOMEM, -EEXIST, -ERANGE.
+ */
+struct radix_tree_node *
+radix_tree_fill_range(struct radix_tree_root *root, unsigned long start,
+		      unsigned long end, void *item, unsigned int flags)
+{
+	unsigned long index = start, maxindex;
+	struct radix_tree_node *node, *child;
+	int error, root_shift, shift, tag, offset;
+	void *entry;
+
+	/* Sanity check */
+	if (start > end)
+		return ERR_PTR(-EINVAL);
+
+	/* Make sure the tree is high enough.  */
+	root_shift = radix_tree_load_root(root, &node, &maxindex);
+	if (end > maxindex) {
+		error = radix_tree_extend(root, end, root_shift);
+		if (error < 0)
+			return ERR_PTR(error);
+		root_shift = error;
+	}
+
+	/* Special case: single slot tree */
+	if (!root_shift) {
+		if (node && (!(flags & RADIX_TREE_FILL_OVERWRITE)))
+			return ERR_PTR(-EEXIST);
+		if (flags & RADIX_TREE_FILL_CLEAR_TAGS)
+			root_tag_clear_all(root);
+		rcu_assign_pointer(root->rnode, item);
+		return NULL;
+	}
+
+next_node:
+	node = NULL;
+	offset = 0;
+	entry = rcu_dereference_raw(root->rnode);
+	shift = root_shift;
+
+	/* Descend to the index. Do at least one step. */
+	do {
+		child = entry_to_node(entry);
+		shift -= RADIX_TREE_MAP_SHIFT;
+		if (!child || !radix_tree_is_internal_node(entry)) {
+			/* Entry wider than range */
+			if (child) {
+				error = -ERANGE;
+				goto undo;
+			}
+			/* Hole wider tnan truncated range */
+			if (!item)
+				goto skip_node;
+			child = radix_tree_node_alloc(root);
+			if (!child) {
+				error = -ENOMEM;
+				goto undo;
+			}
+			child->shift = shift;
+			child->offset = offset;
+			child->parent = node;
+			/* Populate range with retry entries. */
+			if (flags & RADIX_TREE_FILL_ATOMIC) {
+				int idx = (index >> shift) &
+					   RADIX_TREE_MAP_MASK;
+				int last = RADIX_TREE_MAP_SIZE;
+
+				if (end < (index | shift_maxindex(shift)))
+					last = (end >> shift) &
+						RADIX_TREE_MAP_MASK;
+				for (; idx <= last; idx++)
+					child->slots[idx] = RADIX_TREE_RETRY;
+			}
+			entry = node_to_entry(child);
+			if (node) {
+				rcu_assign_pointer(node->slots[offset], entry);
+				node->count++;
+			} else
+				rcu_assign_pointer(root->rnode, entry);
+		}
+		node = child;
+		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+		entry = rcu_dereference_raw(node->slots[offset]);
+
+		/* Stop if find leaf or slot inside range */
+	} while ((flags & RADIX_TREE_FILL_LEAVES) ? shift :
+			((index & ((1ul << shift) - 1)) ||
+			 (index | ((1ul << shift) - 1)) > end));
+
+next_slot:
+	/* NULL or retry entry */
+	if (entry <= RADIX_TREE_RETRY)
+		goto fill;
+
+	if (!(flags & RADIX_TREE_FILL_OVERWRITE)) {
+		error = -EEXIST;
+		goto undo;
+	}
+
+	/* Cut sub-tree */
+	if (unlikely(radix_tree_is_internal_node(entry))) {
+		rcu_assign_pointer(node->slots[offset], item);
+		child = entry_to_node(entry);
+		offset = 0;
+		do {
+			entry = rcu_dereference_raw(child->slots[offset]);
+			if (entry)
+				child->count--;
+			if (radix_tree_is_internal_node(entry)) {
+				child = entry_to_node(entry);
+				offset = 0;
+			} else if (++offset == RADIX_TREE_MAP_SIZE) {
+				offset = child->offset;
+				entry = child->parent;
+				WARN_ON_ONCE(child->count);
+				radix_tree_node_free(child);
+				child = entry;
+			}
+		} while (child != node);
+	}
+
+	if (flags & RADIX_TREE_FILL_CLEAR_TAGS) {
+		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+			node_tag_clear(root, node, tag, offset);
+	}
+
+	/* Skip the rest if we're cleared class slot in node */
+	if (!--node->count && !item && __radix_tree_delete_node(root, node))
+		goto skip_node;
+
+
+fill:
+	rcu_assign_pointer(node->slots[offset], item);
+	if (item)
+		node->count++;
+
+	index += 1ul << shift;
+	if (index - 1 == end)
+		return node;
+
+	/* Next slot in this node and still in range */
+	if (index + (1ul << shift) - 1 <= end &&
+			++offset < RADIX_TREE_MAP_SIZE) {
+		entry = rcu_dereference_raw(node->slots[offset]);
+		goto next_slot;
+	}
+
+	goto next_node;
+
+skip_node:
+	index |= shift_maxindex(shift);
+	if (index++ >= end)
+		return node;
+	goto next_node;
+
+undo:
+	if (index > start)
+		radix_tree_fill_range(root, start, index - 1, NULL,
+				      RADIX_TREE_FILL_OVERWRITE);
+	return ERR_PTR(error);
+}
+EXPORT_SYMBOL(radix_tree_fill_range);
+
+/**
  * radix_tree_range_tag_if_tagged - for each item in given range set given
  *				   tag if item has another tag set
  * @root:		radix tree root

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 2/4] lib/radix-tree: remove sibling entries
  2016-08-27 14:14 [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range Konstantin Khlebnikov
@ 2016-08-27 14:16 ` Konstantin Khlebnikov
  2016-08-27 14:16 ` [PATCH RFC 3/4] testing/radix-tree: replace multi-order with range operations Konstantin Khlebnikov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Konstantin Khlebnikov @ 2016-08-27 14:16 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel, Kirill A. Shutemov, Ross Zwisler

Current implementation stores huge entry as "canonical" head entry with
tail of "sibling" entries which points backward to the head. Iterator
jumps back and forward when sees them. This complication is required for
THP in page-cache because struct page can tell it start index and size.

This patch removes support of sibling entries but keeps part that allows
store data pointers in non-leaf slots. Huge pages will be stored as range
of slots which points to the head page.

This allows to simplify fast-path in radix_tree_next_chunk(): huge entry
is reported as single-slot chunk for any index within range.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
---
 include/linux/radix-tree.h |   75 ++++-----------------
 lib/radix-tree.c           |  158 +++++++++++++-------------------------------
 2 files changed, 60 insertions(+), 173 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index af33e8d93ec3..1721ddbf981d 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -37,8 +37,8 @@
  * 10 - exceptional entry
  * 11 - this bit combination is currently unused/reserved
  *
- * The internal entry may be a pointer to the next level in the tree, a
- * sibling entry, or an indicator that the entry in this slot has been moved
+ * The internal entry may be a pointer to the next level in the tree
+ * or an indicator that the entry in this slot has been moved
  * to another location in the tree and the lookup should be restarted.  While
  * NULL fits the 'data pointer' pattern, it means that there is no entry in
  * the tree for this index (no matter what level of the tree it is found at).
@@ -265,13 +265,7 @@ static inline void radix_tree_replace_slot(void **pslot, void *item)
 int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 			unsigned order, struct radix_tree_node **nodep,
 			void ***slotp);
-int __radix_tree_insert(struct radix_tree_root *, unsigned long index,
-			unsigned order, void *);
-static inline int radix_tree_insert(struct radix_tree_root *root,
-			unsigned long index, void *entry)
-{
-	return __radix_tree_insert(root, index, 0, entry);
-}
+int radix_tree_insert(struct radix_tree_root *, unsigned long index, void *);
 void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
 			  struct radix_tree_node **nodep, void ***slotp);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
@@ -354,7 +348,6 @@ radix_tree_truncate_range(struct radix_tree_root *root,
  * @index:	index of current slot
  * @next_index:	one beyond the last index for this chunk
  * @tags:	bit-mask for tag-iterating
- * @shift:	shift for the node that holds our slots
  *
  * This radix tree iterator works in terms of "chunks" of slots.  A chunk is a
  * subinterval of slots contained within one radix tree leaf node.  It is
@@ -367,20 +360,8 @@ struct radix_tree_iter {
 	unsigned long	index;
 	unsigned long	next_index;
 	unsigned long	tags;
-#ifdef CONFIG_RADIX_TREE_MULTIORDER
-	unsigned int	shift;
-#endif
 };
 
-static inline unsigned int iter_shift(struct radix_tree_iter *iter)
-{
-#ifdef CONFIG_RADIX_TREE_MULTIORDER
-	return iter->shift;
-#else
-	return 0;
-#endif
-}
-
 #define RADIX_TREE_ITER_TAG_MASK	0x00FF	/* tag index in lower byte */
 #define RADIX_TREE_ITER_TAGGED		0x0100	/* lookup tagged slots */
 #define RADIX_TREE_ITER_CONTIG		0x0200	/* stop at first hole */
@@ -441,12 +422,6 @@ void **radix_tree_iter_retry(struct radix_tree_iter *iter)
 	return NULL;
 }
 
-static inline unsigned long
-__radix_tree_iter_add(struct radix_tree_iter *iter, unsigned long slots)
-{
-	return iter->index + (slots << iter_shift(iter));
-}
-
 /**
  * radix_tree_iter_next - resume iterating when the chunk may be invalid
  * @iter:	iterator state
@@ -458,7 +433,7 @@ __radix_tree_iter_add(struct radix_tree_iter *iter, unsigned long slots)
 static inline __must_check
 void **radix_tree_iter_next(struct radix_tree_iter *iter)
 {
-	iter->next_index = __radix_tree_iter_add(iter, 1);
+	iter->next_index = iter->index + 1;
 	iter->tags = 0;
 	return NULL;
 }
@@ -489,7 +464,7 @@ void **radix_tree_iter_jump(struct radix_tree_iter *iter, unsigned long index)
 static __always_inline long
 radix_tree_chunk_size(struct radix_tree_iter *iter)
 {
-	return (iter->next_index - iter->index) >> iter_shift(iter);
+	return iter->next_index - iter->index;
 }
 
 static inline struct radix_tree_node *entry_to_node(void *ptr)
@@ -508,6 +483,9 @@ static inline struct radix_tree_node *entry_to_node(void *ptr)
  * This function updates @iter->index in the case of a successful lookup.
  * For tagged lookup it also eats @iter->tags.
  *
+ * Please keep this fast-path as small as possible. Complicated logic is hidden
+ * inside radix_tree_next_chunk() which prepares chunks for this funciton.
+ *
  * There are several cases where 'slot' can be passed in as NULL to this
  * function.  These cases result from the use of radix_tree_iter_next() or
  * radix_tree_iter_retry().  In these cases we don't end up dereferencing
@@ -520,49 +498,24 @@ static __always_inline void **
 radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
 {
 	if (flags & RADIX_TREE_ITER_TAGGED) {
-		void *canon = slot;
-
 		iter->tags >>= 1;
-		if (unlikely(!iter->tags))
-			return NULL;
-		while (IS_ENABLED(CONFIG_RADIX_TREE_MULTIORDER) &&
-					radix_tree_is_internal_node(slot[1])) {
-			if (entry_to_node(slot[1]) == canon) {
-				iter->tags >>= 1;
-				iter->index = __radix_tree_iter_add(iter, 1);
-				slot++;
-				continue;
-			}
-			iter->next_index = __radix_tree_iter_add(iter, 1);
-			return NULL;
-		}
 		if (likely(iter->tags & 1ul)) {
-			iter->index = __radix_tree_iter_add(iter, 1);
+			iter->index++;
 			return slot + 1;
 		}
-		if (!(flags & RADIX_TREE_ITER_CONTIG)) {
+		if (!(flags & RADIX_TREE_ITER_CONTIG) && likely(iter->tags)) {
 			unsigned offset = __ffs(iter->tags);
 
 			iter->tags >>= offset;
-			iter->index = __radix_tree_iter_add(iter, offset + 1);
+			iter->index += offset + 1;
 			return slot + offset + 1;
 		}
 	} else {
-		long count = radix_tree_chunk_size(iter);
-		void *canon = slot;
+		long size = radix_tree_chunk_size(iter);
 
-		while (--count > 0) {
+		while (--size > 0) {
 			slot++;
-			iter->index = __radix_tree_iter_add(iter, 1);
-
-			if (IS_ENABLED(CONFIG_RADIX_TREE_MULTIORDER) &&
-			    radix_tree_is_internal_node(*slot)) {
-				if (entry_to_node(*slot) == canon)
-					continue;
-				iter->next_index = iter->index;
-				break;
-			}
-
+			iter->index++;
 			if (likely(*slot))
 				return slot;
 			if (flags & RADIX_TREE_ITER_CONTIG) {
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index c46a60065a77..234f1ddbd7a9 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -77,21 +77,6 @@ static inline void *node_to_entry(void *ptr)
 
 #define RADIX_TREE_RETRY	node_to_entry(NULL)
 
-#ifdef CONFIG_RADIX_TREE_MULTIORDER
-/* Sibling slots point directly to another slot in the same node */
-static inline bool is_sibling_entry(struct radix_tree_node *parent, void *node)
-{
-	void **ptr = node;
-	return (parent->slots <= ptr) &&
-			(ptr < parent->slots + RADIX_TREE_MAP_SIZE);
-}
-#else
-static inline bool is_sibling_entry(struct radix_tree_node *parent, void *node)
-{
-	return false;
-}
-#endif
-
 static inline unsigned long get_slot_offset(struct radix_tree_node *parent,
 						 void **slot)
 {
@@ -104,16 +89,6 @@ static unsigned int radix_tree_descend(struct radix_tree_node *parent,
 	unsigned int offset = (index >> parent->shift) & RADIX_TREE_MAP_MASK;
 	void **entry = rcu_dereference_raw(parent->slots[offset]);
 
-#ifdef CONFIG_RADIX_TREE_MULTIORDER
-	if (radix_tree_is_internal_node(entry)) {
-		unsigned long siboff = get_slot_offset(parent, entry);
-		if (siboff < RADIX_TREE_MAP_SIZE) {
-			offset = siboff;
-			entry = rcu_dereference_raw(parent->slots[offset]);
-		}
-	}
-#endif
-
 	*nodep = (void *)entry;
 	return offset;
 }
@@ -232,12 +207,7 @@ static void dump_node(struct radix_tree_node *node, unsigned long index)
 		void *entry = node->slots[i];
 		if (!entry)
 			continue;
-		if (is_sibling_entry(node, entry)) {
-			pr_debug("radix sblng %p offset %ld val %p indices %ld-%ld\n",
-					entry, i,
-					*(void **)entry_to_node(entry),
-					first, last);
-		} else if (!radix_tree_is_internal_node(entry)) {
+		if (!radix_tree_is_internal_node(entry)) {
 			pr_debug("radix entry %p offset %ld indices %ld-%ld\n",
 					entry, i, first, last);
 		} else {
@@ -596,25 +566,6 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 		slot = &node->slots[offset];
 	}
 
-#ifdef CONFIG_RADIX_TREE_MULTIORDER
-	/* Insert pointers to the canonical entry */
-	if (order > shift) {
-		unsigned i, n = 1 << (order - shift);
-		offset = offset & ~(n - 1);
-		slot = &node->slots[offset];
-		child = node_to_entry(slot);
-		for (i = 0; i < n; i++) {
-			if (slot[i])
-				return -EEXIST;
-		}
-
-		for (i = 1; i < n; i++) {
-			rcu_assign_pointer(slot[i], child);
-			node->count++;
-		}
-	}
-#endif
-
 	if (nodep)
 		*nodep = node;
 	if (slotp)
@@ -623,16 +574,15 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 }
 
 /**
- *	__radix_tree_insert    -    insert into a radix tree
+ *	radix_tree_insert    -    insert into a radix tree
  *	@root:		radix tree root
  *	@index:		index key
- *	@order:		key covers the 2^order indices around index
  *	@item:		item to insert
  *
  *	Insert an item into the radix tree at position @index.
  */
-int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
-			unsigned order, void *item)
+int radix_tree_insert(struct radix_tree_root *root, unsigned long index,
+		      void *item)
 {
 	struct radix_tree_node *node;
 	void **slot;
@@ -640,7 +590,7 @@ int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
 
 	BUG_ON(radix_tree_is_internal_node(item));
 
-	error = __radix_tree_create(root, index, order, &node, &slot);
+	error = __radix_tree_create(root, index, 0, &node, &slot);
 	if (error)
 		return error;
 	if (*slot != NULL)
@@ -659,7 +609,7 @@ int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
 
 	return 0;
 }
-EXPORT_SYMBOL(__radix_tree_insert);
+EXPORT_SYMBOL(radix_tree_insert);
 
 /**
  *	__radix_tree_lookup	-	lookup an item in a radix tree
@@ -895,14 +845,6 @@ int radix_tree_tag_get(struct radix_tree_root *root,
 }
 EXPORT_SYMBOL(radix_tree_tag_get);
 
-static inline void __set_iter_shift(struct radix_tree_iter *iter,
-					unsigned int shift)
-{
-#ifdef CONFIG_RADIX_TREE_MULTIORDER
-	iter->shift = shift;
-#endif
-}
-
 /**
  * radix_tree_next_chunk - find next chunk of slots for iteration
  *
@@ -914,9 +856,10 @@ static inline void __set_iter_shift(struct radix_tree_iter *iter,
 void **radix_tree_next_chunk(struct radix_tree_root *root,
 			     struct radix_tree_iter *iter, unsigned flags)
 {
-	unsigned tag = flags & RADIX_TREE_ITER_TAG_MASK;
-	struct radix_tree_node *node, *child;
+	unsigned int shift, tag = flags & RADIX_TREE_ITER_TAG_MASK;
 	unsigned long index, offset, maxindex;
+	struct radix_tree_node *node;
+	void *entry;
 
 	if ((flags & RADIX_TREE_ITER_TAGGED) && !root_tag_get(root, tag))
 		return NULL;
@@ -934,28 +877,27 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
 	if (!index && iter->index)
 		return NULL;
 
- restart:
-	radix_tree_load_root(root, &child, &maxindex);
+restart:
+	shift = radix_tree_load_root(root, &node, &maxindex);
 	if (index > maxindex)
 		return NULL;
-	if (!child)
-		return NULL;
 
-	if (!radix_tree_is_internal_node(child)) {
+	if (!maxindex) {
 		/* Single-slot tree */
-		iter->index = index;
-		iter->next_index = maxindex + 1;
-		iter->tags = 1;
-		__set_iter_shift(iter, 0);
+		iter->index = 0;
+		iter->next_index = 1;
+		iter->tags = 0;
 		return (void **)&root->rnode;
 	}
 
-	do {
-		node = entry_to_node(child);
-		offset = radix_tree_descend(node, &child, index);
+	node = entry_to_node(node);
+	while (1) {
+		shift -= RADIX_TREE_MAP_SHIFT;
+		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+		entry = rcu_dereference_raw(node->slots[offset]);
 
 		if ((flags & RADIX_TREE_ITER_TAGGED) ?
-				!tag_get(node, tag, offset) : !child) {
+				!tag_get(node, tag, offset) : !entry) {
 			/* Hole detected */
 			if (flags & RADIX_TREE_ITER_CONTIG)
 				return NULL;
@@ -967,30 +909,42 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
 						offset + 1);
 			else
 				while (++offset	< RADIX_TREE_MAP_SIZE) {
-					void *slot = node->slots[offset];
-					if (is_sibling_entry(node, slot))
-						continue;
-					if (slot)
+					if (node->slots[offset])
 						break;
 				}
-			index &= ~node_maxindex(node);
-			index += offset << node->shift;
+
+			index &= ~shift_maxindex(shift);
+			index += offset << shift;
 			/* Overflow after ~0UL */
 			if (!index)
 				return NULL;
 			if (offset == RADIX_TREE_MAP_SIZE)
 				goto restart;
-			child = rcu_dereference_raw(node->slots[offset]);
+			entry = rcu_dereference_raw(node->slots[offset]);
 		}
 
-		if ((child == NULL) || (child == RADIX_TREE_RETRY))
+		/* This is leaf-node */
+		if (!shift)
+			break;
+
+		/* Non-leaf data entry */
+		if (!radix_tree_is_internal_node(entry)) {
+			/* Report as a single slot chunk */
+			iter->index = index;
+			iter->next_index = index + 1;
+			iter->tags = 0;
+			return node->slots + offset;
+		}
+
+		node = entry_to_node(entry);
+		/* RADIX_TREE_RETRY */
+		if (!node)
 			goto restart;
-	} while (radix_tree_is_internal_node(child));
+	}
 
 	/* Update the iterator state */
-	iter->index = (index &~ node_maxindex(node)) | (offset << node->shift);
-	iter->next_index = (index | node_maxindex(node)) + 1;
-	__set_iter_shift(iter, node->shift);
+	iter->index = index;
+	iter->next_index = (index | RADIX_TREE_MAP_MASK) + 1;
 
 	/* Construct iter->tags bit-mask from node->tags[tag] array */
 	if (flags & RADIX_TREE_ITER_TAGGED) {
@@ -1319,7 +1273,6 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
 			goto next;
 		if (!tag_get(node, iftag, offset))
 			goto next;
-		/* Sibling slots never have tags set on them */
 		if (radix_tree_is_internal_node(child)) {
 			node = entry_to_node(child);
 			continue;
@@ -1341,7 +1294,7 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
 				break;
 			tag_set(parent, settag, offset);
 		}
- next:
+next:
 		/* Go to next entry in node */
 		index = ((index >> node->shift) + 1) << node->shift;
 		/* Overflow can happen when last_index is ~0UL... */
@@ -1357,8 +1310,6 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
 			node = node->parent;
 			offset = (index >> node->shift) & RADIX_TREE_MAP_MASK;
 		}
-		if (is_sibling_entry(node, node->slots[offset]))
-			goto next;
 		if (tagged >= nr_to_tag)
 			break;
 	}
@@ -1574,8 +1525,6 @@ static unsigned long __locate(struct radix_tree_node *slot, void *item,
 				continue;
 			}
 			node = entry_to_node(node);
-			if (is_sibling_entry(slot, node))
-				continue;
 			slot = node;
 			break;
 		}
@@ -1750,20 +1699,6 @@ bool __radix_tree_delete_node(struct radix_tree_root *root,
 	return deleted;
 }
 
-static inline void delete_sibling_entries(struct radix_tree_node *node,
-					void *ptr, unsigned offset)
-{
-#ifdef CONFIG_RADIX_TREE_MULTIORDER
-	int i;
-	for (i = 1; offset + i < RADIX_TREE_MAP_SIZE; i++) {
-		if (node->slots[offset + i] != ptr)
-			break;
-		node->slots[offset + i] = NULL;
-		node->count--;
-	}
-#endif
-}
-
 /**
  *	radix_tree_delete_item    -    delete an item from a radix tree
  *	@root:		radix tree root
@@ -1803,7 +1738,6 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
 	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
 		node_tag_clear(root, node, tag, offset);
 
-	delete_sibling_entries(node, node_to_entry(slot), offset);
 	node->slots[offset] = NULL;
 	node->count--;
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 3/4] testing/radix-tree: replace multi-order with range operations
  2016-08-27 14:14 [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range Konstantin Khlebnikov
  2016-08-27 14:16 ` [PATCH RFC 2/4] lib/radix-tree: remove sibling entries Konstantin Khlebnikov
@ 2016-08-27 14:16 ` Konstantin Khlebnikov
  2016-08-27 14:16 ` [PATCH RFC 4/4] testing/radix-tree: benchmark for iterator Konstantin Khlebnikov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Konstantin Khlebnikov @ 2016-08-27 14:16 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel, Kirill A. Shutemov, Ross Zwisler

This patch updates test for multi-order operations according to changes
in radix-tree: huge entry now must remember its range.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
---
 tools/testing/radix-tree/linux/bug.h      |    2 -
 tools/testing/radix-tree/linux/err.h      |   31 ++++++++
 tools/testing/radix-tree/linux/kernel.h   |    3 +
 tools/testing/radix-tree/linux/rcupdate.h |    6 ++
 tools/testing/radix-tree/multiorder.c     |  113 ++++++++++++++++-------------
 tools/testing/radix-tree/test.c           |   49 ++++++++++---
 tools/testing/radix-tree/test.h           |   15 ++--
 7 files changed, 151 insertions(+), 68 deletions(-)
 create mode 100644 tools/testing/radix-tree/linux/err.h

diff --git a/tools/testing/radix-tree/linux/bug.h b/tools/testing/radix-tree/linux/bug.h
index ccbe444977df..7a77fa971e91 100644
--- a/tools/testing/radix-tree/linux/bug.h
+++ b/tools/testing/radix-tree/linux/bug.h
@@ -1 +1 @@
-#define WARN_ON_ONCE(x)		assert(x)
+#define WARN_ON_ONCE(x)		assert(!(x))
diff --git a/tools/testing/radix-tree/linux/err.h b/tools/testing/radix-tree/linux/err.h
new file mode 100644
index 000000000000..6fd3e608d4d7
--- /dev/null
+++ b/tools/testing/radix-tree/linux/err.h
@@ -0,0 +1,31 @@
+#ifndef _LINUX_ERR_H
+#define _LINUX_ERR_H
+
+#define MAX_ERRNO       4095
+
+#define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned long)-MAX_ERRNO)
+
+static inline void *ERR_PTR(long error)
+{
+	return (void *) error;
+}
+
+static inline long PTR_ERR(const void *ptr)
+{
+	return (long) ptr;
+}
+
+static inline bool IS_ERR(const void *ptr)
+{
+	return IS_ERR_VALUE((unsigned long)ptr);
+}
+
+static inline int PTR_ERR_OR_ZERO(const void *ptr)
+{
+	if (IS_ERR(ptr))
+		return PTR_ERR(ptr);
+	else
+		return 0;
+}
+
+#endif /* _LINUX_ERR_H */
diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index be98a47b4e1b..52714e86991b 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -32,6 +32,9 @@
 
 #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
 
+#define IS_ALIGNED(x, a)	(((x) & ((typeof(x))(a) - 1)) == 0)
+#define is_power_of_2(n)	((n) != 0 && (((n) & ((n) - 1)) == 0))
+
 #define container_of(ptr, type, member) ({                      \
 	const typeof( ((type *)0)->member ) *__mptr = (ptr);    \
 	(type *)( (char *)__mptr - offsetof(type, member) );})
diff --git a/tools/testing/radix-tree/linux/rcupdate.h b/tools/testing/radix-tree/linux/rcupdate.h
index f7129ea2a899..8c4ae8173778 100644
--- a/tools/testing/radix-tree/linux/rcupdate.h
+++ b/tools/testing/radix-tree/linux/rcupdate.h
@@ -3,6 +3,12 @@
 
 #include <urcu.h>
 
+/* urcu.h includes errno.h which undefines ERANGE */
+#ifndef ERANGE
+#define ERANGE 34
+#endif
+
+#define RCU_INIT_POINTER(p, v) rcu_assign_pointer(p, v)
 #define rcu_dereference_raw(p) rcu_dereference(p)
 #define rcu_dereference_protected(p, cond) rcu_dereference(p)
 
diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c
index 39d9b9568fe2..f73a3bfa83d8 100644
--- a/tools/testing/radix-tree/multiorder.c
+++ b/tools/testing/radix-tree/multiorder.c
@@ -29,7 +29,7 @@ static void __multiorder_tag_test(int index, int order)
 	unsigned long first = 0;
 
 	/* our canonical entry */
-	base = index & ~((1 << order) - 1);
+	base = index;
 
 	printf("Multiorder tag test with index %d, canonical entry %d\n",
 			index, base);
@@ -43,7 +43,7 @@ static void __multiorder_tag_test(int index, int order)
 	 * item_insert_order().
 	 */
 	for_each_index(i, base, order) {
-		err = __radix_tree_insert(&tree, i, order,
+		err = radix_tree_insert(&tree, i,
 				(void *)(0xA0 | RADIX_TREE_EXCEPTIONAL_ENTRY));
 		assert(err == -EEXIST);
 	}
@@ -55,18 +55,18 @@ static void __multiorder_tag_test(int index, int order)
 
 	assert(radix_tree_tag_set(&tree, index, 0));
 
-	for_each_index(i, base, order) {
-		assert(radix_tree_tag_get(&tree, i, 0));
+	assert(radix_tree_tag_get(&tree, index, 0));
+
+	for_each_index(i, base, order)
 		assert(!radix_tree_tag_get(&tree, i, 1));
-	}
 
 	assert(radix_tree_range_tag_if_tagged(&tree, &first, ~0UL, 10, 0, 1) == 1);
 	assert(radix_tree_tag_clear(&tree, index, 0));
 
-	for_each_index(i, base, order) {
+	assert(radix_tree_tag_get(&tree, index, 1));
+
+	for_each_index(i, base, order)
 		assert(!radix_tree_tag_get(&tree, i, 0));
-		assert(radix_tree_tag_get(&tree, i, 1));
-	}
 
 	assert(radix_tree_tag_clear(&tree, index, 1));
 
@@ -122,7 +122,7 @@ static void multiorder_tag_tests(void)
 static void multiorder_check(unsigned long index, int order)
 {
 	unsigned long i;
-	unsigned long min = index & ~((1UL << order) - 1);
+	unsigned long min = index;
 	unsigned long max = min + (1UL << order);
 	RADIX_TREE(tree, GFP_KERNEL);
 
@@ -145,7 +145,7 @@ static void multiorder_check(unsigned long index, int order)
 		assert(radix_tree_insert(&tree, i, entry) == -EEXIST);
 	}
 
-	assert(item_delete(&tree, index) != 0);
+	assert(item_delete_order(&tree, index, order) != 0);
 
 	for (i = 0; i < 2*max; i++)
 		item_check_absent(&tree, i);
@@ -178,7 +178,7 @@ static void multiorder_shrink(unsigned long index, int order)
 	for (i = max; i < 2*max; i++)
 		item_check_absent(&tree, i);
 
-	if (!item_delete(&tree, 0)) {
+	if (!item_delete_order(&tree, 0, order)) {
 		printf("failed to delete index %ld (order %d)\n", index, order);		abort();
 	}
 
@@ -221,14 +221,12 @@ void multiorder_iteration(void)
 				break;
 
 		radix_tree_for_each_slot(slot, &tree, &iter, j) {
-			int height = order[i] / RADIX_TREE_MAP_SHIFT;
-			int shift = height * RADIX_TREE_MAP_SHIFT;
 			int mask = (1 << order[i]) - 1;
 
 			assert(iter.index >= (index[i] &~ mask));
 			assert(iter.index <= (index[i] | mask));
-			assert(iter.shift == shift);
-			i++;
+			if (iter.index == (index[i] | mask))
+				i++;
 		}
 	}
 
@@ -248,36 +246,48 @@ void multiorder_tagged_iteration(void)
 #define MT_NUM_ENTRIES 9
 	int index[MT_NUM_ENTRIES] = {0, 2, 4, 16, 32, 40, 64, 72, 128};
 	int order[MT_NUM_ENTRIES] = {1, 0, 2, 4,  3,  1,  3,  0,   7};
-
-#define TAG_ENTRIES 7
-	int tag_index[TAG_ENTRIES] = {0, 4, 16, 40, 64, 72, 128};
+	int tag[MT_NUM_ENTRIES]   = {1, 0, 1, 1,  0,  1,  1,  1,   1};
 
 	for (i = 0; i < MT_NUM_ENTRIES; i++)
 		assert(!item_insert_order(&tree, index[i], order[i]));
 
 	assert(!radix_tree_tagged(&tree, 1));
 
-	for (i = 0; i < TAG_ENTRIES; i++)
-		assert(radix_tree_tag_set(&tree, tag_index[i], 1));
+	for (i = 0; i < MT_NUM_ENTRIES; i++) {
+		unsigned long end = item_order_end(index[i], order[i]);
+
+		assert(!radix_tree_tag_get(&tree, index[i], 1));
+		assert(!radix_tree_tag_get(&tree, end, 1));
+		if (tag[i])
+			assert(radix_tree_tag_set(&tree, end, 1));
+	}
+
+	for (i = 0; i < MT_NUM_ENTRIES; i++) {
+		unsigned long end = item_order_end(index[i], order[i]);
+
+		if (tag[i])
+			assert(radix_tree_tag_get(&tree, end, 1));
+		else
+			assert(!radix_tree_tag_get(&tree, end, 1));
+	}
 
 	for (j = 0; j < 256; j++) {
-		int mask, k;
+		int k;
 
-		for (i = 0; i < TAG_ENTRIES; i++) {
-			for (k = i; index[k] < tag_index[i]; k++)
-				;
-			if (j <= (index[k] | ((1 << order[k]) - 1)))
+		for (k = 0; k < MT_NUM_ENTRIES; k++)
+			if (tag[k] && j <= item_order_end(index[k], order[k]))
 				break;
-		}
 
 		radix_tree_for_each_tagged(slot, &tree, &iter, j, 1) {
-			for (k = i; index[k] < tag_index[i]; k++)
-				;
-			mask = (1 << order[k]) - 1;
-
-			assert(iter.index >= (tag_index[i] &~ mask));
-			assert(iter.index <= (tag_index[i] | mask));
-			i++;
+			unsigned long end = item_order_end(index[k], order[k]);
+
+			assert(k < MT_NUM_ENTRIES);
+			assert(radix_tree_tag_get(&tree, iter.index, 1));
+			assert(iter.index >= index[k]);
+			assert(iter.index <= end);
+			if (iter.index == end)
+				while (++k < MT_NUM_ENTRIES && !tag[k])
+					;
 		}
 	}
 
@@ -285,33 +295,36 @@ void multiorder_tagged_iteration(void)
 					MT_NUM_ENTRIES, 1, 2);
 
 	for (j = 0; j < 256; j++) {
-		int mask, k;
+		int k;
 
-		for (i = 0; i < TAG_ENTRIES; i++) {
-			for (k = i; index[k] < tag_index[i]; k++)
-				;
-			if (j <= (index[k] | ((1 << order[k]) - 1)))
+		for (k = 0; k < MT_NUM_ENTRIES; k++)
+			if (tag[k] && j <= item_order_end(index[k], order[k]))
 				break;
-		}
 
 		radix_tree_for_each_tagged(slot, &tree, &iter, j, 2) {
-			for (k = i; index[k] < tag_index[i]; k++)
-				;
-			mask = (1 << order[k]) - 1;
-
-			assert(iter.index >= (tag_index[i] &~ mask));
-			assert(iter.index <= (tag_index[i] | mask));
-			i++;
+			unsigned long end = item_order_end(index[k], order[k]);
+
+			assert(k < MT_NUM_ENTRIES);
+			assert(radix_tree_tag_get(&tree, iter.index, 2));
+			assert(iter.index >= index[k]);
+			assert(iter.index <= end);
+			if (iter.index == end)
+				while (++k < MT_NUM_ENTRIES && !tag[k])
+					;
 		}
 	}
 
-	first = 1;
+	assert(!radix_tree_tagged(&tree, 0));
+
+	first = index[1];
 	radix_tree_range_tag_if_tagged(&tree, &first, ~0UL,
 					MT_NUM_ENTRIES, 1, 0);
-	i = 0;
+
 	radix_tree_for_each_tagged(slot, &tree, &iter, 0, 0) {
-		assert(iter.index == tag_index[i]);
-		i++;
+		assert(radix_tree_tag_get(&tree, iter.index, 0));
+		assert(iter.index >= index[2]);
+		assert(iter.index <= item_order_end(index[2], order[2]));
+		break;
 	}
 
 	item_kill_tree(&tree);
diff --git a/tools/testing/radix-tree/test.c b/tools/testing/radix-tree/test.c
index a6e8099eaf4f..d6761d1a157a 100644
--- a/tools/testing/radix-tree/test.c
+++ b/tools/testing/radix-tree/test.c
@@ -4,6 +4,7 @@
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/bitops.h>
+#include <linux/err.h>
 
 #include "test.h"
 
@@ -24,21 +25,23 @@ int item_tag_get(struct radix_tree_root *root, unsigned long index, int tag)
 	return radix_tree_tag_get(root, index, tag);
 }
 
-int __item_insert(struct radix_tree_root *root, struct item *item,
-			unsigned order)
+int item_insert(struct radix_tree_root *root, unsigned long index)
 {
-	return __radix_tree_insert(root, item->index, order, item);
+	return radix_tree_insert(root, index, item_create(index, index));
 }
 
-int item_insert(struct radix_tree_root *root, unsigned long index)
+unsigned long item_order_end(unsigned long index, unsigned int order)
 {
-	return __item_insert(root, item_create(index), 0);
+	return index + (1ul << order) - 1;
 }
 
 int item_insert_order(struct radix_tree_root *root, unsigned long index,
-			unsigned order)
+			unsigned int order)
 {
-	return __item_insert(root, item_create(index), order);
+	unsigned long end = item_order_end(index, order);
+
+	return PTR_ERR_OR_ZERO(radix_tree_fill_range(root, index, end,
+				item_create(index, end), 0));
 }
 
 int item_delete(struct radix_tree_root *root, unsigned long index)
@@ -47,17 +50,34 @@ int item_delete(struct radix_tree_root *root, unsigned long index)
 
 	if (item) {
 		assert(item->index == index);
+		assert(item->end == index);
 		free(item);
 		return 1;
 	}
 	return 0;
 }
 
-struct item *item_create(unsigned long index)
+int item_delete_order(struct radix_tree_root *root, unsigned long index,
+			unsigned int order)
+{
+	struct item *item = radix_tree_lookup(root, index);
+	unsigned long end = item_order_end(index, order);
+
+	if (item) {
+		assert(item->index == index);
+		assert(item->end == end);
+	}
+	radix_tree_truncate_range(root, index, end);
+	free(item);
+	return !!item;
+}
+
+struct item *item_create(unsigned long index, unsigned long end)
 {
 	struct item *ret = malloc(sizeof(*ret));
 
 	ret->index = index;
+	ret->end = end;
 	return ret;
 }
 
@@ -207,10 +227,17 @@ void item_kill_tree(struct radix_tree_root *root)
 		int i;
 
 		for (i = 0; i < nfound; i++) {
-			void *ret;
+			void *item;
+
+			if (items[i]->index != items[i]->end) {
+				radix_tree_truncate_range(root, items[i]->index,
+								items[i]->end);
+				free(items[i]);
+				break;
+			}
 
-			ret = radix_tree_delete(root, items[i]->index);
-			assert(ret == items[i]);
+			item = radix_tree_delete(root, items[i]->index);
+			assert(item == items[i]);
 			free(items[i]);
 		}
 	}
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index 217fb2403f09..93a6ce5e5a59 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -4,16 +4,19 @@
 #include <linux/rcupdate.h>
 
 struct item {
-	unsigned long index;
+	unsigned long index, end;
 };
 
-struct item *item_create(unsigned long index);
-int __item_insert(struct radix_tree_root *root, struct item *item,
-			unsigned order);
+struct item *item_create(unsigned long index, unsigned long end);
 int item_insert(struct radix_tree_root *root, unsigned long index);
-int item_insert_order(struct radix_tree_root *root, unsigned long index,
-			unsigned order);
 int item_delete(struct radix_tree_root *root, unsigned long index);
+
+unsigned long item_order_end(unsigned long index, unsigned int order);
+int item_insert_order(struct radix_tree_root *root, unsigned long index,
+			unsigned int order);
+int item_delete_order(struct radix_tree_root *root, unsigned long index,
+			unsigned int order);
+
 struct item *item_lookup(struct radix_tree_root *root, unsigned long index);
 
 void item_check_present(struct radix_tree_root *root, unsigned long index);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 4/4] testing/radix-tree: benchmark for iterator
  2016-08-27 14:14 [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range Konstantin Khlebnikov
  2016-08-27 14:16 ` [PATCH RFC 2/4] lib/radix-tree: remove sibling entries Konstantin Khlebnikov
  2016-08-27 14:16 ` [PATCH RFC 3/4] testing/radix-tree: replace multi-order with range operations Konstantin Khlebnikov
@ 2016-08-27 14:16 ` Konstantin Khlebnikov
       [not found] ` <20160827180927.GA22919@linux.intel.com>
  2016-08-30 22:39 ` Ross Zwisler
  4 siblings, 0 replies; 21+ messages in thread
From: Konstantin Khlebnikov @ 2016-08-27 14:16 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel, Kirill A. Shutemov, Ross Zwisler

This adds simple benchmark for iterator similar to one I've used for
commit 78c1d78488a3 ("radix-tree: introduce bit-optimized iterator")

Building with make BENCHMARK=1 set radix tree order to 6 and adds -O2,
this allows to get performance comparable to in kernel performance.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
---
 tools/testing/radix-tree/Makefile       |    6 ++
 tools/testing/radix-tree/benchmark.c    |  101 +++++++++++++++++++++++++++++++
 tools/testing/radix-tree/linux/kernel.h |    4 +
 tools/testing/radix-tree/main.c         |    2 +
 tools/testing/radix-tree/test.h         |    1 
 5 files changed, 113 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/radix-tree/benchmark.c

diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index 6079ec142685..1594335d1ed6 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -4,7 +4,11 @@ LDFLAGS += -lpthread -lurcu
 TARGETS = main
 OFILES = main.o radix-tree.o linux.o test.o tag_check.o find_next_bit.o \
 	 regression1.o regression2.o regression3.o multiorder.o \
-	 iteration_check.o
+	 iteration_check.o benchmark.o
+
+ifdef BENCHMARK
+	CFLAGS += -DBENCHMARK=1 -O2
+endif
 
 targets: $(TARGETS)
 
diff --git a/tools/testing/radix-tree/benchmark.c b/tools/testing/radix-tree/benchmark.c
new file mode 100644
index 000000000000..05d46071bf37
--- /dev/null
+++ b/tools/testing/radix-tree/benchmark.c
@@ -0,0 +1,101 @@
+/*
+ * benchmark.c:
+ * Author: Konstantin Khlebnikov <koct9i@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/radix-tree.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <time.h>
+#include "test.h"
+
+#define NSEC_PER_SEC	1000000000L
+
+static long long benchmark_iter(struct radix_tree_root *root, bool tagged)
+{
+	volatile unsigned long sink = 0;
+	struct radix_tree_iter iter;
+	struct timespec start, finish;
+	long long nsec;
+	int l, loops = 1;
+	void **slot;
+
+#ifdef BENCHMARK
+again:
+#endif
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	for (l = 0; l < loops; l++) {
+		if (tagged) {
+			radix_tree_for_each_tagged(slot, root, &iter, 0, 0)
+				sink ^= (unsigned long)slot;
+		} else {
+			radix_tree_for_each_slot(slot, root, &iter, 0)
+				sink ^= (unsigned long)slot;
+		}
+	}
+	clock_gettime(CLOCK_MONOTONIC, &finish);
+
+	nsec = (finish.tv_sec - start.tv_sec) * NSEC_PER_SEC +
+	       (finish.tv_nsec - start.tv_nsec);
+
+#ifdef BENCHMARK
+	if (loops == 1 && nsec * 5 < NSEC_PER_SEC) {
+		loops = NSEC_PER_SEC / nsec / 4 + 1;
+		goto again;
+	}
+#endif
+
+	nsec /= loops;
+	return nsec;
+}
+
+static void benchmark_size(unsigned long size, unsigned long step, int order)
+{
+	RADIX_TREE(tree, GFP_KERNEL);
+	long long normal, tagged;
+	unsigned long index;
+
+	for (index = 0 ; index < size ; index += step) {
+		item_insert_order(&tree, index, order);
+		radix_tree_tag_set(&tree, item_order_end(index, order), 0);
+	}
+
+	tagged = benchmark_iter(&tree, true);
+	normal = benchmark_iter(&tree, false);
+
+	printf("Size %ld, step %6ld, order %d tagged %10lld ns, normal %10lld ns\n",
+		size, step, order, tagged, normal);
+
+	item_kill_tree(&tree);
+}
+
+void benchmark(void)
+{
+	unsigned long size[] = {1 << 10, 1 << 20,
+#ifdef BENCHMARK
+		1 << 27,
+#endif
+		0};
+	unsigned long step[] = {1, 2, 7, 15, 63, 64, 65,
+				128, 256, 512, 12345, 0};
+	int c, s;
+
+	printf("starting benchmarks\n");
+	printf("RADIX_TREE_MAP_SHIFT = %d\n", RADIX_TREE_MAP_SHIFT);
+
+	for (c = 0; size[c]; c++)
+		for (s = 0; step[s]; s++)
+			benchmark_size(size[c], step[s], 0);
+
+	for (c = 0; size[c]; c++)
+		for (s = 0; step[s]; s++)
+			benchmark_size(size[c], step[s] << 9, 9);
+}
diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index 52714e86991b..ddabc495423f 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -10,7 +10,11 @@
 #include "../../include/linux/compiler.h"
 #include "../../../include/linux/kconfig.h"
 
+#ifdef BENCHMARK
+#define RADIX_TREE_MAP_SHIFT	6
+#else
 #define RADIX_TREE_MAP_SHIFT	3
+#endif
 
 #ifndef NULL
 #define NULL	0
diff --git a/tools/testing/radix-tree/main.c b/tools/testing/radix-tree/main.c
index daa9010693e8..6e9a02cb7b97 100644
--- a/tools/testing/radix-tree/main.c
+++ b/tools/testing/radix-tree/main.c
@@ -335,6 +335,8 @@ int main(int argc, char **argv)
 	iteration_test();
 	single_thread_tests(long_run);
 
+	benchmark();
+
 	sleep(1);
 	printf("after sleep(1): %d allocated\n", nr_allocated);
 	rcu_unregister_thread();
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index 93a6ce5e5a59..2529a622f1f2 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -31,6 +31,7 @@ void item_kill_tree(struct radix_tree_root *root);
 void tag_check(void);
 void multiorder_checks(void);
 void iteration_test(void);
+void benchmark(void);
 
 struct item *
 item_tag_set(struct radix_tree_root *root, unsigned long index, int tag);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* RE: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
       [not found] ` <20160827180927.GA22919@linux.intel.com>
@ 2016-08-29 15:21   ` Matthew Wilcox
  2016-08-29 16:14     ` Konstantin Khlebnikov
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2016-08-29 15:21 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Ross Zwisler, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 15719 bytes --]

Thanks, Ross.

Konstantin, I think there are problems with the concept behind this series.  You have multiple entries in the tree with the same value.  That works out fine when the entry is a pointer (eg to a struct page), but not so well when it's an exceptional entry (eg a swap cache entry or a DAX radix tree entry).  If you look at the recent DAX work, you'll see there's a lock bit, and having multiple lock bits is a recipe for disaster.

But I did notice that we have a missing test in the test-suite; one that checks whether replace_slot actually replaces all of the parts of a multiorder entry.  See attachment.

-----Original Message-----
From: Ross Zwisler [mailto:ross.zwisler@linux.intel.com] 
Sent: Saturday, August 27, 2016 2:09 PM
To: Matthew Wilcox <mawilcox@microsoft.com>
Subject: Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range

Hey Matthew,

Just wanted to make sure that you saw this series.

- Ross

On Sat, Aug 27, 2016 at 05:14:34PM +0300, Konstantin Khlebnikov wrote:
> This patch adds function for filling and truncating ranges of slots:
> 
> radix_tree_node *radix_tree_fill_range(root, start, end, item, flags)
> 
> It fills slots in range "begin".."end" with "item" and returns pointer 
> to the last filled node. Filling with NULL truncates range.
> 
> This is intended for managing transparent huge pages in page cache 
> where all entries are aligned but this function can handle arbitrary 
> unaligned ranges. Might be useful for PAT or VMA-like extent trees.
> 
> By default filling range constructs shallow tree: entries are assigned 
> directly inner slots if possible. In worst case any range requires 
> only
> 2 * RADIX_TREE_MAX_PATH nodes. If length is power of two and start 
> index is aligned then all slots are always in single node and requires 
> at most RADIX_TREE_MAX_PATH nodes.
> 
> Function accepts several flags:
> 
> RADIX_TREE_FILL_LEAVES  - build deep tree, insert entry into leaves.
> 
> RADIX_TREE_FILL_OVERWRITE - overwrite instead of failing with -EEXIST.
> 
> RADIX_TREE_FILL_ATOMIC - play well with concurrent RCU-protected lookup:
> fill new nodes with RADIX_TREE_RETRY before inserting them into the tree.
> At following iterations these slots are filled with @item or sub-nodes.
> 
> RADIX_TREE_FILL_CLEAR_TAGS - also clears all tags.
> 
> radix_tree_fill_range() returns pointer to the node which holds the 
> last slot in range, NULL if this is root slot, or ERR_PTR in case of error.
> 
> Thus, radix_tree_fill_range() can handle all operations required for THP:
> 
> * Insert
> Fill range with pointer to head page.
> 
> radix_tree_fill_range(root, index, index + nr_pages - 1, head_page,
> 		      RADIX_TREE_FILL_ATOMIC)
> 
> * Remove
> Fill range with NULL or shadow entry, returned value will be used for 
> linking completely shadow nodes into slab shrinker.
> 
> radix_tree_fill_range(root, index, index + nr_pages - 1, NULL,
> 		      RADIX_TREE_FILL_OVERWRITE)
> 
> * Merge
> Fill range with overwrite to replace 0-order pages with THP.
> 
> radix_tree_fill_range(root, index, index + nr_pages - 1, head_page,
> 		      RADIX_TREE_FILL_OVERWRITE | RADIX_TREE_FILL_ATOMIC)
> 
> * Split
> Two passes: first fill leaves with head_page entry and then replace 
> each slot with pointer to individual tail page. This could be done in 
> single pass but makes radix_tree_fill_range much more complicated.
> 
> radix_tree_fill_range(root, index, index + nr_pages - 1, head_page,
> 		      RADIX_TREE_FILL_LEAVES | RADIX_TREE_FILL_OVERWRITE |
> 		      RADIX_TREE_FILL_ATOMIC);
> radix_tree_for_each_slot(...)
> 	radix_tree_replace_slot(slot, head + iter.index - head->index);
> 
> 
> Page lookup and iterator will return pointer to head page for any index.
> 
> 
> Code inside iterator loop could detect huge entry, handle all 
> sub-pages and jump to next index using new helper function radix_tree_iter_jump():
> 
> slot = radix_tree_iter_jump(&iter, page->index + 
> hpage_nr_pages(page));
> 
> This helper has builtin protection against overflows: jump to index = 
> 0 stops iterator. This uses existing logic in radix_tree_next_chunk():
> if iter.next_index is zero then iter.index must be zero too.
> 
> 
> Tags should be set only for last index of THP range: this way iterator 
> will find them regardless of starting index.
> 
> radix_tree_preload_range() pre-allocates nodes for filling range.
> 
> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
> ---
>  include/linux/radix-tree.h |   46 ++++++++
>  lib/radix-tree.c           |  245 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 291 insertions(+)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h 
> index 4613bf35c311..af33e8d93ec3 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -319,6 +319,35 @@ static inline void radix_tree_preload_end(void)
>  	preempt_enable();
>  }
>  
> +#define RADIX_TREE_FILL_LEAVES		1 /* build full depth tree */
> +#define RADIX_TREE_FILL_OVERWRITE	2 /* overwrite non-empty slots */
> +#define RADIX_TREE_FILL_CLEAR_TAGS	4 /* clear all tags */
> +#define RADIX_TREE_FILL_ATOMIC		8 /* play well with rcu lookup */
> +
> +struct radix_tree_node *
> +radix_tree_fill_range(struct radix_tree_root *root, unsigned long start,
> +		      unsigned long end, void *item, unsigned int flags);
> +
> +int radix_tree_preload_range(gfp_t gfp_mask, unsigned long start,
> +			     unsigned long end, unsigned int flags);
> +
> +/**
> + * radix_tree_truncate_range  - remove everything in range
> + * @root:	radix tree root
> + * @start:	first index
> + * @end:	last index
> + *
> + * This function removes all items and tags within given range.
> + */
> +static inline void
> +radix_tree_truncate_range(struct radix_tree_root *root,
> +			  unsigned long start, unsigned long end) {
> +	radix_tree_fill_range(root, start, end, NULL,
> +			      RADIX_TREE_FILL_OVERWRITE |
> +			      RADIX_TREE_FILL_CLEAR_TAGS); }
> +
>  /**
>   * struct radix_tree_iter - radix tree iterator state
>   *
> @@ -435,6 +464,23 @@ void **radix_tree_iter_next(struct 
> radix_tree_iter *iter)  }
>  
>  /**
> + * radix_tree_iter_jump - restart iterating from given index if it non-zero
> + * @iter:	iterator state
> + * @index:	next index
> + *
> + * If index is zero when iterator will stop. This protects from 
> +endless loop
> + * when index overflows after visiting last entry.
> + */
> +static inline __must_check
> +void **radix_tree_iter_jump(struct radix_tree_iter *iter, unsigned 
> +long index) {
> +	iter->index = index - 1;
> +	iter->next_index = index;
> +	iter->tags = 0;
> +	return NULL;
> +}
> +
> +/**
>   * radix_tree_chunk_size - get current chunk size
>   *
>   * @iter:	pointer to radix tree iterator
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 
> 1b7bf7314141..c46a60065a77 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -36,6 +36,7 @@
>  #include <linux/bitops.h>
>  #include <linux/rcupdate.h>
>  #include <linux/preempt.h>		/* in_interrupt() */
> +#include <linux/err.h>
>  
>  
>  /* Number of nodes in fully populated tree of given height */ @@ 
> -1014,6 +1015,250 @@ void **radix_tree_next_chunk(struct 
> radix_tree_root *root,  EXPORT_SYMBOL(radix_tree_next_chunk);
>  
>  /**
> + * radix_tree_preload_range  - preload nodes for filling range.
> + * @gfp_mask:
> + * @start:	first index
> + * @end:	last index
> + * @flags:	RADIX_TREE_FILL_*
> + */
> +int radix_tree_preload_range(gfp_t gfp_mask, unsigned long start,
> +			     unsigned long end, unsigned int flags) {
> +	unsigned long length = end - start + 1;
> +	int nr_nodes, shift;
> +
> +	/* Preloading doesn't help anything with this gfp mask, skip it */
> +	if (!gfpflags_allow_blocking(gfp_mask)) {
> +		preempt_disable();
> +		return 0;
> +	}
> +
> +	/*
> +	 * For filling leaves tree must cover all indexes in range at all
> +	 * levels plus RADIX_TREE_MAX_PATH required for growing tree depth
> +	 * and only root node is shared for sure.
> +	 *
> +	 * If for aligned range we need RADIX_TREE_MAX_PATH for growing depth
> +	 * and RADIX_TREE_MAX_PATH for path where all slots will be.
> +	 *
> +	 * For arbitrary range we need again RADIX_TREE_MAX_PATH for growing
> +	 * depth and two RADIX_TREE_MAX_PATH chains for constructing arc of
> +	 * slots from leaf to root and back. Only root node is shared.
> +	 */
> +	if (flags & RADIX_TREE_FILL_LEAVES) {
> +		if (start > end)
> +			return -EINVAL;
> +		shift = 0;
> +		nr_nodes = RADIX_TREE_MAX_PATH - 1;
> +		do {
> +			shift += RADIX_TREE_MAP_SHIFT;
> +			nr_nodes += (end >> shift) - (start >> shift) + 1;
> +		} while (shift < RADIX_TREE_INDEX_BITS);
> +	} else if (is_power_of_2(length) && IS_ALIGNED(start, length))
> +		nr_nodes = RADIX_TREE_MAX_PATH * 2 - 1;
> +	else
> +		nr_nodes = RADIX_TREE_MAX_PATH * 3 - 2;
> +	return __radix_tree_preload(gfp_mask, nr_nodes); } 
> +EXPORT_SYMBOL(radix_tree_preload_range);
> +
> +/**
> + * radix_tree_fill_range - fill range of slots
> + * @root:	radix tree root
> + * @start:	first index
> + * @end:	last index
> + * @item:	value for filling, NULL for removing
> + * @flags:	RADIX_TREE_FILL_* flags
> + * Returns:	pointer last node or NULL, ERR_PTR for errors
> + *
> + * By default builds shallow tree: assign entry to inner slots if possible.
> + * In wost case range requires up to 2 * RADIX_TREE_MAX_PATH nodes 
> +plus
> + * RADIX_TREE_MAX_PATH for extending tree depth.
> + *
> + * If length is 2^n and start aligned to it then all slots are in one node.
> + *
> + * This function cannot fill or cut part of bugger range if this 
> +require
> + * spltting inner slots and insering new nodes: fails with -ERANGE.
> + *
> + * With flag RADIX_TREE_FILL_LEAVES builds deep tree and insert @item 
> +into
> + * leaf slots. This requires much more nodes.
> + *
> + * With flag RADIX_TREE_FILL_OVERWRITE removes everything in range 
> +and cut
> + * sub-tree if @item is NULL. Without that flag function undo all 
> +chandges
> + * and fails with code -EEXIST if finds any populated slot.
> + *
> + * With flag RADIX_TREE_FILL_ATOMIC function plays well with 
> +rcu-protected
> + * lookups: it fills new nodes with RADIX_TREE_RETRY before inserting 
> +them
> + * into the tree: lookup will see either old entry, @item or retry entry.
> + * At following iterations these slots are filled with @item or sub-nodes.
> + *
> + * With flag RADIX_TREE_FILL_CLEAR_TAGS also clears all tags.
> + *
> + * Function returns pointer to node which holds the last slot in 
> +range,
> + * NULL if that was root slot, or ERR_PTR: -ENOMEM, -EEXIST, -ERANGE.
> + */
> +struct radix_tree_node *
> +radix_tree_fill_range(struct radix_tree_root *root, unsigned long start,
> +		      unsigned long end, void *item, unsigned int flags) {
> +	unsigned long index = start, maxindex;
> +	struct radix_tree_node *node, *child;
> +	int error, root_shift, shift, tag, offset;
> +	void *entry;
> +
> +	/* Sanity check */
> +	if (start > end)
> +		return ERR_PTR(-EINVAL);
> +
> +	/* Make sure the tree is high enough.  */
> +	root_shift = radix_tree_load_root(root, &node, &maxindex);
> +	if (end > maxindex) {
> +		error = radix_tree_extend(root, end, root_shift);
> +		if (error < 0)
> +			return ERR_PTR(error);
> +		root_shift = error;
> +	}
> +
> +	/* Special case: single slot tree */
> +	if (!root_shift) {
> +		if (node && (!(flags & RADIX_TREE_FILL_OVERWRITE)))
> +			return ERR_PTR(-EEXIST);
> +		if (flags & RADIX_TREE_FILL_CLEAR_TAGS)
> +			root_tag_clear_all(root);
> +		rcu_assign_pointer(root->rnode, item);
> +		return NULL;
> +	}
> +
> +next_node:
> +	node = NULL;
> +	offset = 0;
> +	entry = rcu_dereference_raw(root->rnode);
> +	shift = root_shift;
> +
> +	/* Descend to the index. Do at least one step. */
> +	do {
> +		child = entry_to_node(entry);
> +		shift -= RADIX_TREE_MAP_SHIFT;
> +		if (!child || !radix_tree_is_internal_node(entry)) {
> +			/* Entry wider than range */
> +			if (child) {
> +				error = -ERANGE;
> +				goto undo;
> +			}
> +			/* Hole wider tnan truncated range */
> +			if (!item)
> +				goto skip_node;
> +			child = radix_tree_node_alloc(root);
> +			if (!child) {
> +				error = -ENOMEM;
> +				goto undo;
> +			}
> +			child->shift = shift;
> +			child->offset = offset;
> +			child->parent = node;
> +			/* Populate range with retry entries. */
> +			if (flags & RADIX_TREE_FILL_ATOMIC) {
> +				int idx = (index >> shift) &
> +					   RADIX_TREE_MAP_MASK;
> +				int last = RADIX_TREE_MAP_SIZE;
> +
> +				if (end < (index | shift_maxindex(shift)))
> +					last = (end >> shift) &
> +						RADIX_TREE_MAP_MASK;
> +				for (; idx <= last; idx++)
> +					child->slots[idx] = RADIX_TREE_RETRY;
> +			}
> +			entry = node_to_entry(child);
> +			if (node) {
> +				rcu_assign_pointer(node->slots[offset], entry);
> +				node->count++;
> +			} else
> +				rcu_assign_pointer(root->rnode, entry);
> +		}
> +		node = child;
> +		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
> +		entry = rcu_dereference_raw(node->slots[offset]);
> +
> +		/* Stop if find leaf or slot inside range */
> +	} while ((flags & RADIX_TREE_FILL_LEAVES) ? shift :
> +			((index & ((1ul << shift) - 1)) ||
> +			 (index | ((1ul << shift) - 1)) > end));
> +
> +next_slot:
> +	/* NULL or retry entry */
> +	if (entry <= RADIX_TREE_RETRY)
> +		goto fill;
> +
> +	if (!(flags & RADIX_TREE_FILL_OVERWRITE)) {
> +		error = -EEXIST;
> +		goto undo;
> +	}
> +
> +	/* Cut sub-tree */
> +	if (unlikely(radix_tree_is_internal_node(entry))) {
> +		rcu_assign_pointer(node->slots[offset], item);
> +		child = entry_to_node(entry);
> +		offset = 0;
> +		do {
> +			entry = rcu_dereference_raw(child->slots[offset]);
> +			if (entry)
> +				child->count--;
> +			if (radix_tree_is_internal_node(entry)) {
> +				child = entry_to_node(entry);
> +				offset = 0;
> +			} else if (++offset == RADIX_TREE_MAP_SIZE) {
> +				offset = child->offset;
> +				entry = child->parent;
> +				WARN_ON_ONCE(child->count);
> +				radix_tree_node_free(child);
> +				child = entry;
> +			}
> +		} while (child != node);
> +	}
> +
> +	if (flags & RADIX_TREE_FILL_CLEAR_TAGS) {
> +		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
> +			node_tag_clear(root, node, tag, offset);
> +	}
> +
> +	/* Skip the rest if we're cleared class slot in node */
> +	if (!--node->count && !item && __radix_tree_delete_node(root, node))
> +		goto skip_node;
> +
> +
> +fill:
> +	rcu_assign_pointer(node->slots[offset], item);
> +	if (item)
> +		node->count++;
> +
> +	index += 1ul << shift;
> +	if (index - 1 == end)
> +		return node;
> +
> +	/* Next slot in this node and still in range */
> +	if (index + (1ul << shift) - 1 <= end &&
> +			++offset < RADIX_TREE_MAP_SIZE) {
> +		entry = rcu_dereference_raw(node->slots[offset]);
> +		goto next_slot;
> +	}
> +
> +	goto next_node;
> +
> +skip_node:
> +	index |= shift_maxindex(shift);
> +	if (index++ >= end)
> +		return node;
> +	goto next_node;
> +
> +undo:
> +	if (index > start)
> +		radix_tree_fill_range(root, start, index - 1, NULL,
> +				      RADIX_TREE_FILL_OVERWRITE);
> +	return ERR_PTR(error);
> +}
> +EXPORT_SYMBOL(radix_tree_fill_range);
> +
> +/**
>   * radix_tree_range_tag_if_tagged - for each item in given range set given
>   *				   tag if item has another tag set
>   * @root:		radix tree root
> 

[-- Attachment #2: rtts.replace-slot.diff --]
[-- Type: application/octet-stream, Size: 1971 bytes --]

diff --git a/tools/testing/radix-tree/linux/types.h b/tools/testing/radix-tree/linux/types.h
index faa0b6f..8491d89 100644
--- a/tools/testing/radix-tree/linux/types.h
+++ b/tools/testing/radix-tree/linux/types.h
@@ -6,8 +6,6 @@
 #define __rcu
 #define __read_mostly
 
-#define BITS_PER_LONG (sizeof(long) * 8)
-
 static inline void INIT_LIST_HEAD(struct list_head *list)
 {
 	list->next = list;
diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c
index f14252d..2dea001 100644
--- a/tools/testing/radix-tree/multiorder.c
+++ b/tools/testing/radix-tree/multiorder.c
@@ -123,6 +123,9 @@ static void multiorder_check(unsigned long index, int order)
 	unsigned long i;
 	unsigned long min = index & ~((1UL << order) - 1);
 	unsigned long max = min + (1UL << order);
+	void **slot;
+	static void *entry = (void *)(0xA0 | RADIX_TREE_EXCEPTIONAL_ENTRY);
+	struct item *item;
 	RADIX_TREE(tree, GFP_KERNEL);
 
 	printf("Multiorder index %ld, order %d\n", index, order);
@@ -130,7 +133,7 @@ static void multiorder_check(unsigned long index, int order)
 	assert(item_insert_order(&tree, index, order) == 0);
 
 	for (i = min; i < max; i++) {
-		struct item *item = item_lookup(&tree, i);
+		item = item_lookup(&tree, i);
 		assert(item != 0);
 		assert(item->index == index);
 	}
@@ -138,11 +141,16 @@ static void multiorder_check(unsigned long index, int order)
 		item_check_absent(&tree, i);
 	for (i = max; i < 2*max; i++)
 		item_check_absent(&tree, i);
-	for (i = min; i < max; i++) {
-		static void *entry = (void *)
-					(0xA0 | RADIX_TREE_EXCEPTIONAL_ENTRY);
+	for (i = min; i < max; i++)
 		assert(radix_tree_insert(&tree, i, entry) == -EEXIST);
+
+	slot = radix_tree_lookup_slot(&tree, index);
+	radix_tree_replace_slot(slot, entry);
+	for (i = min; i < max; i++) {
+		struct item *item2 = item_lookup(&tree, i);
+		assert(item2 == entry);
 	}
+	radix_tree_replace_slot(slot, item);
 
 	assert(item_delete(&tree, index) != 0);
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-29 15:21   ` [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range Matthew Wilcox
@ 2016-08-29 16:14     ` Konstantin Khlebnikov
  2016-08-29 18:08       ` Matthew Wilcox
  0 siblings, 1 reply; 21+ messages in thread
From: Konstantin Khlebnikov @ 2016-08-29 16:14 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Ross Zwisler, linux-kernel

On Mon, Aug 29, 2016 at 6:21 PM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
> Thanks, Ross.
>
> Konstantin, I think there are problems with the concept behind this series.  You have multiple entries in the tree with the same value.  That works out fine when the entry is a pointer (eg to a struct page), but not so well when it's an exceptional entry (eg a swap cache entry or a DAX radix tree entry).  If you look at the recent DAX work, you'll see there's a lock bit, and having multiple lock bits is a recipe for disaster.
>

I see no problem here. They could use lock bit at first or last entry.
Anyway all changes should be protecred by lock at mapping.


> But I did notice that we have a missing test in the test-suite; one that checks whether replace_slot actually replaces all of the parts of a multiorder entry.  See attachment.
>
> -----Original Message-----
> From: Ross Zwisler [mailto:ross.zwisler@linux.intel.com]
> Sent: Saturday, August 27, 2016 2:09 PM
> To: Matthew Wilcox <mawilcox@microsoft.com>
> Subject: Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
>
> Hey Matthew,
>
> Just wanted to make sure that you saw this series.
>
> - Ross
>
> On Sat, Aug 27, 2016 at 05:14:34PM +0300, Konstantin Khlebnikov wrote:
>> This patch adds function for filling and truncating ranges of slots:
>>
>> radix_tree_node *radix_tree_fill_range(root, start, end, item, flags)
>>
>> It fills slots in range "begin".."end" with "item" and returns pointer
>> to the last filled node. Filling with NULL truncates range.
>>
>> This is intended for managing transparent huge pages in page cache
>> where all entries are aligned but this function can handle arbitrary
>> unaligned ranges. Might be useful for PAT or VMA-like extent trees.
>>
>> By default filling range constructs shallow tree: entries are assigned
>> directly inner slots if possible. In worst case any range requires
>> only
>> 2 * RADIX_TREE_MAX_PATH nodes. If length is power of two and start
>> index is aligned then all slots are always in single node and requires
>> at most RADIX_TREE_MAX_PATH nodes.
>>
>> Function accepts several flags:
>>
>> RADIX_TREE_FILL_LEAVES  - build deep tree, insert entry into leaves.
>>
>> RADIX_TREE_FILL_OVERWRITE - overwrite instead of failing with -EEXIST.
>>
>> RADIX_TREE_FILL_ATOMIC - play well with concurrent RCU-protected lookup:
>> fill new nodes with RADIX_TREE_RETRY before inserting them into the tree.
>> At following iterations these slots are filled with @item or sub-nodes.
>>
>> RADIX_TREE_FILL_CLEAR_TAGS - also clears all tags.
>>
>> radix_tree_fill_range() returns pointer to the node which holds the
>> last slot in range, NULL if this is root slot, or ERR_PTR in case of error.
>>
>> Thus, radix_tree_fill_range() can handle all operations required for THP:
>>
>> * Insert
>> Fill range with pointer to head page.
>>
>> radix_tree_fill_range(root, index, index + nr_pages - 1, head_page,
>>                     RADIX_TREE_FILL_ATOMIC)
>>
>> * Remove
>> Fill range with NULL or shadow entry, returned value will be used for
>> linking completely shadow nodes into slab shrinker.
>>
>> radix_tree_fill_range(root, index, index + nr_pages - 1, NULL,
>>                     RADIX_TREE_FILL_OVERWRITE)
>>
>> * Merge
>> Fill range with overwrite to replace 0-order pages with THP.
>>
>> radix_tree_fill_range(root, index, index + nr_pages - 1, head_page,
>>                     RADIX_TREE_FILL_OVERWRITE | RADIX_TREE_FILL_ATOMIC)
>>
>> * Split
>> Two passes: first fill leaves with head_page entry and then replace
>> each slot with pointer to individual tail page. This could be done in
>> single pass but makes radix_tree_fill_range much more complicated.
>>
>> radix_tree_fill_range(root, index, index + nr_pages - 1, head_page,
>>                     RADIX_TREE_FILL_LEAVES | RADIX_TREE_FILL_OVERWRITE |
>>                     RADIX_TREE_FILL_ATOMIC);
>> radix_tree_for_each_slot(...)
>>       radix_tree_replace_slot(slot, head + iter.index - head->index);
>>
>>
>> Page lookup and iterator will return pointer to head page for any index.
>>
>>
>> Code inside iterator loop could detect huge entry, handle all
>> sub-pages and jump to next index using new helper function radix_tree_iter_jump():
>>
>> slot = radix_tree_iter_jump(&iter, page->index +
>> hpage_nr_pages(page));
>>
>> This helper has builtin protection against overflows: jump to index =
>> 0 stops iterator. This uses existing logic in radix_tree_next_chunk():
>> if iter.next_index is zero then iter.index must be zero too.
>>
>>
>> Tags should be set only for last index of THP range: this way iterator
>> will find them regardless of starting index.
>>
>> radix_tree_preload_range() pre-allocates nodes for filling range.
>>
>> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
>> ---
>>  include/linux/radix-tree.h |   46 ++++++++
>>  lib/radix-tree.c           |  245 ++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 291 insertions(+)
>>
>> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
>> index 4613bf35c311..af33e8d93ec3 100644
>> --- a/include/linux/radix-tree.h
>> +++ b/include/linux/radix-tree.h
>> @@ -319,6 +319,35 @@ static inline void radix_tree_preload_end(void)
>>       preempt_enable();
>>  }
>>
>> +#define RADIX_TREE_FILL_LEAVES               1 /* build full depth tree */
>> +#define RADIX_TREE_FILL_OVERWRITE    2 /* overwrite non-empty slots */
>> +#define RADIX_TREE_FILL_CLEAR_TAGS   4 /* clear all tags */
>> +#define RADIX_TREE_FILL_ATOMIC               8 /* play well with rcu lookup */
>> +
>> +struct radix_tree_node *
>> +radix_tree_fill_range(struct radix_tree_root *root, unsigned long start,
>> +                   unsigned long end, void *item, unsigned int flags);
>> +
>> +int radix_tree_preload_range(gfp_t gfp_mask, unsigned long start,
>> +                          unsigned long end, unsigned int flags);
>> +
>> +/**
>> + * radix_tree_truncate_range  - remove everything in range
>> + * @root:    radix tree root
>> + * @start:   first index
>> + * @end:     last index
>> + *
>> + * This function removes all items and tags within given range.
>> + */
>> +static inline void
>> +radix_tree_truncate_range(struct radix_tree_root *root,
>> +                       unsigned long start, unsigned long end) {
>> +     radix_tree_fill_range(root, start, end, NULL,
>> +                           RADIX_TREE_FILL_OVERWRITE |
>> +                           RADIX_TREE_FILL_CLEAR_TAGS); }
>> +
>>  /**
>>   * struct radix_tree_iter - radix tree iterator state
>>   *
>> @@ -435,6 +464,23 @@ void **radix_tree_iter_next(struct
>> radix_tree_iter *iter)  }
>>
>>  /**
>> + * radix_tree_iter_jump - restart iterating from given index if it non-zero
>> + * @iter:    iterator state
>> + * @index:   next index
>> + *
>> + * If index is zero when iterator will stop. This protects from
>> +endless loop
>> + * when index overflows after visiting last entry.
>> + */
>> +static inline __must_check
>> +void **radix_tree_iter_jump(struct radix_tree_iter *iter, unsigned
>> +long index) {
>> +     iter->index = index - 1;
>> +     iter->next_index = index;
>> +     iter->tags = 0;
>> +     return NULL;
>> +}
>> +
>> +/**
>>   * radix_tree_chunk_size - get current chunk size
>>   *
>>   * @iter:    pointer to radix tree iterator
>> diff --git a/lib/radix-tree.c b/lib/radix-tree.c index
>> 1b7bf7314141..c46a60065a77 100644
>> --- a/lib/radix-tree.c
>> +++ b/lib/radix-tree.c
>> @@ -36,6 +36,7 @@
>>  #include <linux/bitops.h>
>>  #include <linux/rcupdate.h>
>>  #include <linux/preempt.h>           /* in_interrupt() */
>> +#include <linux/err.h>
>>
>>
>>  /* Number of nodes in fully populated tree of given height */ @@
>> -1014,6 +1015,250 @@ void **radix_tree_next_chunk(struct
>> radix_tree_root *root,  EXPORT_SYMBOL(radix_tree_next_chunk);
>>
>>  /**
>> + * radix_tree_preload_range  - preload nodes for filling range.
>> + * @gfp_mask:
>> + * @start:   first index
>> + * @end:     last index
>> + * @flags:   RADIX_TREE_FILL_*
>> + */
>> +int radix_tree_preload_range(gfp_t gfp_mask, unsigned long start,
>> +                          unsigned long end, unsigned int flags) {
>> +     unsigned long length = end - start + 1;
>> +     int nr_nodes, shift;
>> +
>> +     /* Preloading doesn't help anything with this gfp mask, skip it */
>> +     if (!gfpflags_allow_blocking(gfp_mask)) {
>> +             preempt_disable();
>> +             return 0;
>> +     }
>> +
>> +     /*
>> +      * For filling leaves tree must cover all indexes in range at all
>> +      * levels plus RADIX_TREE_MAX_PATH required for growing tree depth
>> +      * and only root node is shared for sure.
>> +      *
>> +      * If for aligned range we need RADIX_TREE_MAX_PATH for growing depth
>> +      * and RADIX_TREE_MAX_PATH for path where all slots will be.
>> +      *
>> +      * For arbitrary range we need again RADIX_TREE_MAX_PATH for growing
>> +      * depth and two RADIX_TREE_MAX_PATH chains for constructing arc of
>> +      * slots from leaf to root and back. Only root node is shared.
>> +      */
>> +     if (flags & RADIX_TREE_FILL_LEAVES) {
>> +             if (start > end)
>> +                     return -EINVAL;
>> +             shift = 0;
>> +             nr_nodes = RADIX_TREE_MAX_PATH - 1;
>> +             do {
>> +                     shift += RADIX_TREE_MAP_SHIFT;
>> +                     nr_nodes += (end >> shift) - (start >> shift) + 1;
>> +             } while (shift < RADIX_TREE_INDEX_BITS);
>> +     } else if (is_power_of_2(length) && IS_ALIGNED(start, length))
>> +             nr_nodes = RADIX_TREE_MAX_PATH * 2 - 1;
>> +     else
>> +             nr_nodes = RADIX_TREE_MAX_PATH * 3 - 2;
>> +     return __radix_tree_preload(gfp_mask, nr_nodes); }
>> +EXPORT_SYMBOL(radix_tree_preload_range);
>> +
>> +/**
>> + * radix_tree_fill_range - fill range of slots
>> + * @root:    radix tree root
>> + * @start:   first index
>> + * @end:     last index
>> + * @item:    value for filling, NULL for removing
>> + * @flags:   RADIX_TREE_FILL_* flags
>> + * Returns:  pointer last node or NULL, ERR_PTR for errors
>> + *
>> + * By default builds shallow tree: assign entry to inner slots if possible.
>> + * In wost case range requires up to 2 * RADIX_TREE_MAX_PATH nodes
>> +plus
>> + * RADIX_TREE_MAX_PATH for extending tree depth.
>> + *
>> + * If length is 2^n and start aligned to it then all slots are in one node.
>> + *
>> + * This function cannot fill or cut part of bugger range if this
>> +require
>> + * spltting inner slots and insering new nodes: fails with -ERANGE.
>> + *
>> + * With flag RADIX_TREE_FILL_LEAVES builds deep tree and insert @item
>> +into
>> + * leaf slots. This requires much more nodes.
>> + *
>> + * With flag RADIX_TREE_FILL_OVERWRITE removes everything in range
>> +and cut
>> + * sub-tree if @item is NULL. Without that flag function undo all
>> +chandges
>> + * and fails with code -EEXIST if finds any populated slot.
>> + *
>> + * With flag RADIX_TREE_FILL_ATOMIC function plays well with
>> +rcu-protected
>> + * lookups: it fills new nodes with RADIX_TREE_RETRY before inserting
>> +them
>> + * into the tree: lookup will see either old entry, @item or retry entry.
>> + * At following iterations these slots are filled with @item or sub-nodes.
>> + *
>> + * With flag RADIX_TREE_FILL_CLEAR_TAGS also clears all tags.
>> + *
>> + * Function returns pointer to node which holds the last slot in
>> +range,
>> + * NULL if that was root slot, or ERR_PTR: -ENOMEM, -EEXIST, -ERANGE.
>> + */
>> +struct radix_tree_node *
>> +radix_tree_fill_range(struct radix_tree_root *root, unsigned long start,
>> +                   unsigned long end, void *item, unsigned int flags) {
>> +     unsigned long index = start, maxindex;
>> +     struct radix_tree_node *node, *child;
>> +     int error, root_shift, shift, tag, offset;
>> +     void *entry;
>> +
>> +     /* Sanity check */
>> +     if (start > end)
>> +             return ERR_PTR(-EINVAL);
>> +
>> +     /* Make sure the tree is high enough.  */
>> +     root_shift = radix_tree_load_root(root, &node, &maxindex);
>> +     if (end > maxindex) {
>> +             error = radix_tree_extend(root, end, root_shift);
>> +             if (error < 0)
>> +                     return ERR_PTR(error);
>> +             root_shift = error;
>> +     }
>> +
>> +     /* Special case: single slot tree */
>> +     if (!root_shift) {
>> +             if (node && (!(flags & RADIX_TREE_FILL_OVERWRITE)))
>> +                     return ERR_PTR(-EEXIST);
>> +             if (flags & RADIX_TREE_FILL_CLEAR_TAGS)
>> +                     root_tag_clear_all(root);
>> +             rcu_assign_pointer(root->rnode, item);
>> +             return NULL;
>> +     }
>> +
>> +next_node:
>> +     node = NULL;
>> +     offset = 0;
>> +     entry = rcu_dereference_raw(root->rnode);
>> +     shift = root_shift;
>> +
>> +     /* Descend to the index. Do at least one step. */
>> +     do {
>> +             child = entry_to_node(entry);
>> +             shift -= RADIX_TREE_MAP_SHIFT;
>> +             if (!child || !radix_tree_is_internal_node(entry)) {
>> +                     /* Entry wider than range */
>> +                     if (child) {
>> +                             error = -ERANGE;
>> +                             goto undo;
>> +                     }
>> +                     /* Hole wider tnan truncated range */
>> +                     if (!item)
>> +                             goto skip_node;
>> +                     child = radix_tree_node_alloc(root);
>> +                     if (!child) {
>> +                             error = -ENOMEM;
>> +                             goto undo;
>> +                     }
>> +                     child->shift = shift;
>> +                     child->offset = offset;
>> +                     child->parent = node;
>> +                     /* Populate range with retry entries. */
>> +                     if (flags & RADIX_TREE_FILL_ATOMIC) {
>> +                             int idx = (index >> shift) &
>> +                                        RADIX_TREE_MAP_MASK;
>> +                             int last = RADIX_TREE_MAP_SIZE;
>> +
>> +                             if (end < (index | shift_maxindex(shift)))
>> +                                     last = (end >> shift) &
>> +                                             RADIX_TREE_MAP_MASK;
>> +                             for (; idx <= last; idx++)
>> +                                     child->slots[idx] = RADIX_TREE_RETRY;
>> +                     }
>> +                     entry = node_to_entry(child);
>> +                     if (node) {
>> +                             rcu_assign_pointer(node->slots[offset], entry);
>> +                             node->count++;
>> +                     } else
>> +                             rcu_assign_pointer(root->rnode, entry);
>> +             }
>> +             node = child;
>> +             offset = (index >> shift) & RADIX_TREE_MAP_MASK;
>> +             entry = rcu_dereference_raw(node->slots[offset]);
>> +
>> +             /* Stop if find leaf or slot inside range */
>> +     } while ((flags & RADIX_TREE_FILL_LEAVES) ? shift :
>> +                     ((index & ((1ul << shift) - 1)) ||
>> +                      (index | ((1ul << shift) - 1)) > end));
>> +
>> +next_slot:
>> +     /* NULL or retry entry */
>> +     if (entry <= RADIX_TREE_RETRY)
>> +             goto fill;
>> +
>> +     if (!(flags & RADIX_TREE_FILL_OVERWRITE)) {
>> +             error = -EEXIST;
>> +             goto undo;
>> +     }
>> +
>> +     /* Cut sub-tree */
>> +     if (unlikely(radix_tree_is_internal_node(entry))) {
>> +             rcu_assign_pointer(node->slots[offset], item);
>> +             child = entry_to_node(entry);
>> +             offset = 0;
>> +             do {
>> +                     entry = rcu_dereference_raw(child->slots[offset]);
>> +                     if (entry)
>> +                             child->count--;
>> +                     if (radix_tree_is_internal_node(entry)) {
>> +                             child = entry_to_node(entry);
>> +                             offset = 0;
>> +                     } else if (++offset == RADIX_TREE_MAP_SIZE) {
>> +                             offset = child->offset;
>> +                             entry = child->parent;
>> +                             WARN_ON_ONCE(child->count);
>> +                             radix_tree_node_free(child);
>> +                             child = entry;
>> +                     }
>> +             } while (child != node);
>> +     }
>> +
>> +     if (flags & RADIX_TREE_FILL_CLEAR_TAGS) {
>> +             for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
>> +                     node_tag_clear(root, node, tag, offset);
>> +     }
>> +
>> +     /* Skip the rest if we're cleared class slot in node */
>> +     if (!--node->count && !item && __radix_tree_delete_node(root, node))
>> +             goto skip_node;
>> +
>> +
>> +fill:
>> +     rcu_assign_pointer(node->slots[offset], item);
>> +     if (item)
>> +             node->count++;
>> +
>> +     index += 1ul << shift;
>> +     if (index - 1 == end)
>> +             return node;
>> +
>> +     /* Next slot in this node and still in range */
>> +     if (index + (1ul << shift) - 1 <= end &&
>> +                     ++offset < RADIX_TREE_MAP_SIZE) {
>> +             entry = rcu_dereference_raw(node->slots[offset]);
>> +             goto next_slot;
>> +     }
>> +
>> +     goto next_node;
>> +
>> +skip_node:
>> +     index |= shift_maxindex(shift);
>> +     if (index++ >= end)
>> +             return node;
>> +     goto next_node;
>> +
>> +undo:
>> +     if (index > start)
>> +             radix_tree_fill_range(root, start, index - 1, NULL,
>> +                                   RADIX_TREE_FILL_OVERWRITE);
>> +     return ERR_PTR(error);
>> +}
>> +EXPORT_SYMBOL(radix_tree_fill_range);
>> +
>> +/**
>>   * radix_tree_range_tag_if_tagged - for each item in given range set given
>>   *                              tag if item has another tag set
>>   * @root:            radix tree root
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-29 16:14     ` Konstantin Khlebnikov
@ 2016-08-29 18:08       ` Matthew Wilcox
  2016-08-29 18:15         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2016-08-29 18:08 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Ross Zwisler, linux-kernel

The DAX lock bit is analogous to the PageLock.  You can't serialise on the mapping lock; the contention will be too high.

And the point of having the radix tree support the same entry for many indices is that we don't have to go and probe the radix tree multiple times looking for the first or last entry.  We just look up the index, then use the entry we got back.

-----Original Message-----
From: Konstantin Khlebnikov [mailto:koct9i@gmail.com] 
Sent: Monday, August 29, 2016 12:14 PM
To: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>; linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range

On Mon, Aug 29, 2016 at 6:21 PM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
> Thanks, Ross.
>
> Konstantin, I think there are problems with the concept behind this series.  You have multiple entries in the tree with the same value.  That works out fine when the entry is a pointer (eg to a struct page), but not so well when it's an exceptional entry (eg a swap cache entry or a DAX radix tree entry).  If you look at the recent DAX work, you'll see there's a lock bit, and having multiple lock bits is a recipe for disaster.
>

I see no problem here. They could use lock bit at first or last entry.
Anyway all changes should be protecred by lock at mapping.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-29 18:08       ` Matthew Wilcox
@ 2016-08-29 18:15         ` Konstantin Khlebnikov
  2016-08-29 18:52           ` Matthew Wilcox
  0 siblings, 1 reply; 21+ messages in thread
From: Konstantin Khlebnikov @ 2016-08-29 18:15 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Ross Zwisler, linux-kernel

On Mon, Aug 29, 2016 at 9:08 PM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
> The DAX lock bit is analogous to the PageLock.  You can't serialise on the mapping lock; the contention will be too high.

I mean DAX lock bit is protected by mapping lock thus we can set/clear
it for all entries if needed.

>
> And the point of having the radix tree support the same entry for many indices is that we don't have to go and probe the radix tree multiple times looking for the first or last entry.  We just look up the index, then use the entry we got back.

Right, but lookup function can also return pointer for node --
fininding first or last entries is trivial here.
We could simply scan back/forward or if huge DAX order is known then
just align slot pointer to first or last entry.

>
> -----Original Message-----
> From: Konstantin Khlebnikov [mailto:koct9i@gmail.com]
> Sent: Monday, August 29, 2016 12:14 PM
> To: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
>
> On Mon, Aug 29, 2016 at 6:21 PM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
>> Thanks, Ross.
>>
>> Konstantin, I think there are problems with the concept behind this series.  You have multiple entries in the tree with the same value.  That works out fine when the entry is a pointer (eg to a struct page), but not so well when it's an exceptional entry (eg a swap cache entry or a DAX radix tree entry).  If you look at the recent DAX work, you'll see there's a lock bit, and having multiple lock bits is a recipe for disaster.
>>
>
> I see no problem here. They could use lock bit at first or last entry.
> Anyway all changes should be protecred by lock at mapping.
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-29 18:15         ` Konstantin Khlebnikov
@ 2016-08-29 18:52           ` Matthew Wilcox
  2016-08-30 21:56             ` Dan Williams
  2016-08-30 22:00             ` Ross Zwisler
  0 siblings, 2 replies; 21+ messages in thread
From: Matthew Wilcox @ 2016-08-29 18:52 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Ross Zwisler, linux-kernel

It may be protected by the mapping lock in the current code, but I would it expect it to become an RCU lookup + lock eventually.  No mapping lock, just like the page cache.

Even if we can work around it, why do we want to?  What's the compelling reason to change from the current radix tree representation of order-N entries to an arbitrary range?  There are no in-kernel users right now; is there a performance reason to change?  We don't usually change an API in anticipation of future users appearing, particularly when the API makes it harder for the existing users to use it.

-----Original Message-----
From: Konstantin Khlebnikov [mailto:koct9i@gmail.com] 
Sent: Monday, August 29, 2016 2:16 PM
To: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>; linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range

On Mon, Aug 29, 2016 at 9:08 PM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
> The DAX lock bit is analogous to the PageLock.  You can't serialise on the mapping lock; the contention will be too high.

I mean DAX lock bit is protected by mapping lock thus we can set/clear it for all entries if needed.

>
> And the point of having the radix tree support the same entry for many indices is that we don't have to go and probe the radix tree multiple times looking for the first or last entry.  We just look up the index, then use the entry we got back.

Right, but lookup function can also return pointer for node -- fininding first or last entries is trivial here.
We could simply scan back/forward or if huge DAX order is known then just align slot pointer to first or last entry.

>
> -----Original Message-----
> From: Konstantin Khlebnikov [mailto:koct9i@gmail.com]
> Sent: Monday, August 29, 2016 12:14 PM
> To: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>; 
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH RFC 1/4] lib/radix: add universal 
> radix_tree_fill_range
>
> On Mon, Aug 29, 2016 at 6:21 PM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
>> Thanks, Ross.
>>
>> Konstantin, I think there are problems with the concept behind this series.  You have multiple entries in the tree with the same value.  That works out fine when the entry is a pointer (eg to a struct page), but not so well when it's an exceptional entry (eg a swap cache entry or a DAX radix tree entry).  If you look at the recent DAX work, you'll see there's a lock bit, and having multiple lock bits is a recipe for disaster.
>>
>
> I see no problem here. They could use lock bit at first or last entry.
> Anyway all changes should be protecred by lock at mapping.
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-29 18:52           ` Matthew Wilcox
@ 2016-08-30 21:56             ` Dan Williams
  2016-08-30 22:03               ` Ross Zwisler
  2016-08-30 22:00             ` Ross Zwisler
  1 sibling, 1 reply; 21+ messages in thread
From: Dan Williams @ 2016-08-30 21:56 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Konstantin Khlebnikov, Ross Zwisler, linux-kernel

On Mon, Aug 29, 2016 at 11:52 AM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
> It may be protected by the mapping lock in the current code, but I would it expect it to become an RCU lookup + lock eventually.  No mapping lock, just like the page cache.
>
> Even if we can work around it, why do we want to?  What's the compelling reason to change from the current radix tree representation of order-N entries to an arbitrary range?  There are no in-kernel users right now; is there a performance reason to change?  We don't usually change an API in anticipation of future users appearing, particularly when the API makes it harder for the existing users to use it.

I'd use a fill range api for the radix backing get_dev_pagemap() and
potentially another use in device-dax.  It centralizes the common
routine of breaking down a range into its constituent power-of-2
ranges.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-29 18:52           ` Matthew Wilcox
  2016-08-30 21:56             ` Dan Williams
@ 2016-08-30 22:00             ` Ross Zwisler
  1 sibling, 0 replies; 21+ messages in thread
From: Ross Zwisler @ 2016-08-30 22:00 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Konstantin Khlebnikov, Ross Zwisler, linux-kernel

On Mon, Aug 29, 2016 at 06:52:54PM +0000, Matthew Wilcox wrote:
> It may be protected by the mapping lock in the current code, but I would it
> expect it to become an RCU lookup + lock eventually.  No mapping lock, just
> like the page cache.
> 
> Even if we can work around it, why do we want to?  What's the compelling
> reason to change from the current radix tree representation of order-N
> entries to an arbitrary range?  There are no in-kernel users right now; is
> there a performance reason to change?  We don't usually change an API in
> anticipation of future users appearing, particularly when the API makes it
> harder for the existing users to use it.

I do have a patch set out for review which uses the multi-order nature of the
radix tree:

https://lkml.org/lkml/2016/8/23/725

This code takes advantage of the fact that using the radix tree for an order-0
entry is the same as for a multi-order entry.  Both have a single lock bit,
and a single entry that i need to use for lookups, sets, locking and
unlocking.

My usage fits well with the current implementation of the radix tree, and I'd
like to keep it simple if I can.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-30 21:56             ` Dan Williams
@ 2016-08-30 22:03               ` Ross Zwisler
  2016-08-30 22:21                 ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Ross Zwisler @ 2016-08-30 22:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Matthew Wilcox, Konstantin Khlebnikov, Ross Zwisler, linux-kernel

On Tue, Aug 30, 2016 at 02:56:17PM -0700, Dan Williams wrote:
> On Mon, Aug 29, 2016 at 11:52 AM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
> > It may be protected by the mapping lock in the current code, but I would it expect it to become an RCU lookup + lock eventually.  No mapping lock, just like the page cache.
> >
> > Even if we can work around it, why do we want to?  What's the compelling reason to change from the current radix tree representation of order-N entries to an arbitrary range?  There are no in-kernel users right now; is there a performance reason to change?  We don't usually change an API in anticipation of future users appearing, particularly when the API makes it harder for the existing users to use it.
> 
> I'd use a fill range api for the radix backing get_dev_pagemap() and
> potentially another use in device-dax.  It centralizes the common
> routine of breaking down a range into its constituent power-of-2
> ranges.

Does your usage not work with the current sibling & canonical entry model?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-30 22:03               ` Ross Zwisler
@ 2016-08-30 22:21                 ` Dan Williams
  2016-08-30 22:53                   ` Ross Zwisler
  2016-08-31 14:57                   ` Matthew Wilcox
  0 siblings, 2 replies; 21+ messages in thread
From: Dan Williams @ 2016-08-30 22:21 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: Matthew Wilcox, Konstantin Khlebnikov, linux-kernel

On Tue, Aug 30, 2016 at 3:03 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Aug 30, 2016 at 02:56:17PM -0700, Dan Williams wrote:
>> On Mon, Aug 29, 2016 at 11:52 AM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
>> > It may be protected by the mapping lock in the current code, but I would it expect it to become an RCU lookup + lock eventually.  No mapping lock, just like the page cache.
>> >
>> > Even if we can work around it, why do we want to?  What's the compelling reason to change from the current radix tree representation of order-N entries to an arbitrary range?  There are no in-kernel users right now; is there a performance reason to change?  We don't usually change an API in anticipation of future users appearing, particularly when the API makes it harder for the existing users to use it.
>>
>> I'd use a fill range api for the radix backing get_dev_pagemap() and
>> potentially another use in device-dax.  It centralizes the common
>> routine of breaking down a range into its constituent power-of-2
>> ranges.
>
> Does your usage not work with the current sibling & canonical entry model?

It does, but I find myself writing code to walk a range and determine
the order of each entry as I insert them.  I can see other users
needing the same sort of insert helper and the aspect I like of
Konstantin's proposed change is that the functionality is part of the
core implementation rather than left to be duplicated in each user.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-27 14:14 [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range Konstantin Khlebnikov
                   ` (3 preceding siblings ...)
       [not found] ` <20160827180927.GA22919@linux.intel.com>
@ 2016-08-30 22:39 ` Ross Zwisler
  4 siblings, 0 replies; 21+ messages in thread
From: Ross Zwisler @ 2016-08-30 22:39 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Andrew Morton, linux-kernel, Kirill A. Shutemov, Ross Zwisler

On Sat, Aug 27, 2016 at 05:14:34PM +0300, Konstantin Khlebnikov wrote:
> Tags should be set only for last index of THP range: this way iterator
> will find them regardless of starting index.

I don't think this works well for DAX.  We really want to to have the tags be
consistent for all indices within a multi-order range.  Meaning, if I fault in
an order-9 fault, and then I get a PTE write fault to anywhere within that
range, I want to be able to do a lookup, find the one canonical entry that has
my dirty tags, flush, and eventually I want to be able to clear that one tag.

I agree that it's *possible* to do all of this with your code, but it puts a
lot of onus on the user.  I now have to have two paths, one for order-0
entries, and one for multi-order entries where I know to use a specific entry
as my canonical entry where I can count on the log bit, on tags, etc.

This was actually the way that it was done with the old PMD code.   We used
the first aligned index for the PMD to be the one source of truth.  On every
fault I would first check to see if there was a PMD aligned entry, and then if
not I would treat it like a normal 4k fault.  The multi-order radix tree with
sibling entries was a huge step forward.

I guess my question is the same as Matthew's: what is the problem you need to
solve with this code, and why can't the current code be made to solve it?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-30 22:21                 ` Dan Williams
@ 2016-08-30 22:53                   ` Ross Zwisler
  2016-09-01  6:12                     ` Konstantin Khlebnikov
  2016-08-31 14:57                   ` Matthew Wilcox
  1 sibling, 1 reply; 21+ messages in thread
From: Ross Zwisler @ 2016-08-30 22:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Matthew Wilcox, Konstantin Khlebnikov, linux-kernel

On Tue, Aug 30, 2016 at 03:21:24PM -0700, Dan Williams wrote:
> On Tue, Aug 30, 2016 at 3:03 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Tue, Aug 30, 2016 at 02:56:17PM -0700, Dan Williams wrote:
> >> On Mon, Aug 29, 2016 at 11:52 AM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
> >> > It may be protected by the mapping lock in the current code, but I would it expect it to become an RCU lookup + lock eventually.  No mapping lock, just like the page cache.
> >> >
> >> > Even if we can work around it, why do we want to?  What's the compelling reason to change from the current radix tree representation of order-N entries to an arbitrary range?  There are no in-kernel users right now; is there a performance reason to change?  We don't usually change an API in anticipation of future users appearing, particularly when the API makes it harder for the existing users to use it.
> >>
> >> I'd use a fill range api for the radix backing get_dev_pagemap() and
> >> potentially another use in device-dax.  It centralizes the common
> >> routine of breaking down a range into its constituent power-of-2
> >> ranges.
> >
> > Does your usage not work with the current sibling & canonical entry model?
> 
> It does, but I find myself writing code to walk a range and determine
> the order of each entry as I insert them.  I can see other users
> needing the same sort of insert helper and the aspect I like of
> Konstantin's proposed change is that the functionality is part of the
> core implementation rather than left to be duplicated in each user.

Perhaps the answer is to have them both?  Matthew's multi-order radix
functionality with siblings for those of us that really *want* a single
canonical entry that we can look up, use tags on, etc.   And Konstantin's
method where we insert a bunch of duplicate entries that don't have sibling
pointers?  Is there a reason why they can't coexist?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-30 22:21                 ` Dan Williams
  2016-08-30 22:53                   ` Ross Zwisler
@ 2016-08-31 14:57                   ` Matthew Wilcox
  2016-08-31 16:36                     ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2016-08-31 14:57 UTC (permalink / raw)
  To: Dan Williams, Ross Zwisler; +Cc: Konstantin Khlebnikov, linux-kernel

I'm not at all against the idea of having a tree which supports ranges, except that we already have one; the interval tree.  Did you investigate using the interval tree for your use case?

-----Original Message-----
From: Dan Williams [mailto:dan.j.williams@intel.com] 
Sent: Tuesday, August 30, 2016 6:21 PM
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>; Konstantin Khlebnikov <koct9i@gmail.com>; linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range

On Tue, Aug 30, 2016 at 3:03 PM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote:
> On Tue, Aug 30, 2016 at 02:56:17PM -0700, Dan Williams wrote:
>> On Mon, Aug 29, 2016 at 11:52 AM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
>> > It may be protected by the mapping lock in the current code, but I would it expect it to become an RCU lookup + lock eventually.  No mapping lock, just like the page cache.
>> >
>> > Even if we can work around it, why do we want to?  What's the compelling reason to change from the current radix tree representation of order-N entries to an arbitrary range?  There are no in-kernel users right now; is there a performance reason to change?  We don't usually change an API in anticipation of future users appearing, particularly when the API makes it harder for the existing users to use it.
>>
>> I'd use a fill range api for the radix backing get_dev_pagemap() and 
>> potentially another use in device-dax.  It centralizes the common 
>> routine of breaking down a range into its constituent power-of-2 
>> ranges.
>
> Does your usage not work with the current sibling & canonical entry model?

It does, but I find myself writing code to walk a range and determine the order of each entry as I insert them.  I can see other users needing the same sort of insert helper and the aspect I like of Konstantin's proposed change is that the functionality is part of the core implementation rather than left to be duplicated in each user.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-31 14:57                   ` Matthew Wilcox
@ 2016-08-31 16:36                     ` Dan Williams
  2016-09-01  6:16                       ` Konstantin Khlebnikov
  0 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2016-08-31 16:36 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Ross Zwisler, Konstantin Khlebnikov, linux-kernel

On Wed, Aug 31, 2016 at 7:57 AM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
> I'm not at all against the idea of having a tree which supports ranges, except that we already have one; the interval tree.  Did you investigate using the interval tree for your use case?

I am continuing to investigate, but that is orthogonal to whether
Konstantin's changes are an improvement for the radix implementation.
Hmm, would we have ended up with two data-structures if a range-based
radix was available?

The benefits I see is that it simplifies insertion as it no longer
needs to explicitly manage the order of the entries, and, iiuc, let's
the user skip the sibling-to-head conversion when it is not needed
which simplifies lookups.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-30 22:53                   ` Ross Zwisler
@ 2016-09-01  6:12                     ` Konstantin Khlebnikov
  2016-09-02 17:59                       ` Matthew Wilcox
  0 siblings, 1 reply; 21+ messages in thread
From: Konstantin Khlebnikov @ 2016-09-01  6:12 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: Dan Williams, Matthew Wilcox, linux-kernel

On Wed, Aug 31, 2016 at 1:53 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Aug 30, 2016 at 03:21:24PM -0700, Dan Williams wrote:
>> On Tue, Aug 30, 2016 at 3:03 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > On Tue, Aug 30, 2016 at 02:56:17PM -0700, Dan Williams wrote:
>> >> On Mon, Aug 29, 2016 at 11:52 AM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
>> >> > It may be protected by the mapping lock in the current code, but I would it expect it to become an RCU lookup + lock eventually.  No mapping lock, just like the page cache.
>> >> >
>> >> > Even if we can work around it, why do we want to?  What's the compelling reason to change from the current radix tree representation of order-N entries to an arbitrary range?  There are no in-kernel users right now; is there a performance reason to change?  We don't usually change an API in anticipation of future users appearing, particularly when the API makes it harder for the existing users to use it.
>> >>
>> >> I'd use a fill range api for the radix backing get_dev_pagemap() and
>> >> potentially another use in device-dax.  It centralizes the common
>> >> routine of breaking down a range into its constituent power-of-2
>> >> ranges.
>> >
>> > Does your usage not work with the current sibling & canonical entry model?
>>
>> It does, but I find myself writing code to walk a range and determine
>> the order of each entry as I insert them.  I can see other users
>> needing the same sort of insert helper and the aspect I like of
>> Konstantin's proposed change is that the functionality is part of the
>> core implementation rather than left to be duplicated in each user.
>
> Perhaps the answer is to have them both?  Matthew's multi-order radix
> functionality with siblings for those of us that really *want* a single
> canonical entry that we can look up, use tags on, etc.   And Konstantin's
> method where we insert a bunch of duplicate entries that don't have sibling
> pointers?  Is there a reason why they can't coexist?

I'm not all against "sibling" entries, I just don't want to mess them
into iterator and
common lookup routines. This is redundant.

Actually it's very easy to integrate similar "sibling" entries into my
filling function.

That will be yet another flag which tells to assign given entry only
to the first slot and
fill following tail with reference to that first slot. Just a pointer
with both lower bits set to
distinguish it from exceptional and internal pointers.

I think it's better to call them "indirect" entries because this will
work for arbitrary
ranges too where they are not siblings at all and may be located in
several nodes.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-08-31 16:36                     ` Dan Williams
@ 2016-09-01  6:16                       ` Konstantin Khlebnikov
  0 siblings, 0 replies; 21+ messages in thread
From: Konstantin Khlebnikov @ 2016-09-01  6:16 UTC (permalink / raw)
  To: Dan Williams; +Cc: Matthew Wilcox, Ross Zwisler, linux-kernel

On Wed, Aug 31, 2016 at 7:36 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Aug 31, 2016 at 7:57 AM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
>> I'm not at all against the idea of having a tree which supports ranges, except that we already have one; the interval tree.  Did you investigate using the interval tree for your use case?
>
> I am continuing to investigate, but that is orthogonal to whether
> Konstantin's changes are an improvement for the radix implementation.
> Hmm, would we have ended up with two data-structures if a range-based
> radix was available?

Interval tree is a augmented rb-tree. AFAIK it doesn't support RCU lookup
without special dances with sequential counters - some branches disappears
from RCU readers during rebalance.

>
> The benefits I see is that it simplifies insertion as it no longer
> needs to explicitly manage the order of the entries, and, iiuc, let's
> the user skip the sibling-to-head conversion when it is not needed
> which simplifies lookups.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-09-01  6:12                     ` Konstantin Khlebnikov
@ 2016-09-02 17:59                       ` Matthew Wilcox
  2016-09-03  4:50                         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2016-09-02 17:59 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Ross Zwisler; +Cc: Dan Williams, linux-kernel

I have a rewrite of the iterators; would you like to take a look?

http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/idr-2016-09-02

There's five distinct sets of changes in that tree:

1. Test suite enhancements (first 8 patches)
2. Split/Join (patches 9-11)
3. Misc cleanups (patches 12-16)
4. Iterator rewrite (patches 17-19)
5. IDR rewrite (patches 20-25)

I could rebase the cleanups & iterator rewrite on top of Linus' tree if we don't want to get the split/join functionality into 4.9.

-----Original Message-----
From: Konstantin Khlebnikov [mailto:koct9i@gmail.com] 
Sent: Thursday, September 1, 2016 2:12 AM
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>; Matthew Wilcox <mawilcox@microsoft.com>; linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range

On Wed, Aug 31, 2016 at 1:53 AM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote:
> On Tue, Aug 30, 2016 at 03:21:24PM -0700, Dan Williams wrote:
>> On Tue, Aug 30, 2016 at 3:03 PM, Ross Zwisler 
>> <ross.zwisler@linux.intel.com> wrote:
>> > On Tue, Aug 30, 2016 at 02:56:17PM -0700, Dan Williams wrote:
>> >> On Mon, Aug 29, 2016 at 11:52 AM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
>> >> > It may be protected by the mapping lock in the current code, but I would it expect it to become an RCU lookup + lock eventually.  No mapping lock, just like the page cache.
>> >> >
>> >> > Even if we can work around it, why do we want to?  What's the compelling reason to change from the current radix tree representation of order-N entries to an arbitrary range?  There are no in-kernel users right now; is there a performance reason to change?  We don't usually change an API in anticipation of future users appearing, particularly when the API makes it harder for the existing users to use it.
>> >>
>> >> I'd use a fill range api for the radix backing get_dev_pagemap() 
>> >> and potentially another use in device-dax.  It centralizes the 
>> >> common routine of breaking down a range into its constituent 
>> >> power-of-2 ranges.
>> >
>> > Does your usage not work with the current sibling & canonical entry model?
>>
>> It does, but I find myself writing code to walk a range and determine 
>> the order of each entry as I insert them.  I can see other users 
>> needing the same sort of insert helper and the aspect I like of 
>> Konstantin's proposed change is that the functionality is part of the 
>> core implementation rather than left to be duplicated in each user.
>
> Perhaps the answer is to have them both?  Matthew's multi-order radix 
> functionality with siblings for those of us that really *want* a single
> canonical entry that we can look up, use tags on, etc.   And Konstantin's
> method where we insert a bunch of duplicate entries that don't have 
> sibling pointers?  Is there a reason why they can't coexist?

I'm not all against "sibling" entries, I just don't want to mess them into iterator and common lookup routines. This is redundant.

Actually it's very easy to integrate similar "sibling" entries into my filling function.

That will be yet another flag which tells to assign given entry only to the first slot and fill following tail with reference to that first slot. Just a pointer with both lower bits set to distinguish it from exceptional and internal pointers.

I think it's better to call them "indirect" entries because this will work for arbitrary ranges too where they are not siblings at all and may be located in several nodes.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
  2016-09-02 17:59                       ` Matthew Wilcox
@ 2016-09-03  4:50                         ` Konstantin Khlebnikov
  0 siblings, 0 replies; 21+ messages in thread
From: Konstantin Khlebnikov @ 2016-09-03  4:50 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Ross Zwisler, Dan Williams, linux-kernel

On Fri, Sep 2, 2016 at 8:59 PM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
> I have a rewrite of the iterators; would you like to take a look?
>
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/idr-2016-09-02
>
> There's five distinct sets of changes in that tree:
>
> 1. Test suite enhancements (first 8 patches)
> 2. Split/Join (patches 9-11)
> 3. Misc cleanups (patches 12-16)
> 4. Iterator rewrite (patches 17-19)

Have you compared performance?
There is simple benchmark in my patchset.

> 5. IDR rewrite (patches 20-25)

Why? I don't see reason for that.

>
> I could rebase the cleanups & iterator rewrite on top of Linus' tree if we don't want to get the split/join functionality into 4.9.
>
> -----Original Message-----
> From: Konstantin Khlebnikov [mailto:koct9i@gmail.com]
> Sent: Thursday, September 1, 2016 2:12 AM
> To: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>; Matthew Wilcox <mawilcox@microsoft.com>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range
>
> On Wed, Aug 31, 2016 at 1:53 AM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote:
>> On Tue, Aug 30, 2016 at 03:21:24PM -0700, Dan Williams wrote:
>>> On Tue, Aug 30, 2016 at 3:03 PM, Ross Zwisler
>>> <ross.zwisler@linux.intel.com> wrote:
>>> > On Tue, Aug 30, 2016 at 02:56:17PM -0700, Dan Williams wrote:
>>> >> On Mon, Aug 29, 2016 at 11:52 AM, Matthew Wilcox <mawilcox@microsoft.com> wrote:
>>> >> > It may be protected by the mapping lock in the current code, but I would it expect it to become an RCU lookup + lock eventually.  No mapping lock, just like the page cache.
>>> >> >
>>> >> > Even if we can work around it, why do we want to?  What's the compelling reason to change from the current radix tree representation of order-N entries to an arbitrary range?  There are no in-kernel users right now; is there a performance reason to change?  We don't usually change an API in anticipation of future users appearing, particularly when the API makes it harder for the existing users to use it.
>>> >>
>>> >> I'd use a fill range api for the radix backing get_dev_pagemap()
>>> >> and potentially another use in device-dax.  It centralizes the
>>> >> common routine of breaking down a range into its constituent
>>> >> power-of-2 ranges.
>>> >
>>> > Does your usage not work with the current sibling & canonical entry model?
>>>
>>> It does, but I find myself writing code to walk a range and determine
>>> the order of each entry as I insert them.  I can see other users
>>> needing the same sort of insert helper and the aspect I like of
>>> Konstantin's proposed change is that the functionality is part of the
>>> core implementation rather than left to be duplicated in each user.
>>
>> Perhaps the answer is to have them both?  Matthew's multi-order radix
>> functionality with siblings for those of us that really *want* a single
>> canonical entry that we can look up, use tags on, etc.   And Konstantin's
>> method where we insert a bunch of duplicate entries that don't have
>> sibling pointers?  Is there a reason why they can't coexist?
>
> I'm not all against "sibling" entries, I just don't want to mess them into iterator and common lookup routines. This is redundant.
>
> Actually it's very easy to integrate similar "sibling" entries into my filling function.
>
> That will be yet another flag which tells to assign given entry only to the first slot and fill following tail with reference to that first slot. Just a pointer with both lower bits set to distinguish it from exceptional and internal pointers.
>
> I think it's better to call them "indirect" entries because this will work for arbitrary ranges too where they are not siblings at all and may be located in several nodes.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-09-03  4:50 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-27 14:14 [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range Konstantin Khlebnikov
2016-08-27 14:16 ` [PATCH RFC 2/4] lib/radix-tree: remove sibling entries Konstantin Khlebnikov
2016-08-27 14:16 ` [PATCH RFC 3/4] testing/radix-tree: replace multi-order with range operations Konstantin Khlebnikov
2016-08-27 14:16 ` [PATCH RFC 4/4] testing/radix-tree: benchmark for iterator Konstantin Khlebnikov
     [not found] ` <20160827180927.GA22919@linux.intel.com>
2016-08-29 15:21   ` [PATCH RFC 1/4] lib/radix: add universal radix_tree_fill_range Matthew Wilcox
2016-08-29 16:14     ` Konstantin Khlebnikov
2016-08-29 18:08       ` Matthew Wilcox
2016-08-29 18:15         ` Konstantin Khlebnikov
2016-08-29 18:52           ` Matthew Wilcox
2016-08-30 21:56             ` Dan Williams
2016-08-30 22:03               ` Ross Zwisler
2016-08-30 22:21                 ` Dan Williams
2016-08-30 22:53                   ` Ross Zwisler
2016-09-01  6:12                     ` Konstantin Khlebnikov
2016-09-02 17:59                       ` Matthew Wilcox
2016-09-03  4:50                         ` Konstantin Khlebnikov
2016-08-31 14:57                   ` Matthew Wilcox
2016-08-31 16:36                     ` Dan Williams
2016-09-01  6:16                       ` Konstantin Khlebnikov
2016-08-30 22:00             ` Ross Zwisler
2016-08-30 22:39 ` Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).