All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/9] latched RB-trees and __module_address()
@ 2015-05-06 13:51 Peter Zijlstra
  2015-05-06 13:51 ` [PATCH v6 1/9] rbtree: Make lockless searches non-fatal Peter Zijlstra
                   ` (9 more replies)
  0 siblings, 10 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-06 13:51 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

This series is aimed at making __module_address() go fast(er).

The reason for doing so is that most stack unwinders use kernel_text_address()
to validate each frame. Perf and ftrace (can) end up doing a lot of stack
traces from performance sensitive code.

On the way there it:
 - annotates and sanitizes module locking
 - introduces the latched RB-tree
 - employs it to make __module_address() go fast.

I've build and boot tested this on x86_64 with modules and lockdep
enabled.  Performance numbers (below) are done with lockdep disabled.

As previously mentioned; the reason for writing the latched RB-tree as generic
code is mostly for clarity/documentation purposes; as there are a number of
separate and non trivial bits to the complete solution.

As measured on my ivb-ep system with 84 modules loaded; the test module reports
(cache hot, performance cpufreq):

          avg +- stdev
Before:   611 +- 10 [ns] per __module_address() call
After:     17 +-  5 [ns] per __module_address() call

PMI measurements for a cpu running loops in a module (also [ns]):

Before:	Mean: 2719 +- 1, Stdev: 214, Samples: 40036
After:  Mean:  947 +- 0, Stdev: 132, Samples: 40037

Note; I have also tested things like: perf record -a -g modprobe
mod_test, to make 'sure' to hit some of the more interesting paths.

Changes since last time:

 - rebased against Rusty's tree
 - raw_read_seqcount_latch() -- (mingo)

Based on rusty/linux.git/pending-rebases; please consider for 4.2

Thanks!


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 1/9] rbtree: Make lockless searches non-fatal
  2015-05-06 13:51 [PATCH v6 0/9] latched RB-trees and __module_address() Peter Zijlstra
@ 2015-05-06 13:51 ` Peter Zijlstra
  2015-05-06 13:51 ` [PATCH v6 2/9] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-06 13:51 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz,
	Andrea Arcangeli, David Woodhouse, Rik van Riel,
	Michel Lespinasse

[-- Attachment #1: peterz-rbtree-lockless.patch --]
[-- Type: text/plain, Size: 10818 bytes --]

Change the insert and erase code such that lockless searches are
non-fatal.

In and of itself an rbtree cannot be correctly searched while
in-modification, we can however provide weaker guarantees that will
allow the rbtree to be used in conjunction with other techniques, such
as latches; see 9b0fd802e8c0 ("seqcount: Add raw_write_seqcount_latch()").

For this to work we need the following guarantees from the rbtree
code:

 1) a lockless reader must not see partial stores, this would allow it
    to observe nodes that are invalid memory.

 2) there must not be (temporary) loops in the tree structure in the
    modifier's program order, this would cause a lookup which
    interrupts the modifier to get stuck indefinitely.

For 1) we must use WRITE_ONCE() for all updates to the tree structure;
in particular this patch only does rb_{left,right} as those are the
only element required for simple searches.

It generates slightly worse code, probably because volatile. But in
pointer chasing heavy code a few instructions more should not matter.

For 2) I have carefully audited the code and drawn every intermediate
link state and not found a loop.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rbtree.h           |   16 ++++++--
 include/linux/rbtree_augmented.h |   21 +++++++---
 lib/rbtree.c                     |   76 +++++++++++++++++++++++++++------------
 3 files changed, 81 insertions(+), 32 deletions(-)

--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -31,6 +31,7 @@
 
 #include <linux/kernel.h>
 #include <linux/stddef.h>
+#include <linux/rcupdate.h>
 
 struct rb_node {
 	unsigned long  __rb_parent_color;
@@ -73,11 +74,11 @@ extern struct rb_node *rb_first_postorde
 extern struct rb_node *rb_next_postorder(const struct rb_node *);
 
 /* Fast replacement of a single node without remove/rebalance/add/rebalance */
-extern void rb_replace_node(struct rb_node *victim, struct rb_node *new, 
+extern void rb_replace_node(struct rb_node *victim, struct rb_node *new,
 			    struct rb_root *root);
 
-static inline void rb_link_node(struct rb_node * node, struct rb_node * parent,
-				struct rb_node ** rb_link)
+static inline void rb_link_node(struct rb_node *node, struct rb_node *parent,
+				struct rb_node **rb_link)
 {
 	node->__rb_parent_color = (unsigned long)parent;
 	node->rb_left = node->rb_right = NULL;
@@ -85,6 +86,15 @@ static inline void rb_link_node(struct r
 	*rb_link = node;
 }
 
+static inline void rb_link_node_rcu(struct rb_node *node, struct rb_node *parent,
+				    struct rb_node **rb_link)
+{
+	node->__rb_parent_color = (unsigned long)parent;
+	node->rb_left = node->rb_right = NULL;
+
+	rcu_assign_pointer(*rb_link, node);
+}
+
 #define rb_entry_safe(ptr, type, member) \
 	({ typeof(ptr) ____ptr = (ptr); \
 	   ____ptr ? rb_entry(____ptr, type, member) : NULL; \
--- a/include/linux/rbtree_augmented.h
+++ b/include/linux/rbtree_augmented.h
@@ -123,11 +123,11 @@ __rb_change_child(struct rb_node *old, s
 {
 	if (parent) {
 		if (parent->rb_left == old)
-			parent->rb_left = new;
+			WRITE_ONCE(parent->rb_left, new);
 		else
-			parent->rb_right = new;
+			WRITE_ONCE(parent->rb_right, new);
 	} else
-		root->rb_node = new;
+		WRITE_ONCE(root->rb_node, new);
 }
 
 extern void __rb_erase_color(struct rb_node *parent, struct rb_root *root,
@@ -137,7 +137,8 @@ static __always_inline struct rb_node *
 __rb_erase_augmented(struct rb_node *node, struct rb_root *root,
 		     const struct rb_augment_callbacks *augment)
 {
-	struct rb_node *child = node->rb_right, *tmp = node->rb_left;
+	struct rb_node *child = node->rb_right;
+	struct rb_node *tmp = node->rb_left;
 	struct rb_node *parent, *rebalance;
 	unsigned long pc;
 
@@ -167,6 +168,7 @@ __rb_erase_augmented(struct rb_node *nod
 		tmp = parent;
 	} else {
 		struct rb_node *successor = child, *child2;
+
 		tmp = child->rb_left;
 		if (!tmp) {
 			/*
@@ -180,6 +182,7 @@ __rb_erase_augmented(struct rb_node *nod
 			 */
 			parent = successor;
 			child2 = successor->rb_right;
+
 			augment->copy(node, successor);
 		} else {
 			/*
@@ -201,19 +204,23 @@ __rb_erase_augmented(struct rb_node *nod
 				successor = tmp;
 				tmp = tmp->rb_left;
 			} while (tmp);
-			parent->rb_left = child2 = successor->rb_right;
-			successor->rb_right = child;
+			child2 = successor->rb_right;
+			WRITE_ONCE(parent->rb_left, child2);
+			WRITE_ONCE(successor->rb_right, child);
 			rb_set_parent(child, successor);
+
 			augment->copy(node, successor);
 			augment->propagate(parent, successor);
 		}
 
-		successor->rb_left = tmp = node->rb_left;
+		tmp = node->rb_left;
+		WRITE_ONCE(successor->rb_left, tmp);
 		rb_set_parent(tmp, successor);
 
 		pc = node->__rb_parent_color;
 		tmp = __rb_parent(pc);
 		__rb_change_child(node, successor, tmp, root);
+
 		if (child2) {
 			successor->__rb_parent_color = pc;
 			rb_set_parent_color(child2, parent, RB_BLACK);
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -44,6 +44,30 @@
  *  parentheses and have some accompanying text comment.
  */
 
+/*
+ * Notes on lockless lookups:
+ *
+ * All stores to the tree structure (rb_left and rb_right) must be done using
+ * WRITE_ONCE(). And we must not inadvertently cause (temporary) loops in the
+ * tree structure as seen in program order.
+ *
+ * These two requirements will allow lockless iteration of the tree -- not
+ * correct iteration mind you, tree rotations are not atomic so a lookup might
+ * miss entire subtrees.
+ *
+ * But they do guarantee that any such traversal will only see valid elements
+ * and that it will indeed complete -- does not get stuck in a loop.
+ *
+ * It also guarantees that if the lookup returns an element it is the 'correct'
+ * one. But not returning an element does _NOT_ mean it's not present.
+ *
+ * NOTE:
+ *
+ * Stores to __rb_parent_color are not important for simple lookups so those
+ * are left undone as of now. Nor did I check for loops involving parent
+ * pointers.
+ */
+
 static inline void rb_set_black(struct rb_node *rb)
 {
 	rb->__rb_parent_color |= RB_BLACK;
@@ -129,8 +153,9 @@ __rb_insert(struct rb_node *node, struct
 				 * This still leaves us in violation of 4), the
 				 * continuation into Case 3 will fix that.
 				 */
-				parent->rb_right = tmp = node->rb_left;
-				node->rb_left = parent;
+				tmp = node->rb_left;
+				WRITE_ONCE(parent->rb_right, tmp);
+				WRITE_ONCE(node->rb_left, parent);
 				if (tmp)
 					rb_set_parent_color(tmp, parent,
 							    RB_BLACK);
@@ -149,8 +174,8 @@ __rb_insert(struct rb_node *node, struct
 			 *     /                 \
 			 *    n                   U
 			 */
-			gparent->rb_left = tmp;  /* == parent->rb_right */
-			parent->rb_right = gparent;
+			WRITE_ONCE(gparent->rb_left, tmp); /* == parent->rb_right */
+			WRITE_ONCE(parent->rb_right, gparent);
 			if (tmp)
 				rb_set_parent_color(tmp, gparent, RB_BLACK);
 			__rb_rotate_set_parents(gparent, parent, root, RB_RED);
@@ -171,8 +196,9 @@ __rb_insert(struct rb_node *node, struct
 			tmp = parent->rb_left;
 			if (node == tmp) {
 				/* Case 2 - right rotate at parent */
-				parent->rb_left = tmp = node->rb_right;
-				node->rb_right = parent;
+				tmp = node->rb_right;
+				WRITE_ONCE(parent->rb_left, tmp);
+				WRITE_ONCE(node->rb_right, parent);
 				if (tmp)
 					rb_set_parent_color(tmp, parent,
 							    RB_BLACK);
@@ -183,8 +209,8 @@ __rb_insert(struct rb_node *node, struct
 			}
 
 			/* Case 3 - left rotate at gparent */
-			gparent->rb_right = tmp;  /* == parent->rb_left */
-			parent->rb_left = gparent;
+			WRITE_ONCE(gparent->rb_right, tmp); /* == parent->rb_left */
+			WRITE_ONCE(parent->rb_left, gparent);
 			if (tmp)
 				rb_set_parent_color(tmp, gparent, RB_BLACK);
 			__rb_rotate_set_parents(gparent, parent, root, RB_RED);
@@ -224,8 +250,9 @@ ____rb_erase_color(struct rb_node *paren
 				 *      / \         / \
 				 *     Sl  Sr      N   Sl
 				 */
-				parent->rb_right = tmp1 = sibling->rb_left;
-				sibling->rb_left = parent;
+				tmp1 = sibling->rb_left;
+				WRITE_ONCE(parent->rb_right, tmp1);
+				WRITE_ONCE(sibling->rb_left, parent);
 				rb_set_parent_color(tmp1, parent, RB_BLACK);
 				__rb_rotate_set_parents(parent, sibling, root,
 							RB_RED);
@@ -275,9 +302,10 @@ ____rb_erase_color(struct rb_node *paren
 				 *                       \
 				 *                        Sr
 				 */
-				sibling->rb_left = tmp1 = tmp2->rb_right;
-				tmp2->rb_right = sibling;
-				parent->rb_right = tmp2;
+				tmp1 = tmp2->rb_right;
+				WRITE_ONCE(sibling->rb_left, tmp1);
+				WRITE_ONCE(tmp2->rb_right, sibling);
+				WRITE_ONCE(parent->rb_right, tmp2);
 				if (tmp1)
 					rb_set_parent_color(tmp1, sibling,
 							    RB_BLACK);
@@ -297,8 +325,9 @@ ____rb_erase_color(struct rb_node *paren
 			 *        / \         / \
 			 *      (sl) sr      N  (sl)
 			 */
-			parent->rb_right = tmp2 = sibling->rb_left;
-			sibling->rb_left = parent;
+			tmp2 = sibling->rb_left;
+			WRITE_ONCE(parent->rb_right, tmp2);
+			WRITE_ONCE(sibling->rb_left, parent);
 			rb_set_parent_color(tmp1, sibling, RB_BLACK);
 			if (tmp2)
 				rb_set_parent(tmp2, parent);
@@ -310,8 +339,9 @@ ____rb_erase_color(struct rb_node *paren
 			sibling = parent->rb_left;
 			if (rb_is_red(sibling)) {
 				/* Case 1 - right rotate at parent */
-				parent->rb_left = tmp1 = sibling->rb_right;
-				sibling->rb_right = parent;
+				tmp1 = sibling->rb_right;
+				WRITE_ONCE(parent->rb_left, tmp1);
+				WRITE_ONCE(sibling->rb_right, parent);
 				rb_set_parent_color(tmp1, parent, RB_BLACK);
 				__rb_rotate_set_parents(parent, sibling, root,
 							RB_RED);
@@ -336,9 +366,10 @@ ____rb_erase_color(struct rb_node *paren
 					break;
 				}
 				/* Case 3 - right rotate at sibling */
-				sibling->rb_right = tmp1 = tmp2->rb_left;
-				tmp2->rb_left = sibling;
-				parent->rb_left = tmp2;
+				tmp1 = tmp2->rb_left;
+				WRITE_ONCE(sibling->rb_right, tmp1);
+				WRITE_ONCE(tmp2->rb_left, sibling);
+				WRITE_ONCE(parent->rb_left, tmp2);
 				if (tmp1)
 					rb_set_parent_color(tmp1, sibling,
 							    RB_BLACK);
@@ -347,8 +378,9 @@ ____rb_erase_color(struct rb_node *paren
 				sibling = tmp2;
 			}
 			/* Case 4 - left rotate at parent + color flips */
-			parent->rb_left = tmp2 = sibling->rb_right;
-			sibling->rb_right = parent;
+			tmp2 = sibling->rb_right;
+			WRITE_ONCE(parent->rb_left, tmp2);
+			WRITE_ONCE(sibling->rb_right, parent);
 			rb_set_parent_color(tmp1, sibling, RB_BLACK);
 			if (tmp2)
 				rb_set_parent(tmp2, parent);



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 2/9] seqlock: Better document raw_write_seqcount_latch()
  2015-05-06 13:51 [PATCH v6 0/9] latched RB-trees and __module_address() Peter Zijlstra
  2015-05-06 13:51 ` [PATCH v6 1/9] rbtree: Make lockless searches non-fatal Peter Zijlstra
@ 2015-05-06 13:51 ` Peter Zijlstra
  2015-05-06 13:51 ` [PATCH v6 3/9] rcu: Move lockless_dereference() out of rcupdate.h Peter Zijlstra
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-06 13:51 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz,
	Andrea Arcangeli, David Woodhouse, Rik van Riel,
	Michel Lespinasse

[-- Attachment #1: peterz-latch-comment.patch --]
[-- Type: text/plain, Size: 4960 bytes --]

Improve the documentation of the latch technique as used in the
current timekeeping code, such that it can be readily employed
elsewhere.

Borrow from the comments in timekeeping and replace those with a
reference to this more generic comment.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/seqlock.h   |   76 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/time/timekeeping.c |   27 ----------------
 2 files changed, 76 insertions(+), 27 deletions(-)

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -233,9 +233,83 @@ static inline void raw_write_seqcount_en
 	s->sequence++;
 }
 
-/*
+/**
  * raw_write_seqcount_latch - redirect readers to even/odd copy
  * @s: pointer to seqcount_t
+ *
+ * The latch technique is a multiversion concurrency control method that allows
+ * queries during non-atomic modifications. If you can guarantee queries never
+ * interrupt the modification -- e.g. the concurrency is strictly between CPUs
+ * -- you most likely do not need this.
+ *
+ * Where the traditional RCU/lockless data structures rely on atomic
+ * modifications to ensure queries observe either the old or the new state the
+ * latch allows the same for non-atomic updates. The trade-off is doubling the
+ * cost of storage; we have to maintain two copies of the entire data
+ * structure.
+ *
+ * Very simply put: we first modify one copy and then the other. This ensures
+ * there is always one copy in a stable state, ready to give us an answer.
+ *
+ * The basic form is a data structure like:
+ *
+ * struct latch_struct {
+ *	seqcount_t		seq;
+ *	struct data_struct	data[2];
+ * };
+ *
+ * Where a modification, which is assumed to be externally serialized, does the
+ * following:
+ *
+ * void latch_modify(struct latch_struct *latch, ...)
+ * {
+ *	smp_wmb();	<- Ensure that the last data[1] update is visible
+ *	latch->seq++;
+ *	smp_wmb();	<- Ensure that the seqcount update is visible
+ *
+ *	modify(latch->data[0], ...);
+ *
+ *	smp_wmb();	<- Ensure that the data[0] update is visible
+ *	latch->seq++;
+ *	smp_wmb();	<- Ensure that the seqcount update is visible
+ *
+ *	modify(latch->data[1], ...);
+ * }
+ *
+ * The query will have a form like:
+ *
+ * struct entry *latch_query(struct latch_struct *latch, ...)
+ * {
+ *	struct entry *entry;
+ *	unsigned seq, idx;
+ *
+ *	do {
+ *		seq = latch->seq;
+ *		smp_rmb();
+ *
+ *		idx = seq & 0x01;
+ *		entry = data_query(latch->data[idx], ...);
+ *
+ *		smp_rmb();
+ *	} while (seq != latch->seq);
+ *
+ *	return entry;
+ * }
+ *
+ * So during the modification, queries are first redirected to data[1]. Then we
+ * modify data[0]. When that is complete, we redirect queries back to data[0]
+ * and we can modify data[1].
+ *
+ * NOTE: The non-requirement for atomic modifications does _NOT_ include
+ *       the publishing of new entries in the case where data is a dynamic
+ *       data structure.
+ *
+ *       An iteration might start in data[0] and get suspended long enough
+ *       to miss an entire modification sequence, once it resumes it might
+ *       observe the new entry.
+ *
+ * NOTE: When data is a dynamic data structure; one should use regular RCU
+ *       patterns to manage the lifetimes of the objects within.
  */
 static inline void raw_write_seqcount_latch(seqcount_t *s)
 {
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -330,32 +330,7 @@ static inline s64 timekeeping_get_ns(str
  * We want to use this from any context including NMI and tracing /
  * instrumenting the timekeeping code itself.
  *
- * So we handle this differently than the other timekeeping accessor
- * functions which retry when the sequence count has changed. The
- * update side does:
- *
- * smp_wmb();	<- Ensure that the last base[1] update is visible
- * tkf->seq++;
- * smp_wmb();	<- Ensure that the seqcount update is visible
- * update(tkf->base[0], tkr);
- * smp_wmb();	<- Ensure that the base[0] update is visible
- * tkf->seq++;
- * smp_wmb();	<- Ensure that the seqcount update is visible
- * update(tkf->base[1], tkr);
- *
- * The reader side does:
- *
- * do {
- *	seq = tkf->seq;
- *	smp_rmb();
- *	idx = seq & 0x01;
- *	now = now(tkf->base[idx]);
- *	smp_rmb();
- * } while (seq != tkf->seq)
- *
- * As long as we update base[0] readers are forced off to
- * base[1]. Once base[0] is updated readers are redirected to base[0]
- * and the base[1] update takes place.
+ * Employ the latch technique; see @raw_write_seqcount_latch.
  *
  * So if a NMI hits the update of base[0] then it will use base[1]
  * which is still consistent. In the worst case this can result is a



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 3/9] rcu: Move lockless_dereference() out of rcupdate.h
  2015-05-06 13:51 [PATCH v6 0/9] latched RB-trees and __module_address() Peter Zijlstra
  2015-05-06 13:51 ` [PATCH v6 1/9] rbtree: Make lockless searches non-fatal Peter Zijlstra
  2015-05-06 13:51 ` [PATCH v6 2/9] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
@ 2015-05-06 13:51 ` Peter Zijlstra
  2015-05-06 13:51 ` [PATCH v6 4/9] seqlock: Introduce raw_read_seqcount_latch() Peter Zijlstra
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-06 13:51 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

[-- Attachment #1: peterz-move-lockless_dereference.patch --]
[-- Type: text/plain, Size: 2475 bytes --]

I want to use lockless_dereference() from seqlock.h, which would mean
including rcupdate.h from it, however rcupdate.h already includes
seqlock.h.

Avoid this by moving lockless_dereference() into compiler.h. This is
somewhat tricky since it uses smp_read_barrier_depends() which isn't
available there, but its a CPP macro so we can get away with it.

The alternative would be moving it into asm/barrier.h, but that would
be updating each arch (I can do if people feel that is more
appropriate).

Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/compiler.h |   15 +++++++++++++++
 include/linux/rcupdate.h |   15 ---------------
 2 files changed, 15 insertions(+), 15 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -453,6 +453,21 @@ static __always_inline void __write_once
 	(volatile typeof(x) *)&(x); })
 #define ACCESS_ONCE(x) (*__ACCESS_ONCE(x))
 
+/**
+ * lockless_dereference() - safely load a pointer for later dereference
+ * @p: The pointer to load
+ *
+ * Similar to rcu_dereference(), but for situations where the pointed-to
+ * object's lifetime is managed by something other than RCU.  That
+ * "something other" might be reference counting or simple immortality.
+ */
+#define lockless_dereference(p) \
+({ \
+	typeof(p) _________p1 = ACCESS_ONCE(p); \
+	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
+	(_________p1); \
+})
+
 /* Ignore/forbid kprobes attach on very low level functions marked by this attribute: */
 #ifdef CONFIG_KPROBES
 # define __kprobes	__attribute__((__section__(".kprobes.text")))
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -650,21 +650,6 @@ static inline void rcu_preempt_sleep_che
 #define RCU_INITIALIZER(v) (typeof(*(v)) __force __rcu *)(v)
 
 /**
- * lockless_dereference() - safely load a pointer for later dereference
- * @p: The pointer to load
- *
- * Similar to rcu_dereference(), but for situations where the pointed-to
- * object's lifetime is managed by something other than RCU.  That
- * "something other" might be reference counting or simple immortality.
- */
-#define lockless_dereference(p) \
-({ \
-	typeof(p) _________p1 = ACCESS_ONCE(p); \
-	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
-	(_________p1); \
-})
-
-/**
  * rcu_assign_pointer() - assign to RCU-protected pointer
  * @p: pointer to assign to
  * @v: value to assign (publish)



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 4/9] seqlock: Introduce raw_read_seqcount_latch()
  2015-05-06 13:51 [PATCH v6 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (2 preceding siblings ...)
  2015-05-06 13:51 ` [PATCH v6 3/9] rcu: Move lockless_dereference() out of rcupdate.h Peter Zijlstra
@ 2015-05-06 13:51 ` Peter Zijlstra
  2015-05-06 13:51 ` [PATCH v6 5/9] rbtree: Implement generic latch_tree Peter Zijlstra
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-06 13:51 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

[-- Attachment #1: peterz-latch-read.patch --]
[-- Type: text/plain, Size: 1522 bytes --]

Because with latches there is a strict data dependency on the seq load
we can avoid the rmb in favour of a read_barrier_depends.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/seqlock.h   |    9 +++++++--
 kernel/time/timekeeping.c |    2 +-
 2 files changed, 8 insertions(+), 3 deletions(-)

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -35,6 +35,7 @@
 #include <linux/spinlock.h>
 #include <linux/preempt.h>
 #include <linux/lockdep.h>
+#include <linux/compiler.h>
 #include <asm/processor.h>
 
 /*
@@ -233,6 +234,11 @@ static inline void raw_write_seqcount_en
 	s->sequence++;
 }
 
+static inline int raw_read_seqcount_latch(seqcount_t *s)
+{
+	return lockless_dereference(s->sequence);
+}
+
 /**
  * raw_write_seqcount_latch - redirect readers to even/odd copy
  * @s: pointer to seqcount_t
@@ -284,8 +290,7 @@ static inline void raw_write_seqcount_en
  *	unsigned seq, idx;
  *
  *	do {
- *		seq = latch->seq;
- *		smp_rmb();
+ *		seq = lockless_dereference(latch->seq);
  *
  *		idx = seq & 0x01;
  *		entry = data_query(latch->data[idx], ...);
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -393,7 +393,7 @@ static __always_inline u64 __ktime_get_f
 	u64 now;
 
 	do {
-		seq = raw_read_seqcount(&tkf->seq);
+		seq = raw_read_seqcount_latch(&tkf->seq);
 		tkr = tkf->base + (seq & 0x01);
 		now = ktime_to_ns(tkr->base) + timekeeping_get_ns(tkr);
 	} while (read_seqcount_retry(&tkf->seq, seq));



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 5/9] rbtree: Implement generic latch_tree
  2015-05-06 13:51 [PATCH v6 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (3 preceding siblings ...)
  2015-05-06 13:51 ` [PATCH v6 4/9] seqlock: Introduce raw_read_seqcount_latch() Peter Zijlstra
@ 2015-05-06 13:51 ` Peter Zijlstra
  2015-05-06 13:51 ` [PATCH v6 6/9] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-06 13:51 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz,
	Michel Lespinasse, Andrea Arcangeli, David Woodhouse,
	Rik van Riel

[-- Attachment #1: peterz-rbtree-latch.patch --]
[-- Type: text/plain, Size: 7590 bytes --]

Implement a latched RB-tree in order to get unconditional RCU/lockless
lookups.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rbtree_latch.h |  212 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 212 insertions(+)

--- /dev/null
+++ b/include/linux/rbtree_latch.h
@@ -0,0 +1,212 @@
+/*
+ * Latched RB-trees
+ *
+ * Copyright (C) 2015 Intel Corp., Peter Zijlstra <peterz@infradead.org>
+ *
+ * Since RB-trees have non-atomic modifications they're not immediately suited
+ * for RCU/lockless queries. Even though we made RB-tree lookups non-fatal for
+ * lockless lookups; we cannot guarantee they return a correct result.
+ *
+ * The simplest solution is a seqlock + RB-tree, this will allow lockless
+ * lookups; but has the constraint (inherent to the seqlock) that read sides
+ * cannot nest in write sides.
+ *
+ * If we need to allow unconditional lookups (say as required for NMI context
+ * usage) we need a more complex setup; this data structure provides this by
+ * employing the latch technique -- see @raw_write_seqcount_latch -- to
+ * implement a latched RB-tree which does allow for unconditional lookups by
+ * virtue of always having (at least) one stable copy of the tree.
+ *
+ * However, while we have the guarantee that there is at all times one stable
+ * copy, this does not guarantee an iteration will not observe modifications.
+ * What might have been a stable copy at the start of the iteration, need not
+ * remain so for the duration of the iteration.
+ *
+ * Therefore, this does require a lockless RB-tree iteration to be non-fatal;
+ * see the comment in lib/rbtree.c. Note however that we only require the first
+ * condition -- not seeing partial stores -- because the latch thing isolates
+ * us from loops. If we were to interrupt a modification the lookup would be
+ * pointed at the stable tree and complete while the modification was halted.
+ */
+
+#ifndef RB_TREE_LATCH_H
+#define RB_TREE_LATCH_H
+
+#include <linux/rbtree.h>
+#include <linux/seqlock.h>
+
+struct latch_tree_node {
+	struct rb_node node[2];
+};
+
+struct latch_tree_root {
+	seqcount_t	seq;
+	struct rb_root	tree[2];
+};
+
+/**
+ * latch_tree_ops - operators to define the tree order
+ * @less: used for insertion; provides the (partial) order between two elements.
+ * @comp: used for lookups; provides the order between the search key and an element.
+ *
+ * The operators are related like:
+ *
+ *	comp(a->key,b) < 0  := less(a,b)
+ *	comp(a->key,b) > 0  := less(b,a)
+ *	comp(a->key,b) == 0 := !less(a,b) && !less(b,a)
+ *
+ * If these operators define a partial order on the elements we make no
+ * guarantee on which of the elements matching the key is found. See
+ * latch_tree_find().
+ */
+struct latch_tree_ops {
+	bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b);
+	int  (*comp)(void *key,                 struct latch_tree_node *b);
+};
+
+static __always_inline struct latch_tree_node *
+__lt_from_rb(struct rb_node *node, int idx)
+{
+	return container_of(node, struct latch_tree_node, node[idx]);
+}
+
+static __always_inline void
+__lt_insert(struct latch_tree_node *ltn, struct latch_tree_root *ltr, int idx,
+	    bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b))
+{
+	struct rb_root *root = &ltr->tree[idx];
+	struct rb_node **link = &root->rb_node;
+	struct rb_node *node = &ltn->node[idx];
+	struct rb_node *parent = NULL;
+	struct latch_tree_node *ltp;
+
+	while (*link) {
+		parent = *link;
+		ltp = __lt_from_rb(parent, idx);
+
+		if (less(ltn, ltp))
+			link = &parent->rb_left;
+		else
+			link = &parent->rb_right;
+	}
+
+	rb_link_node_rcu(node, parent, link);
+	rb_insert_color(node, root);
+}
+
+static __always_inline void
+__lt_erase(struct latch_tree_node *ltn, struct latch_tree_root *ltr, int idx)
+{
+	rb_erase(&ltn->node[idx], &ltr->tree[idx]);
+}
+
+static __always_inline struct latch_tree_node *
+__lt_find(void *key, struct latch_tree_root *ltr, int idx,
+	  int (*comp)(void *key, struct latch_tree_node *node))
+{
+	struct rb_node *node = rcu_dereference_raw(ltr->tree[idx].rb_node);
+	struct latch_tree_node *ltn;
+	int c;
+
+	while (node) {
+		ltn = __lt_from_rb(node, idx);
+		c = comp(key, ltn);
+
+		if (c < 0)
+			node = rcu_dereference_raw(node->rb_left);
+		else if (c > 0)
+			node = rcu_dereference_raw(node->rb_right);
+		else
+			return ltn;
+	}
+
+	return NULL;
+}
+
+/**
+ * latch_tree_insert() - insert @node into the trees @root
+ * @node: nodes to insert
+ * @root: trees to insert @node into
+ * @ops: operators defining the node order
+ *
+ * It inserts @node into @root in an ordered fashion such that we can always
+ * observe one complete tree. See the comment for raw_write_seqcount_latch().
+ *
+ * The inserts use rcu_assign_pointer() to publish the element such that the
+ * tree structure is stored before we can observe the new @node.
+ *
+ * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
+ * serialized.
+ */
+static __always_inline void
+latch_tree_insert(struct latch_tree_node *node,
+		  struct latch_tree_root *root,
+		  const struct latch_tree_ops *ops)
+{
+	raw_write_seqcount_latch(&root->seq);
+	__lt_insert(node, root, 0, ops->less);
+	raw_write_seqcount_latch(&root->seq);
+	__lt_insert(node, root, 1, ops->less);
+}
+
+/**
+ * latch_tree_erase() - removes @node from the trees @root
+ * @node: nodes to remote
+ * @root: trees to remove @node from
+ * @ops: operators defining the node order
+ *
+ * Removes @node from the trees @root in an ordered fashion such that we can
+ * always observe one complete tree. See the comment for
+ * raw_write_seqcount_latch().
+ *
+ * It is assumed that @node will observe one RCU quiescent state before being
+ * reused of freed.
+ *
+ * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
+ * serialized.
+ */
+static __always_inline void
+latch_tree_erase(struct latch_tree_node *node,
+		 struct latch_tree_root *root,
+		 const struct latch_tree_ops *ops)
+{
+	raw_write_seqcount_latch(&root->seq);
+	__lt_erase(node, root, 0);
+	raw_write_seqcount_latch(&root->seq);
+	__lt_erase(node, root, 1);
+}
+
+/**
+ * latch_tree_find() - find the node matching @key in the trees @root
+ * @key: search key
+ * @root: trees to search for @key
+ * @ops: operators defining the node order
+ *
+ * Does a lockless lookup in the trees @root for the node matching @key.
+ *
+ * It is assumed that this is called while holding the appropriate RCU read
+ * side lock.
+ *
+ * If the operators define a partial order on the elements (there are multiple
+ * elements which have the same key value) it is undefined which of these
+ * elements will be found. Nor is it possible to iterate the tree to find
+ * further elements with the same key value.
+ *
+ * Returns: a pointer to the node matching @key or NULL.
+ */
+static __always_inline struct latch_tree_node *
+latch_tree_find(void *key, struct latch_tree_root *root,
+		const struct latch_tree_ops *ops)
+{
+	struct latch_tree_node *node;
+	unsigned int seq;
+
+	do {
+		seq = raw_read_seqcount_latch(&root->seq);
+		node = __lt_find(key, root, seq & 1, ops->comp);
+	} while (read_seqcount_retry(&root->seq, seq));
+
+	return node;
+}
+
+#endif /* RB_TREE_LATCH_H */



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 6/9] module: Optimize __module_address() using a latched RB-tree
  2015-05-06 13:51 [PATCH v6 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (4 preceding siblings ...)
  2015-05-06 13:51 ` [PATCH v6 5/9] rbtree: Implement generic latch_tree Peter Zijlstra
@ 2015-05-06 13:51 ` Peter Zijlstra
  2015-05-06 13:51 ` [PATCH v6 7/9] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING Peter Zijlstra
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-06 13:51 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

[-- Attachment #1: peterz-module-latch-tree.patch --]
[-- Type: text/plain, Size: 6890 bytes --]

Currently __module_address() is using a linear search through all
modules in order to find the module corresponding to the provided
address. With a lot of modules this can take a lot of time.

One of the users of this is kernel_text_address() which is employed
in many stack unwinders; which in turn are used by perf-callchain and
ftrace (possibly from NMI context).

So by optimizing __module_address() we optimize many stack unwinders
which are used by both perf and tracing in performance sensitive code.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/module.h |   29 +++++++++++-
 kernel/module.c        |  115 ++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 136 insertions(+), 8 deletions(-)

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -17,6 +17,7 @@
 #include <linux/moduleparam.h>
 #include <linux/jump_label.h>
 #include <linux/export.h>
+#include <linux/rbtree_latch.h>
 
 #include <linux/percpu.h>
 #include <asm/module.h>
@@ -210,6 +211,13 @@ enum module_state {
 	MODULE_STATE_UNFORMED,	/* Still setting it up. */
 };
 
+struct module;
+
+struct mod_tree_node {
+	struct module *mod;
+	struct latch_tree_node node;
+};
+
 struct module {
 	enum module_state state;
 
@@ -269,8 +277,15 @@ struct module {
 	/* Startup function. */
 	int (*init)(void);
 
-	/* If this is non-NULL, vfree after init() returns */
-	void *module_init;
+	/*
+	 * If this is non-NULL, vfree() after init() returns.
+	 *
+	 * Cacheline align here, such that:
+	 *   module_init, module_core, init_size, core_size,
+	 *   init_text_size, core_text_size and ltn_core.node[0]
+	 * are on the same cacheline.
+	 */
+	void *module_init	____cacheline_aligned;
 
 	/* Here is the actual code + data, vfree'd on unload. */
 	void *module_core;
@@ -281,6 +296,14 @@ struct module {
 	/* The size of the executable code in each section.  */
 	unsigned int init_text_size, core_text_size;
 
+	/*
+	 * We want mtn_core::{mod,node[0]} to be in the same cacheline as the
+	 * above entries such that a regular lookup will only touch one
+	 * cacheline.
+	 */
+	struct mod_tree_node	mtn_core;
+	struct mod_tree_node	mtn_init;
+
 	/* Size of RO sections of the module (text+rodata) */
 	unsigned int init_ro_size, core_ro_size;
 
@@ -367,7 +390,7 @@ struct module {
 	ctor_fn_t *ctors;
 	unsigned int num_ctors;
 #endif
-};
+} ____cacheline_aligned;
 #ifndef MODULE_ARCH_INIT
 #define MODULE_ARCH_INIT {}
 #endif
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -101,6 +101,108 @@
 DEFINE_MUTEX(module_mutex);
 EXPORT_SYMBOL_GPL(module_mutex);
 static LIST_HEAD(modules);
+
+/*
+ * Use a latched RB-tree for __module_address(); this allows us to use
+ * RCU-sched lookups of the address from any context.
+ *
+ * Because modules have two address ranges: init and core, we need two
+ * latch_tree_nodes entries. Therefore we need the back-pointer from
+ * mod_tree_node.
+ *
+ * Because init ranges are short lived we mark them unlikely and have placed
+ * them outside the critical cacheline in struct module.
+ */
+
+static __always_inline unsigned long __mod_tree_val(struct latch_tree_node *n)
+{
+	struct mod_tree_node *mtn = container_of(n, struct mod_tree_node, node);
+	struct module *mod = mtn->mod;
+
+	if (unlikely(mtn == &mod->mtn_init))
+		return (unsigned long)mod->module_init;
+
+	return (unsigned long)mod->module_core;
+}
+
+static __always_inline unsigned long __mod_tree_size(struct latch_tree_node *n)
+{
+	struct mod_tree_node *mtn = container_of(n, struct mod_tree_node, node);
+	struct module *mod = mtn->mod;
+
+	if (unlikely(mtn == &mod->mtn_init))
+		return (unsigned long)mod->init_size;
+
+	return (unsigned long)mod->core_size;
+}
+
+static __always_inline bool
+mod_tree_less(struct latch_tree_node *a, struct latch_tree_node *b)
+{
+	return __mod_tree_val(a) < __mod_tree_val(b);
+}
+
+static __always_inline int
+mod_tree_comp(void *key, struct latch_tree_node *n)
+{
+	unsigned long val = (unsigned long)key;
+	unsigned long start, end;
+
+	start = __mod_tree_val(n);
+	if (val < start)
+		return -1;
+
+	end = start + __mod_tree_size(n);
+	if (val >= end)
+		return 1;
+
+	return 0;
+}
+
+static const struct latch_tree_ops mod_tree_ops = {
+	.less = mod_tree_less,
+	.comp = mod_tree_comp,
+};
+
+static struct latch_tree_root mod_tree __cacheline_aligned;
+
+/*
+ * These modifications: insert, remove_init and remove; are serialized by the
+ * module_mutex.
+ */
+static void mod_tree_insert(struct module *mod)
+{
+	mod->mtn_core.mod = mod;
+	mod->mtn_init.mod = mod;
+
+	latch_tree_insert(&mod->mtn_core.node, &mod_tree, &mod_tree_ops);
+	if (mod->init_size)
+		latch_tree_insert(&mod->mtn_init.node, &mod_tree, &mod_tree_ops);
+}
+
+static void mod_tree_remove_init(struct module *mod)
+{
+	if (mod->init_size)
+		latch_tree_erase(&mod->mtn_init.node, &mod_tree, &mod_tree_ops);
+}
+
+static void mod_tree_remove(struct module *mod)
+{
+	latch_tree_erase(&mod->mtn_core.node, &mod_tree, &mod_tree_ops);
+	mod_tree_remove_init(mod);
+}
+
+static struct module *mod_tree_find(unsigned long addr)
+{
+	struct latch_tree_node *ltn;
+
+	ltn = latch_tree_find((void *)addr, &mod_tree, &mod_tree_ops);
+	if (!ltn)
+		return NULL;
+
+	return container_of(ltn, struct mod_tree_node, node)->mod;
+}
+
 #ifdef CONFIG_KGDB_KDB
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
@@ -1852,6 +1954,7 @@ static void free_module(struct module *m
 	mutex_lock(&module_mutex);
 	/* Unlink carefully: kallsyms could be walking list. */
 	list_del_rcu(&mod->list);
+	mod_tree_remove(mod);
 	/* Remove this module from bug list, this uses list_del_rcu */
 	module_bug_cleanup(mod);
 	/* Wait for RCU synchronizing before releasing mod->list and buglist. */
@@ -3119,6 +3222,7 @@ static noinline int do_init_module(struc
 	mod->symtab = mod->core_symtab;
 	mod->strtab = mod->core_strtab;
 #endif
+	mod_tree_remove_init(mod);
 	unset_module_init_ro_nx(mod);
 	module_arch_freeing_init(mod);
 	mod->module_init = NULL;
@@ -3189,6 +3293,7 @@ static int add_unformed_module(struct mo
 		goto out;
 	}
 	list_add_rcu(&mod->list, &modules);
+	mod_tree_insert(mod);
 	err = 0;
 
 out:
@@ -3831,13 +3936,13 @@ struct module *__module_address(unsigned
 	if (addr < module_addr_min || addr > module_addr_max)
 		return NULL;
 
-	list_for_each_entry_rcu(mod, &modules, list) {
+	mod = mod_tree_find(addr);
+	if (mod) {
+		BUG_ON(!within_module(addr, mod));
 		if (mod->state == MODULE_STATE_UNFORMED)
-			continue;
-		if (within_module(addr, mod))
-			return mod;
+			mod = NULL;
 	}
-	return NULL;
+	return mod;
 }
 EXPORT_SYMBOL_GPL(__module_address);
 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 7/9] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
  2015-05-06 13:51 [PATCH v6 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (5 preceding siblings ...)
  2015-05-06 13:51 ` [PATCH v6 6/9] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
@ 2015-05-06 13:51 ` Peter Zijlstra
  2015-05-06 13:51 ` [PATCH v6 8/9] module: Use __module_address() for module_address_lookup() Peter Zijlstra
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-06 13:51 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz, Andrew Morton

[-- Attachment #1: peterz-modtree-optional.patch --]
[-- Type: text/plain, Size: 3661 bytes --]

Andrew worried about the overhead on small systems; only use the fancy
code when either perf or tracing is enabled.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Requested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/module.h |    4 +++-
 init/Kconfig           |    4 ++++
 kernel/module.c        |   30 ++++++++++++++++++++++++++++--
 3 files changed, 35 insertions(+), 3 deletions(-)

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -282,7 +282,7 @@ struct module {
 	 *
 	 * Cacheline align here, such that:
 	 *   module_init, module_core, init_size, core_size,
-	 *   init_text_size, core_text_size and ltn_core.node[0]
+	 *   init_text_size, core_text_size and mtn_core::{mod,node[0]}
 	 * are on the same cacheline.
 	 */
 	void *module_init	____cacheline_aligned;
@@ -296,6 +296,7 @@ struct module {
 	/* The size of the executable code in each section.  */
 	unsigned int init_text_size, core_text_size;
 
+#ifdef CONFIG_MODULES_TREE_LOOKUP
 	/*
 	 * We want mtn_core::{mod,node[0]} to be in the same cacheline as the
 	 * above entries such that a regular lookup will only touch one
@@ -303,6 +304,7 @@ struct module {
 	 */
 	struct mod_tree_node	mtn_core;
 	struct mod_tree_node	mtn_init;
+#endif
 
 	/* Size of RO sections of the module (text+rodata) */
 	unsigned int init_ro_size, core_ro_size;
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1978,6 +1978,10 @@ endchoice
 
 endif # MODULES
 
+config MODULES_TREE_LOOKUP
+	def_bool y
+	depends on PERF_EVENTS || TRACING
+
 config INIT_ALL_POSSIBLE
 	bool
 	help
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -102,6 +102,8 @@ DEFINE_MUTEX(module_mutex);
 EXPORT_SYMBOL_GPL(module_mutex);
 static LIST_HEAD(modules);
 
+#ifdef CONFIG_MODULES_TREE_LOOKUP
+
 /*
  * Use a latched RB-tree for __module_address(); this allows us to use
  * RCU-sched lookups of the address from any context.
@@ -112,6 +114,10 @@ static LIST_HEAD(modules);
  *
  * Because init ranges are short lived we mark them unlikely and have placed
  * them outside the critical cacheline in struct module.
+ *
+ * This is conditional on PERF_EVENTS || TRACING because those can really hit
+ * __module_address() hard by doing a lot of stack unwinding; potentially from
+ * NMI context.
  */
 
 static __always_inline unsigned long __mod_tree_val(struct latch_tree_node *n)
@@ -192,7 +198,7 @@ static void mod_tree_remove(struct modul
 	mod_tree_remove_init(mod);
 }
 
-static struct module *mod_tree_find(unsigned long addr)
+static struct module *mod_find(unsigned long addr)
 {
 	struct latch_tree_node *ltn;
 
@@ -203,6 +209,26 @@ static struct module *mod_tree_find(unsi
 	return container_of(ltn, struct mod_tree_node, node)->mod;
 }
 
+#else /* MODULES_TREE_LOOKUP */
+
+static void mod_tree_insert(struct module *mod) { }
+static void mod_tree_remove_init(struct module *mod) { }
+static void mod_tree_remove(struct module *mod) { }
+
+static struct module *mod_find(unsigned long addr)
+{
+	struct module *mod;
+
+	list_for_each_entry_rcu(mod, &modules, list) {
+		if (within_module(addr, mod))
+			return mod;
+	}
+
+	return NULL;
+}
+
+#endif /* MODULES_TREE_LOOKUP */
+
 #ifdef CONFIG_KGDB_KDB
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
@@ -3963,7 +3989,7 @@ struct module *__module_address(unsigned
 	if (addr < module_addr_min || addr > module_addr_max)
 		return NULL;
 
-	mod = mod_tree_find(addr);
+	mod = mod_find(addr);
 	if (mod) {
 		BUG_ON(!within_module(addr, mod));
 		if (mod->state == MODULE_STATE_UNFORMED)



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 8/9] module: Use __module_address() for module_address_lookup()
  2015-05-06 13:51 [PATCH v6 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (6 preceding siblings ...)
  2015-05-06 13:51 ` [PATCH v6 7/9] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING Peter Zijlstra
@ 2015-05-06 13:51 ` Peter Zijlstra
  2015-05-06 13:51 ` [PATCH v6 9/9] module: Rework module_addr_{min,max} Peter Zijlstra
  2015-05-07  1:20 ` [PATCH v6 0/9] latched RB-trees and __module_address() Rusty Russell
  9 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-06 13:51 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

[-- Attachment #1: peterz-module-use-__module_address.patch --]
[-- Type: text/plain, Size: 1127 bytes --]

Use the generic __module_address() addr to struct module lookup
instead of open coding it once more.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3515,19 +3515,15 @@ const char *module_address_lookup(unsign
 			    char **modname,
 			    char *namebuf)
 {
-	struct module *mod;
 	const char *ret = NULL;
+	struct module *mod;
 
 	preempt_disable();
-	list_for_each_entry_rcu(mod, &modules, list) {
-		if (mod->state == MODULE_STATE_UNFORMED)
-			continue;
-		if (within_module(addr, mod)) {
-			if (modname)
-				*modname = mod->name;
-			ret = get_ksymbol(mod, addr, size, offset);
-			break;
-		}
+	mod = __module_address(addr);
+	if (mod) {
+		if (modname)
+			*modname = mod->name;
+		ret = get_ksymbol(mod, addr, size, offset);
 	}
 	/* Make a copy in here where it's safe */
 	if (ret) {
@@ -3535,6 +3531,7 @@ const char *module_address_lookup(unsign
 		ret = namebuf;
 	}
 	preempt_enable();
+
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 9/9] module: Rework module_addr_{min,max}
  2015-05-06 13:51 [PATCH v6 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (7 preceding siblings ...)
  2015-05-06 13:51 ` [PATCH v6 8/9] module: Use __module_address() for module_address_lookup() Peter Zijlstra
@ 2015-05-06 13:51 ` Peter Zijlstra
  2015-05-07  1:20 ` [PATCH v6 0/9] latched RB-trees and __module_address() Rusty Russell
  9 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-06 13:51 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

[-- Attachment #1: peterz-module-mod_tree-line.patch --]
[-- Type: text/plain, Size: 5784 bytes --]

__module_address() does an initial bound check before doing the
{list/tree} iteration to find the actual module. The bound variables
are nowhere near the mod_tree cacheline, in fact they're nowhere near
one another.

module_addr_min lives in .data while module_addr_max lives in .bss
(smarty pants GCC thinks the explicit 0 assignment is a mistake).

Rectify this by moving the two variables into a structure together
with the latch_tree_root to guarantee they all share the same
cacheline and avoid hitting two extra cachelines for the lookup.

While reworking the bounds code, move the bound update from allocation
to insertion time, this avoids updating the bounds for a few error
paths.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   84 ++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 54 insertions(+), 30 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -170,7 +170,26 @@ static const struct latch_tree_ops mod_t
 	.comp = mod_tree_comp,
 };
 
-static struct latch_tree_root mod_tree __cacheline_aligned;
+static struct mod_tree_root {
+	struct latch_tree_root root;
+	unsigned long addr_min;
+	unsigned long addr_max;
+} mod_tree __cacheline_aligned = {
+	.addr_min = -1UL,
+};
+
+#define module_addr_min mod_tree.addr_min
+#define module_addr_max mod_tree.addr_max
+
+static noinline void __mod_tree_insert(struct mod_tree_node *node)
+{
+	latch_tree_insert(&node->node, &mod_tree.root, &mod_tree_ops);
+}
+
+static void __mod_tree_remove(struct mod_tree_node *node)
+{
+	latch_tree_erase(&node->node, &mod_tree.root, &mod_tree_ops);
+}
 
 /*
  * These modifications: insert, remove_init and remove; are serialized by the
@@ -181,20 +200,20 @@ static void mod_tree_insert(struct modul
 	mod->mtn_core.mod = mod;
 	mod->mtn_init.mod = mod;
 
-	latch_tree_insert(&mod->mtn_core.node, &mod_tree, &mod_tree_ops);
+	__mod_tree_insert(&mod->mtn_core);
 	if (mod->init_size)
-		latch_tree_insert(&mod->mtn_init.node, &mod_tree, &mod_tree_ops);
+		__mod_tree_insert(&mod->mtn_init);
 }
 
 static void mod_tree_remove_init(struct module *mod)
 {
 	if (mod->init_size)
-		latch_tree_erase(&mod->mtn_init.node, &mod_tree, &mod_tree_ops);
+		__mod_tree_remove(&mod->mtn_init);
 }
 
 static void mod_tree_remove(struct module *mod)
 {
-	latch_tree_erase(&mod->mtn_core.node, &mod_tree, &mod_tree_ops);
+	__mod_tree_remove(&mod->mtn_core);
 	mod_tree_remove_init(mod);
 }
 
@@ -202,7 +221,7 @@ static struct module *mod_find(unsigned
 {
 	struct latch_tree_node *ltn;
 
-	ltn = latch_tree_find((void *)addr, &mod_tree, &mod_tree_ops);
+	ltn = latch_tree_find((void *)addr, &mod_tree.root, &mod_tree_ops);
 	if (!ltn)
 		return NULL;
 
@@ -211,6 +230,8 @@ static struct module *mod_find(unsigned
 
 #else /* MODULES_TREE_LOOKUP */
 
+static unsigned long module_addr_min = -1UL, module_addr_max = 0;
+
 static void mod_tree_insert(struct module *mod) { }
 static void mod_tree_remove_init(struct module *mod) { }
 static void mod_tree_remove(struct module *mod) { }
@@ -229,6 +250,28 @@ static struct module *mod_find(unsigned
 
 #endif /* MODULES_TREE_LOOKUP */
 
+/*
+ * Bounds of module text, for speeding up __module_address.
+ * Protected by module_mutex.
+ */
+static void __mod_update_bounds(void *base, unsigned int size)
+{
+	unsigned long min = (unsigned long)base;
+	unsigned long max = min + size;
+
+	if (min < module_addr_min)
+		module_addr_min = min;
+	if (max > module_addr_max)
+		module_addr_max = max;
+}
+
+static void mod_update_bounds(struct module *mod)
+{
+	__mod_update_bounds(mod->module_core, mod->core_size);
+	if (mod->init_size)
+		__mod_update_bounds(mod->module_init, mod->init_size);
+}
+
 #ifdef CONFIG_KGDB_KDB
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
@@ -297,10 +340,6 @@ static DECLARE_WAIT_QUEUE_HEAD(module_wq
 
 static BLOCKING_NOTIFIER_HEAD(module_notify_list);
 
-/* Bounds of module allocation, for speeding __module_address.
- * Protected by module_mutex. */
-static unsigned long module_addr_min = -1UL, module_addr_max = 0;
-
 int register_module_notifier(struct notifier_block *nb)
 {
 	return blocking_notifier_chain_register(&module_notify_list, nb);
@@ -2539,22 +2578,6 @@ void * __weak module_alloc(unsigned long
 	return vmalloc_exec(size);
 }
 
-static void *module_alloc_update_bounds(unsigned long size)
-{
-	void *ret = module_alloc(size);
-
-	if (ret) {
-		mutex_lock(&module_mutex);
-		/* Update module bounds. */
-		if ((unsigned long)ret < module_addr_min)
-			module_addr_min = (unsigned long)ret;
-		if ((unsigned long)ret + size > module_addr_max)
-			module_addr_max = (unsigned long)ret + size;
-		mutex_unlock(&module_mutex);
-	}
-	return ret;
-}
-
 #ifdef CONFIG_DEBUG_KMEMLEAK
 static void kmemleak_load_module(const struct module *mod,
 				 const struct load_info *info)
@@ -2956,7 +2979,7 @@ static int move_module(struct module *mo
 	void *ptr;
 
 	/* Do the allocs. */
-	ptr = module_alloc_update_bounds(mod->core_size);
+	ptr = module_alloc(mod->core_size);
 	/*
 	 * The pointer to this block is stored in the module structure
 	 * which is inside the block. Just mark it as not being a
@@ -2970,7 +2993,7 @@ static int move_module(struct module *mo
 	mod->module_core = ptr;
 
 	if (mod->init_size) {
-		ptr = module_alloc_update_bounds(mod->init_size);
+		ptr = module_alloc(mod->init_size);
 		/*
 		 * The pointer to this block is stored in the module structure
 		 * which is inside the block. This block doesn't need to be
@@ -3340,6 +3363,7 @@ static int add_unformed_module(struct mo
 		err = -EEXIST;
 		goto out;
 	}
+	mod_update_bounds(mod);
 	list_add_rcu(&mod->list, &modules);
 	mod_tree_insert(mod);
 	err = 0;



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/9] latched RB-trees and __module_address()
  2015-05-06 13:51 [PATCH v6 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (8 preceding siblings ...)
  2015-05-06 13:51 ` [PATCH v6 9/9] module: Rework module_addr_{min,max} Peter Zijlstra
@ 2015-05-07  1:20 ` Rusty Russell
  2015-05-07 19:28   ` Ingo Molnar
  9 siblings, 1 reply; 14+ messages in thread
From: Rusty Russell @ 2015-05-07  1:20 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

Peter Zijlstra <peterz@infradead.org> writes:
> This series is aimed at making __module_address() go fast(er).

Acked-by: Rusty Russell <rusty@rustcorp.com.au> (module parts)

Since all the interesting stuff is not module-specific, assume
this is via Ingo?  Otherwise, I'll take it...

Thanks,
Rusty.

>
> The reason for doing so is that most stack unwinders use kernel_text_address()
> to validate each frame. Perf and ftrace (can) end up doing a lot of stack
> traces from performance sensitive code.
>
> On the way there it:
>  - annotates and sanitizes module locking
>  - introduces the latched RB-tree
>  - employs it to make __module_address() go fast.
>
> I've build and boot tested this on x86_64 with modules and lockdep
> enabled.  Performance numbers (below) are done with lockdep disabled.
>
> As previously mentioned; the reason for writing the latched RB-tree as generic
> code is mostly for clarity/documentation purposes; as there are a number of
> separate and non trivial bits to the complete solution.
>
> As measured on my ivb-ep system with 84 modules loaded; the test module reports
> (cache hot, performance cpufreq):
>
>           avg +- stdev
> Before:   611 +- 10 [ns] per __module_address() call
> After:     17 +-  5 [ns] per __module_address() call
>
> PMI measurements for a cpu running loops in a module (also [ns]):
>
> Before:	Mean: 2719 +- 1, Stdev: 214, Samples: 40036
> After:  Mean:  947 +- 0, Stdev: 132, Samples: 40037
>
> Note; I have also tested things like: perf record -a -g modprobe
> mod_test, to make 'sure' to hit some of the more interesting paths.
>
> Changes since last time:
>
>  - rebased against Rusty's tree
>  - raw_read_seqcount_latch() -- (mingo)
>
> Based on rusty/linux.git/pending-rebases; please consider for 4.2
>
> Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/9] latched RB-trees and __module_address()
  2015-05-07  1:20 ` [PATCH v6 0/9] latched RB-trees and __module_address() Rusty Russell
@ 2015-05-07 19:28   ` Ingo Molnar
  2015-05-08 17:42     ` Rusty Russell
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2015-05-07 19:28 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Peter Zijlstra, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, laijs, linux


* Rusty Russell <rusty@rustcorp.com.au> wrote:

> Peter Zijlstra <peterz@infradead.org> writes:
> > This series is aimed at making __module_address() go fast(er).
> 
> Acked-by: Rusty Russell <rusty@rustcorp.com.au> (module parts)
> 
> Since all the interesting stuff is not module-specific, assume this 
> is via Ingo?  Otherwise, I'll take it...

I can certainly take them, but since I think that the _breakages_ are 
going to be in module land foremost, it should be rather under your 
watchful eyes? :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/9] latched RB-trees and __module_address()
  2015-05-07 19:28   ` Ingo Molnar
@ 2015-05-08 17:42     ` Rusty Russell
  2015-05-12 11:52       ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Rusty Russell @ 2015-05-08 17:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, laijs, linux

Ingo Molnar <mingo@kernel.org> writes:
> * Rusty Russell <rusty@rustcorp.com.au> wrote:
>
>> Peter Zijlstra <peterz@infradead.org> writes:
>> > This series is aimed at making __module_address() go fast(er).
>> 
>> Acked-by: Rusty Russell <rusty@rustcorp.com.au> (module parts)
>> 
>> Since all the interesting stuff is not module-specific, assume this 
>> is via Ingo?  Otherwise, I'll take it...
>
> I can certainly take them, but since I think that the _breakages_ are 
> going to be in module land foremost, it should be rather under your 
> watchful eyes? :-)

Ingo, I feel like you just gave me a free puppy...

Applied,
Rusty.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/9] latched RB-trees and __module_address()
  2015-05-08 17:42     ` Rusty Russell
@ 2015-05-12 11:52       ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-12 11:52 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ingo Molnar, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, laijs, linux

On Sat, May 09, 2015 at 03:12:32AM +0930, Rusty Russell wrote:
> Ingo Molnar <mingo@kernel.org> writes:
> > * Rusty Russell <rusty@rustcorp.com.au> wrote:
> >
> >> Peter Zijlstra <peterz@infradead.org> writes:
> >> > This series is aimed at making __module_address() go fast(er).
> >> 
> >> Acked-by: Rusty Russell <rusty@rustcorp.com.au> (module parts)
> >> 
> >> Since all the interesting stuff is not module-specific, assume this 
> >> is via Ingo?  Otherwise, I'll take it...
> >
> > I can certainly take them, but since I think that the _breakages_ are 
> > going to be in module land foremost, it should be rather under your 
> > watchful eyes? :-)
> 
> Ingo, I feel like you just gave me a free puppy...

Hehe, thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-05-12 11:52 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-06 13:51 [PATCH v6 0/9] latched RB-trees and __module_address() Peter Zijlstra
2015-05-06 13:51 ` [PATCH v6 1/9] rbtree: Make lockless searches non-fatal Peter Zijlstra
2015-05-06 13:51 ` [PATCH v6 2/9] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
2015-05-06 13:51 ` [PATCH v6 3/9] rcu: Move lockless_dereference() out of rcupdate.h Peter Zijlstra
2015-05-06 13:51 ` [PATCH v6 4/9] seqlock: Introduce raw_read_seqcount_latch() Peter Zijlstra
2015-05-06 13:51 ` [PATCH v6 5/9] rbtree: Implement generic latch_tree Peter Zijlstra
2015-05-06 13:51 ` [PATCH v6 6/9] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
2015-05-06 13:51 ` [PATCH v6 7/9] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING Peter Zijlstra
2015-05-06 13:51 ` [PATCH v6 8/9] module: Use __module_address() for module_address_lookup() Peter Zijlstra
2015-05-06 13:51 ` [PATCH v6 9/9] module: Rework module_addr_{min,max} Peter Zijlstra
2015-05-07  1:20 ` [PATCH v6 0/9] latched RB-trees and __module_address() Rusty Russell
2015-05-07 19:28   ` Ingo Molnar
2015-05-08 17:42     ` Rusty Russell
2015-05-12 11:52       ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.