linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/9] latched RB-trees and __module_address()
@ 2015-04-08 16:48 Peter Zijlstra
  2015-04-08 16:48 ` [PATCH v4 1/9] module: Sanitize RCU usage and locking Peter Zijlstra
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-08 16:48 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz

This series is aimed at making __module_address() go fast(er).

The reason for doing so is that most stack unwinders use kernel_text_address()
to validate each frame. Perf and ftrace (can) end up doing a lot of stack
traces from performance sensitive code.

On the way there it:
 - annotates and sanitizes module locking
 - introduces the latched RB-tree
 - employs it to make __module_address() go fast.

I've build and boot tested this on x86_64 with modules and lockdep
enabled.  Performance numbers (below) are done with lockdep disabled.

As previously mentioned; the reason for writing the latched RB-tree as generic
code is mostly for clarity/documentation purposes; as there are a number of
separate and non trivial bits to the complete solution.

As measued on my ivb-ep system with 84 modules loaded; prior to patching
the test module (below) reports:

          avg +- stdev
Before:  1689 +- 287 [ns] per __module_address() call
After:    137 +-  38 [ns] per __module_address() call

Note; I have also tested things like: perf record -a -g modprobe
mod_test, to make 'sure' to hit some of the more interesting paths.

Changes since last time:

 - depend on CONFIG_PERF_EVENTS || CONFIG_TRACING -- akpm
 - use within_module() for jump_labels -- rusty
 - minor comment changes -- Compudj, mingo

Rusty, please consider merging this.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/9] module: Sanitize RCU usage and locking
  2015-04-08 16:48 [PATCH v4 0/9] latched RB-trees and __module_address() Peter Zijlstra
@ 2015-04-08 16:48 ` Peter Zijlstra
  2015-04-09  4:21   ` Lai Jiangshan
  2015-04-08 16:48 ` [PATCH v4 2/9] module: Annotate module version magic Peter Zijlstra
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-08 16:48 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz

[-- Attachment #1: peterz-module-sanitize-rcu.patch --]
[-- Type: text/plain, Size: 5648 bytes --]

Currently the RCU usage in module is an inconsistent mess of RCU and
RCU-sched, this is broken for CONFIG_PREEMPT where synchronize_rcu()
does not imply synchronize_sched().

Most usage sites use preempt_{dis,en}able() which is RCU-sched, but
(most of) the modification sites use synchronize_rcu(). With the
exception of the module bug list, which actually uses RCU.

Convert everything over to RCU-sched.

Furthermore add lockdep asserts to all sites, because its not at all
clear to me the required locking is observed, esp. on exported
functions.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/module.h |   12 ++++++++++--
 kernel/module.c        |   42 ++++++++++++++++++++++++++++++++++--------
 lib/bug.c              |    7 +++++--
 3 files changed, 49 insertions(+), 12 deletions(-)

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -415,14 +415,22 @@ struct symsearch {
 	bool unused;
 };
 
-/* Search for an exported symbol by name. */
+/*
+ * Search for an exported symbol by name.
+ *
+ * Must be called with module_mutex held or preemption disabled.
+ */
 const struct kernel_symbol *find_symbol(const char *name,
 					struct module **owner,
 					const unsigned long **crc,
 					bool gplok,
 					bool warn);
 
-/* Walk the exported symbol table */
+/*
+ * Walk the exported symbol table
+ *
+ * Must be called with module_mutex held or preemption disabled.
+ */
 bool each_symbol_section(bool (*fn)(const struct symsearch *arr,
 				    struct module *owner,
 				    void *data), void *data);
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -106,6 +106,24 @@ static LIST_HEAD(modules);
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
 
+static void module_assert_mutex(void)
+{
+	lockdep_assert_held(&module_mutex);
+}
+
+static void module_assert_mutex_or_preempt(void)
+{
+#ifdef CONFIG_LOCKDEP
+	int rcu_held = rcu_read_lock_sched_held();
+	int mutex_held = 1;
+
+	if (debug_locks)
+		mutex_held = lockdep_is_held(&module_mutex);
+
+	WARN_ON(!rcu_held && !mutex_held);
+#endif
+}
+
 #ifdef CONFIG_MODULE_SIG
 #ifdef CONFIG_MODULE_SIG_FORCE
 static bool sig_enforce = true;
@@ -319,6 +337,8 @@ bool each_symbol_section(bool (*fn)(cons
 #endif
 	};
 
+	module_assert_mutex_or_preempt();
+
 	if (each_symbol_in_section(arr, ARRAY_SIZE(arr), NULL, fn, data))
 		return true;
 
@@ -458,6 +478,8 @@ static struct module *find_module_all(co
 {
 	struct module *mod;
 
+	module_assert_mutex();
+
 	list_for_each_entry(mod, &modules, list) {
 		if (!even_unformed && mod->state == MODULE_STATE_UNFORMED)
 			continue;
@@ -1856,8 +1878,8 @@ static void free_module(struct module *m
 	list_del_rcu(&mod->list);
 	/* Remove this module from bug list, this uses list_del_rcu */
 	module_bug_cleanup(mod);
-	/* Wait for RCU synchronizing before releasing mod->list and buglist. */
-	synchronize_rcu();
+	/* Wait for RCU-sched synchronizing before releasing mod->list and buglist. */
+	synchronize_sched();
 	mutex_unlock(&module_mutex);
 
 	/* This may be NULL, but that's OK */
@@ -3106,11 +3128,11 @@ static noinline int do_init_module(struc
 	mod->init_text_size = 0;
 	/*
 	 * We want to free module_init, but be aware that kallsyms may be
-	 * walking this with preempt disabled.  In all the failure paths,
-	 * we call synchronize_rcu/synchronize_sched, but we don't want
-	 * to slow down the success path, so use actual RCU here.
+	 * walking this with preempt disabled.  In all the failure paths, we
+	 * call synchronize_sched, but we don't want to slow down the success
+	 * path, so use actual RCU here.
 	 */
-	call_rcu(&freeinit->rcu, do_free_init);
+	call_rcu_sched(&freeinit->rcu, do_free_init);
 	mutex_unlock(&module_mutex);
 	wake_up_all(&module_wq);
 
@@ -3368,8 +3390,8 @@ static int load_module(struct load_info
 	/* Unlink carefully: kallsyms could be walking list. */
 	list_del_rcu(&mod->list);
 	wake_up_all(&module_wq);
-	/* Wait for RCU synchronizing before releasing mod->list. */
-	synchronize_rcu();
+	/* Wait for RCU-sched synchronizing before releasing mod->list. */
+	synchronize_sched();
 	mutex_unlock(&module_mutex);
  free_module:
 	/* Free lock-classes; relies on the preceding sync_rcu() */
@@ -3636,6 +3658,8 @@ int module_kallsyms_on_each_symbol(int (
 	unsigned int i;
 	int ret;
 
+	module_assert_mutex();
+
 	list_for_each_entry(mod, &modules, list) {
 		if (mod->state == MODULE_STATE_UNFORMED)
 			continue;
@@ -3810,6 +3834,8 @@ struct module *__module_address(unsigned
 	if (addr < module_addr_min || addr > module_addr_max)
 		return NULL;
 
+	module_assert_mutex_or_preempt();
+
 	list_for_each_entry_rcu(mod, &modules, list) {
 		if (mod->state == MODULE_STATE_UNFORMED)
 			continue;
--- a/lib/bug.c
+++ b/lib/bug.c
@@ -66,7 +66,7 @@ static const struct bug_entry *module_fi
 	struct module *mod;
 	const struct bug_entry *bug = NULL;
 
-	rcu_read_lock();
+	rcu_read_lock_sched();
 	list_for_each_entry_rcu(mod, &module_bug_list, bug_list) {
 		unsigned i;
 
@@ -77,7 +77,7 @@ static const struct bug_entry *module_fi
 	}
 	bug = NULL;
 out:
-	rcu_read_unlock();
+	rcu_read_unlock_sched();
 
 	return bug;
 }
@@ -88,6 +88,8 @@ void module_bug_finalize(const Elf_Ehdr
 	char *secstrings;
 	unsigned int i;
 
+	lockdep_assert_held(&module_mutex);
+
 	mod->bug_table = NULL;
 	mod->num_bugs = 0;
 
@@ -113,6 +115,7 @@ void module_bug_finalize(const Elf_Ehdr
 
 void module_bug_cleanup(struct module *mod)
 {
+	lockdep_assert_held(&module_mutex);
 	list_del_rcu(&mod->bug_list);
 }
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 2/9] module: Annotate module version magic
  2015-04-08 16:48 [PATCH v4 0/9] latched RB-trees and __module_address() Peter Zijlstra
  2015-04-08 16:48 ` [PATCH v4 1/9] module: Sanitize RCU usage and locking Peter Zijlstra
@ 2015-04-08 16:48 ` Peter Zijlstra
  2015-04-08 16:48 ` [PATCH v4 3/9] module, jump_label: Fix module locking Peter Zijlstra
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-08 16:48 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz

[-- Attachment #1: peterz-module-fix-load_module.patch --]
[-- Type: text/plain, Size: 2807 bytes --]

Due to the new lockdep checks we go:

[    9.759380] ------------[ cut here ]------------
[    9.759389] WARNING: CPU: 31 PID: 597 at ../kernel/module.c:216 each_symbol_section+0x121/0x130()
[    9.759391] Modules linked in:
[    9.759393] CPU: 31 PID: 597 Comm: modprobe Not tainted 4.0.0-rc1+ #65
[    9.759393] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[    9.759396]  ffffffff817d8676 ffff880424567ca8 ffffffff8157e98b 0000000000000001
[    9.759398]  0000000000000000 ffff880424567ce8 ffffffff8105fbc7 ffff880424567cd8
[    9.759400]  0000000000000000 ffffffff810ec160 ffff880424567d40 0000000000000000
[    9.759400] Call Trace:
[    9.759407]  [<ffffffff8157e98b>] dump_stack+0x4f/0x7b
[    9.759410]  [<ffffffff8105fbc7>] warn_slowpath_common+0x97/0xe0
[    9.759412]  [<ffffffff810ec160>] ? section_objs+0x60/0x60
[    9.759414]  [<ffffffff8105fc2a>] warn_slowpath_null+0x1a/0x20
[    9.759415]  [<ffffffff810ed9c1>] each_symbol_section+0x121/0x130
[    9.759417]  [<ffffffff810eda01>] find_symbol+0x31/0x70
[    9.759420]  [<ffffffff810ef5bf>] load_module+0x20f/0x2660
[    9.759422]  [<ffffffff8104ef10>] ? __do_page_fault+0x190/0x4e0
[    9.759426]  [<ffffffff815880ec>] ? retint_restore_args+0x13/0x13
[    9.759427]  [<ffffffff815880ec>] ? retint_restore_args+0x13/0x13
[    9.759433]  [<ffffffff810ae73d>] ? trace_hardirqs_on_caller+0x11d/0x1e0
[    9.759437]  [<ffffffff812fcc0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[    9.759439]  [<ffffffff815880ec>] ? retint_restore_args+0x13/0x13
[    9.759441]  [<ffffffff810f1ade>] SyS_init_module+0xce/0x100
[    9.759443]  [<ffffffff81587429>] system_call_fastpath+0x12/0x17
[    9.759445] ---[ end trace 9294429076a9c644 ]---

As per the comment this site should be fine, but lets wrap it in
preempt_disable() anyhow to placate lockdep.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1191,11 +1191,17 @@ static inline int check_modstruct_versio
 {
 	const unsigned long *crc;
 
-	/* Since this should be found in kernel (which can't be removed),
-	 * no locking is necessary. */
+	/*
+	 * Since this should be found in kernel (which can't be removed), no
+	 * locking is necessary -- use preempt_disable() to placate lockdep.
+	 */
+	preempt_disable();
 	if (!find_symbol(VMLINUX_SYMBOL_STR(module_layout), NULL,
-			 &crc, true, false))
+			 &crc, true, false)) {
+		preempt_enable();
 		BUG();
+	}
+	preempt_enable();
 	return check_version(sechdrs, versindex,
 			     VMLINUX_SYMBOL_STR(module_layout), mod, crc,
 			     NULL);



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 3/9] module, jump_label: Fix module locking
  2015-04-08 16:48 [PATCH v4 0/9] latched RB-trees and __module_address() Peter Zijlstra
  2015-04-08 16:48 ` [PATCH v4 1/9] module: Sanitize RCU usage and locking Peter Zijlstra
  2015-04-08 16:48 ` [PATCH v4 2/9] module: Annotate module version magic Peter Zijlstra
@ 2015-04-08 16:48 ` Peter Zijlstra
  2015-04-08 16:48 ` [PATCH v4 4/9] rbtree: Make lockless searches non-fatal Peter Zijlstra
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-08 16:48 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz, Jason Baron

[-- Attachment #1: peterz-module-fix-jump_label.patch --]
[-- Type: text/plain, Size: 2730 bytes --]

As per the module core lockdep annotations:

[   18.034047] ---[ end trace 9294429076a9c673 ]---
[   18.047760] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[   18.059228]  ffffffff817d8676 ffff880036683c38 ffffffff8157e98b 0000000000000001
[   18.067541]  0000000000000000 ffff880036683c78 ffffffff8105fbc7 ffff880036683c68
[   18.075851]  ffffffffa0046b08 0000000000000000 ffffffffa0046d00 ffffffffa0046cc8
[   18.084173] Call Trace:
[   18.086906]  [<ffffffff8157e98b>] dump_stack+0x4f/0x7b
[   18.092649]  [<ffffffff8105fbc7>] warn_slowpath_common+0x97/0xe0
[   18.099361]  [<ffffffff8105fc2a>] warn_slowpath_null+0x1a/0x20
[   18.105880]  [<ffffffff810ee502>] __module_address+0x1d2/0x1e0
[   18.112400]  [<ffffffff81161153>] jump_label_module_notify+0x143/0x1e0
[   18.119710]  [<ffffffff810814bf>] notifier_call_chain+0x4f/0x70
[   18.126326]  [<ffffffff8108160e>] __blocking_notifier_call_chain+0x5e/0x90
[   18.134009]  [<ffffffff81081656>] blocking_notifier_call_chain+0x16/0x20
[   18.141490]  [<ffffffff810f0f00>] load_module+0x1b50/0x2660
[   18.147720]  [<ffffffff810f1ade>] SyS_init_module+0xce/0x100
[   18.154045]  [<ffffffff81587429>] system_call_fastpath+0x12/0x17
[   18.160748] ---[ end trace 9294429076a9c674 ]---

Jump labels is not doing it right; fix this.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jason Baron <jbaron@akamai.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/jump_label.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -302,7 +302,7 @@ static int jump_label_add_module(struct
 			continue;
 
 		key = iterk;
-		if (__module_address(iter->key) == mod) {
+		if (within_module(iter->key, mod)) {
 			/*
 			 * Set key->entries to iter, but preserve JUMP_LABEL_TRUE_BRANCH.
 			 */
@@ -339,7 +339,7 @@ static void jump_label_del_module(struct
 
 		key = (struct static_key *)(unsigned long)iter->key;
 
-		if (__module_address(iter->key) == mod)
+		if (within_module(iter->key, mod))
 			continue;
 
 		prev = &key->next;
@@ -443,14 +443,16 @@ static void jump_label_update(struct sta
 {
 	struct jump_entry *stop = __stop___jump_table;
 	struct jump_entry *entry = jump_label_get_entries(key);
-
 #ifdef CONFIG_MODULES
-	struct module *mod = __module_address((unsigned long)key);
+	struct module *mod;
 
 	__jump_label_mod_update(key, enable);
 
+	preempt_disable();
+	mod = __module_address((unsigned long)key);
 	if (mod)
 		stop = mod->jump_entries + mod->num_jump_entries;
+	preempt_enable();
 #endif
 	/* if there are no users, entry can be NULL */
 	if (entry)



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 4/9] rbtree: Make lockless searches non-fatal
  2015-04-08 16:48 [PATCH v4 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (2 preceding siblings ...)
  2015-04-08 16:48 ` [PATCH v4 3/9] module, jump_label: Fix module locking Peter Zijlstra
@ 2015-04-08 16:48 ` Peter Zijlstra
  2015-04-08 16:48 ` [PATCH v4 5/9] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-08 16:48 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz, David Woodhouse,
	Rik van Riel, Andrea Arcangeli, Michel Lespinasse

[-- Attachment #1: peterz-rbtree-lockless.patch --]
[-- Type: text/plain, Size: 10107 bytes --]

Change the insert and erase code such that lockless searches are
non-fatal.

In and of itself an rbtree cannot be correctly searched while
in-modification, we can however provide weaker guarantees that will
allow the rbtree to be used in conjunction with other techniques, such
as latches; see 9b0fd802e8c0 ("seqcount: Add raw_write_seqcount_latch()").

For this to work we need the following guarantees from the rbtree
code:

 1) a lockless reader must not see partial stores, this would allow it
    to observe nodes that are invalid memory.

 2) there must not be (temporary) loops in the tree structure in the
    modifier's program order, this would cause a lookup which
    interrupts the modifier to get stuck indefinitely.

For 1) we must use WRITE_ONCE() for all updates to the tree structure;
in particular this patch only does rb_{left,right} as those are the
only element required for simple searches.

It generates slightly worse code, probably because volatile. But in
pointer chasing heavy code a few instructions more should not matter.

For 2) I have carefully audited the code and drawn every intermediate
link state and not found a loop.

Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rbtree.h           |   10 +++++
 include/linux/rbtree_augmented.h |   21 +++++++---
 lib/rbtree.c                     |   76 +++++++++++++++++++++++++++------------
 3 files changed, 78 insertions(+), 29 deletions(-)

--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -31,6 +31,7 @@
 
 #include <linux/kernel.h>
 #include <linux/stddef.h>
+#include <linux/rcupdate.h>
 
 struct rb_node {
 	unsigned long  __rb_parent_color;
@@ -85,6 +86,15 @@ static inline void rb_link_node(struct r
 	*rb_link = node;
 }
 
+static inline void rb_link_node_rcu(struct rb_node * node, struct rb_node * parent,
+				    struct rb_node ** rb_link)
+{
+	node->__rb_parent_color = (unsigned long)parent;
+	node->rb_left = node->rb_right = NULL;
+
+	rcu_assign_pointer(*rb_link, node);
+}
+
 #define rb_entry_safe(ptr, type, member) \
 	({ typeof(ptr) ____ptr = (ptr); \
 	   ____ptr ? rb_entry(____ptr, type, member) : NULL; \
--- a/include/linux/rbtree_augmented.h
+++ b/include/linux/rbtree_augmented.h
@@ -123,11 +123,11 @@ __rb_change_child(struct rb_node *old, s
 {
 	if (parent) {
 		if (parent->rb_left == old)
-			parent->rb_left = new;
+			WRITE_ONCE(parent->rb_left, new);
 		else
-			parent->rb_right = new;
+			WRITE_ONCE(parent->rb_right, new);
 	} else
-		root->rb_node = new;
+		WRITE_ONCE(root->rb_node, new);
 }
 
 extern void __rb_erase_color(struct rb_node *parent, struct rb_root *root,
@@ -137,7 +137,8 @@ static __always_inline struct rb_node *
 __rb_erase_augmented(struct rb_node *node, struct rb_root *root,
 		     const struct rb_augment_callbacks *augment)
 {
-	struct rb_node *child = node->rb_right, *tmp = node->rb_left;
+	struct rb_node *child = node->rb_right;
+	struct rb_node *tmp = node->rb_left;
 	struct rb_node *parent, *rebalance;
 	unsigned long pc;
 
@@ -167,6 +168,7 @@ __rb_erase_augmented(struct rb_node *nod
 		tmp = parent;
 	} else {
 		struct rb_node *successor = child, *child2;
+
 		tmp = child->rb_left;
 		if (!tmp) {
 			/*
@@ -180,6 +182,7 @@ __rb_erase_augmented(struct rb_node *nod
 			 */
 			parent = successor;
 			child2 = successor->rb_right;
+
 			augment->copy(node, successor);
 		} else {
 			/*
@@ -201,19 +204,23 @@ __rb_erase_augmented(struct rb_node *nod
 				successor = tmp;
 				tmp = tmp->rb_left;
 			} while (tmp);
-			parent->rb_left = child2 = successor->rb_right;
-			successor->rb_right = child;
+			child2 = successor->rb_right;
+			WRITE_ONCE(parent->rb_left, child2);
+			WRITE_ONCE(successor->rb_right, child);
 			rb_set_parent(child, successor);
+
 			augment->copy(node, successor);
 			augment->propagate(parent, successor);
 		}
 
-		successor->rb_left = tmp = node->rb_left;
+		tmp = node->rb_left;
+		WRITE_ONCE(successor->rb_left, tmp);
 		rb_set_parent(tmp, successor);
 
 		pc = node->__rb_parent_color;
 		tmp = __rb_parent(pc);
 		__rb_change_child(node, successor, tmp, root);
+
 		if (child2) {
 			successor->__rb_parent_color = pc;
 			rb_set_parent_color(child2, parent, RB_BLACK);
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -44,6 +44,30 @@
  *  parentheses and have some accompanying text comment.
  */
 
+/*
+ * Notes on lockless lookups:
+ *
+ * All stores to the tree structure (rb_left and rb_right) must be done using
+ * WRITE_ONCE(). And we must not inadvertently cause (temporary) loops in the
+ * tree structure as seen in program order.
+ *
+ * These two requirements will allow lockless iteration of the tree -- not
+ * correct iteration mind you, tree rotations are not atomic so a lookup might
+ * miss entire subtrees.
+ *
+ * But they do guarantee that any such traversal will only see valid elements
+ * and that it will indeed complete -- does not get stuck in a loop.
+ *
+ * It also guarantees that if the lookup returns an element it is the 'correct'
+ * one. But not returning an element does _NOT_ mean its not present.
+ *
+ * NOTE:
+ *
+ * Stores to __rb_parent_color are not important for simple lookups so those
+ * are left undone as of now. Nor did I check for loops involving parent
+ * pointers.
+ */
+
 static inline void rb_set_black(struct rb_node *rb)
 {
 	rb->__rb_parent_color |= RB_BLACK;
@@ -129,8 +153,9 @@ __rb_insert(struct rb_node *node, struct
 				 * This still leaves us in violation of 4), the
 				 * continuation into Case 3 will fix that.
 				 */
-				parent->rb_right = tmp = node->rb_left;
-				node->rb_left = parent;
+				tmp = node->rb_left;
+				WRITE_ONCE(parent->rb_right, tmp);
+				WRITE_ONCE(node->rb_left, parent);
 				if (tmp)
 					rb_set_parent_color(tmp, parent,
 							    RB_BLACK);
@@ -149,8 +174,8 @@ __rb_insert(struct rb_node *node, struct
 			 *     /                 \
 			 *    n                   U
 			 */
-			gparent->rb_left = tmp;  /* == parent->rb_right */
-			parent->rb_right = gparent;
+			WRITE_ONCE(gparent->rb_left, tmp); /* == parent->rb_right */
+			WRITE_ONCE(parent->rb_right, gparent);
 			if (tmp)
 				rb_set_parent_color(tmp, gparent, RB_BLACK);
 			__rb_rotate_set_parents(gparent, parent, root, RB_RED);
@@ -171,8 +196,9 @@ __rb_insert(struct rb_node *node, struct
 			tmp = parent->rb_left;
 			if (node == tmp) {
 				/* Case 2 - right rotate at parent */
-				parent->rb_left = tmp = node->rb_right;
-				node->rb_right = parent;
+				tmp = node->rb_right;
+				WRITE_ONCE(parent->rb_left, tmp);
+				WRITE_ONCE(node->rb_right, parent);
 				if (tmp)
 					rb_set_parent_color(tmp, parent,
 							    RB_BLACK);
@@ -183,8 +209,8 @@ __rb_insert(struct rb_node *node, struct
 			}
 
 			/* Case 3 - left rotate at gparent */
-			gparent->rb_right = tmp;  /* == parent->rb_left */
-			parent->rb_left = gparent;
+			WRITE_ONCE(gparent->rb_right, tmp); /* == parent->rb_left */
+			WRITE_ONCE(parent->rb_left, gparent);
 			if (tmp)
 				rb_set_parent_color(tmp, gparent, RB_BLACK);
 			__rb_rotate_set_parents(gparent, parent, root, RB_RED);
@@ -224,8 +250,9 @@ ____rb_erase_color(struct rb_node *paren
 				 *      / \         / \
 				 *     Sl  Sr      N   Sl
 				 */
-				parent->rb_right = tmp1 = sibling->rb_left;
-				sibling->rb_left = parent;
+				tmp1 = sibling->rb_left;
+				WRITE_ONCE(parent->rb_right, tmp1);
+				WRITE_ONCE(sibling->rb_left, parent);
 				rb_set_parent_color(tmp1, parent, RB_BLACK);
 				__rb_rotate_set_parents(parent, sibling, root,
 							RB_RED);
@@ -275,9 +302,10 @@ ____rb_erase_color(struct rb_node *paren
 				 *                       \
 				 *                        Sr
 				 */
-				sibling->rb_left = tmp1 = tmp2->rb_right;
-				tmp2->rb_right = sibling;
-				parent->rb_right = tmp2;
+				tmp1 = tmp2->rb_right;
+				WRITE_ONCE(sibling->rb_left, tmp1);
+				WRITE_ONCE(tmp2->rb_right, sibling);
+				WRITE_ONCE(parent->rb_right, tmp2);
 				if (tmp1)
 					rb_set_parent_color(tmp1, sibling,
 							    RB_BLACK);
@@ -297,8 +325,9 @@ ____rb_erase_color(struct rb_node *paren
 			 *        / \         / \
 			 *      (sl) sr      N  (sl)
 			 */
-			parent->rb_right = tmp2 = sibling->rb_left;
-			sibling->rb_left = parent;
+			tmp2 = sibling->rb_left;
+			WRITE_ONCE(parent->rb_right, tmp2);
+			WRITE_ONCE(sibling->rb_left, parent);
 			rb_set_parent_color(tmp1, sibling, RB_BLACK);
 			if (tmp2)
 				rb_set_parent(tmp2, parent);
@@ -310,8 +339,9 @@ ____rb_erase_color(struct rb_node *paren
 			sibling = parent->rb_left;
 			if (rb_is_red(sibling)) {
 				/* Case 1 - right rotate at parent */
-				parent->rb_left = tmp1 = sibling->rb_right;
-				sibling->rb_right = parent;
+				tmp1 = sibling->rb_right;
+				WRITE_ONCE(parent->rb_left, tmp1);
+				WRITE_ONCE(sibling->rb_right, parent);
 				rb_set_parent_color(tmp1, parent, RB_BLACK);
 				__rb_rotate_set_parents(parent, sibling, root,
 							RB_RED);
@@ -336,9 +366,10 @@ ____rb_erase_color(struct rb_node *paren
 					break;
 				}
 				/* Case 3 - right rotate at sibling */
-				sibling->rb_right = tmp1 = tmp2->rb_left;
-				tmp2->rb_left = sibling;
-				parent->rb_left = tmp2;
+				tmp1 = tmp2->rb_left;
+				WRITE_ONCE(sibling->rb_right, tmp1);
+				WRITE_ONCE(tmp2->rb_left, sibling);
+				WRITE_ONCE(parent->rb_left, tmp2);
 				if (tmp1)
 					rb_set_parent_color(tmp1, sibling,
 							    RB_BLACK);
@@ -347,8 +378,9 @@ ____rb_erase_color(struct rb_node *paren
 				sibling = tmp2;
 			}
 			/* Case 4 - left rotate at parent + color flips */
-			parent->rb_left = tmp2 = sibling->rb_right;
-			sibling->rb_right = parent;
+			tmp2 = sibling->rb_right;
+			WRITE_ONCE(parent->rb_left, tmp2);
+			WRITE_ONCE(sibling->rb_right, parent);
 			rb_set_parent_color(tmp1, sibling, RB_BLACK);
 			if (tmp2)
 				rb_set_parent(tmp2, parent);



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 5/9] seqlock: Better document raw_write_seqcount_latch()
  2015-04-08 16:48 [PATCH v4 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (3 preceding siblings ...)
  2015-04-08 16:48 ` [PATCH v4 4/9] rbtree: Make lockless searches non-fatal Peter Zijlstra
@ 2015-04-08 16:48 ` Peter Zijlstra
  2015-04-08 16:48 ` [PATCH v4 6/9] rbtree: Implement generic latch_tree Peter Zijlstra
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-08 16:48 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz, Andrea Arcangeli,
	David Woodhouse, Rik van Riel, Michel Lespinasse

[-- Attachment #1: peterz-latch-comment.patch --]
[-- Type: text/plain, Size: 4960 bytes --]

Improve the documentation of the latch technique as used in the
current timekeeping code, such that it can be readily employed
elsewhere.

Borrow from the comments in timekeeping and replace those with a
reference to this more generic comment.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/seqlock.h   |   76 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/time/timekeeping.c |   27 ----------------
 2 files changed, 76 insertions(+), 27 deletions(-)

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -233,9 +233,83 @@ static inline void raw_write_seqcount_en
 	s->sequence++;
 }
 
-/*
+/**
  * raw_write_seqcount_latch - redirect readers to even/odd copy
  * @s: pointer to seqcount_t
+ *
+ * The latch technique is a multiversion concurrency control method that allows
+ * queries during non atomic modifications. If you can guarantee queries never
+ * interrupt the modification -- e.g. the concurrency is strictly between CPUs
+ * -- you most likely do not need this.
+ *
+ * Where the traditional RCU/lockless data structures rely on atomic
+ * modifications to ensure queries observe either the old or the new state the
+ * latch allows the same for non atomic updates. The trade-off is doubling the
+ * cost of storage; we have to maintain two copies of the entire data
+ * structure.
+ *
+ * Very simply put: we first modify one copy and then the other. This ensures
+ * there is always one copy in a stable state, ready to give us an answer.
+ *
+ * The basic form is a data structure like:
+ *
+ * struct latch_struct {
+ *	seqcount_t		seq;
+ *	struct data_struct	data[2];
+ * };
+ *
+ * Where a modification, which is assumed to be externally serialized, does the
+ * following:
+ *
+ * void latch_modify(struct latch_struct *latch, ...)
+ * {
+ *	smp_wmb();	<- Ensure that the last data[1] update is visible
+ *	latch->seq++;
+ *	smp_wmb();	<- Ensure that the seqcount update is visible
+ *
+ *	modify(latch->data[0], ...);
+ *
+ *	smp_wmb();	<- Ensure that the data[0] update is visible
+ *	latch->seq++;
+ *	smp_wmb();	<- Ensure that the seqcount update is visible
+ *
+ *	modify(latch->data[1], ...);
+ * }
+ *
+ * The query will have a form like:
+ *
+ * struct entry *latch_query(struct latch_struct *latch, ...)
+ * {
+ *	struct entry *entry;
+ *	unsigned seq, idx;
+ *
+ *	do {
+ *		seq = latch->seq;
+ *		smp_rmb();
+ *
+ *		idx = seq & 0x01;
+ *		entry = data_query(latch->data[idx], ...);
+ *
+ *		smp_rmb();
+ *	} while (seq != latch->seq);
+ *
+ *	return entry;
+ * }
+ *
+ * So during the modification, queries are first redirected to data[1]. Then we
+ * modify data[0]. When that is complete, we redirect queries back to data[0]
+ * and we can modify data[1].
+ *
+ * NOTE: The non-requirement for atomic modifications does _NOT_ include
+ *       the publishing of new entries in the case where data is a dynamic
+ *       data structure.
+ *
+ *       An iteration might start in data[0] and get suspended long enough
+ *       to miss an entire modification sequence, once it resumes it might
+ *       observe the new entry.
+ *
+ * NOTE: When data is a dynamic data structure; one should use regular RCU
+ *       patterns to manage the lifetimes of the objects within.
  */
 static inline void raw_write_seqcount_latch(seqcount_t *s)
 {
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -339,32 +339,7 @@ static inline s64 timekeeping_get_ns_raw
  * We want to use this from any context including NMI and tracing /
  * instrumenting the timekeeping code itself.
  *
- * So we handle this differently than the other timekeeping accessor
- * functions which retry when the sequence count has changed. The
- * update side does:
- *
- * smp_wmb();	<- Ensure that the last base[1] update is visible
- * tkf->seq++;
- * smp_wmb();	<- Ensure that the seqcount update is visible
- * update(tkf->base[0], tkr);
- * smp_wmb();	<- Ensure that the base[0] update is visible
- * tkf->seq++;
- * smp_wmb();	<- Ensure that the seqcount update is visible
- * update(tkf->base[1], tkr);
- *
- * The reader side does:
- *
- * do {
- *	seq = tkf->seq;
- *	smp_rmb();
- *	idx = seq & 0x01;
- *	now = now(tkf->base[idx]);
- *	smp_rmb();
- * } while (seq != tkf->seq)
- *
- * As long as we update base[0] readers are forced off to
- * base[1]. Once base[0] is updated readers are redirected to base[0]
- * and the base[1] update takes place.
+ * Employ the latch technique; see @raw_write_seqcount_latch.
  *
  * So if a NMI hits the update of base[0] then it will use base[1]
  * which is still consistent. In the worst case this can result is a



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 6/9] rbtree: Implement generic latch_tree
  2015-04-08 16:48 [PATCH v4 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (4 preceding siblings ...)
  2015-04-08 16:48 ` [PATCH v4 5/9] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
@ 2015-04-08 16:48 ` Peter Zijlstra
  2015-04-09  8:09   ` Lai Jiangshan
  2015-04-08 16:48 ` [PATCH v4 7/9] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-08 16:48 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

[-- Attachment #1: peterz-rbtree-latch.patch --]
[-- Type: text/plain, Size: 8020 bytes --]

Implement a latched RB-tree in order to get unconditional RCU/lockless
lookups.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rbtree_latch.h |  223 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 223 insertions(+)

--- /dev/null
+++ b/include/linux/rbtree_latch.h
@@ -0,0 +1,223 @@
+/*
+ * Latched RB-trees
+ *
+ * Copyright (C) 2015 Intel Corp., Peter Zijlstra <peterz@infradead.org>
+ *
+ * Since RB-trees have non atomic modifications they're not immediately suited
+ * for RCU/lockless queries. Even though we made RB tree lookups non-fatal for
+ * lockless lookups; we cannot guarantee they return a correct result.
+ *
+ * The simplest solution is a seqlock + rb-tree, this will allow lockless
+ * lookups; but has the constraint (inherent to the seqlock) that read sides
+ * cannot nest in write sides.
+ *
+ * If we need to allow unconditional lookups (say as required for NMI context
+ * usage) we need a more complex setup; this data structure provides this by
+ * employing the latch technique -- see @raw_write_seqcount_latch -- to
+ * implement a latched RB-tree which does allow for unconditional lookups by
+ * virtue of always having (at least) one stable copy of the tree.
+ *
+ * However, while we have the guarantee that there is at all times one stable
+ * copy, this does not guarantee an iteration will not observe modifications.
+ * What might have been a stable copy at the start of the iteration, need not
+ * remain so for the duration of the iteration.
+ *
+ * Therefore, this does require a lockless RB-tree iteration to be non-fatal;
+ * see the comment in lib/rbtree.c. Note however that we only require the first
+ * condition -- not seeing partial stores -- because the latch thing isolates
+ * us from loops. If we were to interrupt a modification the lookup would be
+ * pointed at the stable tree and complete while the modification was halted.
+ */
+
+#ifndef RB_TREE_LATCH_H
+#define RB_TREE_LATCH_H
+
+#include <linux/rbtree.h>
+#include <linux/seqlock.h>
+
+struct latch_tree_node {
+	/*
+	 * Because we have an array of two entries in struct latch_tree_nodes
+	 * it's not possible to use container_of() to get back to the
+	 * encapsulating structure; therefore we have to put in a back pointer.
+	 */
+	void		*priv;
+	struct rb_node	node;
+};
+
+struct latch_tree_nodes {
+	struct latch_tree_node node[2];
+};
+
+struct latch_tree_root {
+	seqcount_t	seq;
+	struct rb_root	tree[2];
+};
+
+/**
+ * latch_tree_ops - operators to define the tree order
+ * @less: used for insertion; provides the (partial) order between two elements.
+ * @comp: used for lookups; provides the order between the search key and an element.
+ *
+ * The operators are related like:
+ *
+ *	comp(a->key,b) < 0  := less(a,b)
+ *	comp(a->key,b) > 0  := less(b,a)
+ *	comp(a->key,b) == 0 := !less(a,b) && !less(b,a)
+ *
+ * If these operators define a partial order on the elements we make no
+ * guarantee on which of the elements matching the key is found. See
+ * latch_tree_find().
+ */
+struct latch_tree_ops {
+	bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b);
+	int  (*comp)(void *key,                 struct latch_tree_node *b);
+};
+
+static __always_inline void
+__lt_insert(struct latch_tree_node *ltn, struct rb_root *root,
+	    bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b))
+{
+	struct rb_node **link = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct latch_tree_node *ltp;
+
+	while (*link) {
+		parent = *link;
+		ltp = container_of(parent, struct latch_tree_node, node);
+
+		if (less(ltn, ltp))
+			link = &parent->rb_left;
+		else
+			link = &parent->rb_right;
+	}
+
+	rb_link_node_rcu(&ltn->node, parent, link);
+	rb_insert_color(&ltn->node, root);
+}
+
+static __always_inline void
+__lt_erase(struct latch_tree_node *ltn, struct rb_root *root)
+{
+	rb_erase(&ltn->node, root);
+}
+
+static __always_inline struct latch_tree_node *
+__lt_find(void *key, struct rb_root *root,
+	  int (*comp)(void *key, struct latch_tree_node *ltn))
+{
+	struct rb_node *n = rcu_dereference_raw(root->rb_node);
+	struct latch_tree_node *ltn;
+	int c;
+
+	while (n) {
+		ltn = container_of(n, struct latch_tree_node, node);
+		c = comp(key, ltn);
+
+		if (c < 0)
+			n = rcu_dereference_raw(n->rb_left);
+		else if (c > 0)
+			n = rcu_dereference_raw(n->rb_right);
+		else
+			return ltn;
+	}
+
+	return NULL;
+}
+
+/**
+ * latch_tree_insert() - insert @nodes into the trees @root
+ * @nodes: nodes to insert
+ * @root: trees to insert @nodes into
+ * @priv: pointer to node private data
+ * @ops: operators defining the node order
+ *
+ * Initializes @nodes private pointer with @priv; which typically is a back
+ * pointer to the containing structure, used by @ops to find the search key.
+ *
+ * Then it inserts @nodes into @root in an ordered fashion such that we can
+ * always observe one complete tree. See the comment for
+ * raw_write_seqcount_latch().
+ *
+ * The inserts use rcu_assign_pointer() to publish the element such that
+ * both the @priv pointer values in @nodes as the tree structure is stored
+ * before we can observe the new @nodes.
+ *
+ * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
+ * serialized.
+ */
+static __always_inline void
+latch_tree_insert(struct latch_tree_nodes *nodes,
+		  struct latch_tree_root *root,
+		  void *priv,
+		  const struct latch_tree_ops *ops)
+{
+	nodes->node[0].priv = nodes->node[1].priv = priv;
+
+	raw_write_seqcount_latch(&root->seq);
+	__lt_insert(&nodes->node[0], &root->tree[0], ops->less);
+	raw_write_seqcount_latch(&root->seq);
+	__lt_insert(&nodes->node[1], &root->tree[1], ops->less);
+}
+
+/**
+ * latch_tree_erase() - removes @nodes from the trees @root
+ * @nodes: nodes to remote
+ * @root: trees to remove @nodes from
+ * @ops: operators defining the node order
+ *
+ * Removes @nodes from the trees @root in an ordered fashion such that we can
+ * always observe one complete tree. See the comment for
+ * raw_write_seqcount_latch().
+ *
+ * It is assumed that @nodes will observe one RCU quiescent state before being
+ * reused of freed.
+ *
+ * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
+ * serialized.
+ */
+static __always_inline void
+latch_tree_erase(struct latch_tree_nodes *nodes,
+		 struct latch_tree_root *root,
+		 const struct latch_tree_ops *ops)
+{
+	raw_write_seqcount_latch(&root->seq);
+	__lt_erase(&nodes->node[0], &root->tree[0]);
+	raw_write_seqcount_latch(&root->seq);
+	__lt_erase(&nodes->node[1], &root->tree[1]);
+}
+
+/**
+ * latch_tree_find() - find the node matching @key in the trees @root
+ * @key: search key
+ * @root: trees to search for @key
+ * @ops: operators defining the node order
+ *
+ * Does a lockless lookup in the trees @root for the node matching @key.
+ *
+ * It is assumed that this is called while holding the appropriate RCU read
+ * side lock.
+ *
+ * If the operators define a partial order on the elements (there are multiple
+ * elements which have the same key value) it is undefined which of these
+ * elements will be found. Nor is it possible to iterate the tree to find
+ * further elements with the same key value.
+ *
+ * Returns: a pointer to the node matching @key or NULL.
+ */
+static __always_inline struct latch_tree_node *
+latch_tree_find(void *key, struct latch_tree_root *root,
+		const struct latch_tree_ops *ops)
+{
+	struct latch_tree_node *node;
+	unsigned int seq;
+
+	do {
+		seq = raw_read_seqcount(&root->seq);
+		node = __lt_find(key, &root->tree[seq & 1], ops->comp);
+	} while (read_seqcount_retry(&root->seq, seq));
+
+	return node;
+}
+
+#endif /* RB_TREE_LATCH_H */



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 7/9] module: Optimize __module_address() using a latched RB-tree
  2015-04-08 16:48 [PATCH v4 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (5 preceding siblings ...)
  2015-04-08 16:48 ` [PATCH v4 6/9] rbtree: Implement generic latch_tree Peter Zijlstra
@ 2015-04-08 16:48 ` Peter Zijlstra
  2015-04-08 16:48 ` [PATCH v4 8/9] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING Peter Zijlstra
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-08 16:48 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz

[-- Attachment #1: peterz-module-latch-tree.patch --]
[-- Type: text/plain, Size: 6634 bytes --]

Currently __module_address() is using a linear search through all
modules in order to find the module corresponding to the provided
address. With a lot of modules this can take a lot of time.

One of the users of this is kernel_text_address() which is employed
in many stack unwinders; which in turn are used by perf-callchain and
ftrace (possibly from NMI context).

So by optimizing __module_address() we optimize many stack unwinders
which are used by both perf and tracing in performance sensitive code.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/module.h |   22 ++++++++--
 kernel/module.c        |  107 ++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 121 insertions(+), 8 deletions(-)

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -17,6 +17,7 @@
 #include <linux/moduleparam.h>
 #include <linux/jump_label.h>
 #include <linux/export.h>
+#include <linux/rbtree_latch.h>
 
 #include <linux/percpu.h>
 #include <asm/module.h>
@@ -269,8 +270,15 @@ struct module {
 	/* Startup function. */
 	int (*init)(void);
 
-	/* If this is non-NULL, vfree after init() returns */
-	void *module_init;
+	/*
+	 * If this is non-NULL, vfree after init() returns.
+	 *
+	 * Cacheline align here, such that:
+	 *   module_init, module_core, init_size, core_size,
+	 *   init_text_size, core_text_size and ltn_core.node[0]
+	 * are on the same cacheline.
+	 */
+	void *module_init	____cacheline_aligned;
 
 	/* Here is the actual code + data, vfree'd on unload. */
 	void *module_core;
@@ -281,6 +289,14 @@ struct module {
 	/* The size of the executable code in each section.  */
 	unsigned int init_text_size, core_text_size;
 
+	/*
+	 * We rely on the order of these two entries; not only do we want
+	 * core.node[0] to be in the same cacheline as the above entries,
+	 * we also assume ltn_init has a higher address than ltn_core.
+	 */
+	struct latch_tree_nodes	ltn_core;
+	struct latch_tree_nodes	ltn_init;
+
 	/* Size of RO sections of the module (text+rodata) */
 	unsigned int init_ro_size, core_ro_size;
 
@@ -361,7 +377,7 @@ struct module {
 	ctor_fn_t *ctors;
 	unsigned int num_ctors;
 #endif
-};
+} ____cacheline_aligned;
 #ifndef MODULE_ARCH_INIT
 #define MODULE_ARCH_INIT {}
 #endif
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -102,6 +102,99 @@
 DEFINE_MUTEX(module_mutex);
 EXPORT_SYMBOL_GPL(module_mutex);
 static LIST_HEAD(modules);
+
+/*
+ * Use a latched RB-tree for __module_address(); this allows us to use
+ * RCU-sched lookups of the address from any context.
+ *
+ * Because modules have two address ranges: init and core, we need two
+ * latch_tree_nodes entries. We use the order they appear in struct module to
+ * determine if we need to use the init or core values for the comparisons.
+ */
+
+static __always_inline unsigned long __mod_tree_val(struct latch_tree_node *n)
+{
+	struct module *mod = n->priv;
+
+	if (n >= mod->ltn_init.node)
+		return (unsigned long)mod->module_init;
+	else
+		return (unsigned long)mod->module_core;
+}
+
+static __always_inline unsigned long __mod_tree_size(struct latch_tree_node *n)
+{
+	struct module *mod = n->priv;
+
+	if (n >= mod->ltn_init.node)
+		return (unsigned long)mod->init_size;
+	else
+		return (unsigned long)mod->core_size;
+}
+
+static __always_inline bool
+mod_tree_less(struct latch_tree_node *a, struct latch_tree_node *b)
+{
+	return __mod_tree_val(a) < __mod_tree_val(b);
+}
+
+static __always_inline int
+mod_tree_comp(void *key, struct latch_tree_node *n)
+{
+	unsigned long val = (unsigned long)key;
+	unsigned long start, end;
+
+	end = start = __mod_tree_val(n);
+	end += __mod_tree_size(n);
+
+	if (val < start)
+		return -1;
+
+	if (val >= end)
+		return 1;
+
+	return 0;
+}
+
+static const struct latch_tree_ops mod_tree_ops = {
+	.less = mod_tree_less,
+	.comp = mod_tree_comp,
+};
+
+static struct latch_tree_root mod_tree __cacheline_aligned;
+
+/*
+ * These modifications: insert, remove_init and remove; are serialized by the
+ * module_mutex.
+ */
+static void mod_tree_insert(struct module *mod)
+{
+	latch_tree_insert(&mod->ltn_core, &mod_tree, mod, &mod_tree_ops);
+	if (mod->init_size)
+		latch_tree_insert(&mod->ltn_init, &mod_tree, mod, &mod_tree_ops);
+}
+
+static void mod_tree_remove_init(struct module *mod)
+{
+	if (mod->init_size)
+		latch_tree_erase(&mod->ltn_init, &mod_tree, &mod_tree_ops);
+}
+
+static void mod_tree_remove(struct module *mod)
+{
+	latch_tree_erase(&mod->ltn_core, &mod_tree, &mod_tree_ops);
+	mod_tree_remove_init(mod);
+}
+
+static struct module *mod_tree_find(unsigned long addr)
+{
+	struct latch_tree_node *ltn;
+
+	ltn = latch_tree_find((void *)addr, &mod_tree, &mod_tree_ops);
+
+	return ltn ? ltn->priv : NULL;
+}
+
 #ifdef CONFIG_KGDB_KDB
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
@@ -1876,6 +1969,7 @@ static void free_module(struct module *m
 	mutex_lock(&module_mutex);
 	/* Unlink carefully: kallsyms could be walking list. */
 	list_del_rcu(&mod->list);
+	mod_tree_remove(mod);
 	/* Remove this module from bug list, this uses list_del_rcu */
 	module_bug_cleanup(mod);
 	/* Wait for RCU-sched synchronizing before releasing mod->list and buglist. */
@@ -3120,6 +3214,7 @@ static noinline int do_init_module(struc
 	mod->symtab = mod->core_symtab;
 	mod->strtab = mod->core_strtab;
 #endif
+	mod_tree_remove_init(mod);
 	unset_module_init_ro_nx(mod);
 	module_arch_freeing_init(mod);
 	mod->module_init = NULL;
@@ -3190,6 +3285,7 @@ static int add_unformed_module(struct mo
 		goto out;
 	}
 	list_add_rcu(&mod->list, &modules);
+	mod_tree_insert(mod);
 	err = 0;
 
 out:
@@ -3389,6 +3485,7 @@ static int load_module(struct load_info
 	mutex_lock(&module_mutex);
 	/* Unlink carefully: kallsyms could be walking list. */
 	list_del_rcu(&mod->list);
+	mod_tree_remove(mod);
 	wake_up_all(&module_wq);
 	/* Wait for RCU-sched synchronizing before releasing mod->list. */
 	synchronize_sched();
@@ -3833,13 +3930,13 @@ struct module *__module_address(unsigned
 
 	module_assert_mutex_or_preempt();
 
-	list_for_each_entry_rcu(mod, &modules, list) {
+	mod = mod_tree_find(addr);
+	if (mod) {
+		BUG_ON(!within_module(addr, mod));
 		if (mod->state == MODULE_STATE_UNFORMED)
-			continue;
-		if (within_module(addr, mod))
-			return mod;
+			mod = NULL;
 	}
-	return NULL;
+	return mod;
 }
 EXPORT_SYMBOL_GPL(__module_address);
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 8/9] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
  2015-04-08 16:48 [PATCH v4 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (6 preceding siblings ...)
  2015-04-08 16:48 ` [PATCH v4 7/9] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
@ 2015-04-08 16:48 ` Peter Zijlstra
  2015-04-08 16:48 ` [PATCH v4 9/9] module: Use __module_address() for module_address_lookup() Peter Zijlstra
  2015-04-08 22:08 ` [PATCH v4 0/9] latched RB-trees and __module_address() Andi Kleen
  9 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-08 16:48 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz, Andrew Morton

[-- Attachment #1: peterz-modtree-optional.patch --]
[-- Type: text/plain, Size: 2497 bytes --]

Andrew worried about the overhead on small systems; only use the fancy
code when either perf or tracing is enabled.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Requested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -102,6 +102,8 @@ DEFINE_MUTEX(module_mutex);
 EXPORT_SYMBOL_GPL(module_mutex);
 static LIST_HEAD(modules);
 
+#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_TRACING)
+
 /*
  * Use a latched RB-tree for __module_address(); this allows us to use
  * RCU-sched lookups of the address from any context.
@@ -109,6 +111,10 @@ static LIST_HEAD(modules);
  * Because modules have two address ranges: init and core, we need two
  * latch_tree_nodes entries. We use the order they appear in struct module to
  * determine if we need to use the init or core values for the comparisons.
+ *
+ * This is conditional on PERF_EVENTS || TRACING because those can really hit
+ * __module_address() hard by doing a lot of stack unwinding; potentially from
+ * NMI context.
  */
 
 static __always_inline unsigned long __mod_tree_val(struct latch_tree_node *n)
@@ -185,7 +191,7 @@ static void mod_tree_remove(struct modul
 	mod_tree_remove_init(mod);
 }
 
-static struct module *mod_tree_find(unsigned long addr)
+static struct module *mod_find(unsigned long addr)
 {
 	struct latch_tree_node *ltn;
 
@@ -194,6 +200,26 @@ static struct module *mod_tree_find(unsi
 	return ltn ? ltn->priv : NULL;
 }
 
+#else /* !(PERF_EVENTS || TRACING) */
+
+static void mod_tree_insert(struct module *mod) { }
+static void mod_tree_remove_init(struct module *mod) { }
+static void mod_tree_remove(struct module *mod) { }
+
+static struct module *mod_find(unsigned long addr)
+{
+	struct module *mod;
+
+	list_for_each_entry_rcu(mod, &modules, list) {
+		if (within_module(addr, mod))
+			return mod;
+	}
+
+	return NULL;
+}
+
+#endif /* !(PERF_EVENTS || TRACING) */
+
 #ifdef CONFIG_KGDB_KDB
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
@@ -3939,7 +3965,7 @@ struct module *__module_address(unsigned
 
 	module_assert_mutex_or_preempt();
 
-	mod = mod_tree_find(addr);
+	mod = mod_find(addr);
 	if (mod) {
 		BUG_ON(!within_module(addr, mod));
 		if (mod->state == MODULE_STATE_UNFORMED)



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 9/9] module: Use __module_address() for module_address_lookup()
  2015-04-08 16:48 [PATCH v4 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (7 preceding siblings ...)
  2015-04-08 16:48 ` [PATCH v4 8/9] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING Peter Zijlstra
@ 2015-04-08 16:48 ` Peter Zijlstra
  2015-04-08 22:08 ` [PATCH v4 0/9] latched RB-trees and __module_address() Andi Kleen
  9 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-08 16:48 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz

[-- Attachment #1: peterz-module-use-__module_address.patch --]
[-- Type: text/plain, Size: 1127 bytes --]

Use the generic __module_address() addr to struct module lookup
instead of open coding it once more.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3515,19 +3515,15 @@ const char *module_address_lookup(unsign
 			    char **modname,
 			    char *namebuf)
 {
-	struct module *mod;
 	const char *ret = NULL;
+	struct module *mod;
 
 	preempt_disable();
-	list_for_each_entry_rcu(mod, &modules, list) {
-		if (mod->state == MODULE_STATE_UNFORMED)
-			continue;
-		if (within_module(addr, mod)) {
-			if (modname)
-				*modname = mod->name;
-			ret = get_ksymbol(mod, addr, size, offset);
-			break;
-		}
+	mod = __module_address(addr);
+	if (mod) {
+		if (modname)
+			*modname = mod->name;
+		ret = get_ksymbol(mod, addr, size, offset);
 	}
 	/* Make a copy in here where it's safe */
 	if (ret) {
@@ -3535,6 +3531,7 @@ const char *module_address_lookup(unsign
 		ret = namebuf;
 	}
 	preempt_enable();
+
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/9] latched RB-trees and __module_address()
  2015-04-08 16:48 [PATCH v4 0/9] latched RB-trees and __module_address() Peter Zijlstra
                   ` (8 preceding siblings ...)
  2015-04-08 16:48 ` [PATCH v4 9/9] module: Use __module_address() for module_address_lookup() Peter Zijlstra
@ 2015-04-08 22:08 ` Andi Kleen
  9 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2015-04-08 22:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx


For the record I still think it's better to just look it up in the
kernel page tables, which already maintain that information. This is a much
simpler, much less risky patch. It works on all architectures that use page
tables for the kernel, and have a X bit in them. The rest are likely
obscure and won't have a lot of modules.

My patch implementing this for x86 is here:

http://lkml.iu.edu/hypermail/linux/kernel/1410.1/01879.html

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/9] module: Sanitize RCU usage and locking
  2015-04-08 16:48 ` [PATCH v4 1/9] module: Sanitize RCU usage and locking Peter Zijlstra
@ 2015-04-09  4:21   ` Lai Jiangshan
  2015-04-09 12:36     ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Lai Jiangshan @ 2015-04-09  4:21 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx

On 04/09/2015 12:48 AM, Peter Zijlstra wrote:

> +static void module_assert_mutex_or_preempt(void)
> +{
> +#ifdef CONFIG_LOCKDEP
> +	int rcu_held = rcu_read_lock_sched_held();
> +	int mutex_held = 1;
> +
> +	if (debug_locks)
> +		mutex_held = lockdep_is_held(&module_mutex);
> +
> +	WARN_ON(!rcu_held && !mutex_held);
> +#endif
> +}

Is rcu_lockdep_assert() suitable for it?
(note rcu_lockdep_assert() only works when CONFIG_PROVE_RCU)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] rbtree: Implement generic latch_tree
  2015-04-08 16:48 ` [PATCH v4 6/9] rbtree: Implement generic latch_tree Peter Zijlstra
@ 2015-04-09  8:09   ` Lai Jiangshan
  2015-04-09  8:14     ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Lai Jiangshan @ 2015-04-09  8:09 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On 04/09/2015 12:48 AM, Peter Zijlstra wrote:

> +
> +struct latch_tree_node {
> +	/*
> +	 * Because we have an array of two entries in struct latch_tree_nodes
> +	 * it's not possible to use container_of() to get back to the
> +	 * encapsulating structure; therefore we have to put in a back pointer.
> +	 */
> +	void		*priv;
> +	struct rb_node	node;
> +};

I don't think @priv is strictly needed. It is possible to use container_of()
to go back. @priv is even not used in this file (except the initialization).

First, we can use container_of() to find encapsulating structure from the
struct latch_tree_nodeS.

Second, we can add a @idx to __lt_insert() and __lt_find(), thus we can
find the encapsulating latch_tree_nodes from rb_node or latch_tree_node.
and struct latch_tree_ops uses latch_tree_nodes instead.

Did I miss anything? If the @priv is possible to be removed, removing it will
simplify this file but it may add a little more code in the module.c where
the ltn_core&ltn_init can't share the same ->less() and ->comp() after.

If you do remove @priv, please also use rb_node instead of old latch_tree_node
and rename old latch_tree_nodes to latch_tree_node.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] rbtree: Implement generic latch_tree
  2015-04-09  8:09   ` Lai Jiangshan
@ 2015-04-09  8:14     ` Peter Zijlstra
  2015-04-09  8:55       ` Lai Jiangshan
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-09  8:14 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, Apr 09, 2015 at 04:09:07PM +0800, Lai Jiangshan wrote:
> On 04/09/2015 12:48 AM, Peter Zijlstra wrote:
> 
> > +
> > +struct latch_tree_node {
> > +	/*
> > +	 * Because we have an array of two entries in struct latch_tree_nodes
> > +	 * it's not possible to use container_of() to get back to the
> > +	 * encapsulating structure; therefore we have to put in a back pointer.
> > +	 */
> > +	void		*priv;
> > +	struct rb_node	node;
> > +};
> 
> I don't think @priv is strictly needed. It is possible to use container_of()
> to go back. @priv is even not used in this file (except the initialization).
> 
> First, we can use container_of() to find encapsulating structure from the
> struct latch_tree_nodeS.
> 
> Second, we can add a @idx to __lt_insert() and __lt_find(), 

insert yes, find no. Remember that both nodes are in the _same_ tree.

There is no way of knowing if a tree node is an init or core node while
iterating.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] rbtree: Implement generic latch_tree
  2015-04-09  8:14     ` Peter Zijlstra
@ 2015-04-09  8:55       ` Lai Jiangshan
  2015-04-09 12:13         ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Lai Jiangshan @ 2015-04-09  8:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On 04/09/2015 04:14 PM, Peter Zijlstra wrote:
> On Thu, Apr 09, 2015 at 04:09:07PM +0800, Lai Jiangshan wrote:
>> On 04/09/2015 12:48 AM, Peter Zijlstra wrote:
>>
>>> +
>>> +struct latch_tree_node {
>>> +	/*
>>> +	 * Because we have an array of two entries in struct latch_tree_nodes
>>> +	 * it's not possible to use container_of() to get back to the
>>> +	 * encapsulating structure; therefore we have to put in a back pointer.
>>> +	 */
>>> +	void		*priv;
>>> +	struct rb_node	node;
>>> +};
>>
>> I don't think @priv is strictly needed. It is possible to use container_of()
>> to go back. @priv is even not used in this file (except the initialization).
>>
>> First, we can use container_of() to find encapsulating structure from the
>> struct latch_tree_nodeS.
>>
>> Second, we can add a @idx to __lt_insert() and __lt_find(), 
> 
> insert yes, find no. Remember that both nodes are in the _same_ tree.
> 
> There is no way of knowing if a tree node is an init or core node while
> iterating.
> .
> 

This sentence is talking about module.c not latch_tree.h. So I guess
it is user(module.c)'s problem, not latch_tree.h's problem.

The user(module.c) can wrap the struct latch_tree_nodes and add @priv.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] rbtree: Implement generic latch_tree
  2015-04-09  8:55       ` Lai Jiangshan
@ 2015-04-09 12:13         ` Peter Zijlstra
  2015-04-09 12:28           ` Peter Zijlstra
  2015-04-09 16:31           ` Linus Torvalds
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-09 12:13 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, Apr 09, 2015 at 04:55:05PM +0800, Lai Jiangshan wrote:
> This sentence is talking about module.c not latch_tree.h. So I guess
> it is user(module.c)'s problem, not latch_tree.h's problem.
> 
> The user(module.c) can wrap the struct latch_tree_nodes and add @priv.

OK, took me a while to figure out exactly what and how. You mean
something like this, right?

(compile tested only)

---
 include/linux/module.h       |   11 ++++-
 include/linux/rbtree_latch.h |   93 ++++++++++++++++++-------------------------
 kernel/module.c              |   33 +++++++++------
 3 files changed, 70 insertions(+), 67 deletions(-)

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -211,6 +211,13 @@ enum module_state {
 	MODULE_STATE_UNFORMED,	/* Still setting it up. */
 };
 
+struct module;
+
+struct mod_tree_node {
+	struct latch_tree_node node;
+	struct module *mod;
+};
+
 struct module {
 	enum module_state state;
 
@@ -294,8 +301,8 @@ struct module {
 	 * core.node[0] to be in the same cacheline as the above entries,
 	 * we also assume ltn_init has a higher address than ltn_core.
 	 */
-	struct latch_tree_nodes	ltn_core;
-	struct latch_tree_nodes	ltn_init;
+	struct mod_tree_node	mtn_core;
+	struct mod_tree_node	mtn_init;
 
 	/* Size of RO sections of the module (text+rodata) */
 	unsigned int init_ro_size, core_ro_size;
--- a/include/linux/rbtree_latch.h
+++ b/include/linux/rbtree_latch.h
@@ -36,17 +36,7 @@
 #include <linux/seqlock.h>
 
 struct latch_tree_node {
-	/*
-	 * Because we have an array of two entries in struct latch_tree_nodes
-	 * it's not possible to use container_of() to get back to the
-	 * encapsulating structure; therefore we have to put in a back pointer.
-	 */
-	void		*priv;
-	struct rb_node	node;
-};
-
-struct latch_tree_nodes {
-	struct latch_tree_node node[2];
+	struct rb_node node[2];
 };
 
 struct latch_tree_root {
@@ -74,17 +64,25 @@ struct latch_tree_ops {
 	int  (*comp)(void *key,                 struct latch_tree_node *b);
 };
 
+static __always_inline struct latch_tree_node *
+__lt_from_rb(struct rb_node *node, int idx)
+{
+	return container_of(node, struct latch_tree_node, node[idx]);
+}
+
 static __always_inline void
-__lt_insert(struct latch_tree_node *ltn, struct rb_root *root,
+__lt_insert(struct latch_tree_node *ltn, struct latch_tree_root *ltr, int idx,
 	    bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b))
 {
+	struct rb_root *root = &ltr->tree[idx];
 	struct rb_node **link = &root->rb_node;
+	struct rb_node *node = &ltn->node[idx];
 	struct rb_node *parent = NULL;
 	struct latch_tree_node *ltp;
 
 	while (*link) {
 		parent = *link;
-		ltp = container_of(parent, struct latch_tree_node, node);
+		ltp = __lt_from_rb(parent, idx);
 
 		if (less(ltn, ltp))
 			link = &parent->rb_left;
@@ -92,32 +90,32 @@ __lt_insert(struct latch_tree_node *ltn,
 			link = &parent->rb_right;
 	}
 
-	rb_link_node_rcu(&ltn->node, parent, link);
-	rb_insert_color(&ltn->node, root);
+	rb_link_node_rcu(node, parent, link);
+	rb_insert_color(node, root);
 }
 
 static __always_inline void
-__lt_erase(struct latch_tree_node *ltn, struct rb_root *root)
+__lt_erase(struct latch_tree_node *ltn, struct latch_tree_root *ltr, int idx)
 {
-	rb_erase(&ltn->node, root);
+	rb_erase(&ltn->node[idx], &ltr->tree[idx]);
 }
 
 static __always_inline struct latch_tree_node *
-__lt_find(void *key, struct rb_root *root,
-	  int (*comp)(void *key, struct latch_tree_node *ltn))
+__lt_find(void *key, struct latch_tree_root *ltr, int idx,
+	  int (*comp)(void *key, struct latch_tree_node *node))
 {
-	struct rb_node *n = rcu_dereference_raw(root->rb_node);
+	struct rb_node *node = rcu_dereference_raw(ltr->tree[idx].rb_node);
 	struct latch_tree_node *ltn;
 	int c;
 
-	while (n) {
-		ltn = container_of(n, struct latch_tree_node, node);
+	while (node) {
+		ltn = __lt_from_rb(node, idx);
 		c = comp(key, ltn);
 
 		if (c < 0)
-			n = rcu_dereference_raw(n->rb_left);
+			node = rcu_dereference_raw(node->rb_left);
 		else if (c > 0)
-			n = rcu_dereference_raw(n->rb_right);
+			node = rcu_dereference_raw(node->rb_right);
 		else
 			return ltn;
 	}
@@ -126,65 +124,56 @@ __lt_find(void *key, struct rb_root *roo
 }
 
 /**
- * latch_tree_insert() - insert @nodes into the trees @root
- * @nodes: nodes to insert
- * @root: trees to insert @nodes into
- * @priv: pointer to node private data
+ * latch_tree_insert() - insert @node into the trees @root
+ * @node: nodes to insert
+ * @root: trees to insert @node into
  * @ops: operators defining the node order
  *
- * Initializes @nodes private pointer with @priv; which typically is a back
- * pointer to the containing structure, used by @ops to find the search key.
+ * It inserts @node into @root in an ordered fashion such that we can always
+ * observe one complete tree. See the comment for raw_write_seqcount_latch().
  *
- * Then it inserts @nodes into @root in an ordered fashion such that we can
- * always observe one complete tree. See the comment for
- * raw_write_seqcount_latch().
- *
- * The inserts use rcu_assign_pointer() to publish the element such that
- * both the @priv pointer values in @nodes as the tree structure is stored
- * before we can observe the new @nodes.
+ * The inserts use rcu_assign_pointer() to publish the element such that the
+ * tree structure is stored before we can observe the new @node.
  *
  * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
  * serialized.
  */
 static __always_inline void
-latch_tree_insert(struct latch_tree_nodes *nodes,
+latch_tree_insert(struct latch_tree_node *node,
 		  struct latch_tree_root *root,
-		  void *priv,
 		  const struct latch_tree_ops *ops)
 {
-	nodes->node[0].priv = nodes->node[1].priv = priv;
-
 	raw_write_seqcount_latch(&root->seq);
-	__lt_insert(&nodes->node[0], &root->tree[0], ops->less);
+	__lt_insert(node, root, 0, ops->less);
 	raw_write_seqcount_latch(&root->seq);
-	__lt_insert(&nodes->node[1], &root->tree[1], ops->less);
+	__lt_insert(node, root, 1, ops->less);
 }
 
 /**
- * latch_tree_erase() - removes @nodes from the trees @root
- * @nodes: nodes to remote
- * @root: trees to remove @nodes from
+ * latch_tree_erase() - removes @node from the trees @root
+ * @node: nodes to remote
+ * @root: trees to remove @node from
  * @ops: operators defining the node order
  *
- * Removes @nodes from the trees @root in an ordered fashion such that we can
+ * Removes @node from the trees @root in an ordered fashion such that we can
  * always observe one complete tree. See the comment for
  * raw_write_seqcount_latch().
  *
- * It is assumed that @nodes will observe one RCU quiescent state before being
+ * It is assumed that @node will observe one RCU quiescent state before being
  * reused of freed.
  *
  * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
  * serialized.
  */
 static __always_inline void
-latch_tree_erase(struct latch_tree_nodes *nodes,
+latch_tree_erase(struct latch_tree_node *node,
 		 struct latch_tree_root *root,
 		 const struct latch_tree_ops *ops)
 {
 	raw_write_seqcount_latch(&root->seq);
-	__lt_erase(&nodes->node[0], &root->tree[0]);
+	__lt_erase(node, root, 0);
 	raw_write_seqcount_latch(&root->seq);
-	__lt_erase(&nodes->node[1], &root->tree[1]);
+	__lt_erase(node, root, 1);
 }
 
 /**
@@ -214,7 +203,7 @@ latch_tree_find(void *key, struct latch_
 
 	do {
 		seq = raw_read_seqcount(&root->seq);
-		node = __lt_find(key, &root->tree[seq & 1], ops->comp);
+		node = __lt_find(key, root, seq & 1, ops->comp);
 	} while (read_seqcount_retry(&root->seq, seq));
 
 	return node;
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -119,22 +119,24 @@ static LIST_HEAD(modules);
 
 static __always_inline unsigned long __mod_tree_val(struct latch_tree_node *n)
 {
-	struct module *mod = n->priv;
+	struct mod_tree_node *mtn = container_of(n, struct mod_tree_node, node);
+	struct module *mod = mtn->mod;
 
-	if (n >= mod->ltn_init.node)
+	if (unlikely(mtn == &mod->mtn_init))
 		return (unsigned long)mod->module_init;
-	else
-		return (unsigned long)mod->module_core;
+
+	return (unsigned long)mod->module_core;
 }
 
 static __always_inline unsigned long __mod_tree_size(struct latch_tree_node *n)
 {
-	struct module *mod = n->priv;
+	struct mod_tree_node *mtn = container_of(n, struct mod_tree_node, node);
+	struct module *mod = mtn->mod;
 
-	if (n >= mod->ltn_init.node)
+	if (unlikely(mtn == &mod->mtn_init))
 		return (unsigned long)mod->init_size;
-	else
-		return (unsigned long)mod->core_size;
+
+	return (unsigned long)mod->core_size;
 }
 
 static __always_inline bool
@@ -174,20 +176,23 @@ static struct latch_tree_root mod_tree _
  */
 static void mod_tree_insert(struct module *mod)
 {
-	latch_tree_insert(&mod->ltn_core, &mod_tree, mod, &mod_tree_ops);
+	mod->mtn_core.mod = mod;
+	mod->mtn_init.mod = mod;
+
+	latch_tree_insert(&mod->mtn_core.node, &mod_tree, &mod_tree_ops);
 	if (mod->init_size)
-		latch_tree_insert(&mod->ltn_init, &mod_tree, mod, &mod_tree_ops);
+		latch_tree_insert(&mod->mtn_init.node, &mod_tree, &mod_tree_ops);
 }
 
 static void mod_tree_remove_init(struct module *mod)
 {
 	if (mod->init_size)
-		latch_tree_erase(&mod->ltn_init, &mod_tree, &mod_tree_ops);
+		latch_tree_erase(&mod->mtn_init.node, &mod_tree, &mod_tree_ops);
 }
 
 static void mod_tree_remove(struct module *mod)
 {
-	latch_tree_erase(&mod->ltn_core, &mod_tree, &mod_tree_ops);
+	latch_tree_erase(&mod->mtn_core.node, &mod_tree, &mod_tree_ops);
 	mod_tree_remove_init(mod);
 }
 
@@ -196,8 +201,10 @@ static struct module *mod_find(unsigned
 	struct latch_tree_node *ltn;
 
 	ltn = latch_tree_find((void *)addr, &mod_tree, &mod_tree_ops);
+	if (!ltn)
+		return NULL;
 
-	return ltn ? ltn->priv : NULL;
+	return container_of(ltn, struct mod_tree_node, node)->mod;
 }
 
 #else /* !(PERF_EVENTS || TRACING) */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] rbtree: Implement generic latch_tree
  2015-04-09 12:13         ` Peter Zijlstra
@ 2015-04-09 12:28           ` Peter Zijlstra
  2015-04-09 16:31           ` Linus Torvalds
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-09 12:28 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, Apr 09, 2015 at 02:13:36PM +0200, Peter Zijlstra wrote:
> +struct module;
> +
> +struct mod_tree_node {
> +	struct latch_tree_node node;
> +	struct module *mod;
> +};

The module pointer should be first in this structure.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/9] module: Sanitize RCU usage and locking
  2015-04-09  4:21   ` Lai Jiangshan
@ 2015-04-09 12:36     ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-09 12:36 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx

On Thu, Apr 09, 2015 at 12:21:43PM +0800, Lai Jiangshan wrote:
> On 04/09/2015 12:48 AM, Peter Zijlstra wrote:
> 
> > +static void module_assert_mutex_or_preempt(void)
> > +{
> > +#ifdef CONFIG_LOCKDEP
> > +	int rcu_held = rcu_read_lock_sched_held();
> > +	int mutex_held = 1;
> > +
> > +	if (debug_locks)
> > +		mutex_held = lockdep_is_held(&module_mutex);
> > +
> > +	WARN_ON(!rcu_held && !mutex_held);
> > +#endif
> > +}
> 
> Is rcu_lockdep_assert() suitable for it?
> (note rcu_lockdep_assert() only works when CONFIG_PROVE_RCU)

Nah, this works.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] rbtree: Implement generic latch_tree
  2015-04-09 12:13         ` Peter Zijlstra
  2015-04-09 12:28           ` Peter Zijlstra
@ 2015-04-09 16:31           ` Linus Torvalds
  2015-04-09 16:59             ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2015-04-09 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Ingo Molnar, Rusty Russell, Mathieu Desnoyers,
	Oleg Nesterov, Paul McKenney, Linux Kernel Mailing List,
	Andi Kleen, Steven Rostedt, Thomas Gleixner, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, Apr 9, 2015 at 5:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>  struct latch_tree_node {
> +       struct rb_node node[2];
>  };
>
> +static __always_inline struct latch_tree_node *
> +__lt_from_rb(struct rb_node *node, int idx)
> +{
> +       return container_of(node, struct latch_tree_node, node[idx]);
> +}

Ugh. That syntax of offset_of() worries me a bit, but some grepping
shows that we already use this form of offset_of() in parts of the
kernel, so I guess it's fine.

Even with that small "Ugh", I do have to admit to preferring this to
having the back-pointer.

                        Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] rbtree: Implement generic latch_tree
  2015-04-09 16:31           ` Linus Torvalds
@ 2015-04-09 16:59             ` Peter Zijlstra
  2015-04-09 17:12               ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-09 16:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lai Jiangshan, Ingo Molnar, Rusty Russell, Mathieu Desnoyers,
	Oleg Nesterov, Paul McKenney, Linux Kernel Mailing List,
	Andi Kleen, Steven Rostedt, Thomas Gleixner, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, Apr 09, 2015 at 09:31:16AM -0700, Linus Torvalds wrote:
> On Thu, Apr 9, 2015 at 5:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >  struct latch_tree_node {
> > +       struct rb_node node[2];
> >  };
> >
> > +static __always_inline struct latch_tree_node *
> > +__lt_from_rb(struct rb_node *node, int idx)
> > +{
> > +       return container_of(node, struct latch_tree_node, node[idx]);
> > +}
> 
> Ugh. That syntax of offset_of() worries me a bit, but some grepping
> shows that we already use this form of offset_of() in parts of the
> kernel, so I guess it's fine.

I was a little surprised myself it worked, but its a constant after
all so it 'should'.

> Even with that small "Ugh", I do have to admit to preferring this to
> having the back-pointer.

Yeah me too, I'll respin the patches proper after I've given it some
actual runtime too.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] rbtree: Implement generic latch_tree
  2015-04-09 16:59             ` Peter Zijlstra
@ 2015-04-09 17:12               ` Linus Torvalds
  2015-04-09 17:17                 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2015-04-09 17:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Ingo Molnar, Rusty Russell, Mathieu Desnoyers,
	Oleg Nesterov, Paul McKenney, Linux Kernel Mailing List,
	Andi Kleen, Steven Rostedt, Thomas Gleixner, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, Apr 9, 2015 at 9:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> I was a little surprised myself it worked, but its a constant after
> all so it 'should'.

Well, it should actually work for non-constants too, even when that
'idx' isn't inlined to one of the fixed constants. So even if this
will ever hit the "seq & 1" case for lookup, it should all work. It's
just an expression, after all.

All that is really required is that you can do '&(x->y)' on it, where
'x' is the type, and 'y' is the "member".

It's just unusual, which I think makes it slightly harder to read for
a *human* just because it breaks the normal pattern of those things.
But there's nothing technically wrong with it.

                        Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] rbtree: Implement generic latch_tree
  2015-04-09 17:12               ` Linus Torvalds
@ 2015-04-09 17:17                 ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2015-04-09 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Ingo Molnar, Rusty Russell, Mathieu Desnoyers,
	Oleg Nesterov, Paul McKenney, Linux Kernel Mailing List,
	Andi Kleen, Steven Rostedt, Thomas Gleixner, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, Apr 9, 2015 at 10:12 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It's just unusual, which I think makes it slightly harder to read for
> a *human* just because it breaks the normal pattern of those things.
> But there's nothing technically wrong with it.

You could have written it as

  static __always_inline struct latch_tree_node *
  __lt_from_rb(struct rb_node *node, int idx)
  {
         return (void *)(node-idx);
  }

which is actually shorter, but even if that container_of() use looks a
bit unusual, I think the container_of() shows more clearly what you
actually want. Plus it would be more reliable if you ever end up
adding any other fields to the struct latch_tree_node.

So I wouldn't actually suggest using that "(node-idx)" format. I think
it should result in the exact same code, though.

                         Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-04-09 17:17 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-08 16:48 [PATCH v4 0/9] latched RB-trees and __module_address() Peter Zijlstra
2015-04-08 16:48 ` [PATCH v4 1/9] module: Sanitize RCU usage and locking Peter Zijlstra
2015-04-09  4:21   ` Lai Jiangshan
2015-04-09 12:36     ` Peter Zijlstra
2015-04-08 16:48 ` [PATCH v4 2/9] module: Annotate module version magic Peter Zijlstra
2015-04-08 16:48 ` [PATCH v4 3/9] module, jump_label: Fix module locking Peter Zijlstra
2015-04-08 16:48 ` [PATCH v4 4/9] rbtree: Make lockless searches non-fatal Peter Zijlstra
2015-04-08 16:48 ` [PATCH v4 5/9] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
2015-04-08 16:48 ` [PATCH v4 6/9] rbtree: Implement generic latch_tree Peter Zijlstra
2015-04-09  8:09   ` Lai Jiangshan
2015-04-09  8:14     ` Peter Zijlstra
2015-04-09  8:55       ` Lai Jiangshan
2015-04-09 12:13         ` Peter Zijlstra
2015-04-09 12:28           ` Peter Zijlstra
2015-04-09 16:31           ` Linus Torvalds
2015-04-09 16:59             ` Peter Zijlstra
2015-04-09 17:12               ` Linus Torvalds
2015-04-09 17:17                 ` Linus Torvalds
2015-04-08 16:48 ` [PATCH v4 7/9] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
2015-04-08 16:48 ` [PATCH v4 8/9] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING Peter Zijlstra
2015-04-08 16:48 ` [PATCH v4 9/9] module: Use __module_address() for module_address_lookup() Peter Zijlstra
2015-04-08 22:08 ` [PATCH v4 0/9] latched RB-trees and __module_address() Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).