All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/10] latched RB-trees and __module_address()
@ 2015-04-13 14:11 Peter Zijlstra
  2015-04-13 14:11 ` [PATCH v5 01/10] module: Sanitize RCU usage and locking Peter Zijlstra
                   ` (11 more replies)
  0 siblings, 12 replies; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 14:11 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz


This series is aimed at making __module_address() go fast(er).

The reason for doing so is that most stack unwinders use kernel_text_address()
to validate each frame. Perf and ftrace (can) end up doing a lot of stack
traces from performance sensitive code.

On the way there it:
 - annotates and sanitizes module locking
 - introduces the latched RB-tree
 - employs it to make __module_address() go fast.

I've build and boot tested this on x86_64 with modules and lockdep
enabled.  Performance numbers (below) are done with lockdep disabled.

As previously mentioned; the reason for writing the latched RB-tree as generic
code is mostly for clarity/documentation purposes; as there are a number of
separate and non trivial bits to the complete solution.

As measued on my ivb-ep system with 84 modules loaded; prior to patching
the test module (below) reports (cache hot, performance cpufreq):

          avg +- stdev
Before:   611 +- 10 [ns] per __module_address() call
After:     17 +-  5 [ns] per __module_address() call

PMI measurements for a cpu running loops in a module (also [ns]):

Before:	Mean: 2719 +- 1, Stdev: 214, Samples: 40036
After:  Mean:  947 +- 0, Stdev: 132, Samples: 40037

Note; I have also tested things like: perf record -a -g modprobe
mod_test, to make 'sure' to hit some of the more interesting paths.

Changes since last time:

 - reworked generic latch_tree API (Lai Jiangshan)
 - reworked module bounds (me)
 - reworked all the testing code (not included)

Rusty, please consider merging this (for 4.2, I know its the merge window, no
rush)



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v5 01/10] module: Sanitize RCU usage and locking
  2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
@ 2015-04-13 14:11 ` Peter Zijlstra
  2015-04-13 15:32   ` Ingo Molnar
  2015-04-13 14:11 ` [PATCH v5 02/10] module: Annotate module version magic Peter Zijlstra
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 14:11 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

[-- Attachment #1: peterz-module-sanitize-rcu.patch --]
[-- Type: text/plain, Size: 5648 bytes --]

Currently the RCU usage in module is an inconsistent mess of RCU and
RCU-sched, this is broken for CONFIG_PREEMPT where synchronize_rcu()
does not imply synchronize_sched().

Most usage sites use preempt_{dis,en}able() which is RCU-sched, but
(most of) the modification sites use synchronize_rcu(). With the
exception of the module bug list, which actually uses RCU.

Convert everything over to RCU-sched.

Furthermore add lockdep asserts to all sites, because its not at all
clear to me the required locking is observed, esp. on exported
functions.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/module.h |   12 ++++++++++--
 kernel/module.c        |   42 ++++++++++++++++++++++++++++++++++--------
 lib/bug.c              |    7 +++++--
 3 files changed, 49 insertions(+), 12 deletions(-)

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -415,14 +415,22 @@ struct symsearch {
 	bool unused;
 };
 
-/* Search for an exported symbol by name. */
+/*
+ * Search for an exported symbol by name.
+ *
+ * Must be called with module_mutex held or preemption disabled.
+ */
 const struct kernel_symbol *find_symbol(const char *name,
 					struct module **owner,
 					const unsigned long **crc,
 					bool gplok,
 					bool warn);
 
-/* Walk the exported symbol table */
+/*
+ * Walk the exported symbol table
+ *
+ * Must be called with module_mutex held or preemption disabled.
+ */
 bool each_symbol_section(bool (*fn)(const struct symsearch *arr,
 				    struct module *owner,
 				    void *data), void *data);
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -106,6 +106,24 @@ static LIST_HEAD(modules);
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
 
+static void module_assert_mutex(void)
+{
+	lockdep_assert_held(&module_mutex);
+}
+
+static void module_assert_mutex_or_preempt(void)
+{
+#ifdef CONFIG_LOCKDEP
+	int rcu_held = rcu_read_lock_sched_held();
+	int mutex_held = 1;
+
+	if (debug_locks)
+		mutex_held = lockdep_is_held(&module_mutex);
+
+	WARN_ON(!rcu_held && !mutex_held);
+#endif
+}
+
 #ifdef CONFIG_MODULE_SIG
 #ifdef CONFIG_MODULE_SIG_FORCE
 static bool sig_enforce = true;
@@ -319,6 +337,8 @@ bool each_symbol_section(bool (*fn)(cons
 #endif
 	};
 
+	module_assert_mutex_or_preempt();
+
 	if (each_symbol_in_section(arr, ARRAY_SIZE(arr), NULL, fn, data))
 		return true;
 
@@ -458,6 +478,8 @@ static struct module *find_module_all(co
 {
 	struct module *mod;
 
+	module_assert_mutex();
+
 	list_for_each_entry(mod, &modules, list) {
 		if (!even_unformed && mod->state == MODULE_STATE_UNFORMED)
 			continue;
@@ -1856,8 +1878,8 @@ static void free_module(struct module *m
 	list_del_rcu(&mod->list);
 	/* Remove this module from bug list, this uses list_del_rcu */
 	module_bug_cleanup(mod);
-	/* Wait for RCU synchronizing before releasing mod->list and buglist. */
-	synchronize_rcu();
+	/* Wait for RCU-sched synchronizing before releasing mod->list and buglist. */
+	synchronize_sched();
 	mutex_unlock(&module_mutex);
 
 	/* This may be NULL, but that's OK */
@@ -3106,11 +3128,11 @@ static noinline int do_init_module(struc
 	mod->init_text_size = 0;
 	/*
 	 * We want to free module_init, but be aware that kallsyms may be
-	 * walking this with preempt disabled.  In all the failure paths,
-	 * we call synchronize_rcu/synchronize_sched, but we don't want
-	 * to slow down the success path, so use actual RCU here.
+	 * walking this with preempt disabled.  In all the failure paths, we
+	 * call synchronize_sched, but we don't want to slow down the success
+	 * path, so use actual RCU here.
 	 */
-	call_rcu(&freeinit->rcu, do_free_init);
+	call_rcu_sched(&freeinit->rcu, do_free_init);
 	mutex_unlock(&module_mutex);
 	wake_up_all(&module_wq);
 
@@ -3368,8 +3390,8 @@ static int load_module(struct load_info
 	/* Unlink carefully: kallsyms could be walking list. */
 	list_del_rcu(&mod->list);
 	wake_up_all(&module_wq);
-	/* Wait for RCU synchronizing before releasing mod->list. */
-	synchronize_rcu();
+	/* Wait for RCU-sched synchronizing before releasing mod->list. */
+	synchronize_sched();
 	mutex_unlock(&module_mutex);
  free_module:
 	/* Free lock-classes; relies on the preceding sync_rcu() */
@@ -3636,6 +3658,8 @@ int module_kallsyms_on_each_symbol(int (
 	unsigned int i;
 	int ret;
 
+	module_assert_mutex();
+
 	list_for_each_entry(mod, &modules, list) {
 		if (mod->state == MODULE_STATE_UNFORMED)
 			continue;
@@ -3810,6 +3834,8 @@ struct module *__module_address(unsigned
 	if (addr < module_addr_min || addr > module_addr_max)
 		return NULL;
 
+	module_assert_mutex_or_preempt();
+
 	list_for_each_entry_rcu(mod, &modules, list) {
 		if (mod->state == MODULE_STATE_UNFORMED)
 			continue;
--- a/lib/bug.c
+++ b/lib/bug.c
@@ -66,7 +66,7 @@ static const struct bug_entry *module_fi
 	struct module *mod;
 	const struct bug_entry *bug = NULL;
 
-	rcu_read_lock();
+	rcu_read_lock_sched();
 	list_for_each_entry_rcu(mod, &module_bug_list, bug_list) {
 		unsigned i;
 
@@ -77,7 +77,7 @@ static const struct bug_entry *module_fi
 	}
 	bug = NULL;
 out:
-	rcu_read_unlock();
+	rcu_read_unlock_sched();
 
 	return bug;
 }
@@ -88,6 +88,8 @@ void module_bug_finalize(const Elf_Ehdr
 	char *secstrings;
 	unsigned int i;
 
+	lockdep_assert_held(&module_mutex);
+
 	mod->bug_table = NULL;
 	mod->num_bugs = 0;
 
@@ -113,6 +115,7 @@ void module_bug_finalize(const Elf_Ehdr
 
 void module_bug_cleanup(struct module *mod)
 {
+	lockdep_assert_held(&module_mutex);
 	list_del_rcu(&mod->bug_list);
 }
 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v5 02/10] module: Annotate module version magic
  2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
  2015-04-13 14:11 ` [PATCH v5 01/10] module: Sanitize RCU usage and locking Peter Zijlstra
@ 2015-04-13 14:11 ` Peter Zijlstra
  2015-04-13 14:11 ` [PATCH v5 03/10] module, jump_label: Fix module locking Peter Zijlstra
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 14:11 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

[-- Attachment #1: peterz-module-fix-load_module.patch --]
[-- Type: text/plain, Size: 2807 bytes --]

Due to the new lockdep checks we go:

[    9.759380] ------------[ cut here ]------------
[    9.759389] WARNING: CPU: 31 PID: 597 at ../kernel/module.c:216 each_symbol_section+0x121/0x130()
[    9.759391] Modules linked in:
[    9.759393] CPU: 31 PID: 597 Comm: modprobe Not tainted 4.0.0-rc1+ #65
[    9.759393] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[    9.759396]  ffffffff817d8676 ffff880424567ca8 ffffffff8157e98b 0000000000000001
[    9.759398]  0000000000000000 ffff880424567ce8 ffffffff8105fbc7 ffff880424567cd8
[    9.759400]  0000000000000000 ffffffff810ec160 ffff880424567d40 0000000000000000
[    9.759400] Call Trace:
[    9.759407]  [<ffffffff8157e98b>] dump_stack+0x4f/0x7b
[    9.759410]  [<ffffffff8105fbc7>] warn_slowpath_common+0x97/0xe0
[    9.759412]  [<ffffffff810ec160>] ? section_objs+0x60/0x60
[    9.759414]  [<ffffffff8105fc2a>] warn_slowpath_null+0x1a/0x20
[    9.759415]  [<ffffffff810ed9c1>] each_symbol_section+0x121/0x130
[    9.759417]  [<ffffffff810eda01>] find_symbol+0x31/0x70
[    9.759420]  [<ffffffff810ef5bf>] load_module+0x20f/0x2660
[    9.759422]  [<ffffffff8104ef10>] ? __do_page_fault+0x190/0x4e0
[    9.759426]  [<ffffffff815880ec>] ? retint_restore_args+0x13/0x13
[    9.759427]  [<ffffffff815880ec>] ? retint_restore_args+0x13/0x13
[    9.759433]  [<ffffffff810ae73d>] ? trace_hardirqs_on_caller+0x11d/0x1e0
[    9.759437]  [<ffffffff812fcc0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[    9.759439]  [<ffffffff815880ec>] ? retint_restore_args+0x13/0x13
[    9.759441]  [<ffffffff810f1ade>] SyS_init_module+0xce/0x100
[    9.759443]  [<ffffffff81587429>] system_call_fastpath+0x12/0x17
[    9.759445] ---[ end trace 9294429076a9c644 ]---

As per the comment this site should be fine, but lets wrap it in
preempt_disable() anyhow to placate lockdep.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1191,11 +1191,17 @@ static inline int check_modstruct_versio
 {
 	const unsigned long *crc;
 
-	/* Since this should be found in kernel (which can't be removed),
-	 * no locking is necessary. */
+	/*
+	 * Since this should be found in kernel (which can't be removed), no
+	 * locking is necessary -- use preempt_disable() to placate lockdep.
+	 */
+	preempt_disable();
 	if (!find_symbol(VMLINUX_SYMBOL_STR(module_layout), NULL,
-			 &crc, true, false))
+			 &crc, true, false)) {
+		preempt_enable();
 		BUG();
+	}
+	preempt_enable();
 	return check_version(sechdrs, versindex,
 			     VMLINUX_SYMBOL_STR(module_layout), mod, crc,
 			     NULL);



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v5 03/10] module, jump_label: Fix module locking
  2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
  2015-04-13 14:11 ` [PATCH v5 01/10] module: Sanitize RCU usage and locking Peter Zijlstra
  2015-04-13 14:11 ` [PATCH v5 02/10] module: Annotate module version magic Peter Zijlstra
@ 2015-04-13 14:11 ` Peter Zijlstra
  2015-04-13 14:11 ` [PATCH v5 04/10] rbtree: Make lockless searches non-fatal Peter Zijlstra
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 14:11 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz, Jason Baron

[-- Attachment #1: peterz-module-fix-jump_label.patch --]
[-- Type: text/plain, Size: 2730 bytes --]

As per the module core lockdep annotations:

[   18.034047] ---[ end trace 9294429076a9c673 ]---
[   18.047760] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[   18.059228]  ffffffff817d8676 ffff880036683c38 ffffffff8157e98b 0000000000000001
[   18.067541]  0000000000000000 ffff880036683c78 ffffffff8105fbc7 ffff880036683c68
[   18.075851]  ffffffffa0046b08 0000000000000000 ffffffffa0046d00 ffffffffa0046cc8
[   18.084173] Call Trace:
[   18.086906]  [<ffffffff8157e98b>] dump_stack+0x4f/0x7b
[   18.092649]  [<ffffffff8105fbc7>] warn_slowpath_common+0x97/0xe0
[   18.099361]  [<ffffffff8105fc2a>] warn_slowpath_null+0x1a/0x20
[   18.105880]  [<ffffffff810ee502>] __module_address+0x1d2/0x1e0
[   18.112400]  [<ffffffff81161153>] jump_label_module_notify+0x143/0x1e0
[   18.119710]  [<ffffffff810814bf>] notifier_call_chain+0x4f/0x70
[   18.126326]  [<ffffffff8108160e>] __blocking_notifier_call_chain+0x5e/0x90
[   18.134009]  [<ffffffff81081656>] blocking_notifier_call_chain+0x16/0x20
[   18.141490]  [<ffffffff810f0f00>] load_module+0x1b50/0x2660
[   18.147720]  [<ffffffff810f1ade>] SyS_init_module+0xce/0x100
[   18.154045]  [<ffffffff81587429>] system_call_fastpath+0x12/0x17
[   18.160748] ---[ end trace 9294429076a9c674 ]---

Jump labels is not doing it right; fix this.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jason Baron <jbaron@akamai.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/jump_label.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -302,7 +302,7 @@ static int jump_label_add_module(struct
 			continue;
 
 		key = iterk;
-		if (__module_address(iter->key) == mod) {
+		if (within_module(iter->key, mod)) {
 			/*
 			 * Set key->entries to iter, but preserve JUMP_LABEL_TRUE_BRANCH.
 			 */
@@ -339,7 +339,7 @@ static void jump_label_del_module(struct
 
 		key = (struct static_key *)(unsigned long)iter->key;
 
-		if (__module_address(iter->key) == mod)
+		if (within_module(iter->key, mod))
 			continue;
 
 		prev = &key->next;
@@ -443,14 +443,16 @@ static void jump_label_update(struct sta
 {
 	struct jump_entry *stop = __stop___jump_table;
 	struct jump_entry *entry = jump_label_get_entries(key);
-
 #ifdef CONFIG_MODULES
-	struct module *mod = __module_address((unsigned long)key);
+	struct module *mod;
 
 	__jump_label_mod_update(key, enable);
 
+	preempt_disable();
+	mod = __module_address((unsigned long)key);
 	if (mod)
 		stop = mod->jump_entries + mod->num_jump_entries;
+	preempt_enable();
 #endif
 	/* if there are no users, entry can be NULL */
 	if (entry)



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v5 04/10] rbtree: Make lockless searches non-fatal
  2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
                   ` (2 preceding siblings ...)
  2015-04-13 14:11 ` [PATCH v5 03/10] module, jump_label: Fix module locking Peter Zijlstra
@ 2015-04-13 14:11 ` Peter Zijlstra
  2015-04-13 15:50   ` Ingo Molnar
  2015-04-13 14:11 ` [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 14:11 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz,
	David Woodhouse, Rik van Riel, Andrea Arcangeli,
	Michel Lespinasse

[-- Attachment #1: peterz-rbtree-lockless.patch --]
[-- Type: text/plain, Size: 10107 bytes --]

Change the insert and erase code such that lockless searches are
non-fatal.

In and of itself an rbtree cannot be correctly searched while
in-modification, we can however provide weaker guarantees that will
allow the rbtree to be used in conjunction with other techniques, such
as latches; see 9b0fd802e8c0 ("seqcount: Add raw_write_seqcount_latch()").

For this to work we need the following guarantees from the rbtree
code:

 1) a lockless reader must not see partial stores, this would allow it
    to observe nodes that are invalid memory.

 2) there must not be (temporary) loops in the tree structure in the
    modifier's program order, this would cause a lookup which
    interrupts the modifier to get stuck indefinitely.

For 1) we must use WRITE_ONCE() for all updates to the tree structure;
in particular this patch only does rb_{left,right} as those are the
only element required for simple searches.

It generates slightly worse code, probably because volatile. But in
pointer chasing heavy code a few instructions more should not matter.

For 2) I have carefully audited the code and drawn every intermediate
link state and not found a loop.

Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rbtree.h           |   10 +++++
 include/linux/rbtree_augmented.h |   21 +++++++---
 lib/rbtree.c                     |   76 +++++++++++++++++++++++++++------------
 3 files changed, 78 insertions(+), 29 deletions(-)

--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -31,6 +31,7 @@
 
 #include <linux/kernel.h>
 #include <linux/stddef.h>
+#include <linux/rcupdate.h>
 
 struct rb_node {
 	unsigned long  __rb_parent_color;
@@ -85,6 +86,15 @@ static inline void rb_link_node(struct r
 	*rb_link = node;
 }
 
+static inline void rb_link_node_rcu(struct rb_node * node, struct rb_node * parent,
+				    struct rb_node ** rb_link)
+{
+	node->__rb_parent_color = (unsigned long)parent;
+	node->rb_left = node->rb_right = NULL;
+
+	rcu_assign_pointer(*rb_link, node);
+}
+
 #define rb_entry_safe(ptr, type, member) \
 	({ typeof(ptr) ____ptr = (ptr); \
 	   ____ptr ? rb_entry(____ptr, type, member) : NULL; \
--- a/include/linux/rbtree_augmented.h
+++ b/include/linux/rbtree_augmented.h
@@ -123,11 +123,11 @@ __rb_change_child(struct rb_node *old, s
 {
 	if (parent) {
 		if (parent->rb_left == old)
-			parent->rb_left = new;
+			WRITE_ONCE(parent->rb_left, new);
 		else
-			parent->rb_right = new;
+			WRITE_ONCE(parent->rb_right, new);
 	} else
-		root->rb_node = new;
+		WRITE_ONCE(root->rb_node, new);
 }
 
 extern void __rb_erase_color(struct rb_node *parent, struct rb_root *root,
@@ -137,7 +137,8 @@ static __always_inline struct rb_node *
 __rb_erase_augmented(struct rb_node *node, struct rb_root *root,
 		     const struct rb_augment_callbacks *augment)
 {
-	struct rb_node *child = node->rb_right, *tmp = node->rb_left;
+	struct rb_node *child = node->rb_right;
+	struct rb_node *tmp = node->rb_left;
 	struct rb_node *parent, *rebalance;
 	unsigned long pc;
 
@@ -167,6 +168,7 @@ __rb_erase_augmented(struct rb_node *nod
 		tmp = parent;
 	} else {
 		struct rb_node *successor = child, *child2;
+
 		tmp = child->rb_left;
 		if (!tmp) {
 			/*
@@ -180,6 +182,7 @@ __rb_erase_augmented(struct rb_node *nod
 			 */
 			parent = successor;
 			child2 = successor->rb_right;
+
 			augment->copy(node, successor);
 		} else {
 			/*
@@ -201,19 +204,23 @@ __rb_erase_augmented(struct rb_node *nod
 				successor = tmp;
 				tmp = tmp->rb_left;
 			} while (tmp);
-			parent->rb_left = child2 = successor->rb_right;
-			successor->rb_right = child;
+			child2 = successor->rb_right;
+			WRITE_ONCE(parent->rb_left, child2);
+			WRITE_ONCE(successor->rb_right, child);
 			rb_set_parent(child, successor);
+
 			augment->copy(node, successor);
 			augment->propagate(parent, successor);
 		}
 
-		successor->rb_left = tmp = node->rb_left;
+		tmp = node->rb_left;
+		WRITE_ONCE(successor->rb_left, tmp);
 		rb_set_parent(tmp, successor);
 
 		pc = node->__rb_parent_color;
 		tmp = __rb_parent(pc);
 		__rb_change_child(node, successor, tmp, root);
+
 		if (child2) {
 			successor->__rb_parent_color = pc;
 			rb_set_parent_color(child2, parent, RB_BLACK);
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -44,6 +44,30 @@
  *  parentheses and have some accompanying text comment.
  */
 
+/*
+ * Notes on lockless lookups:
+ *
+ * All stores to the tree structure (rb_left and rb_right) must be done using
+ * WRITE_ONCE(). And we must not inadvertently cause (temporary) loops in the
+ * tree structure as seen in program order.
+ *
+ * These two requirements will allow lockless iteration of the tree -- not
+ * correct iteration mind you, tree rotations are not atomic so a lookup might
+ * miss entire subtrees.
+ *
+ * But they do guarantee that any such traversal will only see valid elements
+ * and that it will indeed complete -- does not get stuck in a loop.
+ *
+ * It also guarantees that if the lookup returns an element it is the 'correct'
+ * one. But not returning an element does _NOT_ mean its not present.
+ *
+ * NOTE:
+ *
+ * Stores to __rb_parent_color are not important for simple lookups so those
+ * are left undone as of now. Nor did I check for loops involving parent
+ * pointers.
+ */
+
 static inline void rb_set_black(struct rb_node *rb)
 {
 	rb->__rb_parent_color |= RB_BLACK;
@@ -129,8 +153,9 @@ __rb_insert(struct rb_node *node, struct
 				 * This still leaves us in violation of 4), the
 				 * continuation into Case 3 will fix that.
 				 */
-				parent->rb_right = tmp = node->rb_left;
-				node->rb_left = parent;
+				tmp = node->rb_left;
+				WRITE_ONCE(parent->rb_right, tmp);
+				WRITE_ONCE(node->rb_left, parent);
 				if (tmp)
 					rb_set_parent_color(tmp, parent,
 							    RB_BLACK);
@@ -149,8 +174,8 @@ __rb_insert(struct rb_node *node, struct
 			 *     /                 \
 			 *    n                   U
 			 */
-			gparent->rb_left = tmp;  /* == parent->rb_right */
-			parent->rb_right = gparent;
+			WRITE_ONCE(gparent->rb_left, tmp); /* == parent->rb_right */
+			WRITE_ONCE(parent->rb_right, gparent);
 			if (tmp)
 				rb_set_parent_color(tmp, gparent, RB_BLACK);
 			__rb_rotate_set_parents(gparent, parent, root, RB_RED);
@@ -171,8 +196,9 @@ __rb_insert(struct rb_node *node, struct
 			tmp = parent->rb_left;
 			if (node == tmp) {
 				/* Case 2 - right rotate at parent */
-				parent->rb_left = tmp = node->rb_right;
-				node->rb_right = parent;
+				tmp = node->rb_right;
+				WRITE_ONCE(parent->rb_left, tmp);
+				WRITE_ONCE(node->rb_right, parent);
 				if (tmp)
 					rb_set_parent_color(tmp, parent,
 							    RB_BLACK);
@@ -183,8 +209,8 @@ __rb_insert(struct rb_node *node, struct
 			}
 
 			/* Case 3 - left rotate at gparent */
-			gparent->rb_right = tmp;  /* == parent->rb_left */
-			parent->rb_left = gparent;
+			WRITE_ONCE(gparent->rb_right, tmp); /* == parent->rb_left */
+			WRITE_ONCE(parent->rb_left, gparent);
 			if (tmp)
 				rb_set_parent_color(tmp, gparent, RB_BLACK);
 			__rb_rotate_set_parents(gparent, parent, root, RB_RED);
@@ -224,8 +250,9 @@ ____rb_erase_color(struct rb_node *paren
 				 *      / \         / \
 				 *     Sl  Sr      N   Sl
 				 */
-				parent->rb_right = tmp1 = sibling->rb_left;
-				sibling->rb_left = parent;
+				tmp1 = sibling->rb_left;
+				WRITE_ONCE(parent->rb_right, tmp1);
+				WRITE_ONCE(sibling->rb_left, parent);
 				rb_set_parent_color(tmp1, parent, RB_BLACK);
 				__rb_rotate_set_parents(parent, sibling, root,
 							RB_RED);
@@ -275,9 +302,10 @@ ____rb_erase_color(struct rb_node *paren
 				 *                       \
 				 *                        Sr
 				 */
-				sibling->rb_left = tmp1 = tmp2->rb_right;
-				tmp2->rb_right = sibling;
-				parent->rb_right = tmp2;
+				tmp1 = tmp2->rb_right;
+				WRITE_ONCE(sibling->rb_left, tmp1);
+				WRITE_ONCE(tmp2->rb_right, sibling);
+				WRITE_ONCE(parent->rb_right, tmp2);
 				if (tmp1)
 					rb_set_parent_color(tmp1, sibling,
 							    RB_BLACK);
@@ -297,8 +325,9 @@ ____rb_erase_color(struct rb_node *paren
 			 *        / \         / \
 			 *      (sl) sr      N  (sl)
 			 */
-			parent->rb_right = tmp2 = sibling->rb_left;
-			sibling->rb_left = parent;
+			tmp2 = sibling->rb_left;
+			WRITE_ONCE(parent->rb_right, tmp2);
+			WRITE_ONCE(sibling->rb_left, parent);
 			rb_set_parent_color(tmp1, sibling, RB_BLACK);
 			if (tmp2)
 				rb_set_parent(tmp2, parent);
@@ -310,8 +339,9 @@ ____rb_erase_color(struct rb_node *paren
 			sibling = parent->rb_left;
 			if (rb_is_red(sibling)) {
 				/* Case 1 - right rotate at parent */
-				parent->rb_left = tmp1 = sibling->rb_right;
-				sibling->rb_right = parent;
+				tmp1 = sibling->rb_right;
+				WRITE_ONCE(parent->rb_left, tmp1);
+				WRITE_ONCE(sibling->rb_right, parent);
 				rb_set_parent_color(tmp1, parent, RB_BLACK);
 				__rb_rotate_set_parents(parent, sibling, root,
 							RB_RED);
@@ -336,9 +366,10 @@ ____rb_erase_color(struct rb_node *paren
 					break;
 				}
 				/* Case 3 - right rotate at sibling */
-				sibling->rb_right = tmp1 = tmp2->rb_left;
-				tmp2->rb_left = sibling;
-				parent->rb_left = tmp2;
+				tmp1 = tmp2->rb_left;
+				WRITE_ONCE(sibling->rb_right, tmp1);
+				WRITE_ONCE(tmp2->rb_left, sibling);
+				WRITE_ONCE(parent->rb_left, tmp2);
 				if (tmp1)
 					rb_set_parent_color(tmp1, sibling,
 							    RB_BLACK);
@@ -347,8 +378,9 @@ ____rb_erase_color(struct rb_node *paren
 				sibling = tmp2;
 			}
 			/* Case 4 - left rotate at parent + color flips */
-			parent->rb_left = tmp2 = sibling->rb_right;
-			sibling->rb_right = parent;
+			tmp2 = sibling->rb_right;
+			WRITE_ONCE(parent->rb_left, tmp2);
+			WRITE_ONCE(sibling->rb_right, parent);
 			rb_set_parent_color(tmp1, sibling, RB_BLACK);
 			if (tmp2)
 				rb_set_parent(tmp2, parent);



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
                   ` (3 preceding siblings ...)
  2015-04-13 14:11 ` [PATCH v5 04/10] rbtree: Make lockless searches non-fatal Peter Zijlstra
@ 2015-04-13 14:11 ` Peter Zijlstra
  2015-04-13 16:32   ` Ingo Molnar
  2015-04-13 14:11 ` [PATCH v5 06/10] rbtree: Implement generic latch_tree Peter Zijlstra
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 14:11 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz,
	Andrea Arcangeli, David Woodhouse, Rik van Riel,
	Michel Lespinasse

[-- Attachment #1: peterz-latch-comment.patch --]
[-- Type: text/plain, Size: 4960 bytes --]

Improve the documentation of the latch technique as used in the
current timekeeping code, such that it can be readily employed
elsewhere.

Borrow from the comments in timekeeping and replace those with a
reference to this more generic comment.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/seqlock.h   |   76 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/time/timekeeping.c |   27 ----------------
 2 files changed, 76 insertions(+), 27 deletions(-)

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -233,9 +233,83 @@ static inline void raw_write_seqcount_en
 	s->sequence++;
 }
 
-/*
+/**
  * raw_write_seqcount_latch - redirect readers to even/odd copy
  * @s: pointer to seqcount_t
+ *
+ * The latch technique is a multiversion concurrency control method that allows
+ * queries during non atomic modifications. If you can guarantee queries never
+ * interrupt the modification -- e.g. the concurrency is strictly between CPUs
+ * -- you most likely do not need this.
+ *
+ * Where the traditional RCU/lockless data structures rely on atomic
+ * modifications to ensure queries observe either the old or the new state the
+ * latch allows the same for non atomic updates. The trade-off is doubling the
+ * cost of storage; we have to maintain two copies of the entire data
+ * structure.
+ *
+ * Very simply put: we first modify one copy and then the other. This ensures
+ * there is always one copy in a stable state, ready to give us an answer.
+ *
+ * The basic form is a data structure like:
+ *
+ * struct latch_struct {
+ *	seqcount_t		seq;
+ *	struct data_struct	data[2];
+ * };
+ *
+ * Where a modification, which is assumed to be externally serialized, does the
+ * following:
+ *
+ * void latch_modify(struct latch_struct *latch, ...)
+ * {
+ *	smp_wmb();	<- Ensure that the last data[1] update is visible
+ *	latch->seq++;
+ *	smp_wmb();	<- Ensure that the seqcount update is visible
+ *
+ *	modify(latch->data[0], ...);
+ *
+ *	smp_wmb();	<- Ensure that the data[0] update is visible
+ *	latch->seq++;
+ *	smp_wmb();	<- Ensure that the seqcount update is visible
+ *
+ *	modify(latch->data[1], ...);
+ * }
+ *
+ * The query will have a form like:
+ *
+ * struct entry *latch_query(struct latch_struct *latch, ...)
+ * {
+ *	struct entry *entry;
+ *	unsigned seq, idx;
+ *
+ *	do {
+ *		seq = latch->seq;
+ *		smp_rmb();
+ *
+ *		idx = seq & 0x01;
+ *		entry = data_query(latch->data[idx], ...);
+ *
+ *		smp_rmb();
+ *	} while (seq != latch->seq);
+ *
+ *	return entry;
+ * }
+ *
+ * So during the modification, queries are first redirected to data[1]. Then we
+ * modify data[0]. When that is complete, we redirect queries back to data[0]
+ * and we can modify data[1].
+ *
+ * NOTE: The non-requirement for atomic modifications does _NOT_ include
+ *       the publishing of new entries in the case where data is a dynamic
+ *       data structure.
+ *
+ *       An iteration might start in data[0] and get suspended long enough
+ *       to miss an entire modification sequence, once it resumes it might
+ *       observe the new entry.
+ *
+ * NOTE: When data is a dynamic data structure; one should use regular RCU
+ *       patterns to manage the lifetimes of the objects within.
  */
 static inline void raw_write_seqcount_latch(seqcount_t *s)
 {
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -339,32 +339,7 @@ static inline s64 timekeeping_get_ns_raw
  * We want to use this from any context including NMI and tracing /
  * instrumenting the timekeeping code itself.
  *
- * So we handle this differently than the other timekeeping accessor
- * functions which retry when the sequence count has changed. The
- * update side does:
- *
- * smp_wmb();	<- Ensure that the last base[1] update is visible
- * tkf->seq++;
- * smp_wmb();	<- Ensure that the seqcount update is visible
- * update(tkf->base[0], tkr);
- * smp_wmb();	<- Ensure that the base[0] update is visible
- * tkf->seq++;
- * smp_wmb();	<- Ensure that the seqcount update is visible
- * update(tkf->base[1], tkr);
- *
- * The reader side does:
- *
- * do {
- *	seq = tkf->seq;
- *	smp_rmb();
- *	idx = seq & 0x01;
- *	now = now(tkf->base[idx]);
- *	smp_rmb();
- * } while (seq != tkf->seq)
- *
- * As long as we update base[0] readers are forced off to
- * base[1]. Once base[0] is updated readers are redirected to base[0]
- * and the base[1] update takes place.
+ * Employ the latch technique; see @raw_write_seqcount_latch.
  *
  * So if a NMI hits the update of base[0] then it will use base[1]
  * which is still consistent. In the worst case this can result is a



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v5 06/10] rbtree: Implement generic latch_tree
  2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
                   ` (4 preceding siblings ...)
  2015-04-13 14:11 ` [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
@ 2015-04-13 14:11 ` Peter Zijlstra
  2015-04-13 16:43   ` Ingo Molnar
  2015-04-13 14:11 ` [PATCH v5 07/10] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 14:11 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz,
	Michel Lespinasse, Andrea Arcangeli, David Woodhouse,
	Rik van Riel

[-- Attachment #1: peterz-rbtree-latch.patch --]
[-- Type: text/plain, Size: 7584 bytes --]

Implement a latched RB-tree in order to get unconditional RCU/lockless
lookups.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rbtree_latch.h |  212 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 212 insertions(+)

--- /dev/null
+++ b/include/linux/rbtree_latch.h
@@ -0,0 +1,212 @@
+/*
+ * Latched RB-trees
+ *
+ * Copyright (C) 2015 Intel Corp., Peter Zijlstra <peterz@infradead.org>
+ *
+ * Since RB-trees have non atomic modifications they're not immediately suited
+ * for RCU/lockless queries. Even though we made RB tree lookups non-fatal for
+ * lockless lookups; we cannot guarantee they return a correct result.
+ *
+ * The simplest solution is a seqlock + rb-tree, this will allow lockless
+ * lookups; but has the constraint (inherent to the seqlock) that read sides
+ * cannot nest in write sides.
+ *
+ * If we need to allow unconditional lookups (say as required for NMI context
+ * usage) we need a more complex setup; this data structure provides this by
+ * employing the latch technique -- see @raw_write_seqcount_latch -- to
+ * implement a latched RB-tree which does allow for unconditional lookups by
+ * virtue of always having (at least) one stable copy of the tree.
+ *
+ * However, while we have the guarantee that there is at all times one stable
+ * copy, this does not guarantee an iteration will not observe modifications.
+ * What might have been a stable copy at the start of the iteration, need not
+ * remain so for the duration of the iteration.
+ *
+ * Therefore, this does require a lockless RB-tree iteration to be non-fatal;
+ * see the comment in lib/rbtree.c. Note however that we only require the first
+ * condition -- not seeing partial stores -- because the latch thing isolates
+ * us from loops. If we were to interrupt a modification the lookup would be
+ * pointed at the stable tree and complete while the modification was halted.
+ */
+
+#ifndef RB_TREE_LATCH_H
+#define RB_TREE_LATCH_H
+
+#include <linux/rbtree.h>
+#include <linux/seqlock.h>
+
+struct latch_tree_node {
+	struct rb_node node[2];
+};
+
+struct latch_tree_root {
+	seqcount_t	seq;
+	struct rb_root	tree[2];
+};
+
+/**
+ * latch_tree_ops - operators to define the tree order
+ * @less: used for insertion; provides the (partial) order between two elements.
+ * @comp: used for lookups; provides the order between the search key and an element.
+ *
+ * The operators are related like:
+ *
+ *	comp(a->key,b) < 0  := less(a,b)
+ *	comp(a->key,b) > 0  := less(b,a)
+ *	comp(a->key,b) == 0 := !less(a,b) && !less(b,a)
+ *
+ * If these operators define a partial order on the elements we make no
+ * guarantee on which of the elements matching the key is found. See
+ * latch_tree_find().
+ */
+struct latch_tree_ops {
+	bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b);
+	int  (*comp)(void *key,                 struct latch_tree_node *b);
+};
+
+static __always_inline struct latch_tree_node *
+__lt_from_rb(struct rb_node *node, int idx)
+{
+	return container_of(node, struct latch_tree_node, node[idx]);
+}
+
+static __always_inline void
+__lt_insert(struct latch_tree_node *ltn, struct latch_tree_root *ltr, int idx,
+	    bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b))
+{
+	struct rb_root *root = &ltr->tree[idx];
+	struct rb_node **link = &root->rb_node;
+	struct rb_node *node = &ltn->node[idx];
+	struct rb_node *parent = NULL;
+	struct latch_tree_node *ltp;
+
+	while (*link) {
+		parent = *link;
+		ltp = __lt_from_rb(parent, idx);
+
+		if (less(ltn, ltp))
+			link = &parent->rb_left;
+		else
+			link = &parent->rb_right;
+	}
+
+	rb_link_node_rcu(node, parent, link);
+	rb_insert_color(node, root);
+}
+
+static __always_inline void
+__lt_erase(struct latch_tree_node *ltn, struct latch_tree_root *ltr, int idx)
+{
+	rb_erase(&ltn->node[idx], &ltr->tree[idx]);
+}
+
+static __always_inline struct latch_tree_node *
+__lt_find(void *key, struct latch_tree_root *ltr, int idx,
+	  int (*comp)(void *key, struct latch_tree_node *node))
+{
+	struct rb_node *node = rcu_dereference_raw(ltr->tree[idx].rb_node);
+	struct latch_tree_node *ltn;
+	int c;
+
+	while (node) {
+		ltn = __lt_from_rb(node, idx);
+		c = comp(key, ltn);
+
+		if (c < 0)
+			node = rcu_dereference_raw(node->rb_left);
+		else if (c > 0)
+			node = rcu_dereference_raw(node->rb_right);
+		else
+			return ltn;
+	}
+
+	return NULL;
+}
+
+/**
+ * latch_tree_insert() - insert @node into the trees @root
+ * @node: nodes to insert
+ * @root: trees to insert @node into
+ * @ops: operators defining the node order
+ *
+ * It inserts @node into @root in an ordered fashion such that we can always
+ * observe one complete tree. See the comment for raw_write_seqcount_latch().
+ *
+ * The inserts use rcu_assign_pointer() to publish the element such that the
+ * tree structure is stored before we can observe the new @node.
+ *
+ * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
+ * serialized.
+ */
+static __always_inline void
+latch_tree_insert(struct latch_tree_node *node,
+		  struct latch_tree_root *root,
+		  const struct latch_tree_ops *ops)
+{
+	raw_write_seqcount_latch(&root->seq);
+	__lt_insert(node, root, 0, ops->less);
+	raw_write_seqcount_latch(&root->seq);
+	__lt_insert(node, root, 1, ops->less);
+}
+
+/**
+ * latch_tree_erase() - removes @node from the trees @root
+ * @node: nodes to remote
+ * @root: trees to remove @node from
+ * @ops: operators defining the node order
+ *
+ * Removes @node from the trees @root in an ordered fashion such that we can
+ * always observe one complete tree. See the comment for
+ * raw_write_seqcount_latch().
+ *
+ * It is assumed that @node will observe one RCU quiescent state before being
+ * reused of freed.
+ *
+ * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
+ * serialized.
+ */
+static __always_inline void
+latch_tree_erase(struct latch_tree_node *node,
+		 struct latch_tree_root *root,
+		 const struct latch_tree_ops *ops)
+{
+	raw_write_seqcount_latch(&root->seq);
+	__lt_erase(node, root, 0);
+	raw_write_seqcount_latch(&root->seq);
+	__lt_erase(node, root, 1);
+}
+
+/**
+ * latch_tree_find() - find the node matching @key in the trees @root
+ * @key: search key
+ * @root: trees to search for @key
+ * @ops: operators defining the node order
+ *
+ * Does a lockless lookup in the trees @root for the node matching @key.
+ *
+ * It is assumed that this is called while holding the appropriate RCU read
+ * side lock.
+ *
+ * If the operators define a partial order on the elements (there are multiple
+ * elements which have the same key value) it is undefined which of these
+ * elements will be found. Nor is it possible to iterate the tree to find
+ * further elements with the same key value.
+ *
+ * Returns: a pointer to the node matching @key or NULL.
+ */
+static __always_inline struct latch_tree_node *
+latch_tree_find(void *key, struct latch_tree_root *root,
+		const struct latch_tree_ops *ops)
+{
+	struct latch_tree_node *node;
+	unsigned int seq;
+
+	do {
+		seq = raw_read_seqcount(&root->seq);
+		node = __lt_find(key, root, seq & 1, ops->comp);
+	} while (read_seqcount_retry(&root->seq, seq));
+
+	return node;
+}
+
+#endif /* RB_TREE_LATCH_H */



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v5 07/10] module: Optimize __module_address() using a latched RB-tree
  2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
                   ` (5 preceding siblings ...)
  2015-04-13 14:11 ` [PATCH v5 06/10] rbtree: Implement generic latch_tree Peter Zijlstra
@ 2015-04-13 14:11 ` Peter Zijlstra
  2015-04-13 16:49   ` Ingo Molnar
  2015-04-13 14:11 ` [PATCH v5 08/10] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING Peter Zijlstra
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 14:11 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

[-- Attachment #1: peterz-module-latch-tree.patch --]
[-- Type: text/plain, Size: 7185 bytes --]

Currently __module_address() is using a linear search through all
modules in order to find the module corresponding to the provided
address. With a lot of modules this can take a lot of time.

One of the users of this is kernel_text_address() which is employed
in many stack unwinders; which in turn are used by perf-callchain and
ftrace (possibly from NMI context).

So by optimizing __module_address() we optimize many stack unwinders
which are used by both perf and tracing in performance sensitive code.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/module.h |   29 ++++++++++--
 kernel/module.c        |  117 ++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 138 insertions(+), 8 deletions(-)

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -17,6 +17,7 @@
 #include <linux/moduleparam.h>
 #include <linux/jump_label.h>
 #include <linux/export.h>
+#include <linux/rbtree_latch.h>
 
 #include <linux/percpu.h>
 #include <asm/module.h>
@@ -210,6 +211,13 @@ enum module_state {
 	MODULE_STATE_UNFORMED,	/* Still setting it up. */
 };
 
+struct module;
+
+struct mod_tree_node {
+	struct module *mod;
+	struct latch_tree_node node;
+};
+
 struct module {
 	enum module_state state;
 
@@ -269,8 +277,15 @@ struct module {
 	/* Startup function. */
 	int (*init)(void);
 
-	/* If this is non-NULL, vfree after init() returns */
-	void *module_init;
+	/*
+	 * If this is non-NULL, vfree after init() returns.
+	 *
+	 * Cacheline align here, such that:
+	 *   module_init, module_core, init_size, core_size,
+	 *   init_text_size, core_text_size and ltn_core.node[0]
+	 * are on the same cacheline.
+	 */
+	void *module_init	____cacheline_aligned;
 
 	/* Here is the actual code + data, vfree'd on unload. */
 	void *module_core;
@@ -281,6 +296,14 @@ struct module {
 	/* The size of the executable code in each section.  */
 	unsigned int init_text_size, core_text_size;
 
+	/*
+	 * We want mtn_core::{mod,node[0]} to be in the same cacheline as the
+	 * above entries such that a regular lookup will only touch the one
+	 * cacheline.
+	 */
+	struct mod_tree_node	mtn_core;
+	struct mod_tree_node	mtn_init;
+
 	/* Size of RO sections of the module (text+rodata) */
 	unsigned int init_ro_size, core_ro_size;
 
@@ -365,7 +388,7 @@ struct module {
 	ctor_fn_t *ctors;
 	unsigned int num_ctors;
 #endif
-};
+} ____cacheline_aligned;
 #ifndef MODULE_ARCH_INIT
 #define MODULE_ARCH_INIT {}
 #endif
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -101,6 +101,109 @@
 DEFINE_MUTEX(module_mutex);
 EXPORT_SYMBOL_GPL(module_mutex);
 static LIST_HEAD(modules);
+
+/*
+ * Use a latched RB-tree for __module_address(); this allows us to use
+ * RCU-sched lookups of the address from any context.
+ *
+ * Because modules have two address ranges: init and core, we need two
+ * latch_tree_nodes entries. Therefore we need the back-pointer from
+ * mod_tree_node.
+ *
+ * Because init ranges are short lived we mark them unlikely and have placed
+ * them outside the critical cacheline in struct module.
+ */
+
+static __always_inline unsigned long __mod_tree_val(struct latch_tree_node *n)
+{
+	struct mod_tree_node *mtn = container_of(n, struct mod_tree_node, node);
+	struct module *mod = mtn->mod;
+
+	if (unlikely(mtn == &mod->mtn_init))
+		return (unsigned long)mod->module_init;
+
+	return (unsigned long)mod->module_core;
+}
+
+static __always_inline unsigned long __mod_tree_size(struct latch_tree_node *n)
+{
+	struct mod_tree_node *mtn = container_of(n, struct mod_tree_node, node);
+	struct module *mod = mtn->mod;
+
+	if (unlikely(mtn == &mod->mtn_init))
+		return (unsigned long)mod->init_size;
+
+	return (unsigned long)mod->core_size;
+}
+
+static __always_inline bool
+mod_tree_less(struct latch_tree_node *a, struct latch_tree_node *b)
+{
+	return __mod_tree_val(a) < __mod_tree_val(b);
+}
+
+static __always_inline int
+mod_tree_comp(void *key, struct latch_tree_node *n)
+{
+	unsigned long val = (unsigned long)key;
+	unsigned long start, end;
+
+	end = start = __mod_tree_val(n);
+	end += __mod_tree_size(n);
+
+	if (val < start)
+		return -1;
+
+	if (val >= end)
+		return 1;
+
+	return 0;
+}
+
+static const struct latch_tree_ops mod_tree_ops = {
+	.less = mod_tree_less,
+	.comp = mod_tree_comp,
+};
+
+static struct latch_tree_root mod_tree __cacheline_aligned;
+
+/*
+ * These modifications: insert, remove_init and remove; are serialized by the
+ * module_mutex.
+ */
+static void mod_tree_insert(struct module *mod)
+{
+	mod->mtn_core.mod = mod;
+	mod->mtn_init.mod = mod;
+
+	latch_tree_insert(&mod->mtn_core.node, &mod_tree, &mod_tree_ops);
+	if (mod->init_size)
+		latch_tree_insert(&mod->mtn_init.node, &mod_tree, &mod_tree_ops);
+}
+
+static void mod_tree_remove_init(struct module *mod)
+{
+	if (mod->init_size)
+		latch_tree_erase(&mod->mtn_init.node, &mod_tree, &mod_tree_ops);
+}
+
+static void mod_tree_remove(struct module *mod)
+{
+	latch_tree_erase(&mod->mtn_core.node, &mod_tree, &mod_tree_ops);
+	mod_tree_remove_init(mod);
+}
+
+static struct module *mod_tree_find(unsigned long addr)
+{
+	struct latch_tree_node *ltn;
+
+	ltn = latch_tree_find((void *)addr, &mod_tree, &mod_tree_ops);
+	if (!ltn)
+		return NULL;
+
+	return container_of(ltn, struct mod_tree_node, node)->mod;
+}
+
 #ifdef CONFIG_KGDB_KDB
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
@@ -1880,6 +1983,7 @@ static void free_module(struct module *m
 	mutex_lock(&module_mutex);
 	/* Unlink carefully: kallsyms could be walking list. */
 	list_del_rcu(&mod->list);
+	mod_tree_remove(mod);
 	/* Remove this module from bug list, this uses list_del_rcu */
 	module_bug_cleanup(mod);
 	/* Wait for RCU-sched synchronizing before releasing mod->list and buglist. */
@@ -3126,6 +3230,7 @@ static noinline int do_init_module(struc
 	mod->symtab = mod->core_symtab;
 	mod->strtab = mod->core_strtab;
 #endif
+	mod_tree_remove_init(mod);
 	unset_module_init_ro_nx(mod);
 	module_arch_freeing_init(mod);
 	mod->module_init = NULL;
@@ -3196,6 +3301,7 @@ static int add_unformed_module(struct mo
 		goto out;
 	}
 	list_add_rcu(&mod->list, &modules);
+	mod_tree_insert(mod);
 	err = 0;
 
 out:
@@ -3395,6 +3501,7 @@ static int load_module(struct load_info
 	mutex_lock(&module_mutex);
 	/* Unlink carefully: kallsyms could be walking list. */
 	list_del_rcu(&mod->list);
+	mod_tree_remove(mod);
 	wake_up_all(&module_wq);
 	/* Wait for RCU-sched synchronizing before releasing mod->list. */
 	synchronize_sched();
@@ -3842,13 +3949,13 @@ struct module *__module_address(unsigned
 
 	module_assert_mutex_or_preempt();
 
-	list_for_each_entry_rcu(mod, &modules, list) {
+	mod = mod_tree_find(addr);
+	if (mod) {
+		BUG_ON(!within_module(addr, mod));
 		if (mod->state == MODULE_STATE_UNFORMED)
-			continue;
-		if (within_module(addr, mod))
-			return mod;
+			mod = NULL;
 	}
-	return NULL;
+	return mod;
 }
 EXPORT_SYMBOL_GPL(__module_address);
 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v5 08/10] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
  2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
                   ` (6 preceding siblings ...)
  2015-04-13 14:11 ` [PATCH v5 07/10] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
@ 2015-04-13 14:11 ` Peter Zijlstra
  2015-04-13 16:53   ` Ingo Molnar
  2015-04-13 14:11 ` [PATCH v5 09/10] module: Use __module_address() for module_address_lookup() Peter Zijlstra
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 14:11 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz, Andrew Morton

[-- Attachment #1: peterz-modtree-optional.patch --]
[-- Type: text/plain, Size: 2437 bytes --]

Andrew worried about the overhead on small systems; only use the fancy
code when either perf or tracing is enabled.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Requested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -102,6 +102,8 @@ DEFINE_MUTEX(module_mutex);
 EXPORT_SYMBOL_GPL(module_mutex);
 static LIST_HEAD(modules);
 
+#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_TRACING)
+
 /*
  * Use a latched RB-tree for __module_address(); this allows us to use
  * RCU-sched lookups of the address from any context.
@@ -112,6 +114,10 @@ static LIST_HEAD(modules);
  *
  * Because init ranges are short lived we mark them unlikely and have placed
  * them outside the critical cacheline in struct module.
+ *
+ * This is conditional on PERF_EVENTS || TRACING because those can really hit
+ * __module_address() hard by doing a lot of stack unwinding; potentially from
+ * NMI context.
  */
 
 static __always_inline unsigned long __mod_tree_val(struct latch_tree_node *n)
@@ -193,7 +199,7 @@ static void mod_tree_remove(struct modul
 	mod_tree_remove_init(mod);
 }
 
-static struct module *mod_tree_find(unsigned long addr)
+static struct module *mod_find(unsigned long addr)
 {
 	struct latch_tree_node *ltn;
 
@@ -204,6 +210,26 @@ static struct module *mod_tree_find(unsi
 	return container_of(ltn, struct mod_tree_node, node)->mod;
 }
 
+#else /* !(PERF_EVENTS || TRACING) */
+
+static void mod_tree_insert(struct module *mod) { }
+static void mod_tree_remove_init(struct module *mod) { }
+static void mod_tree_remove(struct module *mod) { }
+
+static struct module *mod_find(unsigned long addr)
+{
+	struct module *mod;
+
+	list_for_each_entry_rcu(mod, &modules, list) {
+		if (within_module(addr, mod))
+			return mod;
+	}
+
+	return NULL;
+}
+
+#endif /* !(PERF_EVENTS || TRACING) */
+
 #ifdef CONFIG_KGDB_KDB
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
@@ -3949,7 +3975,7 @@ struct module *__module_address(unsigned
 
 	module_assert_mutex_or_preempt();
 
-	mod = mod_tree_find(addr);
+	mod = mod_find(addr);
 	if (mod) {
 		BUG_ON(!within_module(addr, mod));
 		if (mod->state == MODULE_STATE_UNFORMED)



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v5 09/10] module: Use __module_address() for module_address_lookup()
  2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
                   ` (7 preceding siblings ...)
  2015-04-13 14:11 ` [PATCH v5 08/10] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING Peter Zijlstra
@ 2015-04-13 14:11 ` Peter Zijlstra
  2015-04-13 14:11 ` [PATCH v5 10/10] module: Rework module_addr_{min,max} Peter Zijlstra
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 14:11 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

[-- Attachment #1: peterz-module-use-__module_address.patch --]
[-- Type: text/plain, Size: 1127 bytes --]

Use the generic __module_address() addr to struct module lookup
instead of open coding it once more.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3515,19 +3515,15 @@ const char *module_address_lookup(unsign
 			    char **modname,
 			    char *namebuf)
 {
-	struct module *mod;
 	const char *ret = NULL;
+	struct module *mod;
 
 	preempt_disable();
-	list_for_each_entry_rcu(mod, &modules, list) {
-		if (mod->state == MODULE_STATE_UNFORMED)
-			continue;
-		if (within_module(addr, mod)) {
-			if (modname)
-				*modname = mod->name;
-			ret = get_ksymbol(mod, addr, size, offset);
-			break;
-		}
+	mod = __module_address(addr);
+	if (mod) {
+		if (modname)
+			*modname = mod->name;
+		ret = get_ksymbol(mod, addr, size, offset);
 	}
 	/* Make a copy in here where it's safe */
 	if (ret) {
@@ -3535,6 +3531,7 @@ const char *module_address_lookup(unsign
 		ret = namebuf;
 	}
 	preempt_enable();
+
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v5 10/10] module: Rework module_addr_{min,max}
  2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
                   ` (8 preceding siblings ...)
  2015-04-13 14:11 ` [PATCH v5 09/10] module: Use __module_address() for module_address_lookup() Peter Zijlstra
@ 2015-04-13 14:11 ` Peter Zijlstra
  2015-04-13 16:56   ` Ingo Molnar
  2015-04-13 17:02 ` [PATCH v5 00/10] latched RB-trees and __module_address() Ingo Molnar
  2015-04-14  2:57 ` Rusty Russell
  11 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 14:11 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

[-- Attachment #1: peterz-module-mod_tree-line.patch --]
[-- Type: text/plain, Size: 6104 bytes --]

__module_address() does an initial bound check before doing the
{list/tree} iteration to find the actual module. The bound variables
are nowhere near the mod_tree cacheline, in fact they're nowhere near
one another.

module_addr_min lives in .data while module_addr_max lives in .bss
(smarty pants GCC thinks the explicit 0 assignment is a mistake).

Rectify this by moving the two variables into a structure together
with the latch_tree_root to guarantee they all share the same
cacheline and avoid hitting two extra cachelines for the lookup.

While reworking the bounds code, move the bound update from allocation
to insertion time, this avoids updating the bounds for a few error
paths.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   84 ++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 54 insertions(+), 30 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -171,7 +171,26 @@ static const struct latch_tree_ops mod_t
 	.comp = mod_tree_comp,
 };
 
-static struct latch_tree_root mod_tree __cacheline_aligned;
+static struct mod_tree_root {
+	struct latch_tree_root root;
+	unsigned long addr_min;
+	unsigned long addr_max;
+} mod_tree __cacheline_aligned = {
+	.addr_min = -1UL,
+};
+
+#define module_addr_min mod_tree.addr_min
+#define module_addr_max mod_tree.addr_max
+
+static noinline void __mod_tree_insert(struct mod_tree_node *node)
+{
+	latch_tree_insert(&node->node, &mod_tree.root, &mod_tree_ops);
+}
+
+static void __mod_tree_remove(struct mod_tree_node *node)
+{
+	latch_tree_erase(&node->node, &mod_tree.root, &mod_tree_ops);
+}
 
 /*
  * These modifications: insert, remove_init and remove; are serialized by the
@@ -182,20 +201,20 @@ static void mod_tree_insert(struct modul
 	mod->mtn_core.mod = mod;
 	mod->mtn_init.mod = mod;
 
-	latch_tree_insert(&mod->mtn_core.node, &mod_tree, &mod_tree_ops);
+	__mod_tree_insert(&mod->mtn_core);
 	if (mod->init_size)
-		latch_tree_insert(&mod->mtn_init.node, &mod_tree, &mod_tree_ops);
+		__mod_tree_insert(&mod->mtn_init);
 }
 
 static void mod_tree_remove_init(struct module *mod)
 {
 	if (mod->init_size)
-		latch_tree_erase(&mod->mtn_init.node, &mod_tree, &mod_tree_ops);
+		__mod_tree_remove(&mod->mtn_init);
 }
 
 static void mod_tree_remove(struct module *mod)
 {
-	latch_tree_erase(&mod->mtn_core.node, &mod_tree, &mod_tree_ops);
+	__mod_tree_remove(&mod->mtn_core);
 	mod_tree_remove_init(mod);
 }
 
@@ -203,7 +222,7 @@ static struct module *mod_find(unsigned
 {
 	struct latch_tree_node *ltn;
 
-	ltn = latch_tree_find((void *)addr, &mod_tree, &mod_tree_ops);
+	ltn = latch_tree_find((void *)addr, &mod_tree.root, &mod_tree_ops);
 	if (!ltn)
 		return NULL;
 
@@ -212,6 +231,12 @@ static struct module *mod_find(unsigned
 
 #else /* !(PERF_EVENTS || TRACING) */
 
+/*
+ * Bounds of module allocation, for speeding up __module_address.
+ * Protected by module_mutex.
+ */
+static unsigned long module_addr_min = -1UL, module_addr_max = 0;
+
 static void mod_tree_insert(struct module *mod) { }
 static void mod_tree_remove_init(struct module *mod) { }
 static void mod_tree_remove(struct module *mod) { }
@@ -230,6 +255,24 @@ static struct module *mod_find(unsigned
 
 #endif /* !(PERF_EVENTS || TRACING) */
 
+static void __mod_update_bounds(void *base, unsigned int size)
+{
+	unsigned long min = (unsigned long)base;
+	unsigned long max = min + size;
+
+	if (min < module_addr_min)
+		module_addr_min = min;
+	if (max > module_addr_max)
+		module_addr_max = max;
+}
+
+static void mod_update_bounds(struct module *mod)
+{
+	__mod_update_bounds(mod->module_core, mod->core_size);
+	if (mod->init_size)
+		__mod_update_bounds(mod->module_init, mod->init_size);
+}
+
 #ifdef CONFIG_KGDB_KDB
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
@@ -300,10 +343,6 @@ static DECLARE_WAIT_QUEUE_HEAD(module_wq
 
 static BLOCKING_NOTIFIER_HEAD(module_notify_list);
 
-/* Bounds of module allocation, for speeding __module_address.
- * Protected by module_mutex. */
-static unsigned long module_addr_min = -1UL, module_addr_max = 0;
-
 int register_module_notifier(struct notifier_block *nb)
 {
 	return blocking_notifier_chain_register(&module_notify_list, nb);
@@ -2542,22 +2581,6 @@ void * __weak module_alloc(unsigned long
 	return vmalloc_exec(size);
 }
 
-static void *module_alloc_update_bounds(unsigned long size)
-{
-	void *ret = module_alloc(size);
-
-	if (ret) {
-		mutex_lock(&module_mutex);
-		/* Update module bounds. */
-		if ((unsigned long)ret < module_addr_min)
-			module_addr_min = (unsigned long)ret;
-		if ((unsigned long)ret + size > module_addr_max)
-			module_addr_max = (unsigned long)ret + size;
-		mutex_unlock(&module_mutex);
-	}
-	return ret;
-}
-
 #ifdef CONFIG_DEBUG_KMEMLEAK
 static void kmemleak_load_module(const struct module *mod,
 				 const struct load_info *info)
@@ -2942,7 +2965,7 @@ static int move_module(struct module *mo
 	void *ptr;
 
 	/* Do the allocs. */
-	ptr = module_alloc_update_bounds(mod->core_size);
+	ptr = module_alloc(mod->core_size);
 	/*
 	 * The pointer to this block is stored in the module structure
 	 * which is inside the block. Just mark it as not being a
@@ -2956,7 +2979,7 @@ static int move_module(struct module *mo
 	mod->module_core = ptr;
 
 	if (mod->init_size) {
-		ptr = module_alloc_update_bounds(mod->init_size);
+		ptr = module_alloc(mod->init_size);
 		/*
 		 * The pointer to this block is stored in the module structure
 		 * which is inside the block. This block doesn't need to be
@@ -3327,6 +3350,7 @@ static int add_unformed_module(struct mo
 		goto out;
 	}
 	list_add_rcu(&mod->list, &modules);
+	mod_update_bounds(mod);
 	mod_tree_insert(mod);
 	err = 0;
 
@@ -3967,11 +3991,11 @@ struct module *__module_address(unsigned
 {
 	struct module *mod;
 
+	module_assert_mutex_or_preempt();
+
 	if (addr < module_addr_min || addr > module_addr_max)
 		return NULL;
 
-	module_assert_mutex_or_preempt();
-
 	mod = mod_find(addr);
 	if (mod) {
 		BUG_ON(!within_module(addr, mod));



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 01/10] module: Sanitize RCU usage and locking
  2015-04-13 14:11 ` [PATCH v5 01/10] module: Sanitize RCU usage and locking Peter Zijlstra
@ 2015-04-13 15:32   ` Ingo Molnar
  2015-04-13 15:40     ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2015-04-13 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux


* Peter Zijlstra <peterz@infradead.org> wrote:

> Currently the RCU usage in module is an inconsistent mess of RCU and
> RCU-sched, this is broken for CONFIG_PREEMPT where synchronize_rcu()
> does not imply synchronize_sched().
> 
> Most usage sites use preempt_{dis,en}able() which is RCU-sched, but
> (most of) the modification sites use synchronize_rcu(). With the
> exception of the module bug list, which actually uses RCU.
> 
> Convert everything over to RCU-sched.
> 
> Furthermore add lockdep asserts to all sites, because its not at all
> clear to me the required locking is observed, esp. on exported
> functions.

nit:

s/its/it's

> +static void module_assert_mutex_or_preempt(void)
> +{
> +#ifdef CONFIG_LOCKDEP
> +	int rcu_held = rcu_read_lock_sched_held();
> +	int mutex_held = 1;
> +
> +	if (debug_locks)
> +		mutex_held = lockdep_is_held(&module_mutex);
> +
> +	WARN_ON(!rcu_held && !mutex_held);

So because rcu_read_lock_sched_held() also depends on debug_locks 
being on to be fully correct, shouldn't the warning also be within the 
debug_locks condition?

> @@ -3106,11 +3128,11 @@ static noinline int do_init_module(struc
>  	mod->init_text_size = 0;
>  	/*
>  	 * We want to free module_init, but be aware that kallsyms may be
> +	 * walking this with preempt disabled.  In all the failure paths, we
> +	 * call synchronize_sched, but we don't want to slow down the success
> +	 * path, so use actual RCU here.

nit:

s/synchronize_sched
 /synchronize_sched()

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 01/10] module: Sanitize RCU usage and locking
  2015-04-13 15:32   ` Ingo Molnar
@ 2015-04-13 15:40     ` Peter Zijlstra
  2015-04-13 16:32       ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 15:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux

On Mon, Apr 13, 2015 at 05:32:29PM +0200, Ingo Molnar wrote:
> > +static void module_assert_mutex_or_preempt(void)
> > +{
> > +#ifdef CONFIG_LOCKDEP
> > +	int rcu_held = rcu_read_lock_sched_held();
> > +	int mutex_held = 1;
> > +
> > +	if (debug_locks)
> > +		mutex_held = lockdep_is_held(&module_mutex);
> > +
> > +	WARN_ON(!rcu_held && !mutex_held);
> 
> So because rcu_read_lock_sched_held() also depends on debug_locks 
> being on to be fully correct, shouldn't the warning also be within the 
> debug_locks condition?

Ah, see how mutex_held will be true for !debug_locks and therefore we'll
not trigger the warn.

Maybe not the best way to code that though.

Something like so perhaps:

static void module_assert_mutex_or_preempt(void)
{
#ifdef CONFIG_LOCKDEP
	if (!debug_locks)
		return;

	WARN_ON(!rcu_held_lock_sched_held() &&
		!lockdep_is_held(&module_mutex));
#endif
}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 04/10] rbtree: Make lockless searches non-fatal
  2015-04-13 14:11 ` [PATCH v5 04/10] rbtree: Make lockless searches non-fatal Peter Zijlstra
@ 2015-04-13 15:50   ` Ingo Molnar
  2015-04-13 19:10     ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2015-04-13 15:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux, David Woodhouse, Rik van Riel,
	Andrea Arcangeli, Michel Lespinasse


* Peter Zijlstra <peterz@infradead.org> wrote:

> Change the insert and erase code such that lockless searches are
> non-fatal.
> 
> In and of itself an rbtree cannot be correctly searched while
> in-modification, we can however provide weaker guarantees that will
> allow the rbtree to be used in conjunction with other techniques, such
> as latches; see 9b0fd802e8c0 ("seqcount: Add raw_write_seqcount_latch()").
> 
> For this to work we need the following guarantees from the rbtree
> code:
> 
>  1) a lockless reader must not see partial stores, this would allow it
>     to observe nodes that are invalid memory.
> 
>  2) there must not be (temporary) loops in the tree structure in the
>     modifier's program order, this would cause a lookup which
>     interrupts the modifier to get stuck indefinitely.
> 
> For 1) we must use WRITE_ONCE() for all updates to the tree structure;
> in particular this patch only does rb_{left,right} as those are the
> only element required for simple searches.
> 
> It generates slightly worse code, probably because volatile. But in
> pointer chasing heavy code a few instructions more should not matter.

So I had a look at code generation on x86/64-defconfig, it adds 2 more 
instructions, out of 900+ instructions total:

   text    data     bss     dec     hex filename
   3299       0       0    3299     ce3 rbtree.o.before
   3308       0       0    3308     cec rbtree.o.after

One of the instructions is a MOV, the other AFAICS is a NOP due to 
changed jump target alignment.

Interestingly, when compiled with -Os then your patch actually 
_shrinks_ the code:

   text    data     bss     dec     hex filename
   2524       0       0    2524     9dc rbtree.o.before
   2440       0       0    2440     988 rbtree.o.after

and rather significantly so. This is with GCC 4.9. Possibly your patch 
unconfused GCC somehow.

So just for kicks I applied my patch-set that fixes up jump target 
alignments, and the numbers with the regular -O2 became:

   text    data     bss     dec     hex filename
   2995       0       0    2995     bb3 rbtree.o.before
   2981       0       0    2981     ba5 rbtree.o.after

so your patch shrinks rbtree.o even without -Os, so it's probably a 
speedup and doesn't generate worse code once GCC's alignment sillies 
are righted!

>  	*rb_link = node;
>  }
>  
> +static inline void rb_link_node_rcu(struct rb_node * node, struct rb_node * parent,
> +				    struct rb_node ** rb_link)

Minor stylistic nit, the standard pattern I suspect has spaces fewer 
by three:

static inline void rb_link_node_rcu(struct rb_node *node, struct rb_node *parent,
				    struct rb_node **rb_link)

> +/*
> + * Notes on lockless lookups:
> + *
> + * All stores to the tree structure (rb_left and rb_right) must be done using
> + * WRITE_ONCE(). And we must not inadvertently cause (temporary) loops in the
> + * tree structure as seen in program order.
> + *
> + * These two requirements will allow lockless iteration of the tree -- not
> + * correct iteration mind you, tree rotations are not atomic so a lookup might
> + * miss entire subtrees.
> + *
> + * But they do guarantee that any such traversal will only see valid elements
> + * and that it will indeed complete -- does not get stuck in a loop.
> + *
> + * It also guarantees that if the lookup returns an element it is the 'correct'
> + * one. But not returning an element does _NOT_ mean its not present.

s/its/it's

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-13 14:11 ` [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
@ 2015-04-13 16:32   ` Ingo Molnar
  2015-04-13 17:08     ` Mathieu Desnoyers
  2015-04-13 19:17     ` Peter Zijlstra
  0 siblings, 2 replies; 46+ messages in thread
From: Ingo Molnar @ 2015-04-13 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux, Andrea Arcangeli,
	David Woodhouse, Rik van Riel, Michel Lespinasse


* Peter Zijlstra <peterz@infradead.org> wrote:

> +/**
>   * raw_write_seqcount_latch - redirect readers to even/odd copy
>   * @s: pointer to seqcount_t
> + *
> + * The latch technique is a multiversion concurrency control method that allows
> + * queries during non atomic modifications. If you can guarantee queries never
> + * interrupt the modification -- e.g. the concurrency is strictly between CPUs
> + * -- you most likely do not need this.

Speling nit:

 triton:~/tip> git grep -i 'non-atomic' | wc -l
 160
 triton:~/tip> git grep -i 'non atomic' | wc -l
 21

so I guess 'non-atomic' wins?

> + *
> + * Where the traditional RCU/lockless data structures rely on atomic
> + * modifications to ensure queries observe either the old or the new state the
> + * latch allows the same for non atomic updates. The trade-off is doubling the
> + * cost of storage; we have to maintain two copies of the entire data
> + * structure.

s/non atomic/non-atomic

> + * The query will have a form like:
> + *
> + * struct entry *latch_query(struct latch_struct *latch, ...)
> + * {
> + *	struct entry *entry;
> + *	unsigned seq, idx;
> + *
> + *	do {
> + *		seq = latch->seq;
> + *		smp_rmb();
> + *
> + *		idx = seq & 0x01;
> + *		entry = data_query(latch->data[idx], ...);
> + *
> + *		smp_rmb();
> + *	} while (seq != latch->seq);

Btw., I realize this is just a sample, but couldn't this be written 
more optimally as:

	do {
		seq = READ_ONCE(latch->seq);
		smp_read_barrier_depends();

		idx = seq & 0x01;
		entry = data_query(latch->data[idx], ...);

		smp_rmb();
	} while (seq != latch->seq);

Note that there's just a single smp_rmb() barrier: the READ_ONCE() is 
there to make sure GCC doesn't calculate 'idx' from two separate 
reads, but otherwise there's a direct data dependency on latch->seq so 
no smp_rmb() is needed, only a data dependency barrier when doing the 
first lookup AFAICS?

(This doesn't matter on x86 where smp_rmb() is barrier().)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 01/10] module: Sanitize RCU usage and locking
  2015-04-13 15:40     ` Peter Zijlstra
@ 2015-04-13 16:32       ` Ingo Molnar
  0 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2015-04-13 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Apr 13, 2015 at 05:32:29PM +0200, Ingo Molnar wrote:
> > > +static void module_assert_mutex_or_preempt(void)
> > > +{
> > > +#ifdef CONFIG_LOCKDEP
> > > +	int rcu_held = rcu_read_lock_sched_held();
> > > +	int mutex_held = 1;
> > > +
> > > +	if (debug_locks)
> > > +		mutex_held = lockdep_is_held(&module_mutex);
> > > +
> > > +	WARN_ON(!rcu_held && !mutex_held);
> > 
> > So because rcu_read_lock_sched_held() also depends on debug_locks 
> > being on to be fully correct, shouldn't the warning also be within the 
> > debug_locks condition?
> 
> Ah, see how mutex_held will be true for !debug_locks and therefore we'll
> not trigger the warn.
> 
> Maybe not the best way to code that though.
> 
> Something like so perhaps:
> 
> static void module_assert_mutex_or_preempt(void)
> {
> #ifdef CONFIG_LOCKDEP
> 	if (!debug_locks)
> 		return;
> 
> 	WARN_ON(!rcu_held_lock_sched_held() &&
> 		!lockdep_is_held(&module_mutex));
> #endif

Yeah. I'd also make it:

 	if (unlikely(!debug_locks))
 		return;

or such.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 06/10] rbtree: Implement generic latch_tree
  2015-04-13 14:11 ` [PATCH v5 06/10] rbtree: Implement generic latch_tree Peter Zijlstra
@ 2015-04-13 16:43   ` Ingo Molnar
  2015-04-13 19:20     ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2015-04-13 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel


* Peter Zijlstra <peterz@infradead.org> wrote:

> Implement a latched RB-tree in order to get unconditional RCU/lockless
> lookups.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: David Woodhouse <David.Woodhouse@intel.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/rbtree_latch.h |  212 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 212 insertions(+)
> 
> --- /dev/null
> +++ b/include/linux/rbtree_latch.h
> @@ -0,0 +1,212 @@
> +/*
> + * Latched RB-trees
> + *
> + * Copyright (C) 2015 Intel Corp., Peter Zijlstra <peterz@infradead.org>
> + *
> + * Since RB-trees have non atomic modifications they're not immediately suited
> + * for RCU/lockless queries. Even though we made RB tree lookups non-fatal for
> + * lockless lookups; we cannot guarantee they return a correct result.
> + *
> + * The simplest solution is a seqlock + rb-tree, this will allow lockless
> + * lookups; but has the constraint (inherent to the seqlock) that read sides
> + * cannot nest in write sides.
> + *
> + * If we need to allow unconditional lookups (say as required for NMI context
> + * usage) we need a more complex setup; this data structure provides this by
> + * employing the latch technique -- see @raw_write_seqcount_latch -- to
> + * implement a latched RB-tree which does allow for unconditional lookups by
> + * virtue of always having (at least) one stable copy of the tree.
> + *
> + * However, while we have the guarantee that there is at all times one stable
> + * copy, this does not guarantee an iteration will not observe modifications.
> + * What might have been a stable copy at the start of the iteration, need not
> + * remain so for the duration of the iteration.
> + *
> + * Therefore, this does require a lockless RB-tree iteration to be non-fatal;
> + * see the comment in lib/rbtree.c. Note however that we only require the first
> + * condition -- not seeing partial stores -- because the latch thing isolates
> + * us from loops. If we were to interrupt a modification the lookup would be
> + * pointed at the stable tree and complete while the modification was halted.

Minor nit: so this text has 3 variants to spell RB-trees:

	RB-tree
	RB tree
	rb-tree

I suggest we pick one! :-)
	
Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 07/10] module: Optimize __module_address() using a latched RB-tree
  2015-04-13 14:11 ` [PATCH v5 07/10] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
@ 2015-04-13 16:49   ` Ingo Molnar
  2015-04-14 12:31     ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2015-04-13 16:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux


* Peter Zijlstra <peterz@infradead.org> wrote:

> Currently __module_address() is using a linear search through all
> modules in order to find the module corresponding to the provided
> address. With a lot of modules this can take a lot of time.
>
> One of the users of this is kernel_text_address() which is employed 
> in many stack unwinders; which in turn are used by perf-callchain 
> and ftrace (possibly from NMI context).
> 
> So by optimizing __module_address() we optimize many stack unwinders 
> which are used by both perf and tracing in performance sensitive 
> code.

So my (rather typical) workstation has 116 modules loaded currently - 
but setups using in excess of 150 modules are not uncommon either.

A linear list walk of 100-150 entries for every single call chain 
entry that hits some module, in 'perf record -g', can cause some 
overhead!

> +	/*
> +	 * If this is non-NULL, vfree after init() returns.

s/vfree/vfree()

> +	/*
> +	 * We want mtn_core::{mod,node[0]} to be in the same cacheline as the
> +	 * above entries such that a regular lookup will only touch the one
> +	 * cacheline.

s/touch the one cacheline
 /touch one cacheline

?

> +static __always_inline int
> +mod_tree_comp(void *key, struct latch_tree_node *n)
> +{
> +	unsigned long val = (unsigned long)key;
> +	unsigned long start, end;
> +
> +	end = start = __mod_tree_val(n);
> +	end += __mod_tree_size(n);
> +
> +	if (val < start)
> +		return -1;
> +
> +	if (val >= end)
> +		return 1;
> +
> +	return 0;

So since we are counting nanoseconds, I suspect this could be written 
more optimally as:

{
	unsigned long val = (unsigned long)key;
	unsigned long start, end;

	start = __mod_tree_val(n);
	if (val < start)
		return -1;

	end = start + __mod_tree_size(n);
	if (val >= end)
		return 1;

	return 0;
}

right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 08/10] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
  2015-04-13 14:11 ` [PATCH v5 08/10] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING Peter Zijlstra
@ 2015-04-13 16:53   ` Ingo Molnar
  0 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2015-04-13 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux, Andrew Morton


* Peter Zijlstra <peterz@infradead.org> wrote:

> Andrew worried about the overhead on small systems; only use the fancy
> code when either perf or tracing is enabled.
> 
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Requested-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/module.c |   30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -102,6 +102,8 @@ DEFINE_MUTEX(module_mutex);
>  EXPORT_SYMBOL_GPL(module_mutex);
>  static LIST_HEAD(modules);
>  
> +#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_TRACING)
> +
>  /*
>   * Use a latched RB-tree for __module_address(); this allows us to use
>   * RCU-sched lookups of the address from any context.
> @@ -112,6 +114,10 @@ static LIST_HEAD(modules);
>   *
>   * Because init ranges are short lived we mark them unlikely and have placed
>   * them outside the critical cacheline in struct module.
> + *
> + * This is conditional on PERF_EVENTS || TRACING because those can really hit
> + * __module_address() hard by doing a lot of stack unwinding; potentially from
> + * NMI context.

So I think we'd be better off introducing a helper Kconfig bool for 
that, CONFIG_MODULE_LATCHED_LOOKUPS or so, and select that symbol from 
the perf and tracing Kconfig code directly?

Beyond it being a cleaner, self-maintaining construct, that would also 
allow other subsystems to set it as well, without having to modify 
kernel/module.c.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 10/10] module: Rework module_addr_{min,max}
  2015-04-13 14:11 ` [PATCH v5 10/10] module: Rework module_addr_{min,max} Peter Zijlstra
@ 2015-04-13 16:56   ` Ingo Molnar
  2015-04-14  2:55     ` Rusty Russell
  2015-04-14 12:56     ` Peter Zijlstra
  0 siblings, 2 replies; 46+ messages in thread
From: Ingo Molnar @ 2015-04-13 16:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux


* Peter Zijlstra <peterz@infradead.org> wrote:

> __module_address() does an initial bound check before doing the 
> {list/tree} iteration to find the actual module. The bound variables 
> are nowhere near the mod_tree cacheline, in fact they're nowhere 
> near one another.
> 
> module_addr_min lives in .data while module_addr_max lives in .bss 
> (smarty pants GCC thinks the explicit 0 assignment is a mistake).
> 
> Rectify this by moving the two variables into a structure together 
> with the latch_tree_root to guarantee they all share the same 
> cacheline and avoid hitting two extra cachelines for the lookup.
> 
> While reworking the bounds code, move the bound update from 
> allocation to insertion time, this avoids updating the bounds for a 
> few error paths.

> +static struct mod_tree_root {
> +	struct latch_tree_root root;
> +	unsigned long addr_min;
> +	unsigned long addr_max;
> +} mod_tree __cacheline_aligned = {
> +	.addr_min = -1UL,
> +};
> +
> +#define module_addr_min mod_tree.addr_min
> +#define module_addr_max mod_tree.addr_max

>  #else /* !(PERF_EVENTS || TRACING) */
>  
> +/*
> + * Bounds of module allocation, for speeding up __module_address.
> + * Protected by module_mutex.
> + */
> +static unsigned long module_addr_min = -1UL, module_addr_max = 0;

I suspect the same .data vs. .bss problem affects the #else branch as 
well?

If so then it would make sense IMHO to put the structure definition 
into generic code so that both variants benefit from the shared 
cacheline?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/10] latched RB-trees and __module_address()
  2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
                   ` (9 preceding siblings ...)
  2015-04-13 14:11 ` [PATCH v5 10/10] module: Rework module_addr_{min,max} Peter Zijlstra
@ 2015-04-13 17:02 ` Ingo Molnar
  2015-04-14  2:57 ` Rusty Russell
  11 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2015-04-13 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux


* Peter Zijlstra <peterz@infradead.org> wrote:

> This series is aimed at making __module_address() go fast(er).
> 
> The reason for doing so is that most stack unwinders use kernel_text_address()
> to validate each frame. Perf and ftrace (can) end up doing a lot of stack
> traces from performance sensitive code.
> 
> On the way there it:
>  - annotates and sanitizes module locking
>  - introduces the latched RB-tree
>  - employs it to make __module_address() go fast.
> 
> I've build and boot tested this on x86_64 with modules and lockdep
> enabled.  Performance numbers (below) are done with lockdep disabled.
> 
> As previously mentioned; the reason for writing the latched RB-tree as generic
> code is mostly for clarity/documentation purposes; as there are a number of
> separate and non trivial bits to the complete solution.
> 
> As measued on my ivb-ep system with 84 modules loaded; prior to 
> patching the test module (below) reports (cache hot, performance 
> cpufreq):
> 
>           avg +- stdev
> Before:   611 +- 10 [ns] per __module_address() call
> After:     17 +-  5 [ns] per __module_address() call
> 
> PMI measurements for a cpu running loops in a module (also [ns]):
> 
> Before: Mean: 2719 +- 1, Stdev: 214, Samples: 40036
> After:  Mean:  947 +- 0, Stdev: 132, Samples: 40037

Those are some pretty impressive speedups!

I suspect the 900 nsecs residual PMI overhead is due to other, overly 
bloated PMI (perf) processing costs?

> Note; I have also tested things like: perf record -a -g modprobe 
> mod_test, to make 'sure' to hit some of the more interesting paths.
> 
> Changes since last time:
> 
>  - reworked generic latch_tree API (Lai Jiangshan)
>  - reworked module bounds (me)
>  - reworked all the testing code (not included)
> 
> Rusty, please consider merging this (for 4.2, I know its the merge 
> window, no rush)

So modulo the mostly trivial feedback I gave, it looks all good to me 
as well, feel free to also add my:

  Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-13 16:32   ` Ingo Molnar
@ 2015-04-13 17:08     ` Mathieu Desnoyers
  2015-04-13 17:43       ` Paul E. McKenney
  2015-04-13 19:17     ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Mathieu Desnoyers @ 2015-04-13 17:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, rusty, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux, Andrea Arcangeli,
	David Woodhouse, Rik van Riel, Michel Lespinasse

----- Original Message -----
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
[...]
> > + * The query will have a form like:
> > + *
> > + * struct entry *latch_query(struct latch_struct *latch, ...)
> > + * {
> > + *	struct entry *entry;
> > + *	unsigned seq, idx;
> > + *
> > + *	do {
> > + *		seq = latch->seq;
> > + *		smp_rmb();
> > + *
> > + *		idx = seq & 0x01;
> > + *		entry = data_query(latch->data[idx], ...);
> > + *
> > + *		smp_rmb();
> > + *	} while (seq != latch->seq);
> 
> Btw., I realize this is just a sample, but couldn't this be written
> more optimally as:
> 
> 	do {
> 		seq = READ_ONCE(latch->seq);
> 		smp_read_barrier_depends();
> 
> 		idx = seq & 0x01;
> 		entry = data_query(latch->data[idx], ...);
> 
> 		smp_rmb();
> 	} while (seq != latch->seq);
> 
> Note that there's just a single smp_rmb() barrier: the READ_ONCE() is
> there to make sure GCC doesn't calculate 'idx' from two separate
> reads, but otherwise there's a direct data dependency on latch->seq so
> no smp_rmb() is needed, only a data dependency barrier when doing the
> first lookup AFAICS?
> 
> (This doesn't matter on x86 where smp_rmb() is barrier().)

The latch evolved from seqlock.h, where there was no
data dependency between the sequence counter and the
data read, hence the smp_rmb(). Indeed, there is a
data dependency in the case of the latch, so I think
your approach of READ_ONCE + smp_read_barrier_depends()
is appropriate.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-13 17:08     ` Mathieu Desnoyers
@ 2015-04-13 17:43       ` Paul E. McKenney
  2015-04-13 18:21         ` Linus Torvalds
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2015-04-13 17:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Peter Zijlstra, rusty, oleg, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux, Andrea Arcangeli,
	David Woodhouse, Rik van Riel, Michel Lespinasse

On Mon, Apr 13, 2015 at 05:08:19PM +0000, Mathieu Desnoyers wrote:
> ----- Original Message -----
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> [...]
> > > + * The query will have a form like:
> > > + *
> > > + * struct entry *latch_query(struct latch_struct *latch, ...)
> > > + * {
> > > + *	struct entry *entry;
> > > + *	unsigned seq, idx;
> > > + *
> > > + *	do {
> > > + *		seq = latch->seq;
> > > + *		smp_rmb();
> > > + *
> > > + *		idx = seq & 0x01;
> > > + *		entry = data_query(latch->data[idx], ...);
> > > + *
> > > + *		smp_rmb();
> > > + *	} while (seq != latch->seq);
> > 
> > Btw., I realize this is just a sample, but couldn't this be written
> > more optimally as:
> > 
> > 	do {
> > 		seq = READ_ONCE(latch->seq);
> > 		smp_read_barrier_depends();
> > 
> > 		idx = seq & 0x01;
> > 		entry = data_query(latch->data[idx], ...);
> > 
> > 		smp_rmb();
> > 	} while (seq != latch->seq);
> > 
> > Note that there's just a single smp_rmb() barrier: the READ_ONCE() is
> > there to make sure GCC doesn't calculate 'idx' from two separate
> > reads, but otherwise there's a direct data dependency on latch->seq so
> > no smp_rmb() is needed, only a data dependency barrier when doing the
> > first lookup AFAICS?
> > 
> > (This doesn't matter on x86 where smp_rmb() is barrier().)
> 
> The latch evolved from seqlock.h, where there was no
> data dependency between the sequence counter and the
> data read, hence the smp_rmb(). Indeed, there is a
> data dependency in the case of the latch, so I think
> your approach of READ_ONCE + smp_read_barrier_depends()
> is appropriate.

A shorthand for READ_ONCE + smp_read_barrier_depends() is the shiny
new lockless_dereference().  This helps readability, as it is often
non-trivial to work out which accesses smp_read_barrier_depends() is
helping to order.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-13 17:43       ` Paul E. McKenney
@ 2015-04-13 18:21         ` Linus Torvalds
  2015-04-13 18:42           ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Linus Torvalds @ 2015-04-13 18:21 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Rusty Russell,
	Oleg Nesterov, Linux Kernel Mailing List, Andi Kleen,
	Steven Rostedt, Thomas Gleixner, Lai Jiangshan, George Spelvin,
	Andrea Arcangeli, David Woodhouse, Rik van Riel,
	Michel Lespinasse

On Mon, Apr 13, 2015 at 10:43 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> A shorthand for READ_ONCE + smp_read_barrier_depends() is the shiny
> new lockless_dereference()

Related side note - I think people should get used to seeing
"smp_load_acquire()". It has well-defined memory ordering properties
and should generally perform well on most architectures. It's (much)
stronger than lockless_dereference(), and together with
smp_store_release() you can make rather clear guarantees about passing
data locklessly from one CPU to another.

I'd like to see us use more of the pattern of

 - one thread does:

     .. allocate/create some data
      smp_store_release() to "expose it"

 - another thread does:

      smp_load_acquire() to read index/pointer/flag/whatever
      .. use the data any damn way you want ..

and we should probably aim to prefer that pattern over a lot of our
traditional memory barriers.

                          Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-13 18:21         ` Linus Torvalds
@ 2015-04-13 18:42           ` Paul E. McKenney
  2015-04-14 10:25             ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2015-04-13 18:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Rusty Russell,
	Oleg Nesterov, Linux Kernel Mailing List, Andi Kleen,
	Steven Rostedt, Thomas Gleixner, Lai Jiangshan, George Spelvin,
	Andrea Arcangeli, David Woodhouse, Rik van Riel,
	Michel Lespinasse

On Mon, Apr 13, 2015 at 11:21:46AM -0700, Linus Torvalds wrote:
> On Mon, Apr 13, 2015 at 10:43 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > A shorthand for READ_ONCE + smp_read_barrier_depends() is the shiny
> > new lockless_dereference()
> 
> Related side note - I think people should get used to seeing
> "smp_load_acquire()". It has well-defined memory ordering properties
> and should generally perform well on most architectures. It's (much)
> stronger than lockless_dereference(), and together with
> smp_store_release() you can make rather clear guarantees about passing
> data locklessly from one CPU to another.
> 
> I'd like to see us use more of the pattern of
> 
>  - one thread does:
> 
>      .. allocate/create some data
>       smp_store_release() to "expose it"
> 
>  - another thread does:
> 
>       smp_load_acquire() to read index/pointer/flag/whatever
>       .. use the data any damn way you want ..
> 
> and we should probably aim to prefer that pattern over a lot of our
> traditional memory barriers.

I couldn't agree more!

RCU made a similar move from open-coding smp_read_barrier_depends()
to using rcu_dereference() many years ago, and that change made RCU
code -much- easier to read and understand.  I believe that moving
from smp_mb(), smp_rmb(), and smp_wmb() to smp_store_release() and
smp_load_acquire() will provide similar maintainability benefits.
Furthermore, when the current code uses smp_mb(), smp_store_release() and
smp_load_acquire() generate faster code on most architectures.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 04/10] rbtree: Make lockless searches non-fatal
  2015-04-13 15:50   ` Ingo Molnar
@ 2015-04-13 19:10     ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 19:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux, David Woodhouse, Rik van Riel,
	Andrea Arcangeli, Michel Lespinasse

On Mon, Apr 13, 2015 at 05:50:08PM +0200, Ingo Molnar wrote:
> > +static inline void rb_link_node_rcu(struct rb_node * node, struct rb_node * parent,
> > +				    struct rb_node ** rb_link)
> 
> Minor stylistic nit, the standard pattern I suspect has spaces fewer 
> by three:
> 
> static inline void rb_link_node_rcu(struct rb_node *node, struct rb_node *parent,
> 				    struct rb_node **rb_link)

OK, I'll also update the existing code in that file.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-13 16:32   ` Ingo Molnar
  2015-04-13 17:08     ` Mathieu Desnoyers
@ 2015-04-13 19:17     ` Peter Zijlstra
  2015-04-13 19:47       ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 19:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux, Andrea Arcangeli,
	David Woodhouse, Rik van Riel, Michel Lespinasse

On Mon, Apr 13, 2015 at 06:32:02PM +0200, Ingo Molnar wrote:
> 
> Btw., I realize this is just a sample, but couldn't this be written 
> more optimally as:
> 
> 	do {
> 		seq = READ_ONCE(latch->seq);
> 		smp_read_barrier_depends();
> 
> 		idx = seq & 0x01;
> 		entry = data_query(latch->data[idx], ...);
> 
> 		smp_rmb();
> 	} while (seq != latch->seq);
> 

So in the actual code we use raw_read_seqcount() which includes the rmb.

This is true for the existing __ktime_get_fast_ns() as we as the new
latch_tee_find().

Should we look at introducing yet another seq primitive?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 06/10] rbtree: Implement generic latch_tree
  2015-04-13 16:43   ` Ingo Molnar
@ 2015-04-13 19:20     ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 19:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Mon, Apr 13, 2015 at 06:43:15PM +0200, Ingo Molnar wrote:
> Minor nit: so this text has 3 variants to spell RB-trees:
> 
> 	RB-tree
> 	RB tree
> 	rb-tree
> 
> I suggest we pick one! :-)

RB-tree it is ;-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-13 19:17     ` Peter Zijlstra
@ 2015-04-13 19:47       ` Peter Zijlstra
  2015-04-14 10:26         ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-13 19:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux, Andrea Arcangeli,
	David Woodhouse, Rik van Riel, Michel Lespinasse

On Mon, Apr 13, 2015 at 09:17:50PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 13, 2015 at 06:32:02PM +0200, Ingo Molnar wrote:
> > 
> > Btw., I realize this is just a sample, but couldn't this be written 
> > more optimally as:
> > 
> > 	do {
> > 		seq = READ_ONCE(latch->seq);
> > 		smp_read_barrier_depends();
> > 
> > 		idx = seq & 0x01;
> > 		entry = data_query(latch->data[idx], ...);
> > 
> > 		smp_rmb();
> > 	} while (seq != latch->seq);
> > 

> Should we look at introducing yet another seq primitive?

Like so?

---
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -233,6 +233,11 @@ static inline void raw_write_seqcount_en
 	s->sequence++;
 }
 
+static inline int raw_read_seqcount_latch(seqcount_t *s)
+{
+	return lockless_dereference(s->sequence);
+}
+
 /**
  * raw_write_seqcount_latch - redirect readers to even/odd copy
  * @s: pointer to seqcount_t
@@ -284,8 +289,7 @@ static inline void raw_write_seqcount_en
  *	unsigned seq, idx;
  *
  *	do {
- *		seq = latch->seq;
- *		smp_rmb();
+ *		seq = lockless_dereference(latch->seq);
  *
  *		idx = seq & 0x01;
  *		entry = data_query(latch->data[idx], ...);
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -393,7 +393,7 @@ static __always_inline u64 __ktime_get_f
 	u64 now;
 
 	do {
-		seq = raw_read_seqcount(&tkf->seq);
+		seq = raw_read_seqcount_latch(&tkf->seq);
 		tkr = tkf->base + (seq & 0x01);
 		now = ktime_to_ns(tkr->base) + timekeeping_get_ns(tkr);
 	} while (read_seqcount_retry(&tkf->seq, seq));

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 10/10] module: Rework module_addr_{min,max}
  2015-04-13 16:56   ` Ingo Molnar
@ 2015-04-14  2:55     ` Rusty Russell
  2015-04-14  6:42       ` Peter Zijlstra
  2015-04-14 12:56     ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Rusty Russell @ 2015-04-14  2:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel, andi,
	rostedt, tglx, laijs, linux

Ingo Molnar <mingo@kernel.org> writes:
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
>> __module_address() does an initial bound check before doing the 
>> {list/tree} iteration to find the actual module. The bound variables 
>> are nowhere near the mod_tree cacheline, in fact they're nowhere 
>> near one another.
>> 
>> module_addr_min lives in .data while module_addr_max lives in .bss 
>> (smarty pants GCC thinks the explicit 0 assignment is a mistake).
>> 
>> Rectify this by moving the two variables into a structure together 
>> with the latch_tree_root to guarantee they all share the same 
>> cacheline and avoid hitting two extra cachelines for the lookup.
>> 
>> While reworking the bounds code, move the bound update from 
>> allocation to insertion time, this avoids updating the bounds for a 
>> few error paths.
>
>> +static struct mod_tree_root {
>> +	struct latch_tree_root root;
>> +	unsigned long addr_min;
>> +	unsigned long addr_max;
>> +} mod_tree __cacheline_aligned = {
>> +	.addr_min = -1UL,
>> +};
>> +
>> +#define module_addr_min mod_tree.addr_min
>> +#define module_addr_max mod_tree.addr_max

Nice catch.

Does the min/max comparison still win us anything?  (I'm guessing yes...)

In general, I'm happy with this series.  Assume you want another
go-round for Ingo's tweaks, then I'll take them for 4.2.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/10] latched RB-trees and __module_address()
  2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
                   ` (10 preceding siblings ...)
  2015-04-13 17:02 ` [PATCH v5 00/10] latched RB-trees and __module_address() Ingo Molnar
@ 2015-04-14  2:57 ` Rusty Russell
  2015-04-14  6:41   ` Peter Zijlstra
  11 siblings, 1 reply; 46+ messages in thread
From: Rusty Russell @ 2015-04-14  2:57 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, laijs, linux, peterz

Peter Zijlstra <peterz@infradead.org> writes:
> This series is aimed at making __module_address() go fast(er).
>
> The reason for doing so is that most stack unwinders use kernel_text_address()
> to validate each frame. Perf and ftrace (can) end up doing a lot of stack
> traces from performance sensitive code.
>
> On the way there it:
>  - annotates and sanitizes module locking
>  - introduces the latched RB-tree
>  - employs it to make __module_address() go fast.
>
> I've build and boot tested this on x86_64 with modules and lockdep
> enabled.  Performance numbers (below) are done with lockdep disabled.
>
> As previously mentioned; the reason for writing the latched RB-tree as generic
> code is mostly for clarity/documentation purposes; as there are a number of
> separate and non trivial bits to the complete solution.
>
> As measued on my ivb-ep system with 84 modules loaded; prior to patching
> the test module (below) reports (cache hot, performance cpufreq):
>
>           avg +- stdev
> Before:   611 +- 10 [ns] per __module_address() call
> After:     17 +-  5 [ns] per __module_address() call
>
> PMI measurements for a cpu running loops in a module (also [ns]):
>
> Before:	Mean: 2719 +- 1, Stdev: 214, Samples: 40036
> After:  Mean:  947 +- 0, Stdev: 132, Samples: 40037
>
> Note; I have also tested things like: perf record -a -g modprobe
> mod_test, to make 'sure' to hit some of the more interesting paths.
>
> Changes since last time:
>
>  - reworked generic latch_tree API (Lai Jiangshan)
>  - reworked module bounds (me)
>  - reworked all the testing code (not included)
>
> Rusty, please consider merging this (for 4.2, I know its the merge window, no
> rush)

I was tempted to sneak in those module rcu fixes for 4.1, but seeing
Ingo's comments I'll wait for 4.2.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/10] latched RB-trees and __module_address()
  2015-04-14  2:57 ` Rusty Russell
@ 2015-04-14  6:41   ` Peter Zijlstra
  2015-04-15  4:41     ` Rusty Russell
  2015-05-28  2:07     ` Rusty Russell
  0 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-14  6:41 UTC (permalink / raw)
  To: Rusty Russell
  Cc: mingo, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux

On Tue, Apr 14, 2015 at 12:27:05PM +0930, Rusty Russell wrote:

> I was tempted to sneak in those module rcu fixes for 4.1, but seeing
> Ingo's comments I'll wait for 4.2.

I can get you a new version of that if you want. See below. The fixups
are unmodified of the posting (patches 2,3).

---
Subject: module: Sanitize RCU usage and locking
From: Peter Zijlstra <peterz@infradead.org>
Date: Sat Feb 28 19:17:04 CET 2015

Currently the RCU usage in module is an inconsistent mess of RCU and
RCU-sched, this is broken for CONFIG_PREEMPT where synchronize_rcu()
does not imply synchronize_sched().

Most usage sites use preempt_{dis,en}able() which is RCU-sched, but
(most of) the modification sites use synchronize_rcu(). With the
exception of the module bug list, which actually uses RCU.

Convert everything over to RCU-sched.

Furthermore add lockdep asserts to all sites, because it's not at all
clear to me the required locking is observed, esp. on exported
functions.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/module.h |   12 ++++++++++--
 kernel/module.c        |   40 ++++++++++++++++++++++++++++++++--------
 lib/bug.c              |    7 +++++--
 3 files changed, 47 insertions(+), 12 deletions(-)

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -419,14 +419,22 @@ struct symsearch {
 	bool unused;
 };
 
-/* Search for an exported symbol by name. */
+/*
+ * Search for an exported symbol by name.
+ *
+ * Must be called with module_mutex held or preemption disabled.
+ */
 const struct kernel_symbol *find_symbol(const char *name,
 					struct module **owner,
 					const unsigned long **crc,
 					bool gplok,
 					bool warn);
 
-/* Walk the exported symbol table */
+/*
+ * Walk the exported symbol table
+ *
+ * Must be called with module_mutex held or preemption disabled.
+ */
 bool each_symbol_section(bool (*fn)(const struct symsearch *arr,
 				    struct module *owner,
 				    void *data), void *data);
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -105,6 +105,22 @@ static LIST_HEAD(modules);
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
 
+static void module_assert_mutex(void)
+{
+	lockdep_assert_held(&module_mutex);
+}
+
+static void module_assert_mutex_or_preempt(void)
+{
+#ifdef CONFIG_LOCKDEP
+	if (!unlikely(debug_locks))
+		return;
+
+	WARN_ON(!rcu_held_lock_sched_held() &&
+		!lockdep_is_held(&module_mutex));
+#endif
+}
+
 #ifdef CONFIG_MODULE_SIG
 #ifdef CONFIG_MODULE_SIG_FORCE
 static bool sig_enforce = true;
@@ -318,6 +334,8 @@ bool each_symbol_section(bool (*fn)(cons
 #endif
 	};
 
+	module_assert_mutex_or_preempt();
+
 	if (each_symbol_in_section(arr, ARRAY_SIZE(arr), NULL, fn, data))
 		return true;
 
@@ -457,6 +475,8 @@ static struct module *find_module_all(co
 {
 	struct module *mod;
 
+	module_assert_mutex();
+
 	list_for_each_entry(mod, &modules, list) {
 		if (!even_unformed && mod->state == MODULE_STATE_UNFORMED)
 			continue;
@@ -1854,8 +1874,8 @@ static void free_module(struct module *m
 	list_del_rcu(&mod->list);
 	/* Remove this module from bug list, this uses list_del_rcu */
 	module_bug_cleanup(mod);
-	/* Wait for RCU synchronizing before releasing mod->list and buglist. */
-	synchronize_rcu();
+	/* Wait for RCU-sched synchronizing before releasing mod->list and buglist. */
+	synchronize_sched();
 	mutex_unlock(&module_mutex);
 
 	/* This may be NULL, but that's OK */
@@ -3106,11 +3126,11 @@ static noinline int do_init_module(struc
 	mod->init_text_size = 0;
 	/*
 	 * We want to free module_init, but be aware that kallsyms may be
-	 * walking this with preempt disabled.  In all the failure paths,
-	 * we call synchronize_rcu/synchronize_sched, but we don't want
-	 * to slow down the success path, so use actual RCU here.
+	 * walking this with preempt disabled.  In all the failure paths, we
+	 * call synchronize_sched(), but we don't want to slow down the success
+	 * path, so use actual RCU here.
 	 */
-	call_rcu(&freeinit->rcu, do_free_init);
+	call_rcu_sched(&freeinit->rcu, do_free_init);
 	mutex_unlock(&module_mutex);
 	wake_up_all(&module_wq);
 
@@ -3368,8 +3388,8 @@ static int load_module(struct load_info
 	/* Unlink carefully: kallsyms could be walking list. */
 	list_del_rcu(&mod->list);
 	wake_up_all(&module_wq);
-	/* Wait for RCU synchronizing before releasing mod->list. */
-	synchronize_rcu();
+	/* Wait for RCU-sched synchronizing before releasing mod->list. */
+	synchronize_sched();
 	mutex_unlock(&module_mutex);
  free_module:
 	/* Free lock-classes; relies on the preceding sync_rcu() */
@@ -3636,6 +3656,8 @@ int module_kallsyms_on_each_symbol(int (
 	unsigned int i;
 	int ret;
 
+	module_assert_mutex();
+
 	list_for_each_entry(mod, &modules, list) {
 		if (mod->state == MODULE_STATE_UNFORMED)
 			continue;
@@ -3810,6 +3832,8 @@ struct module *__module_address(unsigned
 	if (addr < module_addr_min || addr > module_addr_max)
 		return NULL;
 
+	module_assert_mutex_or_preempt();
+
 	list_for_each_entry_rcu(mod, &modules, list) {
 		if (mod->state == MODULE_STATE_UNFORMED)
 			continue;
--- a/lib/bug.c
+++ b/lib/bug.c
@@ -66,7 +66,7 @@ static const struct bug_entry *module_fi
 	struct module *mod;
 	const struct bug_entry *bug = NULL;
 
-	rcu_read_lock();
+	rcu_read_lock_sched();
 	list_for_each_entry_rcu(mod, &module_bug_list, bug_list) {
 		unsigned i;
 
@@ -77,7 +77,7 @@ static const struct bug_entry *module_fi
 	}
 	bug = NULL;
 out:
-	rcu_read_unlock();
+	rcu_read_unlock_sched();
 
 	return bug;
 }
@@ -88,6 +88,8 @@ void module_bug_finalize(const Elf_Ehdr
 	char *secstrings;
 	unsigned int i;
 
+	lockdep_assert_held(&module_mutex);
+
 	mod->bug_table = NULL;
 	mod->num_bugs = 0;
 
@@ -113,6 +115,7 @@ void module_bug_finalize(const Elf_Ehdr
 
 void module_bug_cleanup(struct module *mod)
 {
+	lockdep_assert_held(&module_mutex);
 	list_del_rcu(&mod->bug_list);
 }
 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 10/10] module: Rework module_addr_{min,max}
  2015-04-14  2:55     ` Rusty Russell
@ 2015-04-14  6:42       ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-14  6:42 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ingo Molnar, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, laijs, linux

On Tue, Apr 14, 2015 at 12:25:45PM +0930, Rusty Russell wrote:
> Ingo Molnar <mingo@kernel.org> writes:
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> __module_address() does an initial bound check before doing the 
> >> {list/tree} iteration to find the actual module. The bound variables 
> >> are nowhere near the mod_tree cacheline, in fact they're nowhere 
> >> near one another.
> >> 
> >> module_addr_min lives in .data while module_addr_max lives in .bss 
> >> (smarty pants GCC thinks the explicit 0 assignment is a mistake).
> >> 
> >> Rectify this by moving the two variables into a structure together 
> >> with the latch_tree_root to guarantee they all share the same 
> >> cacheline and avoid hitting two extra cachelines for the lookup.
> >> 
> >> While reworking the bounds code, move the bound update from 
> >> allocation to insertion time, this avoids updating the bounds for a 
> >> few error paths.
> >
> >> +static struct mod_tree_root {
> >> +	struct latch_tree_root root;
> >> +	unsigned long addr_min;
> >> +	unsigned long addr_max;
> >> +} mod_tree __cacheline_aligned = {
> >> +	.addr_min = -1UL,
> >> +};
> >> +
> >> +#define module_addr_min mod_tree.addr_min
> >> +#define module_addr_max mod_tree.addr_max
> 
> Nice catch.
> 
> Does the min/max comparison still win us anything?  (I'm guessing yes...)

Yep, while a tree iteration is much faster than the linear thing it is
still quite a bit slower than two simple compares.

> In general, I'm happy with this series.  Assume you want another
> go-round for Ingo's tweaks, then I'll take them for 4.2.

Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-13 18:42           ` Paul E. McKenney
@ 2015-04-14 10:25             ` Ingo Molnar
  2015-04-14 13:04               ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2015-04-14 10:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Mathieu Desnoyers, Peter Zijlstra, Rusty Russell,
	Oleg Nesterov, Linux Kernel Mailing List, Andi Kleen,
	Steven Rostedt, Thomas Gleixner, Lai Jiangshan, George Spelvin,
	Andrea Arcangeli, David Woodhouse, Rik van Riel,
	Michel Lespinasse


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Mon, Apr 13, 2015 at 11:21:46AM -0700, Linus Torvalds wrote:
> > On Mon, Apr 13, 2015 at 10:43 AM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> > >
> > > A shorthand for READ_ONCE + smp_read_barrier_depends() is the shiny
> > > new lockless_dereference()
> > 
> > Related side note - I think people should get used to seeing
> > "smp_load_acquire()". It has well-defined memory ordering properties
> > and should generally perform well on most architectures. It's (much)
> > stronger than lockless_dereference(), and together with
> > smp_store_release() you can make rather clear guarantees about passing
> > data locklessly from one CPU to another.
> > 
> > I'd like to see us use more of the pattern of
> > 
> >  - one thread does:
> > 
> >      .. allocate/create some data
> >       smp_store_release() to "expose it"
> > 
> >  - another thread does:
> > 
> >       smp_load_acquire() to read index/pointer/flag/whatever
> >       .. use the data any damn way you want ..
> > 
> > and we should probably aim to prefer that pattern over a lot of our
> > traditional memory barriers.
> 
> I couldn't agree more!

/me too!

> RCU made a similar move from open-coding smp_read_barrier_depends() 
> to using rcu_dereference() many years ago, and that change made RCU 
> code -much- easier to read and understand.  I believe that moving 
> from smp_mb(), smp_rmb(), and smp_wmb() to smp_store_release() and 
> smp_load_acquire() will provide similar maintainability benefits. 
> Furthermore, when the current code uses smp_mb(), 
> smp_store_release() and smp_load_acquire() generate faster code on 
> most architectures.

A similar maintainability argument can be made for locking: 
spin_lock(x) was a big step forward compared to lock_kernel(), 
primarily not because it improves scalability (it often does), but 
because the '(x)' very clearly documents the data structure that is 
being accessed and makes locking and data access bugs a lot more 
visible in the review phase already.

I wish rcu_read_lock() had a data argument, for similar reasons - even 
if it just pointed to a pre-existing lock or an rcu head it never 
touches ;-)

As an example I picked a random file out of the kernel that uses RCU: 
kernel/cpuset.c::validate_change():

static int validate_change(struct cpuset *cur, struct cpuset *trial)
{
	struct cgroup_subsys_state *css;
	struct cpuset *c, *par;
	int ret;

	rcu_read_lock();

	/* Each of our child cpusets must be a subset of us */
	ret = -EBUSY;
	cpuset_for_each_child(c, css, cur)
		if (!is_cpuset_subset(c, trial))
			goto out;

	/* Remaining checks don't apply to root cpuset */
	ret = 0;
	if (cur == &top_cpuset)
		goto out;

	par = parent_cs(cur);

	/* On legacy hiearchy, we must be a subset of our parent cpuset. */
	ret = -EACCES;
	if (!cgroup_on_dfl(cur->css.cgroup) && !is_cpuset_subset(trial, par))
		goto out;

	/*
	 * If either I or some sibling (!= me) is exclusive, we can't
	 * overlap
	 */
	ret = -EINVAL;
	cpuset_for_each_child(c, css, par) {
		if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) &&
		    c != cur &&
		    cpumask_intersects(trial->cpus_allowed, c->cpus_allowed))
			goto out;
		if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) &&
		    c != cur &&
		    nodes_intersects(trial->mems_allowed, c->mems_allowed))
			goto out;
	}

	/*
	 * Cpusets with tasks - existing or newly being attached - can't
	 * be changed to have empty cpus_allowed or mems_allowed.
	 */
	ret = -ENOSPC;
	if ((cgroup_has_tasks(cur->css.cgroup) || cur->attach_in_progress)) {
		if (!cpumask_empty(cur->cpus_allowed) &&
		    cpumask_empty(trial->cpus_allowed))
			goto out;
		if (!nodes_empty(cur->mems_allowed) &&
		    nodes_empty(trial->mems_allowed))
			goto out;
	}

	/*
	 * We can't shrink if we won't have enough room for SCHED_DEADLINE
	 * tasks.
	 */
	ret = -EBUSY;
	if (is_cpu_exclusive(cur) &&
	    !cpuset_cpumask_can_shrink(cur->cpus_allowed,
				       trial->cpus_allowed))
		goto out;

	ret = 0;
out:
	rcu_read_unlock();
	return ret;
}

So just from taking a glance at that function can you tell me what is 
being RCU protected here? I cannot, I can only guess that it must 
either be cpuset_for_each_child() or maybe the cpumasks or other 
internals.

And if I search the file for call_rcu() it shows me nothing. Only if I 
know that cpusets are integrated with cgroups and I search 
kernel/cgroup.c for call_rcu(), do I find:

        call_rcu(&css->rcu_head, css_free_rcu_fn);

aha!

... or if I drill down 3 levels into cpuset_for_each_child() -> 
css_for_each_child() -> css_next_child() do I see the RCU iteration.

It would have been a lot clearer from the onset, if I had a hint 
syntactically:

	rcu_read_lock(&css->rcu_head);
	...
	rcu_read_unlock(&css->rcu_head);

beyond the reviewer bonus I bet this would allow some extra debugging 
as well (only enabled in debug kernels):

  - for example to make sure we only access a field if _that field_ is 
    RCU locked (reducing the chance of having the right locking for 
    the wrong reason)

  - we could possibly also build lockdep dependencies out of such 
    annotated RCU locking patterns.

  - RCU aware list walking primitives could auto-check that this 
    particular list is properly RCU locked.

This could be introduced gradually by using a different API name:

	rcu_lock(&css->rcu_head);
	...
	rcu_unlock(&css->rcu_head);

(the 'read' is implied in RCU locking anyway.)

... and if you think this approach has any merit, I volunteer the perf 
and sched subsystems as guinea pigs! :-)

What do you think?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-13 19:47       ` Peter Zijlstra
@ 2015-04-14 10:26         ` Ingo Molnar
  0 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2015-04-14 10:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux, Andrea Arcangeli,
	David Woodhouse, Rik van Riel, Michel Lespinasse


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Apr 13, 2015 at 09:17:50PM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 13, 2015 at 06:32:02PM +0200, Ingo Molnar wrote:
> > > 
> > > Btw., I realize this is just a sample, but couldn't this be written 
> > > more optimally as:
> > > 
> > > 	do {
> > > 		seq = READ_ONCE(latch->seq);
> > > 		smp_read_barrier_depends();
> > > 
> > > 		idx = seq & 0x01;
> > > 		entry = data_query(latch->data[idx], ...);
> > > 
> > > 		smp_rmb();
> > > 	} while (seq != latch->seq);
> > > 
> 
> > Should we look at introducing yet another seq primitive?
> 
> Like so?
> 
> ---
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -233,6 +233,11 @@ static inline void raw_write_seqcount_en
>  	s->sequence++;
>  }
>  
> +static inline int raw_read_seqcount_latch(seqcount_t *s)
> +{
> +	return lockless_dereference(s->sequence);
> +}
> +
>  /**
>   * raw_write_seqcount_latch - redirect readers to even/odd copy
>   * @s: pointer to seqcount_t
> @@ -284,8 +289,7 @@ static inline void raw_write_seqcount_en
>   *	unsigned seq, idx;
>   *
>   *	do {
> - *		seq = latch->seq;
> - *		smp_rmb();
> + *		seq = lockless_dereference(latch->seq);
>   *
>   *		idx = seq & 0x01;
>   *		entry = data_query(latch->data[idx], ...);
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -393,7 +393,7 @@ static __always_inline u64 __ktime_get_f
>  	u64 now;
>  
>  	do {
> -		seq = raw_read_seqcount(&tkf->seq);
> +		seq = raw_read_seqcount_latch(&tkf->seq);
>  		tkr = tkf->base + (seq & 0x01);
>  		now = ktime_to_ns(tkr->base) + timekeeping_get_ns(tkr);
>  	} while (read_seqcount_retry(&tkf->seq, seq));

Sounds good to me!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 07/10] module: Optimize __module_address() using a latched RB-tree
  2015-04-13 16:49   ` Ingo Molnar
@ 2015-04-14 12:31     ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-14 12:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux

On Mon, Apr 13, 2015 at 06:49:49PM +0200, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:

> > +static __always_inline int
> > +mod_tree_comp(void *key, struct latch_tree_node *n)
> > +{
> > +	unsigned long val = (unsigned long)key;
> > +	unsigned long start, end;
> > +
> > +	end = start = __mod_tree_val(n);
> > +	end += __mod_tree_size(n);
> > +
> > +	if (val < start)
> > +		return -1;
> > +
> > +	if (val >= end)
> > +		return 1;
> > +
> > +	return 0;
> 
> So since we are counting nanoseconds, I suspect this could be written 
> more optimally as:
> 
> {
> 	unsigned long val = (unsigned long)key;
> 	unsigned long start, end;
> 
> 	start = __mod_tree_val(n);
> 	if (val < start)
> 		return -1;
> 
> 	end = start + __mod_tree_size(n);
> 	if (val >= end)
> 		return 1;
> 
> 	return 0;
> }
> 
> right?

I was afraid it would rip apart the common bits of
__mod_tree_{val,size}(), iow. it would end up doing the whole
latch_tree_node -> mod_tree_node -> mod and mtn_init comparison dance
twice.

But GCC does the right thing, so yes.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 10/10] module: Rework module_addr_{min,max}
  2015-04-13 16:56   ` Ingo Molnar
  2015-04-14  2:55     ` Rusty Russell
@ 2015-04-14 12:56     ` Peter Zijlstra
  2015-04-14 13:00       ` Ingo Molnar
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-14 12:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux

On Mon, Apr 13, 2015 at 06:56:36PM +0200, Ingo Molnar wrote:
> > + * Bounds of module allocation, for speeding up __module_address.
> > + * Protected by module_mutex.
> > + */
> > +static unsigned long module_addr_min = -1UL, module_addr_max = 0;
> 
> I suspect the same .data vs. .bss problem affects the #else branch as 
> well?

Yes, but the linear walk already has a 'problem', other than the linear
walk itself being one, the list_head isn't actually on the same line as
the 'key' entries -- although I suppose I could fix that for the
!CONFIG_MODULES_TREE_LOOKUP case.

> If so then it would make sense IMHO to put the structure definition 
> into generic code so that both variants benefit from the shared 
> cacheline?

Isn't this optimizing hopeless code? I mean, I can make the change;
something like the below. Although I suppose we should use
____cacheline_aligned here and just take the false sharing.

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -230,11 +230,15 @@ static struct module *mod_find(unsigned
 
 #else /* MODULES_TREE_LOOKUP */
 
-/*
- * Bounds of module allocation, for speeding up __module_address.
- * Protected by module_mutex.
- */
-static unsigned long module_addr_min = -1UL, module_addr_max = 0;
+static struct mod_bounds {
+	unsigned long addr_min;
+	unsigned long addr_max;
+} mod_bounds __cacheline_aligned = {
+	.addr_min = -1UL,
+};
+
+#define module_addr_min mod_bounds.addr_min
+#define module_addr_max mod_bounds.addr_max
 
 static void mod_tree_insert(struct module *mod) { }
 static void mod_tree_remove_init(struct module *mod) { }
@@ -254,6 +258,10 @@ static struct module *mod_find(unsigned
 
 #endif /* MODULES_TREE_LOOKUP */
 
+/*
+ * Bounds of module text, for speeding up __module_address.
+ * Protected by module_mutex.
+ */
 static void __mod_update_bounds(void *base, unsigned int size)
 {
 	unsigned long min = (unsigned long)base;
@@ -3363,8 +3371,8 @@ static int add_unformed_module(struct mo
 		err = -EEXIST;
 		goto out;
 	}
-	list_add_rcu(&mod->list, &modules);
 	mod_update_bounds(mod);
+	list_add_rcu(&mod->list, &modules);
 	mod_tree_insert(mod);
 	err = 0;
 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 10/10] module: Rework module_addr_{min,max}
  2015-04-14 12:56     ` Peter Zijlstra
@ 2015-04-14 13:00       ` Ingo Molnar
  0 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2015-04-14 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Apr 13, 2015 at 06:56:36PM +0200, Ingo Molnar wrote:
> > > + * Bounds of module allocation, for speeding up __module_address.
> > > + * Protected by module_mutex.
> > > + */
> > > +static unsigned long module_addr_min = -1UL, module_addr_max = 0;
> > 
> > I suspect the same .data vs. .bss problem affects the #else branch as 
> > well?
> 
> Yes, but the linear walk already has a 'problem', other than the linear
> walk itself being one, the list_head isn't actually on the same line as
> the 'key' entries -- although I suppose I could fix that for the
> !CONFIG_MODULES_TREE_LOOKUP case.
> 
> > If so then it would make sense IMHO to put the structure definition 
> > into generic code so that both variants benefit from the shared 
> > cacheline?
> 
> Isn't this optimizing hopeless code? I mean, I can make the change; 
> something like the below. Although I suppose we should use 
> ____cacheline_aligned here and just take the false sharing.

Well, I think the point is to share more code and move the two 
variants closer to each other without hurting either side - not to 
optimize the slower side necessarily.

But I have no strong opinions either way!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-14 10:25             ` Ingo Molnar
@ 2015-04-14 13:04               ` Paul E. McKenney
  2015-04-14 14:31                 ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2015-04-14 13:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Mathieu Desnoyers, Peter Zijlstra, Rusty Russell,
	Oleg Nesterov, Linux Kernel Mailing List, Andi Kleen,
	Steven Rostedt, Thomas Gleixner, Lai Jiangshan, George Spelvin,
	Andrea Arcangeli, David Woodhouse, Rik van Riel,
	Michel Lespinasse

On Tue, Apr 14, 2015 at 12:25:06PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Mon, Apr 13, 2015 at 11:21:46AM -0700, Linus Torvalds wrote:
> > > On Mon, Apr 13, 2015 at 10:43 AM, Paul E. McKenney
> > > <paulmck@linux.vnet.ibm.com> wrote:
> > > >
> > > > A shorthand for READ_ONCE + smp_read_barrier_depends() is the shiny
> > > > new lockless_dereference()
> > > 
> > > Related side note - I think people should get used to seeing
> > > "smp_load_acquire()". It has well-defined memory ordering properties
> > > and should generally perform well on most architectures. It's (much)
> > > stronger than lockless_dereference(), and together with
> > > smp_store_release() you can make rather clear guarantees about passing
> > > data locklessly from one CPU to another.
> > > 
> > > I'd like to see us use more of the pattern of
> > > 
> > >  - one thread does:
> > > 
> > >      .. allocate/create some data
> > >       smp_store_release() to "expose it"
> > > 
> > >  - another thread does:
> > > 
> > >       smp_load_acquire() to read index/pointer/flag/whatever
> > >       .. use the data any damn way you want ..
> > > 
> > > and we should probably aim to prefer that pattern over a lot of our
> > > traditional memory barriers.
> > 
> > I couldn't agree more!
> 
> /me too!
> 
> > RCU made a similar move from open-coding smp_read_barrier_depends() 
> > to using rcu_dereference() many years ago, and that change made RCU 
> > code -much- easier to read and understand.  I believe that moving 
> > from smp_mb(), smp_rmb(), and smp_wmb() to smp_store_release() and 
> > smp_load_acquire() will provide similar maintainability benefits. 
> > Furthermore, when the current code uses smp_mb(), 
> > smp_store_release() and smp_load_acquire() generate faster code on 
> > most architectures.
> 
> A similar maintainability argument can be made for locking: 
> spin_lock(x) was a big step forward compared to lock_kernel(), 
> primarily not because it improves scalability (it often does), but 
> because the '(x)' very clearly documents the data structure that is 
> being accessed and makes locking and data access bugs a lot more 
> visible in the review phase already.
> 
> I wish rcu_read_lock() had a data argument, for similar reasons - even 
> if it just pointed to a pre-existing lock or an rcu head it never 
> touches ;-)

Heh!  Jack Slingwine and I had that argument back in 1993.  I advocated
placing the update-side lock into the rcu_read_lock() equivalent, and
he responded by showing me a use cases were (1) there were no update-side
locks and (2) there were many update-side locks, and it was impossible
to select just one on the read side.  ;-)

However, DYNIX/ptx did not have anything like rcu_dereference() or
list_for_each_entry_rcu(), which perhaps can be used in your example
below.  (Hey, that was 20 years ago, when 50MB was a lot of main
memory.  So we relied on compilers being quite dumb.)

> As an example I picked a random file out of the kernel that uses RCU: 
> kernel/cpuset.c::validate_change():
> 
> static int validate_change(struct cpuset *cur, struct cpuset *trial)
> {
> 	struct cgroup_subsys_state *css;
> 	struct cpuset *c, *par;
> 	int ret;
> 
> 	rcu_read_lock();
> 
> 	/* Each of our child cpusets must be a subset of us */
> 	ret = -EBUSY;
> 	cpuset_for_each_child(c, css, cur)
> 		if (!is_cpuset_subset(c, trial))
> 			goto out;
> 
> 	/* Remaining checks don't apply to root cpuset */
> 	ret = 0;
> 	if (cur == &top_cpuset)
> 		goto out;
> 
> 	par = parent_cs(cur);
> 
> 	/* On legacy hiearchy, we must be a subset of our parent cpuset. */
> 	ret = -EACCES;
> 	if (!cgroup_on_dfl(cur->css.cgroup) && !is_cpuset_subset(trial, par))
> 		goto out;
> 
> 	/*
> 	 * If either I or some sibling (!= me) is exclusive, we can't
> 	 * overlap
> 	 */
> 	ret = -EINVAL;
> 	cpuset_for_each_child(c, css, par) {
> 		if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) &&
> 		    c != cur &&
> 		    cpumask_intersects(trial->cpus_allowed, c->cpus_allowed))
> 			goto out;
> 		if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) &&
> 		    c != cur &&
> 		    nodes_intersects(trial->mems_allowed, c->mems_allowed))
> 			goto out;
> 	}
> 
> 	/*
> 	 * Cpusets with tasks - existing or newly being attached - can't
> 	 * be changed to have empty cpus_allowed or mems_allowed.
> 	 */
> 	ret = -ENOSPC;
> 	if ((cgroup_has_tasks(cur->css.cgroup) || cur->attach_in_progress)) {
> 		if (!cpumask_empty(cur->cpus_allowed) &&
> 		    cpumask_empty(trial->cpus_allowed))
> 			goto out;
> 		if (!nodes_empty(cur->mems_allowed) &&
> 		    nodes_empty(trial->mems_allowed))
> 			goto out;
> 	}
> 
> 	/*
> 	 * We can't shrink if we won't have enough room for SCHED_DEADLINE
> 	 * tasks.
> 	 */
> 	ret = -EBUSY;
> 	if (is_cpu_exclusive(cur) &&
> 	    !cpuset_cpumask_can_shrink(cur->cpus_allowed,
> 				       trial->cpus_allowed))
> 		goto out;
> 
> 	ret = 0;
> out:
> 	rcu_read_unlock();
> 	return ret;
> }
> 
> So just from taking a glance at that function can you tell me what is 
> being RCU protected here? I cannot, I can only guess that it must 
> either be cpuset_for_each_child() or maybe the cpumasks or other 
> internals.
> 
> And if I search the file for call_rcu() it shows me nothing. Only if I 
> know that cpusets are integrated with cgroups and I search 
> kernel/cgroup.c for call_rcu(), do I find:
> 
>         call_rcu(&css->rcu_head, css_free_rcu_fn);
> 
> aha!
> 
> ... or if I drill down 3 levels into cpuset_for_each_child() -> 
> css_for_each_child() -> css_next_child() do I see the RCU iteration.

And I have felt that reviewing pain as well.

But shouldn't these API members be tagged with "_rcu" to make that
more clear?  Sort of like the difference between list_for_each_entry
and list_for_each_entry_rcu()?

> It would have been a lot clearer from the onset, if I had a hint 
> syntactically:
> 
> 	rcu_read_lock(&css->rcu_head);
> 	...
> 	rcu_read_unlock(&css->rcu_head);

I cannot resist asking what you put there if the update side uses
synchronize_rcu()...  A NULL pointer?  A pointer to synchronize_rcu()?
Something else?  And what do you do in the not-uncommon case where
multiple RCU chains are being traversed in the same RCU read-side
critical section?  One approach would be to use varargs, I suppose.
Though with a hash table, list, or tree, you could have a -lot- of
->rcu_head structures to reference, and concurrent additions and deletions
mean that you wouldn't necessarily know which at rcu_read_lock() time.

> beyond the reviewer bonus I bet this would allow some extra debugging 
> as well (only enabled in debug kernels):
> 
>   - for example to make sure we only access a field if _that field_ is 
>     RCU locked (reducing the chance of having the right locking for 
>     the wrong reason)

One possibility would be to mark each traversal of an RCU-protected
pointer.  Currently, if a multilinked structure is inserted in one
shot, only the initial pointer to that structure needs to have
rcu_dereference().  Otherwise, it is hard to tell exactly how far
the RCU protection is to extend.  (Been having too much fun with
this sort of thing in the standards committees...)

>   - we could possibly also build lockdep dependencies out of such 
>     annotated RCU locking patterns.

Tell me more?

>   - RCU aware list walking primitives could auto-check that this 
>     particular list is properly RCU locked.

For example, that a lock in the proper update class was held during
the corresponding update?

> This could be introduced gradually by using a different API name:
> 
> 	rcu_lock(&css->rcu_head);
> 	...
> 	rcu_unlock(&css->rcu_head);
> 
> (the 'read' is implied in RCU locking anyway.)

Agreed, a new API would be needed for something like this.

> ... and if you think this approach has any merit, I volunteer the perf 
> and sched subsystems as guinea pigs! :-)

And rcutorture, not that it counts for much.

> What do you think?

I think that I don't yet fully understand your proposal.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-14 13:04               ` Paul E. McKenney
@ 2015-04-14 14:31                 ` Ingo Molnar
  2015-04-14 15:11                   ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2015-04-14 14:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Mathieu Desnoyers, Peter Zijlstra, Rusty Russell,
	Oleg Nesterov, Linux Kernel Mailing List, Andi Kleen,
	Steven Rostedt, Thomas Gleixner, Lai Jiangshan, George Spelvin,
	Andrea Arcangeli, David Woodhouse, Rik van Riel,
	Michel Lespinasse


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> > I wish rcu_read_lock() had a data argument, for similar reasons - 
> > even if it just pointed to a pre-existing lock or an rcu head it 
> > never touches ;-)
> 
> Heh!  Jack Slingwine and I had that argument back in 1993.  I 
> advocated placing the update-side lock into the rcu_read_lock() 
> equivalent, and he responded by showing me a use cases were (1) 
> there were no update-side locks and (2) there were many update-side 
> locks, and it was impossible to select just one on the read side.  
> ;-)

So as a response I'd have attempted to hand-wave something about those 
scenarios being either not that common, or not that interesting?!! :-)

> However, DYNIX/ptx did not have anything like rcu_dereference() or 
> list_for_each_entry_rcu(), which perhaps can be used in your example 
> below.  (Hey, that was 20 years ago, when 50MB was a lot of main 
> memory.  So we relied on compilers being quite dumb.)

I guess compilers are still dumb in many ways ;-)

> > know that cpusets are integrated with cgroups and I search 
> > kernel/cgroup.c for call_rcu(), do I find:
> > 
> >         call_rcu(&css->rcu_head, css_free_rcu_fn);
> > 
> > aha!
> > 
> > ... or if I drill down 3 levels into cpuset_for_each_child() -> 
> > css_for_each_child() -> css_next_child() do I see the RCU 
> > iteration.
> 
> And I have felt that reviewing pain as well.
> 
> But shouldn't these API members be tagged with "_rcu" to make that 
> more clear?  Sort of like the difference between list_for_each_entry 
> and list_for_each_entry_rcu()?

Yes, agreed absolutely!

Having it as a syntactic element instead of a stylistic one forces 
such self-documentation though.

At the cost of being an extra nuisance.

> > It would have been a lot clearer from the onset, if I had a hint 
> > syntactically:
> > 
> > 	rcu_read_lock(&css->rcu_head);
> > 	...
> > 	rcu_read_unlock(&css->rcu_head);
> 
> I cannot resist asking what you put there if the update side uses 
> synchronize_rcu()...  A NULL pointer?  A pointer to 
> synchronize_rcu()? Something else? [...]

So I'd either put a suitable zero-size struct ('struct 
rcu_head_sync'?) into the protected data structure: i.e. still 
annotate it in an active fashion, but don't change any code.

This would be zero size in the non-debug case, but it would allow 
debugging data in the debug case.

Another solution would be to not require such linking in all cases, 
only when there's a single update side lock.

> [...]  And what do you do in the not-uncommon case where multiple 
> RCU chains are being traversed in the same RCU read-side critical 
> section?  One approach would be to use varargs, I suppose. Though 
> with a hash table, list, or tree, you could have a -lot- of 
> ->rcu_head structures to reference, and concurrent additions and 
> deletions mean that you wouldn't necessarily know which at 
> rcu_read_lock() time.

I'd definitely keep it simple - i.e. no varargs.

I'd just try to link with the data structure in general. Or I'd just 
forget about handling these cases altogether, at least initially - 
first see how it works out for the simpler cases.

> 
> > beyond the reviewer bonus I bet this would allow some extra debugging 
> > as well (only enabled in debug kernels):
> > 
> >   - for example to make sure we only access a field if _that field_ is 
> >     RCU locked (reducing the chance of having the right locking for 
> >     the wrong reason)
> 
> One possibility would be to mark each traversal of an RCU-protected 
> pointer.  Currently, if a multilinked structure is inserted in one 
> shot, only the initial pointer to that structure needs to have 
> rcu_dereference().  Otherwise, it is hard to tell exactly how far 
> the RCU protection is to extend.  (Been having too much fun with 
> this sort of thing in the standards committees...)
>
> >   - we could possibly also build lockdep dependencies out of such 
> >     annotated RCU locking patterns.
> 
> Tell me more?

So to backpedal a bit, I think that in practice the use 
rcu_dereference*() gives us a lot of protection and documentation 
already - so I might be over-doing it.

But we could essentially split up the current monolithic 
rcu_read_lock() class and turn it into classes mirroring the update 
side lock classes.

This would at minimum give us access to CONFIG_LOCK_STAT statistics 
about the frequency of use of the various locks. Right now I can only 
see this:

/proc/lockdep:ffffffff82d27887 OPS: 1412000 FD:    1 BD:    1 ......: rcu_read_lock
/proc/lock_stat:                         rcu_read_lock-R:             0              0           0.00           0.00           0.00           0.00              0        1135745           0.00         969.66     1355676.33           1.19
/proc/lock_stat:                   rcu_read_lock_sched-R:             0              0           0.00           0.00           0.00           0.00              0             16           0.25          20.71          30.40           1.90
/proc/lock_stat:                      rcu_read_lock_bh-R:             0              0           0.00           0.00           0.00           0.00              0           5602           0.33         126.70       40478.88           7.23

which merges them into essentially a single counter.

If we had a more finegrained structure we could tell more about usage 
patterns? Not sure how valuable that is though.

So for example on the SRCU side we could detect such deadlocks:

	mutex_lock(&mutex);
	synchronize_srcu(&srcu);

vs.

	srcu_read_lock(&srcu);
	mutex_lock(&mutex);

(I don't think we are detecting these right now.)

On the classical RCU side it's harder to construct deadlocks, because 
the read side is a non-sleeping and irq-safe primitive, while 
synchronize_rcu() is a sleeping method ;-) So most (all?) deadlock 
scenarios are avoided by just those properties.

Another use would be to potentially split up the RCU read side into 
multiple grace period domains, flushed independently, a bit like how 
SRCU does it? That would be a non-debugging use for it.

> >   - RCU aware list walking primitives could auto-check that this 
> >     particular list is properly RCU locked.
> 
> For example, that a lock in the proper update class was held during 
> the corresponding update?

Yes, and also on the lookup side: that a 'lock' of the proper type is 
read-held during lookup. (with a few special annotations for special 
cases where as a side effect of other things we may have proper RCU 
read side protection.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch()
  2015-04-14 14:31                 ` Ingo Molnar
@ 2015-04-14 15:11                   ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2015-04-14 15:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Mathieu Desnoyers, Peter Zijlstra, Rusty Russell,
	Oleg Nesterov, Linux Kernel Mailing List, Andi Kleen,
	Steven Rostedt, Thomas Gleixner, Lai Jiangshan, George Spelvin,
	Andrea Arcangeli, David Woodhouse, Rik van Riel,
	Michel Lespinasse

On Tue, Apr 14, 2015 at 04:31:15PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > I wish rcu_read_lock() had a data argument, for similar reasons - 
> > > even if it just pointed to a pre-existing lock or an rcu head it 
> > > never touches ;-)
> > 
> > Heh!  Jack Slingwine and I had that argument back in 1993.  I 
> > advocated placing the update-side lock into the rcu_read_lock() 
> > equivalent, and he responded by showing me a use cases were (1) 
> > there were no update-side locks and (2) there were many update-side 
> > locks, and it was impossible to select just one on the read side.  
> > ;-)
> 
> So as a response I'd have attempted to hand-wave something about those 
> scenarios being either not that common, or not that interesting?!! :-)

Unfortunately for me, Jack already had code actually implementing those
particular use cases.  ;-)

> > However, DYNIX/ptx did not have anything like rcu_dereference() or 
> > list_for_each_entry_rcu(), which perhaps can be used in your example 
> > below.  (Hey, that was 20 years ago, when 50MB was a lot of main 
> > memory.  So we relied on compilers being quite dumb.)
> 
> I guess compilers are still dumb in many ways ;-)

But often dumb in much more complicated ways.  ;-)

> > > know that cpusets are integrated with cgroups and I search 
> > > kernel/cgroup.c for call_rcu(), do I find:
> > > 
> > >         call_rcu(&css->rcu_head, css_free_rcu_fn);
> > > 
> > > aha!
> > > 
> > > ... or if I drill down 3 levels into cpuset_for_each_child() -> 
> > > css_for_each_child() -> css_next_child() do I see the RCU 
> > > iteration.
> > 
> > And I have felt that reviewing pain as well.
> > 
> > But shouldn't these API members be tagged with "_rcu" to make that 
> > more clear?  Sort of like the difference between list_for_each_entry 
> > and list_for_each_entry_rcu()?
> 
> Yes, agreed absolutely!
> 
> Having it as a syntactic element instead of a stylistic one forces 
> such self-documentation though.
> 
> At the cost of being an extra nuisance.

But this problem of important stuff happening multiple function calls
down the stack must be happening in contexts other than just RCU.
Perhaps we need something that can query the code?  I could almost make
cbmc do this, but it is a bit aggressive about pointers to functions,
among other things.  (The trick is to make the rcu_dereference() macros
be functions, and to have them all invoke the top-level function, and then
look at cbmc'c complaints about recursion, which would trace the path.)

> > > It would have been a lot clearer from the onset, if I had a hint 
> > > syntactically:
> > > 
> > > 	rcu_read_lock(&css->rcu_head);
> > > 	...
> > > 	rcu_read_unlock(&css->rcu_head);
> > 
> > I cannot resist asking what you put there if the update side uses 
> > synchronize_rcu()...  A NULL pointer?  A pointer to 
> > synchronize_rcu()? Something else? [...]
> 
> So I'd either put a suitable zero-size struct ('struct 
> rcu_head_sync'?) into the protected data structure: i.e. still 
> annotate it in an active fashion, but don't change any code.
> 
> This would be zero size in the non-debug case, but it would allow 
> debugging data in the debug case.
> 
> Another solution would be to not require such linking in all cases, 
> only when there's a single update side lock.

Or we could have a global struct rcu_head_sync variable for each
such use, I guess.

> > [...]  And what do you do in the not-uncommon case where multiple 
> > RCU chains are being traversed in the same RCU read-side critical 
> > section?  One approach would be to use varargs, I suppose. Though 
> > with a hash table, list, or tree, you could have a -lot- of 
> > ->rcu_head structures to reference, and concurrent additions and 
> > deletions mean that you wouldn't necessarily know which at 
> > rcu_read_lock() time.
> 
> I'd definitely keep it simple - i.e. no varargs.
> 
> I'd just try to link with the data structure in general. Or I'd just 
> forget about handling these cases altogether, at least initially - 
> first see how it works out for the simpler cases.

Or have nested rcu_read_lock() calls, one for each protected data
structure.  Sorry, couldn't resist.  ;-)

> > > beyond the reviewer bonus I bet this would allow some extra debugging 
> > > as well (only enabled in debug kernels):
> > > 
> > >   - for example to make sure we only access a field if _that field_ is 
> > >     RCU locked (reducing the chance of having the right locking for 
> > >     the wrong reason)
> > 
> > One possibility would be to mark each traversal of an RCU-protected 
> > pointer.  Currently, if a multilinked structure is inserted in one 
> > shot, only the initial pointer to that structure needs to have 
> > rcu_dereference().  Otherwise, it is hard to tell exactly how far 
> > the RCU protection is to extend.  (Been having too much fun with 
> > this sort of thing in the standards committees...)
> >
> > >   - we could possibly also build lockdep dependencies out of such 
> > >     annotated RCU locking patterns.
> > 
> > Tell me more?
> 
> So to backpedal a bit, I think that in practice the use 
> rcu_dereference*() gives us a lot of protection and documentation 
> already - so I might be over-doing it.

One alternative would be a tool that could dig out the rcu_dereference()
calls that could be reached from a given region of code -- my cbmc
example above.  Another would be to instrument the rcu_dereference() calls
somehow, though I am not yet exactly sure how -- perhaps automatically
making an association (at runtime) between a given rcu_dereference()
and all the rcu_read_lock() calls that protected.

> But we could essentially split up the current monolithic 
> rcu_read_lock() class and turn it into classes mirroring the update 
> side lock classes.
> 
> This would at minimum give us access to CONFIG_LOCK_STAT statistics 
> about the frequency of use of the various locks. Right now I can only 
> see this:
> 
> /proc/lockdep:ffffffff82d27887 OPS: 1412000 FD:    1 BD:    1 ......: rcu_read_lock
> /proc/lock_stat:                         rcu_read_lock-R:             0              0           0.00           0.00           0.00           0.00              0        1135745           0.00         969.66     1355676.33           1.19
> /proc/lock_stat:                   rcu_read_lock_sched-R:             0              0           0.00           0.00           0.00           0.00              0             16           0.25          20.71          30.40           1.90
> /proc/lock_stat:                      rcu_read_lock_bh-R:             0              0           0.00           0.00           0.00           0.00              0           5602           0.33         126.70       40478.88           7.23
> 
> which merges them into essentially a single counter.
> 
> If we had a more finegrained structure we could tell more about usage 
> patterns? Not sure how valuable that is though.

If we could generate the fine-grained view automatically, it seems to
me that it could be worthwhile.  I am a bit concerned about any approach
that requires accurate manual annotation of each of the 1900 instances
of rcu_read_lock() currently in the kernel.

> So for example on the SRCU side we could detect such deadlocks:
> 
> 	mutex_lock(&mutex);
> 	synchronize_srcu(&srcu);
> 
> vs.
> 
> 	srcu_read_lock(&srcu);
> 	mutex_lock(&mutex);
> 
> (I don't think we are detecting these right now.)

For one thing, we would have to reject this sort of false positive:

	read_lock(&rw_lock);
	...
	srcu_read_lock(&srcu);
	...
	srcu_read_unlock(&srcu);
	...
	read_unlock(&rw_lock);

vs.

	srcu_read_lock(&srcu);
	...
	read_lock(&rw_lock);
	...
	read_unlock(&rw_lock);
	...
	srcu_read_unlock(&srcu);

But yes, I suspect that we have been getting lucky because SRCU is not
yet used heavily.

> On the classical RCU side it's harder to construct deadlocks, because 
> the read side is a non-sleeping and irq-safe primitive, while 
> synchronize_rcu() is a sleeping method ;-) So most (all?) deadlock 
> scenarios are avoided by just those properties.

I can't think of a deadload that would not be detected by those
properties.

> Another use would be to potentially split up the RCU read side into 
> multiple grace period domains, flushed independently, a bit like how 
> SRCU does it? That would be a non-debugging use for it.

Well, if we could do a lockdep-like association of the rcu_dereference()
calls with the rcu_read_lock() calls that protected them, would the
resulting clusters of association give us what we need?  Couldn't
we then dump the lockdep sysfs files and see which rcu_dereference()
calls were protected by a given rcu_read_lock() call?

> > >   - RCU aware list walking primitives could auto-check that this 
> > >     particular list is properly RCU locked.
> > 
> > For example, that a lock in the proper update class was held during 
> > the corresponding update?
> 
> Yes, and also on the lookup side: that a 'lock' of the proper type is 
> read-held during lookup. (with a few special annotations for special 
> cases where as a side effect of other things we may have proper RCU 
> read side protection.)

Hmmm...  Suppose we also used lockdep to associate rcu_assign_pointer()
with the locks protecting it.  We could easily find the set of
rcu_dereference() calls accessing that same RCU-protected pointer, and
then associate the update-side locks with the RCU read-side critical
sections.

Might be a very large pile of data in some cases, thoough!  Which is
all the more reason for doing the associations automatically.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/10] latched RB-trees and __module_address()
  2015-04-14  6:41   ` Peter Zijlstra
@ 2015-04-15  4:41     ` Rusty Russell
  2015-04-15  9:41       ` Peter Zijlstra
  2015-05-28  2:07     ` Rusty Russell
  1 sibling, 1 reply; 46+ messages in thread
From: Rusty Russell @ 2015-04-15  4:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux

Peter Zijlstra <peterz@infradead.org> writes:
> On Tue, Apr 14, 2015 at 12:27:05PM +0930, Rusty Russell wrote:
>
>> I was tempted to sneak in those module rcu fixes for 4.1, but seeing
>> Ingo's comments I'll wait for 4.2.
>
> I can get you a new version of that if you want. See below. The fixups
> are unmodified of the posting (patches 2,3).

Another last-minute module-next patch just blew up some archs, so I'm
erring back to being conservative :)

I've put those three patches (2,3,1 for bisectability) into my
pending-rebases tree for now.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/10] latched RB-trees and __module_address()
  2015-04-15  4:41     ` Rusty Russell
@ 2015-04-15  9:41       ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2015-04-15  9:41 UTC (permalink / raw)
  To: Rusty Russell
  Cc: mingo, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux

On Wed, Apr 15, 2015 at 02:11:37PM +0930, Rusty Russell wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> > On Tue, Apr 14, 2015 at 12:27:05PM +0930, Rusty Russell wrote:
> >
> >> I was tempted to sneak in those module rcu fixes for 4.1, but seeing
> >> Ingo's comments I'll wait for 4.2.
> >
> > I can get you a new version of that if you want. See below. The fixups
> > are unmodified of the posting (patches 2,3).
> 
> Another last-minute module-next patch just blew up some archs, so I'm
> erring back to being conservative :)

Hehe ;) Fair enough, no rush.

Thanks

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/10] latched RB-trees and __module_address()
  2015-04-14  6:41   ` Peter Zijlstra
  2015-04-15  4:41     ` Rusty Russell
@ 2015-05-28  2:07     ` Rusty Russell
  2015-05-28 11:20       ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Rusty Russell @ 2015-05-28  2:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux

Peter Zijlstra <peterz@infradead.org> writes:
> On Tue, Apr 14, 2015 at 12:27:05PM +0930, Rusty Russell wrote:

Hmm, this blows up:

> +static void module_assert_mutex_or_preempt(void)
> +{
> +#ifdef CONFIG_LOCKDEP
> +	if (!unlikely(debug_locks))
> +		return;
> +
> +	WARN_ON(!rcu_held_lock_sched_held() &&
> +		!lockdep_is_held(&module_mutex));

That's rcu_read_lock_sched_held, not rcu_held_lock_sched_held().

Also, your unlikely is weird and backwards.

I changed it as below, and folded.  It's in modules-next.

I'm now going to do some *actual* testing, and I'll go all Torvalds on
you if this spews warnings...

Cheers,
Rusty.

diff --git a/kernel/module.c b/kernel/module.c
index 4a89b88b4428..a15899e00ca9 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -113,10 +113,10 @@ static void module_assert_mutex(void)
 static void module_assert_mutex_or_preempt(void)
 {
 #ifdef CONFIG_LOCKDEP
-	if (!unlikely(debug_locks))
+	if (unlikely(!debug_locks))
 		return;
 
-	WARN_ON(!rcu_held_lock_sched_held() &&
+	WARN_ON(!rcu_read_lock_sched_held() &&
 		!lockdep_is_held(&module_mutex));
 #endif
 }


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/10] latched RB-trees and __module_address()
  2015-05-28  2:07     ` Rusty Russell
@ 2015-05-28 11:20       ` Peter Zijlstra
  2015-05-28 23:49         ` Rusty Russell
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2015-05-28 11:20 UTC (permalink / raw)
  To: Rusty Russell
  Cc: mingo, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, laijs, linux

On Thu, May 28, 2015 at 11:37:17AM +0930, Rusty Russell wrote:
> That's rcu_read_lock_sched_held, not rcu_held_lock_sched_held().
> 
> Also, your unlikely is weird and backwards.
> 
> I changed it as below, and folded.  It's in modules-next.
> 
> I'm now going to do some *actual* testing, and I'll go all Torvalds on
> you if this spews warnings...

Fair enough, I did see the build robot fail on this. How come those only
showed up now? This patch has been in your git tree for a long time; you
picked up the rcu validation patches before the other ones.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/10] latched RB-trees and __module_address()
  2015-05-28 11:20       ` Peter Zijlstra
@ 2015-05-28 23:49         ` Rusty Russell
  0 siblings, 0 replies; 46+ messages in thread
From: Rusty Russell @ 2015-05-28 23:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: kbuild test robot, linux-kernel

Peter Zijlstra <peterz@infradead.org> writes:
> On Thu, May 28, 2015 at 11:37:17AM +0930, Rusty Russell wrote:
>> That's rcu_read_lock_sched_held, not rcu_held_lock_sched_held().
>> 
>> Also, your unlikely is weird and backwards.
>> 
>> I changed it as below, and folded.  It's in modules-next.
>> 
>> I'm now going to do some *actual* testing, and I'll go all Torvalds on
>> you if this spews warnings...
>
> Fair enough, I did see the build robot fail on this. How come those only
> showed up now? This patch has been in your git tree for a long time; you
> picked up the rcu validation patches before the other ones.

Indeed...

Fengguang, is the kbuild test robot checking my pending-rebases tree?
It's probably going to produce some noise (since that's where I drop
unreviewed patches, too), but that's OK.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2015-05-29  0:44 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-13 14:11 [PATCH v5 00/10] latched RB-trees and __module_address() Peter Zijlstra
2015-04-13 14:11 ` [PATCH v5 01/10] module: Sanitize RCU usage and locking Peter Zijlstra
2015-04-13 15:32   ` Ingo Molnar
2015-04-13 15:40     ` Peter Zijlstra
2015-04-13 16:32       ` Ingo Molnar
2015-04-13 14:11 ` [PATCH v5 02/10] module: Annotate module version magic Peter Zijlstra
2015-04-13 14:11 ` [PATCH v5 03/10] module, jump_label: Fix module locking Peter Zijlstra
2015-04-13 14:11 ` [PATCH v5 04/10] rbtree: Make lockless searches non-fatal Peter Zijlstra
2015-04-13 15:50   ` Ingo Molnar
2015-04-13 19:10     ` Peter Zijlstra
2015-04-13 14:11 ` [PATCH v5 05/10] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
2015-04-13 16:32   ` Ingo Molnar
2015-04-13 17:08     ` Mathieu Desnoyers
2015-04-13 17:43       ` Paul E. McKenney
2015-04-13 18:21         ` Linus Torvalds
2015-04-13 18:42           ` Paul E. McKenney
2015-04-14 10:25             ` Ingo Molnar
2015-04-14 13:04               ` Paul E. McKenney
2015-04-14 14:31                 ` Ingo Molnar
2015-04-14 15:11                   ` Paul E. McKenney
2015-04-13 19:17     ` Peter Zijlstra
2015-04-13 19:47       ` Peter Zijlstra
2015-04-14 10:26         ` Ingo Molnar
2015-04-13 14:11 ` [PATCH v5 06/10] rbtree: Implement generic latch_tree Peter Zijlstra
2015-04-13 16:43   ` Ingo Molnar
2015-04-13 19:20     ` Peter Zijlstra
2015-04-13 14:11 ` [PATCH v5 07/10] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
2015-04-13 16:49   ` Ingo Molnar
2015-04-14 12:31     ` Peter Zijlstra
2015-04-13 14:11 ` [PATCH v5 08/10] module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING Peter Zijlstra
2015-04-13 16:53   ` Ingo Molnar
2015-04-13 14:11 ` [PATCH v5 09/10] module: Use __module_address() for module_address_lookup() Peter Zijlstra
2015-04-13 14:11 ` [PATCH v5 10/10] module: Rework module_addr_{min,max} Peter Zijlstra
2015-04-13 16:56   ` Ingo Molnar
2015-04-14  2:55     ` Rusty Russell
2015-04-14  6:42       ` Peter Zijlstra
2015-04-14 12:56     ` Peter Zijlstra
2015-04-14 13:00       ` Ingo Molnar
2015-04-13 17:02 ` [PATCH v5 00/10] latched RB-trees and __module_address() Ingo Molnar
2015-04-14  2:57 ` Rusty Russell
2015-04-14  6:41   ` Peter Zijlstra
2015-04-15  4:41     ` Rusty Russell
2015-04-15  9:41       ` Peter Zijlstra
2015-05-28  2:07     ` Rusty Russell
2015-05-28 11:20       ` Peter Zijlstra
2015-05-28 23:49         ` Rusty Russell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.