linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/7] kprobes: Make kretprobes lockless
@ 2020-08-27 16:12 Peter Zijlstra
  2020-08-27 16:12 ` [RFC][PATCH 1/7] llist: Add nonatomic __llist_add() Peter Zijlstra
                   ` (6 more replies)
  0 siblings, 7 replies; 38+ messages in thread
From: Peter Zijlstra @ 2020-08-27 16:12 UTC (permalink / raw)
  To: linux-kernel, mhiramat
  Cc: Eddy_Wu, x86, davem, rostedt, naveen.n.rao, anil.s.keshavamurthy,
	linux-arch, cameron, oleg, will, paulmck, peterz

Hi,

These are on top of Masami's generic kretprobe handler patches (v1):

  https://lkml.kernel.org/r/159844957216.510284.17683703701627367133.stgit@devnote2

They are very lightly tested, esp. the freelist stuff has basically only been
translated to kernel C without much thinking (while attending LPC), all bugs
are undoubted mine and not Cameron's.

Changelogs are terse or nonexistant :/



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 1/7] llist: Add nonatomic __llist_add()
  2020-08-27 16:12 [RFC][PATCH 0/7] kprobes: Make kretprobes lockless Peter Zijlstra
@ 2020-08-27 16:12 ` Peter Zijlstra
  2020-08-27 16:12 ` [RFC][PATCH 2/7] sched: Fix try_invoke_on_locked_down_task() semantics Peter Zijlstra
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2020-08-27 16:12 UTC (permalink / raw)
  To: linux-kernel, mhiramat
  Cc: Eddy_Wu, x86, davem, rostedt, naveen.n.rao, anil.s.keshavamurthy,
	linux-arch, cameron, oleg, will, paulmck, peterz



Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/gpu/drm/i915/i915_request.c |    6 ------
 include/linux/llist.h               |   15 +++++++++++++++
 2 files changed, 15 insertions(+), 6 deletions(-)

--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -357,12 +357,6 @@ void i915_request_retire_upto(struct i91
 	} while (i915_request_retire(tmp) && tmp != rq);
 }
 
-static void __llist_add(struct llist_node *node, struct llist_head *head)
-{
-	node->next = head->first;
-	head->first = node;
-}
-
 static struct i915_request * const *
 __engine_active(struct intel_engine_cs *engine)
 {
--- a/include/linux/llist.h
+++ b/include/linux/llist.h
@@ -197,6 +197,16 @@ static inline struct llist_node *llist_n
 extern bool llist_add_batch(struct llist_node *new_first,
 			    struct llist_node *new_last,
 			    struct llist_head *head);
+
+static inline bool __llist_add_batch(struct llist_node *new_first,
+				     struct llist_node *new_last,
+				     struct llist_head *head)
+{
+	new_last->next = head->first;
+	head->first = new_first;
+	return new_last->next == NULL;
+}
+
 /**
  * llist_add - add a new entry
  * @new:	new entry to be added
@@ -209,6 +219,11 @@ static inline bool llist_add(struct llis
 	return llist_add_batch(new, new, head);
 }
 
+static inline bool __llist_add(struct llist_node *new, struct llist_head *head)
+{
+	return __llist_add_batch(new, new, head);
+}
+
 /**
  * llist_del_all - delete all entries from lock-less list
  * @head:	the head of lock-less list to delete all entries



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 2/7] sched: Fix try_invoke_on_locked_down_task() semantics
  2020-08-27 16:12 [RFC][PATCH 0/7] kprobes: Make kretprobes lockless Peter Zijlstra
  2020-08-27 16:12 ` [RFC][PATCH 1/7] llist: Add nonatomic __llist_add() Peter Zijlstra
@ 2020-08-27 16:12 ` Peter Zijlstra
  2020-08-27 16:12 ` [RFC][PATCH 3/7] kprobes: Remove kretprobe hash Peter Zijlstra
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2020-08-27 16:12 UTC (permalink / raw)
  To: linux-kernel, mhiramat
  Cc: Eddy_Wu, x86, davem, rostedt, naveen.n.rao, anil.s.keshavamurthy,
	linux-arch, cameron, oleg, will, paulmck, peterz


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3006,15 +3006,14 @@ try_to_wake_up(struct task_struct *p, un
  */
 bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)
 {
-	bool ret = false;
 	struct rq_flags rf;
+	bool ret = false;
 	struct rq *rq;
 
-	lockdep_assert_irqs_enabled();
-	raw_spin_lock_irq(&p->pi_lock);
+	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 	if (p->on_rq) {
 		rq = __task_rq_lock(p, &rf);
-		if (task_rq(p) == rq)
+		if (task_rq(p) == rq && rq->curr != p)
 			ret = func(p, arg);
 		rq_unlock(rq, &rf);
 	} else {
@@ -3028,7 +3027,7 @@ bool try_invoke_on_locked_down_task(stru
 				ret = func(p, arg);
 		}
 	}
-	raw_spin_unlock_irq(&p->pi_lock);
+	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-27 16:12 [RFC][PATCH 0/7] kprobes: Make kretprobes lockless Peter Zijlstra
  2020-08-27 16:12 ` [RFC][PATCH 1/7] llist: Add nonatomic __llist_add() Peter Zijlstra
  2020-08-27 16:12 ` [RFC][PATCH 2/7] sched: Fix try_invoke_on_locked_down_task() semantics Peter Zijlstra
@ 2020-08-27 16:12 ` Peter Zijlstra
  2020-08-27 18:00   ` Masami Hiramatsu
                     ` (2 more replies)
  2020-08-27 16:12 ` [RFC][PATCH 4/7] kprobe: Dont kfree() from breakpoint context Peter Zijlstra
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 38+ messages in thread
From: Peter Zijlstra @ 2020-08-27 16:12 UTC (permalink / raw)
  To: linux-kernel, mhiramat
  Cc: Eddy_Wu, x86, davem, rostedt, naveen.n.rao, anil.s.keshavamurthy,
	linux-arch, cameron, oleg, will, paulmck, peterz

The kretprobe hash is mostly superfluous, replace it with a per-task
variable.

This gets rid of the task hash and it's related locking.

The whole invalidate_rp_inst() is tedious and could go away once we
drop rp specific ri size.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/kprobes.h |    5 -
 include/linux/sched.h   |    4 
 kernel/fork.c           |    4 
 kernel/kprobes.c        |  239 +++++++++++++++++++-----------------------------
 4 files changed, 110 insertions(+), 142 deletions(-)

--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -156,7 +156,10 @@ struct kretprobe {
 };
 
 struct kretprobe_instance {
-	struct hlist_node hlist;
+	union {
+		struct llist_node llist;
+		struct hlist_node hlist;
+	};
 	struct kretprobe *rp;
 	kprobe_opcode_t *ret_addr;
 	struct task_struct *task;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1315,6 +1315,10 @@ struct task_struct {
 	struct callback_head		mce_kill_me;
 #endif
 
+#ifdef CONFIG_KRETPROBES
+	struct llist_head               kretprobe_instances;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2161,6 +2161,10 @@ static __latent_entropy struct task_stru
 	INIT_LIST_HEAD(&p->thread_group);
 	p->task_works = NULL;
 
+#ifdef CONFIG_KRETPROBES
+	p->kretprobe_instances.first = NULL;
+#endif
+
 	/*
 	 * Ensure that the cgroup subsystem policies allow the new process to be
 	 * forked. It should be noted the the new process's css_set can be changed
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -53,7 +53,6 @@ static int kprobes_initialized;
  * - RCU hlist traversal under disabling preempt (breakpoint handlers)
  */
 static struct hlist_head kprobe_table[KPROBE_TABLE_SIZE];
-static struct hlist_head kretprobe_inst_table[KPROBE_TABLE_SIZE];
 
 /* NOTE: change this value only with kprobe_mutex held */
 static bool kprobes_all_disarmed;
@@ -61,9 +60,6 @@ static bool kprobes_all_disarmed;
 /* This protects kprobe_table and optimizing_list */
 static DEFINE_MUTEX(kprobe_mutex);
 static DEFINE_PER_CPU(struct kprobe *, kprobe_instance) = NULL;
-static struct {
-	raw_spinlock_t lock ____cacheline_aligned_in_smp;
-} kretprobe_table_locks[KPROBE_TABLE_SIZE];
 
 kprobe_opcode_t * __weak kprobe_lookup_name(const char *name,
 					unsigned int __unused)
@@ -71,11 +67,6 @@ kprobe_opcode_t * __weak kprobe_lookup_n
 	return ((kprobe_opcode_t *)(kallsyms_lookup_name(name)));
 }
 
-static raw_spinlock_t *kretprobe_table_lock_ptr(unsigned long hash)
-{
-	return &(kretprobe_table_locks[hash].lock);
-}
-
 /* Blacklist -- list of struct kprobe_blacklist_entry */
 static LIST_HEAD(kprobe_blacklist);
 
@@ -1241,49 +1232,6 @@ void recycle_rp_inst(struct kretprobe_in
 }
 NOKPROBE_SYMBOL(recycle_rp_inst);
 
-void kretprobe_hash_lock(struct task_struct *tsk,
-			 struct hlist_head **head, unsigned long *flags)
-__acquires(hlist_lock)
-{
-	unsigned long hash = hash_ptr(tsk, KPROBE_HASH_BITS);
-	raw_spinlock_t *hlist_lock;
-
-	*head = &kretprobe_inst_table[hash];
-	hlist_lock = kretprobe_table_lock_ptr(hash);
-	raw_spin_lock_irqsave(hlist_lock, *flags);
-}
-NOKPROBE_SYMBOL(kretprobe_hash_lock);
-
-static void kretprobe_table_lock(unsigned long hash,
-				 unsigned long *flags)
-__acquires(hlist_lock)
-{
-	raw_spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash);
-	raw_spin_lock_irqsave(hlist_lock, *flags);
-}
-NOKPROBE_SYMBOL(kretprobe_table_lock);
-
-void kretprobe_hash_unlock(struct task_struct *tsk,
-			   unsigned long *flags)
-__releases(hlist_lock)
-{
-	unsigned long hash = hash_ptr(tsk, KPROBE_HASH_BITS);
-	raw_spinlock_t *hlist_lock;
-
-	hlist_lock = kretprobe_table_lock_ptr(hash);
-	raw_spin_unlock_irqrestore(hlist_lock, *flags);
-}
-NOKPROBE_SYMBOL(kretprobe_hash_unlock);
-
-static void kretprobe_table_unlock(unsigned long hash,
-				   unsigned long *flags)
-__releases(hlist_lock)
-{
-	raw_spinlock_t *hlist_lock = kretprobe_table_lock_ptr(hash);
-	raw_spin_unlock_irqrestore(hlist_lock, *flags);
-}
-NOKPROBE_SYMBOL(kretprobe_table_unlock);
-
 struct kprobe kprobe_busy = {
 	.addr = (void *) get_kprobe,
 };
@@ -1313,25 +1261,28 @@ void kprobe_busy_end(void)
 void kprobe_flush_task(struct task_struct *tk)
 {
 	struct kretprobe_instance *ri;
-	struct hlist_head *head, empty_rp;
+	struct hlist_head empty_rp;
+	struct llist_node *node;
 	struct hlist_node *tmp;
-	unsigned long hash, flags = 0;
 
+	/* Early boot, not yet initialized. */
 	if (unlikely(!kprobes_initialized))
-		/* Early boot.  kretprobe_table_locks not yet initialized. */
 		return;
 
+	INIT_HLIST_HEAD(&empty_rp);
+
 	kprobe_busy_begin();
 
-	INIT_HLIST_HEAD(&empty_rp);
-	hash = hash_ptr(tk, KPROBE_HASH_BITS);
-	head = &kretprobe_inst_table[hash];
-	kretprobe_table_lock(hash, &flags);
-	hlist_for_each_entry_safe(ri, tmp, head, hlist) {
-		if (ri->task == tk)
-			recycle_rp_inst(ri, &empty_rp);
+	node = current->kretprobe_instances.first;
+	current->kretprobe_instances.first = NULL;
+
+	while (node) {
+		ri = container_of(node, struct kretprobe_instance, llist);
+		node = node->next;
+
+		recycle_rp_inst(ri, &empty_rp);
 	}
-	kretprobe_table_unlock(hash, &flags);
+
 	hlist_for_each_entry_safe(ri, tmp, &empty_rp, hlist) {
 		hlist_del(&ri->hlist);
 		kfree(ri);
@@ -1352,24 +1303,70 @@ static inline void free_rp_inst(struct k
 	}
 }
 
-static void cleanup_rp_inst(struct kretprobe *rp)
+/* XXX all of this only exists because we have rp specific ri's */
+
+static bool __invalidate_rp_inst(struct task_struct *t, void *rp)
 {
-	unsigned long flags, hash;
+	struct llist_node *node = t->kretprobe_instances.first;
 	struct kretprobe_instance *ri;
-	struct hlist_node *next;
-	struct hlist_head *head;
+
+	while (node) {
+		ri = container_of(node, struct kretprobe_instance, llist);
+		node = node->next;
+
+		if (ri->rp == rp)
+			ri->rp = NULL;
+	}
+
+	return true;
+}
+
+struct invl_rp_ipi {
+	struct task_struct *task;
+	void *rp;
+	bool done;
+};
+
+static void __invalidate_rp_ipi(void *arg)
+{
+	struct invl_rp_ipi *iri = arg;
+
+	if (iri->task == current)
+		iri->done = __invalidate_rp_inst(iri->task, iri->rp);
+}
+
+static void invalidate_rp_inst(struct task_struct *t, struct kretprobe *rp)
+{
+	struct invl_rp_ipi iri = {
+		.task = t,
+		.rp = rp,
+		.done = false
+	};
+
+	for (;;) {
+		if (try_invoke_on_locked_down_task(t, __invalidate_rp_inst, rp))
+			return;
+
+		smp_call_function_single(task_cpu(t), __invalidate_rp_ipi, &iri, 1);
+		if (iri.done)
+			return;
+	}
+}
+
+static void cleanup_rp_inst(struct kretprobe *rp)
+{
+	struct task_struct *p, *t;
 
 	/* To avoid recursive kretprobe by NMI, set kprobe busy here */
 	kprobe_busy_begin();
-	for (hash = 0; hash < KPROBE_TABLE_SIZE; hash++) {
-		kretprobe_table_lock(hash, &flags);
-		head = &kretprobe_inst_table[hash];
-		hlist_for_each_entry_safe(ri, next, head, hlist) {
-			if (ri->rp == rp)
-				ri->rp = NULL;
-		}
-		kretprobe_table_unlock(hash, &flags);
+	rcu_read_lock();
+	for_each_process_thread(p, t) {
+		if (!t->kretprobe_instances.first)
+			continue;
+
+		invalidate_rp_inst(t, rp);
 	}
+	rcu_read_unlock();
 	kprobe_busy_end();
 
 	free_rp_inst(rp);
@@ -1935,71 +1932,45 @@ unsigned long __kretprobe_trampoline_han
 					unsigned long trampoline_address,
 					void *frame_pointer)
 {
+	kprobe_opcode_t *correct_ret_addr = NULL;
 	struct kretprobe_instance *ri = NULL;
-	struct hlist_head *head, empty_rp;
+	unsigned long orig_ret_address = 0;
+	struct llist_node *first, *node;
+	struct hlist_head empty_rp;
 	struct hlist_node *tmp;
-	unsigned long flags, orig_ret_address = 0;
-	kprobe_opcode_t *correct_ret_addr = NULL;
-	bool skipped = false;
 
 	INIT_HLIST_HEAD(&empty_rp);
-	kretprobe_hash_lock(current, &head, &flags);
 
-	/*
-	 * It is possible to have multiple instances associated with a given
-	 * task either because multiple functions in the call path have
-	 * return probes installed on them, and/or more than one
-	 * return probe was registered for a target function.
-	 *
-	 * We can handle this because:
-	 *     - instances are always pushed into the head of the list
-	 *     - when multiple return probes are registered for the same
-	 *	 function, the (chronologically) first instance's ret_addr
-	 *	 will be the real return address, and all the rest will
-	 *	 point to kretprobe_trampoline.
-	 */
-	hlist_for_each_entry(ri, head, hlist) {
-		if (ri->task != current)
-			/* another task is sharing our hash bucket */
-			continue;
-		/*
-		 * Return probes must be pushed on this hash list correct
-		 * order (same as return order) so that it can be popped
-		 * correctly. However, if we find it is pushed it incorrect
-		 * order, this means we find a function which should not be
-		 * probed, because the wrong order entry is pushed on the
-		 * path of processing other kretprobe itself.
-		 */
-		if (ri->fp != frame_pointer) {
-			if (!skipped)
-				pr_warn("kretprobe is stacked incorrectly. Trying to fixup.\n");
-			skipped = true;
-			continue;
-		}
+	first = node = current->kretprobe_instances.first;
+	while (node) {
+		ri = container_of(node, struct kretprobe_instance, llist);
 
-		orig_ret_address = (unsigned long)ri->ret_addr;
-		if (skipped)
-			pr_warn("%ps must be blacklisted because of incorrect kretprobe order\n",
-				ri->rp->kp.addr);
+		BUG_ON(ri->fp != frame_pointer);
 
-		if (orig_ret_address != trampoline_address)
+		orig_ret_address = (unsigned long)ri->ret_addr;
+		if (orig_ret_address != trampoline_address) {
 			/*
 			 * This is the real return address. Any other
 			 * instances associated with this task are for
 			 * other calls deeper on the call stack
 			 */
 			break;
+		}
+
+		node = node->next;
 	}
 
 	kretprobe_assert(ri, orig_ret_address, trampoline_address);
-
 	correct_ret_addr = ri->ret_addr;
-	hlist_for_each_entry_safe(ri, tmp, head, hlist) {
-		if (ri->task != current)
-			/* another task is sharing our hash bucket */
-			continue;
-		if (ri->fp != frame_pointer)
-			continue;
+
+	/* Unlink all nodes for this frame. */
+	current->kretprobe_instances.first = node->next;
+	node->next = NULL;
+
+	/* Run them..  */
+	while (first) {
+		ri = container_of(first, struct kretprobe_instance, llist);
+		node = first->next;
 
 		orig_ret_address = (unsigned long)ri->ret_addr;
 		if (ri->rp && ri->rp->handler) {
@@ -2011,17 +1982,9 @@ unsigned long __kretprobe_trampoline_han
 
 		recycle_rp_inst(ri, &empty_rp);
 
-		if (orig_ret_address != trampoline_address)
-			/*
-			 * This is the real return address. Any other
-			 * instances associated with this task are for
-			 * other calls deeper on the call stack
-			 */
-			break;
+		first = node;
 	}
 
-	kretprobe_hash_unlock(current, &flags);
-
 	hlist_for_each_entry_safe(ri, tmp, &empty_rp, hlist) {
 		hlist_del(&ri->hlist);
 		kfree(ri);
@@ -2062,11 +2025,8 @@ static int pre_handler_kretprobe(struct
 
 		arch_prepare_kretprobe(ri, regs);
 
-		/* XXX(hch): why is there no hlist_move_head? */
-		INIT_HLIST_NODE(&ri->hlist);
-		kretprobe_table_lock(hash, &flags);
-		hlist_add_head(&ri->hlist, &kretprobe_inst_table[hash]);
-		kretprobe_table_unlock(hash, &flags);
+		__llist_add(&ri->llist, &current->kretprobe_instances);
+
 	} else {
 		rp->nmissed++;
 		raw_spin_unlock_irqrestore(&rp->lock, flags);
@@ -2551,11 +2511,8 @@ static int __init init_kprobes(void)
 
 	/* FIXME allocate the probe table, currently defined statically */
 	/* initialize all list heads */
-	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
+	for (i = 0; i < KPROBE_TABLE_SIZE; i++)
 		INIT_HLIST_HEAD(&kprobe_table[i]);
-		INIT_HLIST_HEAD(&kretprobe_inst_table[i]);
-		raw_spin_lock_init(&(kretprobe_table_locks[i].lock));
-	}
 
 	err = populate_kprobe_blacklist(__start_kprobe_blacklist,
 					__stop_kprobe_blacklist);



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 4/7] kprobe: Dont kfree() from breakpoint context
  2020-08-27 16:12 [RFC][PATCH 0/7] kprobes: Make kretprobes lockless Peter Zijlstra
                   ` (2 preceding siblings ...)
  2020-08-27 16:12 ` [RFC][PATCH 3/7] kprobes: Remove kretprobe hash Peter Zijlstra
@ 2020-08-27 16:12 ` Peter Zijlstra
  2020-08-27 16:12 ` [RFC][PATCH 5/7] asm-generic/atomic: Add try_cmpxchg() fallbacks Peter Zijlstra
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2020-08-27 16:12 UTC (permalink / raw)
  To: linux-kernel, mhiramat
  Cc: Eddy_Wu, x86, davem, rostedt, naveen.n.rao, anil.s.keshavamurthy,
	linux-arch, cameron, oleg, will, paulmck, peterz

Breakpoint context might not be a safe context for kfree(), push it
out to RCU.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/kprobes.h |    2 +-
 kernel/kprobes.c        |   27 ++++++---------------------
 2 files changed, 7 insertions(+), 22 deletions(-)

--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -159,6 +159,7 @@ struct kretprobe_instance {
 	union {
 		struct llist_node llist;
 		struct hlist_node hlist;
+		struct rcu_head rcu;
 	};
 	struct kretprobe *rp;
 	kprobe_opcode_t *ret_addr;
@@ -398,7 +399,6 @@ int register_kretprobes(struct kretprobe
 void unregister_kretprobes(struct kretprobe **rps, int num);
 
 void kprobe_flush_task(struct task_struct *tk);
-void recycle_rp_inst(struct kretprobe_instance *ri, struct hlist_head *head);
 
 int disable_kprobe(struct kprobe *kp);
 int enable_kprobe(struct kprobe *kp);
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1214,8 +1214,7 @@ void kprobes_inc_nmissed_count(struct kp
 }
 NOKPROBE_SYMBOL(kprobes_inc_nmissed_count);
 
-void recycle_rp_inst(struct kretprobe_instance *ri,
-		     struct hlist_head *head)
+static void recycle_rp_inst(struct kretprobe_instance *ri)
 {
 	struct kretprobe *rp = ri->rp;
 
@@ -1226,9 +1225,9 @@ void recycle_rp_inst(struct kretprobe_in
 		raw_spin_lock(&rp->lock);
 		hlist_add_head(&ri->hlist, &rp->free_instances);
 		raw_spin_unlock(&rp->lock);
-	} else
-		/* Unregistering */
-		hlist_add_head(&ri->hlist, head);
+	} else {
+		kfree_rcu(ri, rcu);
+	}
 }
 NOKPROBE_SYMBOL(recycle_rp_inst);
 
@@ -1261,15 +1260,12 @@ void kprobe_busy_end(void)
 void kprobe_flush_task(struct task_struct *tk)
 {
 	struct kretprobe_instance *ri;
-	struct hlist_head empty_rp;
 	struct llist_node *node;
-	struct hlist_node *tmp;
 
 	/* Early boot, not yet initialized. */
 	if (unlikely(!kprobes_initialized))
 		return;
 
-	INIT_HLIST_HEAD(&empty_rp);
 
 	kprobe_busy_begin();
 
@@ -1280,12 +1276,7 @@ void kprobe_flush_task(struct task_struc
 		ri = container_of(node, struct kretprobe_instance, llist);
 		node = node->next;
 
-		recycle_rp_inst(ri, &empty_rp);
-	}
-
-	hlist_for_each_entry_safe(ri, tmp, &empty_rp, hlist) {
-		hlist_del(&ri->hlist);
-		kfree(ri);
+		recycle_rp_inst(ri);
 	}
 
 	kprobe_busy_end();
@@ -1937,7 +1928,6 @@ unsigned long __kretprobe_trampoline_han
 	unsigned long orig_ret_address = 0;
 	struct llist_node *first, *node;
 	struct hlist_head empty_rp;
-	struct hlist_node *tmp;
 
 	INIT_HLIST_HEAD(&empty_rp);
 
@@ -1980,16 +1970,11 @@ unsigned long __kretprobe_trampoline_han
 			__this_cpu_write(current_kprobe, &kprobe_busy);
 		}
 
-		recycle_rp_inst(ri, &empty_rp);
+		recycle_rp_inst(ri);
 
 		first = node;
 	}
 
-	hlist_for_each_entry_safe(ri, tmp, &empty_rp, hlist) {
-		hlist_del(&ri->hlist);
-		kfree(ri);
-	}
-
 	return orig_ret_address;
 }
 NOKPROBE_SYMBOL(__kretprobe_trampoline_handler)



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 5/7] asm-generic/atomic: Add try_cmpxchg() fallbacks
  2020-08-27 16:12 [RFC][PATCH 0/7] kprobes: Make kretprobes lockless Peter Zijlstra
                   ` (3 preceding siblings ...)
  2020-08-27 16:12 ` [RFC][PATCH 4/7] kprobe: Dont kfree() from breakpoint context Peter Zijlstra
@ 2020-08-27 16:12 ` Peter Zijlstra
  2020-08-27 16:12 ` [RFC][PATCH 6/7] freelist: Lock less freelist Peter Zijlstra
  2020-08-27 16:12 ` [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist Peter Zijlstra
  6 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2020-08-27 16:12 UTC (permalink / raw)
  To: linux-kernel, mhiramat
  Cc: Eddy_Wu, x86, davem, rostedt, naveen.n.rao, anil.s.keshavamurthy,
	linux-arch, cameron, oleg, will, paulmck, peterz

Only x86 provides try_cmpxchg() outside of the atomic_t interfaces,
provide generic fallbacks to create this interface from the widely
available cmpxchg() function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
---
 arch/x86/include/asm/atomic.h             |    2 
 arch/x86/include/asm/atomic64_64.h        |    2 
 arch/x86/include/asm/cmpxchg.h            |    2 
 include/asm-generic/atomic-instrumented.h |  216 ++++++++++++++++++------------
 include/linux/atomic-arch-fallback.h      |   90 +++++++++++-
 include/linux/atomic-fallback.h           |   90 +++++++++++-
 scripts/atomic/gen-atomic-fallback.sh     |   63 +++++++-
 scripts/atomic/gen-atomic-instrumented.sh |   29 +++-
 8 files changed, 372 insertions(+), 122 deletions(-)

--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -199,7 +199,7 @@ static __always_inline int arch_atomic_c
 
 static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
 {
-	return try_cmpxchg(&v->counter, old, new);
+	return arch_try_cmpxchg(&v->counter, old, new);
 }
 #define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
 
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -187,7 +187,7 @@ static inline s64 arch_atomic64_cmpxchg(
 
 static __always_inline bool arch_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
 {
-	return try_cmpxchg(&v->counter, old, new);
+	return arch_try_cmpxchg(&v->counter, old, new);
 }
 #define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg
 
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -221,7 +221,7 @@ extern void __add_wrong_size(void)
 #define __try_cmpxchg(ptr, pold, new, size)				\
 	__raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)
 
-#define try_cmpxchg(ptr, pold, new) 					\
+#define arch_try_cmpxchg(ptr, pold, new) 				\
 	__try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
 
 /*
--- a/include/asm-generic/atomic-instrumented.h
+++ b/include/asm-generic/atomic-instrumented.h
@@ -1642,148 +1642,192 @@ atomic64_dec_if_positive(atomic64_t *v)
 #endif
 
 #if !defined(arch_xchg_relaxed) || defined(arch_xchg)
-#define xchg(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_xchg(__ai_ptr, __VA_ARGS__);				\
+#define xchg(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_xchg(__ai_ptr, __VA_ARGS__); \
 })
 #endif
 
 #if defined(arch_xchg_acquire)
-#define xchg_acquire(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_xchg_acquire(__ai_ptr, __VA_ARGS__);				\
+#define xchg_acquire(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_xchg_acquire(__ai_ptr, __VA_ARGS__); \
 })
 #endif
 
 #if defined(arch_xchg_release)
-#define xchg_release(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_xchg_release(__ai_ptr, __VA_ARGS__);				\
+#define xchg_release(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_xchg_release(__ai_ptr, __VA_ARGS__); \
 })
 #endif
 
 #if defined(arch_xchg_relaxed)
-#define xchg_relaxed(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_xchg_relaxed(__ai_ptr, __VA_ARGS__);				\
+#define xchg_relaxed(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_xchg_relaxed(__ai_ptr, __VA_ARGS__); \
 })
 #endif
 
 #if !defined(arch_cmpxchg_relaxed) || defined(arch_cmpxchg)
-#define cmpxchg(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_cmpxchg(__ai_ptr, __VA_ARGS__);				\
+#define cmpxchg(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg(__ai_ptr, __VA_ARGS__); \
 })
 #endif
 
 #if defined(arch_cmpxchg_acquire)
-#define cmpxchg_acquire(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_cmpxchg_acquire(__ai_ptr, __VA_ARGS__);				\
+#define cmpxchg_acquire(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg_acquire(__ai_ptr, __VA_ARGS__); \
 })
 #endif
 
 #if defined(arch_cmpxchg_release)
-#define cmpxchg_release(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_cmpxchg_release(__ai_ptr, __VA_ARGS__);				\
+#define cmpxchg_release(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg_release(__ai_ptr, __VA_ARGS__); \
 })
 #endif
 
 #if defined(arch_cmpxchg_relaxed)
-#define cmpxchg_relaxed(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_cmpxchg_relaxed(__ai_ptr, __VA_ARGS__);				\
+#define cmpxchg_relaxed(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg_relaxed(__ai_ptr, __VA_ARGS__); \
 })
 #endif
 
 #if !defined(arch_cmpxchg64_relaxed) || defined(arch_cmpxchg64)
-#define cmpxchg64(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_cmpxchg64(__ai_ptr, __VA_ARGS__);				\
+#define cmpxchg64(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg64(__ai_ptr, __VA_ARGS__); \
 })
 #endif
 
 #if defined(arch_cmpxchg64_acquire)
-#define cmpxchg64_acquire(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_cmpxchg64_acquire(__ai_ptr, __VA_ARGS__);				\
+#define cmpxchg64_acquire(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg64_acquire(__ai_ptr, __VA_ARGS__); \
 })
 #endif
 
 #if defined(arch_cmpxchg64_release)
-#define cmpxchg64_release(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_cmpxchg64_release(__ai_ptr, __VA_ARGS__);				\
+#define cmpxchg64_release(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg64_release(__ai_ptr, __VA_ARGS__); \
 })
 #endif
 
 #if defined(arch_cmpxchg64_relaxed)
-#define cmpxchg64_relaxed(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_cmpxchg64_relaxed(__ai_ptr, __VA_ARGS__);				\
+#define cmpxchg64_relaxed(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg64_relaxed(__ai_ptr, __VA_ARGS__); \
 })
 #endif
 
-#define cmpxchg_local(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_cmpxchg_local(__ai_ptr, __VA_ARGS__);				\
+#if !defined(arch_try_cmpxchg_relaxed) || defined(arch_try_cmpxchg)
+#define try_cmpxchg(ptr, oldp, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	typeof(oldp) __ai_oldp = (oldp); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
+	arch_try_cmpxchg(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_try_cmpxchg_acquire)
+#define try_cmpxchg_acquire(ptr, oldp, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	typeof(oldp) __ai_oldp = (oldp); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
+	arch_try_cmpxchg_acquire(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_try_cmpxchg_release)
+#define try_cmpxchg_release(ptr, oldp, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	typeof(oldp) __ai_oldp = (oldp); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
+	arch_try_cmpxchg_release(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_try_cmpxchg_relaxed)
+#define try_cmpxchg_relaxed(ptr, oldp, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	typeof(oldp) __ai_oldp = (oldp); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
+	arch_try_cmpxchg_relaxed(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+#endif
+
+#define cmpxchg_local(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg_local(__ai_ptr, __VA_ARGS__); \
 })
 
-#define cmpxchg64_local(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_cmpxchg64_local(__ai_ptr, __VA_ARGS__);				\
+#define cmpxchg64_local(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_cmpxchg64_local(__ai_ptr, __VA_ARGS__); \
 })
 
-#define sync_cmpxchg(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr));		\
-	arch_sync_cmpxchg(__ai_ptr, __VA_ARGS__);				\
+#define sync_cmpxchg(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+	arch_sync_cmpxchg(__ai_ptr, __VA_ARGS__); \
 })
 
-#define cmpxchg_double(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, 2 * sizeof(*__ai_ptr));		\
-	arch_cmpxchg_double(__ai_ptr, __VA_ARGS__);				\
+#define cmpxchg_double(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, 2 * sizeof(*__ai_ptr)); \
+	arch_cmpxchg_double(__ai_ptr, __VA_ARGS__); \
 })
 
 
-#define cmpxchg_double_local(ptr, ...)						\
-({									\
-	typeof(ptr) __ai_ptr = (ptr);					\
-	instrument_atomic_write(__ai_ptr, 2 * sizeof(*__ai_ptr));		\
-	arch_cmpxchg_double_local(__ai_ptr, __VA_ARGS__);				\
+#define cmpxchg_double_local(ptr, ...) \
+({ \
+	typeof(ptr) __ai_ptr = (ptr); \
+	instrument_atomic_write(__ai_ptr, 2 * sizeof(*__ai_ptr)); \
+	arch_cmpxchg_double_local(__ai_ptr, __VA_ARGS__); \
 })
 
 #endif /* _ASM_GENERIC_ATOMIC_INSTRUMENTED_H */
-// 89bf97f3a7509b740845e51ddf31055b48a81f40
+// ff0fe7f81ee97f01f13bb78b0e3ce800bc56d9dd
--- a/include/linux/atomic-arch-fallback.h
+++ b/include/linux/atomic-arch-fallback.h
@@ -9,9 +9,9 @@
 #include <linux/compiler.h>
 
 #ifndef arch_xchg_relaxed
-#define arch_xchg_relaxed		arch_xchg
-#define arch_xchg_acquire		arch_xchg
-#define arch_xchg_release		arch_xchg
+#define arch_xchg_acquire arch_xchg
+#define arch_xchg_release arch_xchg
+#define arch_xchg_relaxed arch_xchg
 #else /* arch_xchg_relaxed */
 
 #ifndef arch_xchg_acquire
@@ -32,9 +32,9 @@
 #endif /* arch_xchg_relaxed */
 
 #ifndef arch_cmpxchg_relaxed
-#define arch_cmpxchg_relaxed		arch_cmpxchg
-#define arch_cmpxchg_acquire		arch_cmpxchg
-#define arch_cmpxchg_release		arch_cmpxchg
+#define arch_cmpxchg_acquire arch_cmpxchg
+#define arch_cmpxchg_release arch_cmpxchg
+#define arch_cmpxchg_relaxed arch_cmpxchg
 #else /* arch_cmpxchg_relaxed */
 
 #ifndef arch_cmpxchg_acquire
@@ -55,9 +55,9 @@
 #endif /* arch_cmpxchg_relaxed */
 
 #ifndef arch_cmpxchg64_relaxed
-#define arch_cmpxchg64_relaxed		arch_cmpxchg64
-#define arch_cmpxchg64_acquire		arch_cmpxchg64
-#define arch_cmpxchg64_release		arch_cmpxchg64
+#define arch_cmpxchg64_acquire arch_cmpxchg64
+#define arch_cmpxchg64_release arch_cmpxchg64
+#define arch_cmpxchg64_relaxed arch_cmpxchg64
 #else /* arch_cmpxchg64_relaxed */
 
 #ifndef arch_cmpxchg64_acquire
@@ -77,6 +77,76 @@
 
 #endif /* arch_cmpxchg64_relaxed */
 
+#ifndef arch_try_cmpxchg_relaxed
+#ifdef arch_try_cmpxchg
+#define arch_try_cmpxchg_acquire arch_try_cmpxchg
+#define arch_try_cmpxchg_release arch_try_cmpxchg
+#define arch_try_cmpxchg_relaxed arch_try_cmpxchg
+#endif /* arch_try_cmpxchg */
+
+#ifndef arch_try_cmpxchg
+#define arch_try_cmpxchg(_ptr, _oldp, _new) \
+({ \
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+	___r = arch_cmpxchg((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*___op = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* arch_try_cmpxchg */
+
+#ifndef arch_try_cmpxchg_acquire
+#define arch_try_cmpxchg_acquire(_ptr, _oldp, _new) \
+({ \
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+	___r = arch_cmpxchg_acquire((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*___op = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* arch_try_cmpxchg_acquire */
+
+#ifndef arch_try_cmpxchg_release
+#define arch_try_cmpxchg_release(_ptr, _oldp, _new) \
+({ \
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+	___r = arch_cmpxchg_release((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*___op = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* arch_try_cmpxchg_release */
+
+#ifndef arch_try_cmpxchg_relaxed
+#define arch_try_cmpxchg_relaxed(_ptr, _oldp, _new) \
+({ \
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+	___r = arch_cmpxchg_relaxed((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*___op = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* arch_try_cmpxchg_relaxed */
+
+#else /* arch_try_cmpxchg_relaxed */
+
+#ifndef arch_try_cmpxchg_acquire
+#define arch_try_cmpxchg_acquire(...) \
+	__atomic_op_acquire(arch_try_cmpxchg, __VA_ARGS__)
+#endif
+
+#ifndef arch_try_cmpxchg_release
+#define arch_try_cmpxchg_release(...) \
+	__atomic_op_release(arch_try_cmpxchg, __VA_ARGS__)
+#endif
+
+#ifndef arch_try_cmpxchg
+#define arch_try_cmpxchg(...) \
+	__atomic_op_fence(arch_try_cmpxchg, __VA_ARGS__)
+#endif
+
+#endif /* arch_try_cmpxchg_relaxed */
+
 #ifndef arch_atomic_read_acquire
 static __always_inline int
 arch_atomic_read_acquire(const atomic_t *v)
@@ -2288,4 +2358,4 @@ arch_atomic64_dec_if_positive(atomic64_t
 #endif
 
 #endif /* _LINUX_ATOMIC_FALLBACK_H */
-// 90cd26cfd69d2250303d654955a0cc12620fb91b
+// cca554917d7ea73d5e3e7397dd70c484cad9b2c4
--- a/include/linux/atomic-fallback.h
+++ b/include/linux/atomic-fallback.h
@@ -9,9 +9,9 @@
 #include <linux/compiler.h>
 
 #ifndef xchg_relaxed
-#define xchg_relaxed		xchg
-#define xchg_acquire		xchg
-#define xchg_release		xchg
+#define xchg_acquire xchg
+#define xchg_release xchg
+#define xchg_relaxed xchg
 #else /* xchg_relaxed */
 
 #ifndef xchg_acquire
@@ -32,9 +32,9 @@
 #endif /* xchg_relaxed */
 
 #ifndef cmpxchg_relaxed
-#define cmpxchg_relaxed		cmpxchg
-#define cmpxchg_acquire		cmpxchg
-#define cmpxchg_release		cmpxchg
+#define cmpxchg_acquire cmpxchg
+#define cmpxchg_release cmpxchg
+#define cmpxchg_relaxed cmpxchg
 #else /* cmpxchg_relaxed */
 
 #ifndef cmpxchg_acquire
@@ -55,9 +55,9 @@
 #endif /* cmpxchg_relaxed */
 
 #ifndef cmpxchg64_relaxed
-#define cmpxchg64_relaxed		cmpxchg64
-#define cmpxchg64_acquire		cmpxchg64
-#define cmpxchg64_release		cmpxchg64
+#define cmpxchg64_acquire cmpxchg64
+#define cmpxchg64_release cmpxchg64
+#define cmpxchg64_relaxed cmpxchg64
 #else /* cmpxchg64_relaxed */
 
 #ifndef cmpxchg64_acquire
@@ -77,6 +77,76 @@
 
 #endif /* cmpxchg64_relaxed */
 
+#ifndef try_cmpxchg_relaxed
+#ifdef try_cmpxchg
+#define try_cmpxchg_acquire try_cmpxchg
+#define try_cmpxchg_release try_cmpxchg
+#define try_cmpxchg_relaxed try_cmpxchg
+#endif /* try_cmpxchg */
+
+#ifndef try_cmpxchg
+#define try_cmpxchg(_ptr, _oldp, _new) \
+({ \
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+	___r = cmpxchg((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*___op = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* try_cmpxchg */
+
+#ifndef try_cmpxchg_acquire
+#define try_cmpxchg_acquire(_ptr, _oldp, _new) \
+({ \
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+	___r = cmpxchg_acquire((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*___op = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* try_cmpxchg_acquire */
+
+#ifndef try_cmpxchg_release
+#define try_cmpxchg_release(_ptr, _oldp, _new) \
+({ \
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+	___r = cmpxchg_release((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*___op = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* try_cmpxchg_release */
+
+#ifndef try_cmpxchg_relaxed
+#define try_cmpxchg_relaxed(_ptr, _oldp, _new) \
+({ \
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+	___r = cmpxchg_relaxed((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*___op = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* try_cmpxchg_relaxed */
+
+#else /* try_cmpxchg_relaxed */
+
+#ifndef try_cmpxchg_acquire
+#define try_cmpxchg_acquire(...) \
+	__atomic_op_acquire(try_cmpxchg, __VA_ARGS__)
+#endif
+
+#ifndef try_cmpxchg_release
+#define try_cmpxchg_release(...) \
+	__atomic_op_release(try_cmpxchg, __VA_ARGS__)
+#endif
+
+#ifndef try_cmpxchg
+#define try_cmpxchg(...) \
+	__atomic_op_fence(try_cmpxchg, __VA_ARGS__)
+#endif
+
+#endif /* try_cmpxchg_relaxed */
+
 #define arch_atomic_read atomic_read
 #define arch_atomic_read_acquire atomic_read_acquire
 
@@ -2522,4 +2592,4 @@ atomic64_dec_if_positive(atomic64_t *v)
 #endif
 
 #endif /* _LINUX_ATOMIC_FALLBACK_H */
-// 9d95b56f98d82a2a26c7b79ccdd0c47572d50a6f
+// d78e6c293c661c15188f0ec05bce45188c8d5892
--- a/scripts/atomic/gen-atomic-fallback.sh
+++ b/scripts/atomic/gen-atomic-fallback.sh
@@ -144,15 +144,11 @@ gen_proto_order_variants()
 	printf "#endif /* ${basename}_relaxed */\n\n"
 }
 
-gen_xchg_fallbacks()
+gen_order_fallbacks()
 {
 	local xchg="$1"; shift
+
 cat <<EOF
-#ifndef ${xchg}_relaxed
-#define ${xchg}_relaxed		${xchg}
-#define ${xchg}_acquire		${xchg}
-#define ${xchg}_release		${xchg}
-#else /* ${xchg}_relaxed */
 
 #ifndef ${xchg}_acquire
 #define ${xchg}_acquire(...) \\
@@ -169,11 +165,62 @@ cat <<EOF
 	__atomic_op_fence(${xchg}, __VA_ARGS__)
 #endif
 
-#endif /* ${xchg}_relaxed */
+EOF
+}
+
+gen_xchg_fallbacks()
+{
+	local xchg="$1"; shift
+	printf "#ifndef ${xchg}_relaxed\n"
+
+	gen_basic_fallbacks ${xchg}
+
+	printf "#else /* ${xchg}_relaxed */\n"
+
+	gen_order_fallbacks ${xchg}
+
+	printf "#endif /* ${xchg}_relaxed */\n\n"
+}
+
+gen_try_cmpxchg_fallback()
+{
+	local order="$1"; shift;
+
+cat <<EOF
+#ifndef ${ARCH}try_cmpxchg${order}
+#define ${ARCH}try_cmpxchg${order}(_ptr, _oldp, _new) \\
+({ \\
+	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \\
+	___r = ${ARCH}cmpxchg${order}((_ptr), ___o, (_new)); \\
+	if (unlikely(___r != ___o)) \\
+		*___op = ___r; \\
+	likely(___r == ___o); \\
+})
+#endif /* ${ARCH}try_cmpxchg${order} */
 
 EOF
 }
 
+gen_try_cmpxchg_fallbacks()
+{
+	printf "#ifndef ${ARCH}try_cmpxchg_relaxed\n"
+	printf "#ifdef ${ARCH}try_cmpxchg\n"
+
+	gen_basic_fallbacks "${ARCH}try_cmpxchg"
+
+	printf "#endif /* ${ARCH}try_cmpxchg */\n\n"
+
+	for order in "" "_acquire" "_release" "_relaxed"; do
+		gen_try_cmpxchg_fallback "${order}"
+	done
+
+	printf "#else /* ${ARCH}try_cmpxchg_relaxed */\n"
+
+	gen_order_fallbacks "${ARCH}try_cmpxchg"
+
+	printf "#endif /* ${ARCH}try_cmpxchg_relaxed */\n\n"
+}
+
 cat << EOF
 // SPDX-License-Identifier: GPL-2.0
 
@@ -191,6 +238,8 @@ for xchg in "${ARCH}xchg" "${ARCH}cmpxch
 	gen_xchg_fallbacks "${xchg}"
 done
 
+gen_try_cmpxchg_fallbacks
+
 grep '^[a-z]' "$1" | while read name meta args; do
 	gen_proto "${meta}" "${name}" "${ARCH}" "atomic" "int" ${args}
 done
--- a/scripts/atomic/gen-atomic-instrumented.sh
+++ b/scripts/atomic/gen-atomic-instrumented.sh
@@ -103,14 +103,31 @@ gen_xchg()
 	local xchg="$1"; shift
 	local mult="$1"; shift
 
+	if [ "${xchg%${xchg#try_cmpxchg}}" = "try_cmpxchg" ] ; then
+
 cat <<EOF
-#define ${xchg}(ptr, ...)						\\
-({									\\
-	typeof(ptr) __ai_ptr = (ptr);					\\
-	instrument_atomic_write(__ai_ptr, ${mult}sizeof(*__ai_ptr));		\\
-	arch_${xchg}(__ai_ptr, __VA_ARGS__);				\\
+#define ${xchg}(ptr, oldp, ...) \\
+({ \\
+	typeof(ptr) __ai_ptr = (ptr); \\
+	typeof(oldp) __ai_oldp = (oldp); \\
+	instrument_atomic_write(__ai_ptr, ${mult}sizeof(*__ai_ptr)); \\
+	instrument_atomic_write(__ai_oldp, ${mult}sizeof(*__ai_oldp)); \\
+	arch_${xchg}(__ai_ptr, __ai_oldp, __VA_ARGS__); \\
 })
 EOF
+
+	else
+
+cat <<EOF
+#define ${xchg}(ptr, ...) \\
+({ \\
+	typeof(ptr) __ai_ptr = (ptr); \\
+	instrument_atomic_write(__ai_ptr, ${mult}sizeof(*__ai_ptr)); \\
+	arch_${xchg}(__ai_ptr, __VA_ARGS__); \\
+})
+EOF
+
+	fi
 }
 
 gen_optional_xchg()
@@ -160,7 +177,7 @@ grep '^[a-z]' "$1" | while read name met
 	gen_proto "${meta}" "${name}" "atomic64" "s64" ${args}
 done
 
-for xchg in "xchg" "cmpxchg" "cmpxchg64"; do
+for xchg in "xchg" "cmpxchg" "cmpxchg64" "try_cmpxchg"; do
 	for order in "" "_acquire" "_release" "_relaxed"; do
 		gen_optional_xchg "${xchg}" "${order}"
 	done



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 6/7] freelist: Lock less freelist
  2020-08-27 16:12 [RFC][PATCH 0/7] kprobes: Make kretprobes lockless Peter Zijlstra
                   ` (4 preceding siblings ...)
  2020-08-27 16:12 ` [RFC][PATCH 5/7] asm-generic/atomic: Add try_cmpxchg() fallbacks Peter Zijlstra
@ 2020-08-27 16:12 ` Peter Zijlstra
  2020-08-27 16:37   ` peterz
                     ` (3 more replies)
  2020-08-27 16:12 ` [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist Peter Zijlstra
  6 siblings, 4 replies; 38+ messages in thread
From: Peter Zijlstra @ 2020-08-27 16:12 UTC (permalink / raw)
  To: linux-kernel, mhiramat
  Cc: Eddy_Wu, x86, davem, rostedt, naveen.n.rao, anil.s.keshavamurthy,
	linux-arch, cameron, oleg, will, paulmck, peterz



Cc: cameron@moodycamel.com
Cc: oleg@redhat.com
Cc: will@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/freelist.h |  129 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 129 insertions(+)

--- /dev/null
+++ b/include/linux/freelist.h
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+#ifndef FREELIST_H
+#define FREELIST_H
+
+#include <linux/atomic.h>
+
+/*
+ * Copyright: cameron@moodycamel.com
+ *
+ * A simple CAS-based lock-free free list. Not the fastest thing in the world
+ * under heavy contention, but simple and correct (assuming nodes are never
+ * freed until after the free list is destroyed), and fairly speedy under low
+ * contention.
+ *
+ * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
+ */
+
+struct freelist_node {
+	atomic_t		refs;
+	struct freelist_node	*next;
+};
+
+struct freelist_head {
+	struct freelist_node	*head;
+};
+
+#define REFS_ON_FREELIST 0x80000000
+#define REFS_MASK	 0x7FFFFFFF
+
+static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
+{
+	/*
+	 * Since the refcount is zero, and nobody can increase it once it's
+	 * zero (except us, and we run only one copy of this method per node at
+	 * a time, i.e. the single thread case), then we know we can safely
+	 * change the next pointer of the node; however, once the refcount is
+	 * back above zero, then other threads could increase it (happens under
+	 * heavy contention, when the refcount goes to zero in between a load
+	 * and a refcount increment of a node in try_get, then back up to
+	 * something non-zero, then the refcount increment is done by the other
+	 * thread) -- so if the CAS to add the node to the actual list fails,
+	 * decrese the refcount and leave the add operation to the next thread
+	 * who puts the refcount back to zero (which could be us, hence the
+	 * loop).
+	 */
+	struct freelist_node *head = READ_ONCE(list->head);
+
+	for (;;) {
+		WRITE_ONCE(node->next, head);
+		atomic_set_release(&node->refs, 1);
+
+		if (!try_cmpxchg_release(&list->head, &head, node)) {
+			/*
+			 * Hmm, the add failed, but we can only try again when
+			 * the refcount goes back to zero.
+			 */
+			if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
+				continue;
+		}
+		return;
+	}
+}
+
+static inline void freelist_add(struct freelist_node *node, struct freelist_head *list)
+{
+	/*
+	 * We know that the should-be-on-freelist bit is 0 at this point, so
+	 * it's safe to set it using a fetch_add.
+	 */
+	if (!atomic_fetch_add_release(REFS_ON_FREELIST, &node->refs)) {
+		/*
+		 * Oh look! We were the last ones referencing this node, and we
+		 * know we want to add it to the free list, so let's do it!
+		 */
+		__freelist_add(node, list);
+	}
+}
+
+static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
+{
+	struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
+	unsigned int refs;
+
+	while (head) {
+		prev = head;
+		refs = atomic_read(&head->refs);
+		if ((refs & REFS_MASK) == 0 ||
+		    !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
+			head = smp_load_acquire(&list->head);
+			continue;
+		}
+
+		/*
+		 * Good, reference count has been incremented (it wasn't at
+		 * zero), which means we can read the next and not worry about
+		 * it changing between now and the time we do the CAS.
+		 */
+		next = READ_ONCE(head->next);
+		if (try_cmpxchg_acquire(&list->head, &head, next)) {
+			/*
+			 * Yay, got the node. This means it was on the list,
+			 * which means should-be-on-freelist must be false no
+			 * matter the refcount (because nobody else knows it's
+			 * been taken off yet, it can't have been put back on).
+			 */
+			WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
+
+			/*
+			 * Decrease refcount twice, once for our ref, and once
+			 * for the list's ref.
+			 */
+			atomic_fetch_add(-2, &head->refs);
+
+			return head;
+		}
+
+		/*
+		 * OK, the head must have changed on us, but we still need to decrement
+		 * the refcount we increased.
+		 */
+		refs = atomic_fetch_add(-1, &prev->refs);
+		if (refs == REFS_ON_FREELIST + 1)
+			__freelist_add(prev, list);
+	}
+
+	return NULL;
+}
+
+#endif /* FREELIST_H */



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist
  2020-08-27 16:12 [RFC][PATCH 0/7] kprobes: Make kretprobes lockless Peter Zijlstra
                   ` (5 preceding siblings ...)
  2020-08-27 16:12 ` [RFC][PATCH 6/7] freelist: Lock less freelist Peter Zijlstra
@ 2020-08-27 16:12 ` Peter Zijlstra
  2020-08-28  8:48   ` peterz
  6 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2020-08-27 16:12 UTC (permalink / raw)
  To: linux-kernel, mhiramat
  Cc: Eddy_Wu, x86, davem, rostedt, naveen.n.rao, anil.s.keshavamurthy,
	linux-arch, cameron, oleg, will, paulmck, peterz

Gets rid of rp->lock, and as a result kretprobes are now fully
lockless.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/kprobes.h |   11 ++++++--
 kernel/kprobes.c        |   63 +++++++++++++++++++-----------------------------
 2 files changed, 34 insertions(+), 40 deletions(-)

--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -27,6 +27,7 @@
 #include <linux/rcupdate.h>
 #include <linux/mutex.h>
 #include <linux/ftrace.h>
+#include <linux/freelist.h>
 #include <asm/kprobes.h>
 
 #ifdef CONFIG_KPROBES
@@ -151,14 +152,18 @@ struct kretprobe {
 	int maxactive;
 	int nmissed;
 	size_t data_size;
-	struct hlist_head free_instances;
-	raw_spinlock_t lock;
+	struct freelist_head freelist;
 };
 
 struct kretprobe_instance {
 	union {
+		/*
+		 * Dodgy as heck, this relies on not clobbering freelist::refs.
+		 * llist: only clobbers freelist::next.
+		 * rcu: clobbers both, but only after rp::freelist is gone.
+		 */
+		struct freelist_node freelist;
 		struct llist_node llist;
-		struct hlist_node hlist;
 		struct rcu_head rcu;
 	};
 	struct kretprobe *rp;
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1219,12 +1219,8 @@ static void recycle_rp_inst(struct kretp
 	struct kretprobe *rp = ri->rp;
 
 	/* remove rp inst off the rprobe_inst_table */
-	hlist_del(&ri->hlist);
-	INIT_HLIST_NODE(&ri->hlist);
 	if (likely(rp)) {
-		raw_spin_lock(&rp->lock);
-		hlist_add_head(&ri->hlist, &rp->free_instances);
-		raw_spin_unlock(&rp->lock);
+		freelist_add(&ri->freelist, &rp->freelist);
 	} else {
 		kfree_rcu(ri, rcu);
 	}
@@ -1286,10 +1282,13 @@ NOKPROBE_SYMBOL(kprobe_flush_task);
 static inline void free_rp_inst(struct kretprobe *rp)
 {
 	struct kretprobe_instance *ri;
-	struct hlist_node *next;
+	struct freelist_node *node;
+
+	node = rp->freelist.head;
+	while (node) {
+		ri = container_of(node, struct kretprobe_instance, freelist);
+		node = node->next;
 
-	hlist_for_each_entry_safe(ri, next, &rp->free_instances, hlist) {
-		hlist_del(&ri->hlist);
 		kfree(ri);
 	}
 }
@@ -1986,36 +1985,28 @@ NOKPROBE_SYMBOL(__kretprobe_trampoline_h
 static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
 {
 	struct kretprobe *rp = container_of(p, struct kretprobe, kp);
-	unsigned long hash, flags = 0;
 	struct kretprobe_instance *ri;
+	struct freelist_node *fn;
 
-	/* TODO: consider to only swap the RA after the last pre_handler fired */
-	hash = hash_ptr(current, KPROBE_HASH_BITS);
-	raw_spin_lock_irqsave(&rp->lock, flags);
-	if (!hlist_empty(&rp->free_instances)) {
-		ri = hlist_entry(rp->free_instances.first,
-				struct kretprobe_instance, hlist);
-		hlist_del(&ri->hlist);
-		raw_spin_unlock_irqrestore(&rp->lock, flags);
-
-		ri->rp = rp;
-		ri->task = current;
-
-		if (rp->entry_handler && rp->entry_handler(ri, regs)) {
-			raw_spin_lock_irqsave(&rp->lock, flags);
-			hlist_add_head(&ri->hlist, &rp->free_instances);
-			raw_spin_unlock_irqrestore(&rp->lock, flags);
-			return 0;
-		}
-
-		arch_prepare_kretprobe(ri, regs);
+	fn = freelist_try_get(&rp->freelist);
+	if (!fn) {
+		rp->nmissed++;
+		return 0;
+	}
 
-		__llist_add(&ri->llist, &current->kretprobe_instances);
+	ri = container_of(fn, struct kretprobe_instance, freelist);
+	ri->rp = rp;
+	ri->task = current;
 
-	} else {
-		rp->nmissed++;
-		raw_spin_unlock_irqrestore(&rp->lock, flags);
+	if (rp->entry_handler && rp->entry_handler(ri, regs)) {
+		freelist_add(&ri->freelist, &rp->freelist);
+		return 0;
 	}
+
+	arch_prepare_kretprobe(ri, regs);
+
+	__llist_add(&ri->llist, &current->kretprobe_instances);
+
 	return 0;
 }
 NOKPROBE_SYMBOL(pre_handler_kretprobe);
@@ -2072,8 +2063,7 @@ int register_kretprobe(struct kretprobe
 		rp->maxactive = num_possible_cpus();
 #endif
 	}
-	raw_spin_lock_init(&rp->lock);
-	INIT_HLIST_HEAD(&rp->free_instances);
+	rp->freelist.head = NULL;
 	for (i = 0; i < rp->maxactive; i++) {
 		inst = kmalloc(sizeof(struct kretprobe_instance) +
 			       rp->data_size, GFP_KERNEL);
@@ -2081,8 +2071,7 @@ int register_kretprobe(struct kretprobe
 			free_rp_inst(rp);
 			return -ENOMEM;
 		}
-		INIT_HLIST_NODE(&inst->hlist);
-		hlist_add_head(&inst->hlist, &rp->free_instances);
+		freelist_add(&inst->freelist, &rp->freelist);
 	}
 
 	rp->nmissed = 0;



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 6/7] freelist: Lock less freelist
  2020-08-27 16:12 ` [RFC][PATCH 6/7] freelist: Lock less freelist Peter Zijlstra
@ 2020-08-27 16:37   ` peterz
       [not found]     ` <CAFCw3doX6KK5DwpG_OB331Mdw8uYeVqn8YPTjKh_a-m7ZB9+3A@mail.gmail.com>
  2020-08-27 19:08   ` Boqun Feng
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: peterz @ 2020-08-27 16:37 UTC (permalink / raw)
  To: linux-kernel, mhiramat
  Cc: Eddy_Wu, x86, davem, rostedt, naveen.n.rao, anil.s.keshavamurthy,
	linux-arch, cameron, oleg, will, paulmck

On Thu, Aug 27, 2020 at 06:12:43PM +0200, Peter Zijlstra wrote:
> +struct freelist_node {
> +	atomic_t		refs;
> +	struct freelist_node	*next;
> +};

Bah, the next patch relies on this structure to be ordered the other
way around.

Clearly writing code and listening to talks isn't ideal :-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 6/7] freelist: Lock less freelist
       [not found]     ` <CAFCw3doX6KK5DwpG_OB331Mdw8uYeVqn8YPTjKh_a-m7ZB9+3A@mail.gmail.com>
@ 2020-08-27 16:56       ` peterz
  2020-08-27 17:00         ` Cameron
  0 siblings, 1 reply; 38+ messages in thread
From: peterz @ 2020-08-27 16:56 UTC (permalink / raw)
  To: Cameron
  Cc: linux-kernel, mhiramat, Eddy_Wu, x86, davem, rostedt,
	naveen.n.rao, anil.s.keshavamurthy, linux-arch, oleg, will,
	paulmck

On Thu, Aug 27, 2020 at 12:49:20PM -0400, Cameron wrote:
> For what it's worth, the freelist.h code seems to be a faithful adaptation
> of my original blog post code. Didn't think it would end up in the Linux
> kernel one day :-)

Hehe, I ran into the traditional ABA problem for the lockless stack and
asked google, your blog came up.

I'll try and actually think about it a little when the current (virtual)
conference is over.

Are you Ok with the License I put on it, GPLv2 or BDS-2 ?

> I'm just wondering if the assumption that "nodes are never freed until
> after the free list is destroyed" will hold true in the intended use case?

It does, the nodes are only deleted once the whole freelist object is
discarded.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 6/7] freelist: Lock less freelist
  2020-08-27 16:56       ` peterz
@ 2020-08-27 17:00         ` Cameron
  0 siblings, 0 replies; 38+ messages in thread
From: Cameron @ 2020-08-27 17:00 UTC (permalink / raw)
  To: peterz
  Cc: linux-kernel, mhiramat, Eddy_Wu, x86, davem, rostedt,
	naveen.n.rao, anil.s.keshavamurthy, linux-arch, oleg, will,
	paulmck

On Thu, Aug 27, 2020 at 12:56 PM <peterz@infradead.org> wrote:
> Are you Ok with the License I put on it, GPLv2 or BDS-2 ?
Yes, both GPLv2 and BSD-2 (my personal favourite) are fine.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-27 16:12 ` [RFC][PATCH 3/7] kprobes: Remove kretprobe hash Peter Zijlstra
@ 2020-08-27 18:00   ` Masami Hiramatsu
  2020-08-28  8:44     ` peterz
  2020-08-28  9:07     ` Masami Hiramatsu
  2020-08-28  4:44   ` Masami Hiramatsu
  2020-08-28 13:11   ` Eddy_Wu
  2 siblings, 2 replies; 38+ messages in thread
From: Masami Hiramatsu @ 2020-08-27 18:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Eddy_Wu, x86, davem, rostedt, naveen.n.rao,
	anil.s.keshavamurthy, linux-arch, cameron, oleg, will, paulmck

On Thu, 27 Aug 2020 18:12:40 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> +static void invalidate_rp_inst(struct task_struct *t, struct kretprobe *rp)
> +{
> +	struct invl_rp_ipi iri = {
> +		.task = t,
> +		.rp = rp,
> +		.done = false
> +	};
> +
> +	for (;;) {
> +		if (try_invoke_on_locked_down_task(t, __invalidate_rp_inst, rp))
> +			return;
> +
> +		smp_call_function_single(task_cpu(t), __invalidate_rp_ipi, &iri, 1);
> +		if (iri.done)
> +			return;
> +	}

Hmm, what about making a status place holder and point it from
each instance to tell it is valid or not?

struct kretprobe_holder {
	atomic_t refcnt;
	struct kretprobe *rp;
};

struct kretprobe {
...
	struct kretprobe_holder	*rph;	// allocate at register
...
};

struct kretprobe_instance {
...
	struct kretprobe_holder	*rph; // free if refcnt == 0
...
};

cleanup_rp_inst(struct kretprobe *rp)
{
	rp->rph->rp = NULL;
}

kretprobe_trampoline_handler()
{
...
	rp = READ_ONCE(ri->rph-rp);
	if (likely(rp)) {
		// call rp->handler
	} else
		rcu_call(ri, free_rp_inst_rcu);
...
}

free_rp_inst_rcu()
{
	if (!atomic_dec_return(ri->rph->refcnt))
		kfree(ri->rph);
	kfree(ri);
}

This increase kretprobe_instance a bit, but make things simpler.
(and still keep lockless, atomic op is in the rcu callback).

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 6/7] freelist: Lock less freelist
  2020-08-27 16:12 ` [RFC][PATCH 6/7] freelist: Lock less freelist Peter Zijlstra
  2020-08-27 16:37   ` peterz
@ 2020-08-27 19:08   ` Boqun Feng
  2020-08-27 19:57     ` Cameron
  2020-08-28  4:03   ` Lai Jiangshan
  2020-08-28 14:46   ` Oleg Nesterov
  3 siblings, 1 reply; 38+ messages in thread
From: Boqun Feng @ 2020-08-27 19:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mhiramat, Eddy_Wu, x86, davem, rostedt,
	naveen.n.rao, anil.s.keshavamurthy, linux-arch, cameron, oleg,
	will, paulmck

On Thu, Aug 27, 2020 at 06:12:43PM +0200, Peter Zijlstra wrote:
> 
> 
> Cc: cameron@moodycamel.com
> Cc: oleg@redhat.com
> Cc: will@kernel.org
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/freelist.h |  129 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 129 insertions(+)
> 
> --- /dev/null
> +++ b/include/linux/freelist.h
> @@ -0,0 +1,129 @@
> +// SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +#ifndef FREELIST_H
> +#define FREELIST_H
> +
> +#include <linux/atomic.h>
> +
> +/*
> + * Copyright: cameron@moodycamel.com
> + *
> + * A simple CAS-based lock-free free list. Not the fastest thing in the world
> + * under heavy contention, but simple and correct (assuming nodes are never
> + * freed until after the free list is destroyed), and fairly speedy under low
> + * contention.
> + *
> + * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
> + */
> +
> +struct freelist_node {
> +	atomic_t		refs;
> +	struct freelist_node	*next;
> +};
> +
> +struct freelist_head {
> +	struct freelist_node	*head;
> +};
> +
> +#define REFS_ON_FREELIST 0x80000000
> +#define REFS_MASK	 0x7FFFFFFF
> +
> +static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
> +{
> +	/*
> +	 * Since the refcount is zero, and nobody can increase it once it's
> +	 * zero (except us, and we run only one copy of this method per node at
> +	 * a time, i.e. the single thread case), then we know we can safely
> +	 * change the next pointer of the node; however, once the refcount is
> +	 * back above zero, then other threads could increase it (happens under
> +	 * heavy contention, when the refcount goes to zero in between a load
> +	 * and a refcount increment of a node in try_get, then back up to
> +	 * something non-zero, then the refcount increment is done by the other
> +	 * thread) -- so if the CAS to add the node to the actual list fails,
> +	 * decrese the refcount and leave the add operation to the next thread
> +	 * who puts the refcount back to zero (which could be us, hence the
> +	 * loop).
> +	 */
> +	struct freelist_node *head = READ_ONCE(list->head);
> +
> +	for (;;) {
> +		WRITE_ONCE(node->next, head);
> +		atomic_set_release(&node->refs, 1);
> +
> +		if (!try_cmpxchg_release(&list->head, &head, node)) {
> +			/*
> +			 * Hmm, the add failed, but we can only try again when
> +			 * the refcount goes back to zero.
> +			 */
> +			if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
> +				continue;
> +		}
> +		return;
> +	}
> +}
> +
> +static inline void freelist_add(struct freelist_node *node, struct freelist_head *list)
> +{
> +	/*
> +	 * We know that the should-be-on-freelist bit is 0 at this point, so
> +	 * it's safe to set it using a fetch_add.
> +	 */
> +	if (!atomic_fetch_add_release(REFS_ON_FREELIST, &node->refs)) {
> +		/*
> +		 * Oh look! We were the last ones referencing this node, and we
> +		 * know we want to add it to the free list, so let's do it!
> +		 */
> +		__freelist_add(node, list);
> +	}
> +}
> +
> +static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
> +{
> +	struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
> +	unsigned int refs;
> +
> +	while (head) {
> +		prev = head;
> +		refs = atomic_read(&head->refs);
> +		if ((refs & REFS_MASK) == 0 ||
> +		    !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
> +			head = smp_load_acquire(&list->head);
> +			continue;
> +		}
> +
> +		/*
> +		 * Good, reference count has been incremented (it wasn't at
> +		 * zero), which means we can read the next and not worry about
> +		 * it changing between now and the time we do the CAS.
> +		 */
> +		next = READ_ONCE(head->next);
> +		if (try_cmpxchg_acquire(&list->head, &head, next)) {

So if try_cmpxchg_acquire() fails, we don't have ACQUIRE semantics on
read of the new list->head, right? Then probably a
smp_mb__after_atomic() is needed in that case?

Regards,
Boqun

> +			/*
> +			 * Yay, got the node. This means it was on the list,
> +			 * which means should-be-on-freelist must be false no
> +			 * matter the refcount (because nobody else knows it's
> +			 * been taken off yet, it can't have been put back on).
> +			 */
> +			WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
> +
> +			/*
> +			 * Decrease refcount twice, once for our ref, and once
> +			 * for the list's ref.
> +			 */
> +			atomic_fetch_add(-2, &head->refs);
> +
> +			return head;
> +		}
> +
> +		/*
> +		 * OK, the head must have changed on us, but we still need to decrement
> +		 * the refcount we increased.
> +		 */
> +		refs = atomic_fetch_add(-1, &prev->refs);
> +		if (refs == REFS_ON_FREELIST + 1)
> +			__freelist_add(prev, list);
> +	}
> +
> +	return NULL;
> +}
> +
> +#endif /* FREELIST_H */
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 6/7] freelist: Lock less freelist
  2020-08-27 19:08   ` Boqun Feng
@ 2020-08-27 19:57     ` Cameron
  2020-08-28  1:34       ` Boqun Feng
  0 siblings, 1 reply; 38+ messages in thread
From: Cameron @ 2020-08-27 19:57 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, mhiramat, Eddy_Wu, x86, davem,
	rostedt, naveen.n.rao, anil.s.keshavamurthy, linux-arch, oleg,
	will, paulmck

On Thu, Aug 27, 2020 at 3:08 PM Boqun Feng <boqun.feng@gmail.com> wrote:
> So if try_cmpxchg_acquire() fails, we don't have ACQUIRE semantics on
> read of the new list->head, right? Then probably a
> smp_mb__after_atomic() is needed in that case?

Yes, there needs to be an acquire on the head after a failed cmpxchg;
does the atomic_fetch_add following that not have acquire semantics?

Cameron

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 6/7] freelist: Lock less freelist
  2020-08-27 19:57     ` Cameron
@ 2020-08-28  1:34       ` Boqun Feng
  0 siblings, 0 replies; 38+ messages in thread
From: Boqun Feng @ 2020-08-28  1:34 UTC (permalink / raw)
  To: Cameron
  Cc: Peter Zijlstra, linux-kernel, mhiramat, Eddy_Wu, x86, davem,
	rostedt, naveen.n.rao, anil.s.keshavamurthy, linux-arch, oleg,
	will, paulmck

On Thu, Aug 27, 2020 at 03:57:22PM -0400, Cameron wrote:
> On Thu, Aug 27, 2020 at 3:08 PM Boqun Feng <boqun.feng@gmail.com> wrote:
> > So if try_cmpxchg_acquire() fails, we don't have ACQUIRE semantics on
> > read of the new list->head, right? Then probably a
> > smp_mb__after_atomic() is needed in that case?
> 
> Yes, there needs to be an acquire on the head after a failed cmpxchg;
> does the atomic_fetch_add following that not have acquire semantics?
> 

Yes, you're right, the atomic_fecth_add() following is a fully-ordered
atomic, so could provide the necessary ACQUIRE semantics. I was missing
that. Maybe a few words explaining this helps.

Regards,
Boqun

> Cameron

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 6/7] freelist: Lock less freelist
  2020-08-27 16:12 ` [RFC][PATCH 6/7] freelist: Lock less freelist Peter Zijlstra
  2020-08-27 16:37   ` peterz
  2020-08-27 19:08   ` Boqun Feng
@ 2020-08-28  4:03   ` Lai Jiangshan
  2020-08-28 14:46   ` Oleg Nesterov
  3 siblings, 0 replies; 38+ messages in thread
From: Lai Jiangshan @ 2020-08-28  4:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Masami Hiramatsu, Eddy_Wu, X86 ML, davem, Steven Rostedt,
	naveen.n.rao, anil.s.keshavamurthy, linux-arch, cameron,
	Oleg Nesterov, Will Deacon, Paul E. McKenney

On Fri, Aug 28, 2020 at 12:23 AM Peter Zijlstra <peterz@infradead.org> wrote:

> +static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
> +{
> +       /*
> +        * Since the refcount is zero, and nobody can increase it once it's
> +        * zero (except us, and we run only one copy of this method per node at
> +        * a time, i.e. the single thread case), then we know we can safely


> +
> +               /*
> +                * OK, the head must have changed on us, but we still need to decrement
> +                * the refcount we increased.
> +                */
> +               refs = atomic_fetch_add(-1, &prev->refs);
> +               if (refs == REFS_ON_FREELIST + 1)
> +                       __freelist_add(prev, list);

I'm curious whether it is correct to just set the prev->refs to zero and return
@prev? So that it can remove an unneeded "add()&get()" pair (although in
an unlikely branch) and __freelist_add() can be folded into freelist_add()
for tidier code.

Thanks
Lai.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-27 16:12 ` [RFC][PATCH 3/7] kprobes: Remove kretprobe hash Peter Zijlstra
  2020-08-27 18:00   ` Masami Hiramatsu
@ 2020-08-28  4:44   ` Masami Hiramatsu
  2020-08-28 13:11   ` Eddy_Wu
  2 siblings, 0 replies; 38+ messages in thread
From: Masami Hiramatsu @ 2020-08-28  4:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Eddy_Wu, x86, davem, rostedt, naveen.n.rao,
	anil.s.keshavamurthy, linux-arch, cameron, oleg, will, paulmck

On Thu, 27 Aug 2020 18:12:40 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> @@ -1313,25 +1261,28 @@ void kprobe_busy_end(void)
>  void kprobe_flush_task(struct task_struct *tk)
>  {
>  	struct kretprobe_instance *ri;
> -	struct hlist_head *head, empty_rp;
> +	struct hlist_head empty_rp;
> +	struct llist_node *node;
>  	struct hlist_node *tmp;

We don't need this tmp anymore.

> @@ -1935,71 +1932,45 @@ unsigned long __kretprobe_trampoline_han
>  					unsigned long trampoline_address,
>  					void *frame_pointer)
>  {
> +	kprobe_opcode_t *correct_ret_addr = NULL;
>  	struct kretprobe_instance *ri = NULL;
> -	struct hlist_head *head, empty_rp;
> +	unsigned long orig_ret_address = 0;
> +	struct llist_node *first, *node;
> +	struct hlist_head empty_rp;
>  	struct hlist_node *tmp;

Here too.

I'm trying to port this patch on my v4 series. I'll add my RFC patch of
kretprobe_holder too.

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-27 18:00   ` Masami Hiramatsu
@ 2020-08-28  8:44     ` peterz
  2020-08-28  9:07     ` Masami Hiramatsu
  1 sibling, 0 replies; 38+ messages in thread
From: peterz @ 2020-08-28  8:44 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: linux-kernel, Eddy_Wu, x86, davem, rostedt, naveen.n.rao,
	anil.s.keshavamurthy, linux-arch, cameron, oleg, will, paulmck

On Fri, Aug 28, 2020 at 03:00:59AM +0900, Masami Hiramatsu wrote:
> On Thu, 27 Aug 2020 18:12:40 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > +static void invalidate_rp_inst(struct task_struct *t, struct kretprobe *rp)
> > +{
> > +	struct invl_rp_ipi iri = {
> > +		.task = t,
> > +		.rp = rp,
> > +		.done = false
> > +	};
> > +
> > +	for (;;) {
> > +		if (try_invoke_on_locked_down_task(t, __invalidate_rp_inst, rp))
> > +			return;
> > +
> > +		smp_call_function_single(task_cpu(t), __invalidate_rp_ipi, &iri, 1);
> > +		if (iri.done)
> > +			return;
> > +	}
> 
> Hmm, what about making a status place holder and point it from
> each instance to tell it is valid or not?
> 
> struct kretprobe_holder {
> 	atomic_t refcnt;
> 	struct kretprobe *rp;
> };
> 
> struct kretprobe {
> ...
> 	struct kretprobe_holder	*rph;	// allocate at register
> ...
> };
> 
> struct kretprobe_instance {
> ...
> 	struct kretprobe_holder	*rph; // free if refcnt == 0
> ...
> };
> 
> cleanup_rp_inst(struct kretprobe *rp)
> {
> 	rp->rph->rp = NULL;
> }
> 
> kretprobe_trampoline_handler()
> {
> ...
> 	rp = READ_ONCE(ri->rph-rp);
> 	if (likely(rp)) {
> 		// call rp->handler
> 	} else
> 		rcu_call(ri, free_rp_inst_rcu);
> ...
> }
> 
> free_rp_inst_rcu()
> {
> 	if (!atomic_dec_return(ri->rph->refcnt))
> 		kfree(ri->rph);
> 	kfree(ri);
> }
> 
> This increase kretprobe_instance a bit, but make things simpler.
> (and still keep lockless, atomic op is in the rcu callback).

Yes, much better.

Although I'd _love_ to get rid of rp->data_size, then we can simplify
all of this even more. I was thinking we could then have a single global
freelist thing and add some per-cpu cache to it (say 4-8 entries) to
avoid the worst contention.

And then make function-graph use this, instead of the other way around
:-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist
  2020-08-27 16:12 ` [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist Peter Zijlstra
@ 2020-08-28  8:48   ` peterz
  2020-08-28  9:13     ` Masami Hiramatsu
  0 siblings, 1 reply; 38+ messages in thread
From: peterz @ 2020-08-28  8:48 UTC (permalink / raw)
  To: linux-kernel, mhiramat
  Cc: Eddy_Wu, x86, davem, rostedt, naveen.n.rao, anil.s.keshavamurthy,
	linux-arch, cameron, oleg, will, paulmck

On Thu, Aug 27, 2020 at 06:12:44PM +0200, Peter Zijlstra wrote:
>  struct kretprobe_instance {
>  	union {
> +		/*
> +		 * Dodgy as heck, this relies on not clobbering freelist::refs.
> +		 * llist: only clobbers freelist::next.
> +		 * rcu: clobbers both, but only after rp::freelist is gone.
> +		 */
> +		struct freelist_node freelist;
>  		struct llist_node llist;
> -		struct hlist_node hlist;
>  		struct rcu_head rcu;
>  	};

Masami, make sure to make this something like:

	union {
		struct freelist_node freelist;
		struct rcu_head rcu;
	};
	struct llist_node llist;

for v4, because after some sleep I'm fairly sure what I wrote above was
broken.

We'll only use RCU once the freelist is gone, so sharing that storage
should still be okay.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-27 18:00   ` Masami Hiramatsu
  2020-08-28  8:44     ` peterz
@ 2020-08-28  9:07     ` Masami Hiramatsu
  1 sibling, 0 replies; 38+ messages in thread
From: Masami Hiramatsu @ 2020-08-28  9:07 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, linux-kernel, Eddy_Wu, x86, davem, rostedt,
	naveen.n.rao, anil.s.keshavamurthy, linux-arch, cameron, oleg,
	will, paulmck

On Fri, 28 Aug 2020 03:00:59 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> On Thu, 27 Aug 2020 18:12:40 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > +static void invalidate_rp_inst(struct task_struct *t, struct kretprobe *rp)
> > +{
> > +	struct invl_rp_ipi iri = {
> > +		.task = t,
> > +		.rp = rp,
> > +		.done = false
> > +	};
> > +
> > +	for (;;) {
> > +		if (try_invoke_on_locked_down_task(t, __invalidate_rp_inst, rp))
> > +			return;
> > +
> > +		smp_call_function_single(task_cpu(t), __invalidate_rp_ipi, &iri, 1);
> > +		if (iri.done)
> > +			return;
> > +	}
> 
> Hmm, what about making a status place holder and point it from
> each instance to tell it is valid or not?
> 
> struct kretprobe_holder {
> 	atomic_t refcnt;
> 	struct kretprobe *rp;
> };
> 
> struct kretprobe {
> ...
> 	struct kretprobe_holder	*rph;	// allocate at register
> ...
> };
> 
> struct kretprobe_instance {
> ...
> 	struct kretprobe_holder	*rph; // free if refcnt == 0
> ...
> };
> 
> cleanup_rp_inst(struct kretprobe *rp)
> {
> 	rp->rph->rp = NULL;
> }
> 
> kretprobe_trampoline_handler()
> {
> ...
> 	rp = READ_ONCE(ri->rph-rp);
> 	if (likely(rp)) {
> 		// call rp->handler
> 	} else
> 		rcu_call(ri, free_rp_inst_rcu);
> ...
> }
> 
> free_rp_inst_rcu()
> {
> 	if (!atomic_dec_return(ri->rph->refcnt))
> 		kfree(ri->rph);
> 	kfree(ri);
> }
> 
> This increase kretprobe_instance a bit, but make things simpler.
> (and still keep lockless, atomic op is in the rcu callback).

OK, I've written the code and run a smoke test on it.
I'll send it with my 4th version of series.

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist
  2020-08-28  8:48   ` peterz
@ 2020-08-28  9:13     ` Masami Hiramatsu
  2020-08-28  9:18       ` peterz
  0 siblings, 1 reply; 38+ messages in thread
From: Masami Hiramatsu @ 2020-08-28  9:13 UTC (permalink / raw)
  To: peterz
  Cc: linux-kernel, Eddy_Wu, x86, davem, rostedt, naveen.n.rao,
	anil.s.keshavamurthy, linux-arch, cameron, oleg, will, paulmck

On Fri, 28 Aug 2020 10:48:51 +0200
peterz@infradead.org wrote:

> On Thu, Aug 27, 2020 at 06:12:44PM +0200, Peter Zijlstra wrote:
> >  struct kretprobe_instance {
> >  	union {
> > +		/*
> > +		 * Dodgy as heck, this relies on not clobbering freelist::refs.
> > +		 * llist: only clobbers freelist::next.
> > +		 * rcu: clobbers both, but only after rp::freelist is gone.
> > +		 */
> > +		struct freelist_node freelist;
> >  		struct llist_node llist;
> > -		struct hlist_node hlist;
> >  		struct rcu_head rcu;
> >  	};
> 
> Masami, make sure to make this something like:
> 
> 	union {
> 		struct freelist_node freelist;
> 		struct rcu_head rcu;
> 	};
> 	struct llist_node llist;
> 
> for v4, because after some sleep I'm fairly sure what I wrote above was
> broken.
> 
> We'll only use RCU once the freelist is gone, so sharing that storage
> should still be okay.

Hmm, would you mean there is a chance that an instance belongs to
both freelist and llist?

Thank you,


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist
  2020-08-28  9:13     ` Masami Hiramatsu
@ 2020-08-28  9:18       ` peterz
  2020-08-28 10:44         ` Masami Hiramatsu
  2020-08-29  2:29         ` Cameron
  0 siblings, 2 replies; 38+ messages in thread
From: peterz @ 2020-08-28  9:18 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: linux-kernel, Eddy_Wu, x86, davem, rostedt, naveen.n.rao,
	anil.s.keshavamurthy, linux-arch, cameron, oleg, will, paulmck

On Fri, Aug 28, 2020 at 06:13:41PM +0900, Masami Hiramatsu wrote:
> On Fri, 28 Aug 2020 10:48:51 +0200
> peterz@infradead.org wrote:
> 
> > On Thu, Aug 27, 2020 at 06:12:44PM +0200, Peter Zijlstra wrote:
> > >  struct kretprobe_instance {
> > >  	union {
> > > +		/*
> > > +		 * Dodgy as heck, this relies on not clobbering freelist::refs.
> > > +		 * llist: only clobbers freelist::next.
> > > +		 * rcu: clobbers both, but only after rp::freelist is gone.
> > > +		 */
> > > +		struct freelist_node freelist;
> > >  		struct llist_node llist;
> > > -		struct hlist_node hlist;
> > >  		struct rcu_head rcu;
> > >  	};
> > 
> > Masami, make sure to make this something like:
> > 
> > 	union {
> > 		struct freelist_node freelist;
> > 		struct rcu_head rcu;
> > 	};
> > 	struct llist_node llist;
> > 
> > for v4, because after some sleep I'm fairly sure what I wrote above was
> > broken.
> > 
> > We'll only use RCU once the freelist is gone, so sharing that storage
> > should still be okay.
> 
> Hmm, would you mean there is a chance that an instance belongs to
> both freelist and llist?

So the freelist->refs thing is supposed to pin freelist->next for
concurrent usage, but if we instantly stick it on the
current->kretprobe_instances llist while it's still elevated, we'll
overwrite ->next, which would be bad.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist
  2020-08-28  9:18       ` peterz
@ 2020-08-28 10:44         ` Masami Hiramatsu
  2020-08-29  2:29         ` Cameron
  1 sibling, 0 replies; 38+ messages in thread
From: Masami Hiramatsu @ 2020-08-28 10:44 UTC (permalink / raw)
  To: peterz
  Cc: linux-kernel, Eddy_Wu, x86, davem, rostedt, naveen.n.rao,
	anil.s.keshavamurthy, linux-arch, cameron, oleg, will, paulmck

On Fri, 28 Aug 2020 11:18:13 +0200
peterz@infradead.org wrote:

> On Fri, Aug 28, 2020 at 06:13:41PM +0900, Masami Hiramatsu wrote:
> > On Fri, 28 Aug 2020 10:48:51 +0200
> > peterz@infradead.org wrote:
> > 
> > > On Thu, Aug 27, 2020 at 06:12:44PM +0200, Peter Zijlstra wrote:
> > > >  struct kretprobe_instance {
> > > >  	union {
> > > > +		/*
> > > > +		 * Dodgy as heck, this relies on not clobbering freelist::refs.
> > > > +		 * llist: only clobbers freelist::next.
> > > > +		 * rcu: clobbers both, but only after rp::freelist is gone.
> > > > +		 */
> > > > +		struct freelist_node freelist;
> > > >  		struct llist_node llist;
> > > > -		struct hlist_node hlist;
> > > >  		struct rcu_head rcu;
> > > >  	};
> > > 
> > > Masami, make sure to make this something like:
> > > 
> > > 	union {
> > > 		struct freelist_node freelist;
> > > 		struct rcu_head rcu;
> > > 	};
> > > 	struct llist_node llist;
> > > 
> > > for v4, because after some sleep I'm fairly sure what I wrote above was
> > > broken.
> > > 
> > > We'll only use RCU once the freelist is gone, so sharing that storage
> > > should still be okay.
> > 
> > Hmm, would you mean there is a chance that an instance belongs to
> > both freelist and llist?
> 
> So the freelist->refs thing is supposed to pin freelist->next for
> concurrent usage, but if we instantly stick it on the
> current->kretprobe_instances llist while it's still elevated, we'll
> overwrite ->next, which would be bad.

OK, I'll pick these up for my v4 series with that fix.

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-27 16:12 ` [RFC][PATCH 3/7] kprobes: Remove kretprobe hash Peter Zijlstra
  2020-08-27 18:00   ` Masami Hiramatsu
  2020-08-28  4:44   ` Masami Hiramatsu
@ 2020-08-28 13:11   ` Eddy_Wu
  2020-08-28 13:38     ` peterz
                       ` (2 more replies)
  2 siblings, 3 replies; 38+ messages in thread
From: Eddy_Wu @ 2020-08-28 13:11 UTC (permalink / raw)
  To: Peter Zijlstra, Masami Hiramatsu
  Cc: linux-kernel, x86, davem, rostedt, naveen.n.rao,
	anil.s.keshavamurthy, linux-arch, cameron, oleg, will, paulmck

> -----Original Message-----
> From: Peter Zijlstra <peterz@infradead.org>
> Sent: Friday, August 28, 2020 12:13 AM
> To: linux-kernel@vger.kernel.org; mhiramat@kernel.org
> Cc: Eddy Wu (RD-TW) <Eddy_Wu@trendmicro.com>; x86@kernel.org; davem@davemloft.net; rostedt@goodmis.org;
> naveen.n.rao@linux.ibm.com; anil.s.keshavamurthy@intel.com; linux-arch@vger.kernel.org; cameron@moodycamel.com;
> oleg@redhat.com; will@kernel.org; paulmck@kernel.org; peterz@infradead.org
> Subject: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
>
> @@ -1935,71 +1932,45 @@ unsigned long __kretprobe_trampoline_han
>                                         unsigned long trampoline_address,
>                                         void *frame_pointer)
>  {
> // ... removed
> // NULL here
> +       first = node = current->kretprobe_instances.first;
> +       while (node) {
> +               ri = container_of(node, struct kretprobe_instance, llist);
>
> -               orig_ret_address = (unsigned long)ri->ret_addr;
> -               if (skipped)
> -                       pr_warn("%ps must be blacklisted because of incorrect kretprobe order\n",
> -                               ri->rp->kp.addr);
> +               BUG_ON(ri->fp != frame_pointer);
>
> -               if (orig_ret_address != trampoline_address)
> +               orig_ret_address = (unsigned long)ri->ret_addr;
> +               if (orig_ret_address != trampoline_address) {
>                         /*
>                          * This is the real return address. Any other
>                          * instances associated with this task are for
>                          * other calls deeper on the call stack
>                          */
>                         break;
> +               }
> +
> +               node = node->next;
>         }
>

Hi, I found a NULL pointer dereference here, where current->kretprobe_instances.first == NULL in these two scenario:

1) In task "rs:main Q:Reg"
# insmod samples/kprobes/kretprobe_example.ko func=schedule
# pkill sddm-greeter

2) In task "llvmpipe-10"
# insmod samples/kprobes/kretprobe_example.ko func=schedule
login plasmashell session from sddm graphical interface

based on Masami's v2 + Peter's lockless patch, I'll try the new branch once I can compile kernel

Stacktrace may not be really useful here:
[  402.008630] BUG: kernel NULL pointer dereference, address: 0000000000000018
[  402.008633] #PF: supervisor read access in kernel mode
[  402.008642] #PF: error_code(0x0000) - not-present page
[  402.008644] PGD 0 P4D 0
[  402.008646] Oops: 0000 [#1] PREEMPT SMP PTI
[  402.008649] CPU: 7 PID: 1505 Comm: llvmpipe-10 Kdump: loaded Not tainted 5.9.0-rc2-00111-g72091ec08f03-dirty #45
[  402.008650] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019
[  402.008653] RIP: 0010:__kretprobe_trampoline_handler+0xb8/0x17f
[  402.008655] Code: 65 4c 8b 34 25 80 6d 01 00 4c 89 e2 48 c7 c7 91 6b 85 91 49 8d b6 38 07 00 00 e8 d1 1a f9 ff 48 85 db 74 06 48 3b 5d d0 75 16 <49> 8b 75 18 48 c7 c7 a0 6c 85 91 48
 8b 56 28 e8 b2 1a f9 ff 0f 0b
[  402.008655] RSP: 0018:ffffab408147bde0 EFLAGS: 00010246
[  402.008656] RAX: 0000000000000021 RBX: 0000000000000000 RCX: 0000000000000002
[  402.008657] RDX: 0000000080000002 RSI: ffffffff9189757d RDI: 00000000ffffffff
[  402.008658] RBP: ffffab408147be20 R08: 0000000000000001 R09: 000000000000955c
[  402.008658] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000000
[  402.008659] R13: 0000000000000000 R14: ffff90736d305f40 R15: 0000000000000000
[  402.008661] FS:  00007f20f6ffd700(0000) GS:ffff9073781c0000(0000) knlGS:0000000000000000
[  402.008675] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  402.008678] CR2: 0000000000000018 CR3: 00000001ed256006 CR4: 00000000003706e0
[  402.008684] Call Trace:
[  402.008689]  ? elfcorehdr_read+0x40/0x40
[  402.008690]  ? elfcorehdr_read+0x40/0x40
[  402.008691]  trampoline_handler+0x42/0x60
[  402.008692]  kretprobe_trampoline+0x2a/0x50
[  402.008693] RIP: 0010:kretprobe_trampoline+0x0/0x50

TREND MICRO EMAIL NOTICE

The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and we request that you notify us by reply mail or telephone and delete the original message from your mail system.

For details about what personal information we collect and why, please see our Privacy Notice on our website at: Read privacy policy<http://www.trendmicro.com/privacy>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-28 13:11   ` Eddy_Wu
@ 2020-08-28 13:38     ` peterz
  2020-08-28 13:51     ` Masami Hiramatsu
  2020-08-28 14:49     ` Masami Hiramatsu
  2 siblings, 0 replies; 38+ messages in thread
From: peterz @ 2020-08-28 13:38 UTC (permalink / raw)
  To: Eddy_Wu
  Cc: Masami Hiramatsu, linux-kernel, x86, davem, rostedt,
	naveen.n.rao, anil.s.keshavamurthy, linux-arch, cameron, oleg,
	will, paulmck

On Fri, Aug 28, 2020 at 01:11:15PM +0000, Eddy_Wu@trendmicro.com wrote:
> > -----Original Message-----
> > From: Peter Zijlstra <peterz@infradead.org>
> > Sent: Friday, August 28, 2020 12:13 AM
> > To: linux-kernel@vger.kernel.org; mhiramat@kernel.org
> > Cc: Eddy Wu (RD-TW) <Eddy_Wu@trendmicro.com>; x86@kernel.org; davem@davemloft.net; rostedt@goodmis.org;
> > naveen.n.rao@linux.ibm.com; anil.s.keshavamurthy@intel.com; linux-arch@vger.kernel.org; cameron@moodycamel.com;
> > oleg@redhat.com; will@kernel.org; paulmck@kernel.org; peterz@infradead.org
> > Subject: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
> >
> > @@ -1935,71 +1932,45 @@ unsigned long __kretprobe_trampoline_han
> >                                         unsigned long trampoline_address,
> >                                         void *frame_pointer)
> >  {
> > // ... removed
> > // NULL here
> > +       first = node = current->kretprobe_instances.first;
> > +       while (node) {
> > +               ri = container_of(node, struct kretprobe_instance, llist);
> >
> > -               orig_ret_address = (unsigned long)ri->ret_addr;
> > -               if (skipped)
> > -                       pr_warn("%ps must be blacklisted because of incorrect kretprobe order\n",
> > -                               ri->rp->kp.addr);
> > +               BUG_ON(ri->fp != frame_pointer);
> >
> > -               if (orig_ret_address != trampoline_address)
> > +               orig_ret_address = (unsigned long)ri->ret_addr;
> > +               if (orig_ret_address != trampoline_address) {
> >                         /*
> >                          * This is the real return address. Any other
> >                          * instances associated with this task are for
> >                          * other calls deeper on the call stack
> >                          */
> >                         break;
> > +               }
> > +
> > +               node = node->next;
> >         }
> >
> 
> Hi, I found a NULL pointer dereference here, where
> current->kretprobe_instances.first == NULL in these two scenario:

Hurmph, that would mean hitting the trampoline and not having a
kretprobe_instance, weird. Let me try and reproduce.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-28 13:11   ` Eddy_Wu
  2020-08-28 13:38     ` peterz
@ 2020-08-28 13:51     ` Masami Hiramatsu
  2020-08-28 13:58       ` peterz
  2020-08-28 14:11       ` Eddy_Wu
  2020-08-28 14:49     ` Masami Hiramatsu
  2 siblings, 2 replies; 38+ messages in thread
From: Masami Hiramatsu @ 2020-08-28 13:51 UTC (permalink / raw)
  To: Eddy_Wu
  Cc: Peter Zijlstra, linux-kernel, x86, davem, rostedt, naveen.n.rao,
	anil.s.keshavamurthy, linux-arch, cameron, oleg, will, paulmck

On Fri, 28 Aug 2020 13:11:15 +0000
"Eddy_Wu@trendmicro.com" <Eddy_Wu@trendmicro.com> wrote:

> > -----Original Message-----
> > From: Peter Zijlstra <peterz@infradead.org>
> > Sent: Friday, August 28, 2020 12:13 AM
> > To: linux-kernel@vger.kernel.org; mhiramat@kernel.org
> > Cc: Eddy Wu (RD-TW) <Eddy_Wu@trendmicro.com>; x86@kernel.org; davem@davemloft.net; rostedt@goodmis.org;
> > naveen.n.rao@linux.ibm.com; anil.s.keshavamurthy@intel.com; linux-arch@vger.kernel.org; cameron@moodycamel.com;
> > oleg@redhat.com; will@kernel.org; paulmck@kernel.org; peterz@infradead.org
> > Subject: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
> >
> > @@ -1935,71 +1932,45 @@ unsigned long __kretprobe_trampoline_han
> >                                         unsigned long trampoline_address,
> >                                         void *frame_pointer)
> >  {
> > // ... removed
> > // NULL here
> > +       first = node = current->kretprobe_instances.first;
> > +       while (node) {
> > +               ri = container_of(node, struct kretprobe_instance, llist);
> >
> > -               orig_ret_address = (unsigned long)ri->ret_addr;
> > -               if (skipped)
> > -                       pr_warn("%ps must be blacklisted because of incorrect kretprobe order\n",
> > -                               ri->rp->kp.addr);
> > +               BUG_ON(ri->fp != frame_pointer);
> >
> > -               if (orig_ret_address != trampoline_address)
> > +               orig_ret_address = (unsigned long)ri->ret_addr;
> > +               if (orig_ret_address != trampoline_address) {
> >                         /*
> >                          * This is the real return address. Any other
> >                          * instances associated with this task are for
> >                          * other calls deeper on the call stack
> >                          */
> >                         break;
> > +               }
> > +
> > +               node = node->next;
> >         }
> >
> 
> Hi, I found a NULL pointer dereference here, where current->kretprobe_instances.first == NULL in these two scenario:

Thanks! that may be what I'm chasing.

> 
> 1) In task "rs:main Q:Reg"
> # insmod samples/kprobes/kretprobe_example.ko func=schedule
> # pkill sddm-greeter
> 
> 2) In task "llvmpipe-10"
> # insmod samples/kprobes/kretprobe_example.ko func=schedule
> login plasmashell session from sddm graphical interface

OK, schedule function will be the key. I guess the senario is..

1) kretporbe replace the return address with kretprobe_trampoline on task1's kernel stack
2) the task1 forks task2 before returning to the kretprobe_trampoline
3) while copying the process with the kernel stack, task2->kretprobe_instances.first = NULL
4) task2 returns to the kretprobe_trampoline
5) Bomb!

Hmm, we need to fixup the kernel stack when copying process. 

Thank you,

> 
> based on Masami's v2 + Peter's lockless patch, I'll try the new branch once I can compile kernel
> 
> Stacktrace may not be really useful here:
> [  402.008630] BUG: kernel NULL pointer dereference, address: 0000000000000018
> [  402.008633] #PF: supervisor read access in kernel mode
> [  402.008642] #PF: error_code(0x0000) - not-present page
> [  402.008644] PGD 0 P4D 0
> [  402.008646] Oops: 0000 [#1] PREEMPT SMP PTI
> [  402.008649] CPU: 7 PID: 1505 Comm: llvmpipe-10 Kdump: loaded Not tainted 5.9.0-rc2-00111-g72091ec08f03-dirty #45
> [  402.008650] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019
> [  402.008653] RIP: 0010:__kretprobe_trampoline_handler+0xb8/0x17f
> [  402.008655] Code: 65 4c 8b 34 25 80 6d 01 00 4c 89 e2 48 c7 c7 91 6b 85 91 49 8d b6 38 07 00 00 e8 d1 1a f9 ff 48 85 db 74 06 48 3b 5d d0 75 16 <49> 8b 75 18 48 c7 c7 a0 6c 85 91 48
>  8b 56 28 e8 b2 1a f9 ff 0f 0b
> [  402.008655] RSP: 0018:ffffab408147bde0 EFLAGS: 00010246
> [  402.008656] RAX: 0000000000000021 RBX: 0000000000000000 RCX: 0000000000000002
> [  402.008657] RDX: 0000000080000002 RSI: ffffffff9189757d RDI: 00000000ffffffff
> [  402.008658] RBP: ffffab408147be20 R08: 0000000000000001 R09: 000000000000955c
> [  402.008658] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000000
> [  402.008659] R13: 0000000000000000 R14: ffff90736d305f40 R15: 0000000000000000
> [  402.008661] FS:  00007f20f6ffd700(0000) GS:ffff9073781c0000(0000) knlGS:0000000000000000
> [  402.008675] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  402.008678] CR2: 0000000000000018 CR3: 00000001ed256006 CR4: 00000000003706e0
> [  402.008684] Call Trace:
> [  402.008689]  ? elfcorehdr_read+0x40/0x40
> [  402.008690]  ? elfcorehdr_read+0x40/0x40
> [  402.008691]  trampoline_handler+0x42/0x60
> [  402.008692]  kretprobe_trampoline+0x2a/0x50
> [  402.008693] RIP: 0010:kretprobe_trampoline+0x0/0x50
> 
> TREND MICRO EMAIL NOTICE
> 
> The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and we request that you notify us by reply mail or telephone and delete the original message from your mail system.
> 
> For details about what personal information we collect and why, please see our Privacy Notice on our website at: Read privacy policy<http://www.trendmicro.com/privacy>


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-28 13:51     ` Masami Hiramatsu
@ 2020-08-28 13:58       ` peterz
  2020-08-28 14:19         ` Masami Hiramatsu
  2020-08-28 14:11       ` Eddy_Wu
  1 sibling, 1 reply; 38+ messages in thread
From: peterz @ 2020-08-28 13:58 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Eddy_Wu, linux-kernel, x86, davem, rostedt, naveen.n.rao,
	anil.s.keshavamurthy, linux-arch, cameron, oleg, will, paulmck

On Fri, Aug 28, 2020 at 10:51:13PM +0900, Masami Hiramatsu wrote:
 
> OK, schedule function will be the key. I guess the senario is..
> 
> 1) kretporbe replace the return address with kretprobe_trampoline on task1's kernel stack
> 2) the task1 forks task2 before returning to the kretprobe_trampoline
> 3) while copying the process with the kernel stack, task2->kretprobe_instances.first = NULL
> 4) task2 returns to the kretprobe_trampoline
> 5) Bomb!
> 
> Hmm, we need to fixup the kernel stack when copying process. 

How would that scenario have been avoided in the old code? Because there
task2 would have a different has and not have found a kretprobe_instance
either.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-28 13:51     ` Masami Hiramatsu
  2020-08-28 13:58       ` peterz
@ 2020-08-28 14:11       ` Eddy_Wu
  2020-08-28 14:19         ` peterz
  1 sibling, 1 reply; 38+ messages in thread
From: Eddy_Wu @ 2020-08-28 14:11 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: peterz, linux-kernel, x86, davem, rostedt, naveen.n.rao,
	anil.s.keshavamurthy, linux-arch, cameron, oleg, will, paulmck

> From: Masami Hiramatsu <mhiramat@kernel.org>
>
> OK, schedule function will be the key. I guess the senario is..
>
> 1) kretporbe replace the return address with kretprobe_trampoline on task1's kernel stack
> 2) the task1 forks task2 before returning to the kretprobe_trampoline
> 3) while copying the process with the kernel stack, task2->kretprobe_instances.first = NULL

I think new process created by fork/clone uses a brand new kernel stack? I thought only user stack are copied.
Otherwise any process launch should crash in the same way

By the way, I can reproduce this on the latest branch(v4)
TREND MICRO EMAIL NOTICE

The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and we request that you notify us by reply mail or telephone and delete the original message from your mail system.

For details about what personal information we collect and why, please see our Privacy Notice on our website at: Read privacy policy<http://www.trendmicro.com/privacy>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-28 14:11       ` Eddy_Wu
@ 2020-08-28 14:19         ` peterz
  2020-08-28 14:41           ` Masami Hiramatsu
  0 siblings, 1 reply; 38+ messages in thread
From: peterz @ 2020-08-28 14:19 UTC (permalink / raw)
  To: Eddy_Wu
  Cc: Masami Hiramatsu, linux-kernel, x86, davem, rostedt,
	naveen.n.rao, anil.s.keshavamurthy, linux-arch, cameron, oleg,
	will, paulmck

On Fri, Aug 28, 2020 at 02:11:18PM +0000, Eddy_Wu@trendmicro.com wrote:
> > From: Masami Hiramatsu <mhiramat@kernel.org>
> >
> > OK, schedule function will be the key. I guess the senario is..
> >
> > 1) kretporbe replace the return address with kretprobe_trampoline on task1's kernel stack
> > 2) the task1 forks task2 before returning to the kretprobe_trampoline
> > 3) while copying the process with the kernel stack, task2->kretprobe_instances.first = NULL
> 
> I think new process created by fork/clone uses a brand new kernel
> stack? I thought only user stack are copied.  Otherwise any process
> launch should crash in the same way

I was under the same impression, we create a brand new stack-frame for
the new task, this 'fake' frame we can schedule into.

It either points to ret_from_fork() for new user tasks, or
kthread_frame_init() for kernel threads.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-28 13:58       ` peterz
@ 2020-08-28 14:19         ` Masami Hiramatsu
  0 siblings, 0 replies; 38+ messages in thread
From: Masami Hiramatsu @ 2020-08-28 14:19 UTC (permalink / raw)
  To: peterz
  Cc: Eddy_Wu, linux-kernel, x86, davem, rostedt, naveen.n.rao,
	anil.s.keshavamurthy, linux-arch, cameron, oleg, will, paulmck

On Fri, 28 Aug 2020 15:58:24 +0200
peterz@infradead.org wrote:

> On Fri, Aug 28, 2020 at 10:51:13PM +0900, Masami Hiramatsu wrote:
>  
> > OK, schedule function will be the key. I guess the senario is..
> > 
> > 1) kretporbe replace the return address with kretprobe_trampoline on task1's kernel stack
> > 2) the task1 forks task2 before returning to the kretprobe_trampoline
> > 3) while copying the process with the kernel stack, task2->kretprobe_instances.first = NULL
> > 4) task2 returns to the kretprobe_trampoline
> > 5) Bomb!
> > 
> > Hmm, we need to fixup the kernel stack when copying process. 
> 
> How would that scenario have been avoided in the old code? Because there
> task2 would have a different has and not have found a kretprobe_instance
> either.

Good question, I think this bug has not been solved in old code too.
Let me check.

Thanks,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-28 14:19         ` peterz
@ 2020-08-28 14:41           ` Masami Hiramatsu
  0 siblings, 0 replies; 38+ messages in thread
From: Masami Hiramatsu @ 2020-08-28 14:41 UTC (permalink / raw)
  To: peterz
  Cc: Eddy_Wu, Masami Hiramatsu, linux-kernel, x86, davem, rostedt,
	naveen.n.rao, anil.s.keshavamurthy, linux-arch, cameron, oleg,
	will, paulmck

On Fri, 28 Aug 2020 16:19:17 +0200
peterz@infradead.org wrote:

> On Fri, Aug 28, 2020 at 02:11:18PM +0000, Eddy_Wu@trendmicro.com wrote:
> > > From: Masami Hiramatsu <mhiramat@kernel.org>
> > >
> > > OK, schedule function will be the key. I guess the senario is..
> > >
> > > 1) kretporbe replace the return address with kretprobe_trampoline on task1's kernel stack
> > > 2) the task1 forks task2 before returning to the kretprobe_trampoline
> > > 3) while copying the process with the kernel stack, task2->kretprobe_instances.first = NULL
> > 
> > I think new process created by fork/clone uses a brand new kernel
> > stack? I thought only user stack are copied.  Otherwise any process
> > launch should crash in the same way
> 
> I was under the same impression, we create a brand new stack-frame for
> the new task, this 'fake' frame we can schedule into.
> 
> It either points to ret_from_fork() for new user tasks, or
> kthread_frame_init() for kernel threads.

Ah sorry, then it's my misreading... anyway, I could reproduce the crash with
probing on schedule(). Hmm, it is better to dump the current comm with
BUG().

Thank you,



-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 6/7] freelist: Lock less freelist
  2020-08-27 16:12 ` [RFC][PATCH 6/7] freelist: Lock less freelist Peter Zijlstra
                     ` (2 preceding siblings ...)
  2020-08-28  4:03   ` Lai Jiangshan
@ 2020-08-28 14:46   ` Oleg Nesterov
  2020-08-28 15:29     ` peterz
  3 siblings, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2020-08-28 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mhiramat, Eddy_Wu, x86, davem, rostedt,
	naveen.n.rao, anil.s.keshavamurthy, linux-arch, cameron, will,
	paulmck

On 08/27, Peter Zijlstra wrote:
>
>  1 file changed, 129 insertions(+)

129 lines! And I spent more than 2 hours trying to understand these
129 lines ;) looks correct...

However, I still can't understand the usage of _acquire/release ops
in this code.

> +static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
> +{
> +	/*
> +	 * Since the refcount is zero, and nobody can increase it once it's
> +	 * zero (except us, and we run only one copy of this method per node at
> +	 * a time, i.e. the single thread case), then we know we can safely
> +	 * change the next pointer of the node; however, once the refcount is
> +	 * back above zero, then other threads could increase it (happens under
> +	 * heavy contention, when the refcount goes to zero in between a load
> +	 * and a refcount increment of a node in try_get, then back up to
> +	 * something non-zero, then the refcount increment is done by the other
> +	 * thread) -- so if the CAS to add the node to the actual list fails,
> +	 * decrese the refcount and leave the add operation to the next thread
> +	 * who puts the refcount back to zero (which could be us, hence the
> +	 * loop).
> +	 */
> +	struct freelist_node *head = READ_ONCE(list->head);
> +
> +	for (;;) {
> +		WRITE_ONCE(node->next, head);
> +		atomic_set_release(&node->refs, 1);
> +
> +		if (!try_cmpxchg_release(&list->head, &head, node)) {

OK, these 2 _release above look understandable, they pair with
atomic_try_cmpxchg_acquire/try_cmpxchg_acquire in freelist_try_get().

> +			/*
> +			 * Hmm, the add failed, but we can only try again when
> +			 * the refcount goes back to zero.
> +			 */
> +			if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
> +				continue;

Do we really need _release ? Why can't atomic_fetch_add_relaxed() work?

> +static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
> +{
> +	struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
> +	unsigned int refs;
> +
> +	while (head) {
> +		prev = head;
> +		refs = atomic_read(&head->refs);
> +		if ((refs & REFS_MASK) == 0 ||
> +		    !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
> +			head = smp_load_acquire(&list->head);
> +			continue;
> +		}
> +
> +		/*
> +		 * Good, reference count has been incremented (it wasn't at
> +		 * zero), which means we can read the next and not worry about
> +		 * it changing between now and the time we do the CAS.
> +		 */
> +		next = READ_ONCE(head->next);
> +		if (try_cmpxchg_acquire(&list->head, &head, next)) {
> +			/*
> +			 * Yay, got the node. This means it was on the list,
> +			 * which means should-be-on-freelist must be false no
> +			 * matter the refcount (because nobody else knows it's
> +			 * been taken off yet, it can't have been put back on).
> +			 */
> +			WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
> +
> +			/*
> +			 * Decrease refcount twice, once for our ref, and once
> +			 * for the list's ref.
> +			 */
> +			atomic_fetch_add(-2, &head->refs);

Do we the barriers implied by _fetch_? Why can't atomic_sub(2, refs) work?

> +		/*
> +		 * OK, the head must have changed on us, but we still need to decrement
> +		 * the refcount we increased.
> +		 */
> +		refs = atomic_fetch_add(-1, &prev->refs);

Cosmetic, but why not atomic_fetch_dec() ?

Oleg.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/7] kprobes: Remove kretprobe hash
  2020-08-28 13:11   ` Eddy_Wu
  2020-08-28 13:38     ` peterz
  2020-08-28 13:51     ` Masami Hiramatsu
@ 2020-08-28 14:49     ` Masami Hiramatsu
  2 siblings, 0 replies; 38+ messages in thread
From: Masami Hiramatsu @ 2020-08-28 14:49 UTC (permalink / raw)
  To: Eddy_Wu
  Cc: Peter Zijlstra, linux-kernel, x86, davem, rostedt, naveen.n.rao,
	anil.s.keshavamurthy, linux-arch, cameron, oleg, will, paulmck

On Fri, 28 Aug 2020 13:11:15 +0000
"Eddy_Wu@trendmicro.com" <Eddy_Wu@trendmicro.com> wrote:

> > -----Original Message----
> Hi, I found a NULL pointer dereference here, where current->kretprobe_instances.first == NULL in these two scenario:
> 
> 1) In task "rs:main Q:Reg"
> # insmod samples/kprobes/kretprobe_example.ko func=schedule
> # pkill sddm-greeter
> 
> 2) In task "llvmpipe-10"
> # insmod samples/kprobes/kretprobe_example.ko func=schedule
> login plasmashell session from sddm graphical interface
> 
> based on Masami's v2 + Peter's lockless patch, I'll try the new branch once I can compile kernel
> 
> Stacktrace may not be really useful here:
> [  402.008630] BUG: kernel NULL pointer dereference, address: 0000000000000018
> [  402.008633] #PF: supervisor read access in kernel mode
> [  402.008642] #PF: error_code(0x0000) - not-present page
> [  402.008644] PGD 0 P4D 0
> [  402.008646] Oops: 0000 [#1] PREEMPT SMP PTI
> [  402.008649] CPU: 7 PID: 1505 Comm: llvmpipe-10 Kdump: loaded Not tainted 5.9.0-rc2-00111-g72091ec08f03-dirty #45

Hmm, this case llvmpipe will be the user task (not kthread, I guess)

Here are some logs, both happened with following command and wait 5min or so.

cd /sys/kernel/debug/tracing/
echo r:event1 vfs_read >> kprobe_events
echo r:event2 vfs_read %ax >> kprobe_events
echo r:event3 rw_verify_area %ax >> kprobe_events
echo r:schedule schedule >> kprobe_events
echo 1 > events/kprobes/enable


[  332.986337] ------------[ cut here ]------------
[  332.987312] kernel BUG at kernel/kprobes.c:1893!
[  332.988237] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  332.989108] CPU: 7 PID: 55 Comm: kcompactd0 Not tainted 5.9.0-rc2+ #54
[  332.990480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[  332.994600] RIP: 0010:__kretprobe_trampoline_handler+0xf2/0x100
[  332.995551] Code: 48 c7 05 e5 40 ec 7e c0 cc 28 82 4c 89 ff e8 c5 fe ff ff 48 85 db 75 92 48 83 c4 08 4c 89 e0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 55 48 89 e5 41 56 41 55
[  332.998498] RSP: 0000:ffffc90000217cf8 EFLAGS: 00010246
[  332.999405] RAX: ffff88807cfe9700 RBX: 0000000000000000 RCX: 0000000000000000
[  333.000597] RDX: ffffc90000217de8 RSI: ffffffff810471e0 RDI: ffffc90000217d50
[  333.002058] RBP: ffffc90000217d28 R08: 0000000000000001 R09: 0000000000000001
[  333.003594] R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90000217d50
[  333.005219] R13: ffff88807d7dbac0 R14: ffffc90000217e00 R15: ffff88807d7dbac0
[  333.006826] FS:  0000000000000000(0000) GS:ffff88807d7c0000(0000) knlGS:0000000000000000
[  333.008787] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  333.010249] CR2: 0000000000000000 CR3: 0000000002220000 CR4: 00000000000006a0
[  333.011895] Call Trace:
[  333.012529]  trampoline_handler+0x43/0x60
[  333.013214]  kretprobe_trampoline+0x2a/0x50
[  333.014028] RIP: 0010:kretprobe_trampoline+0x0/0x50
[  333.014856] Code: c7 e9 2d 04 82 e8 a0 f2 0d 00 5d c3 31 f6 e9 79 ff ff ff be 01 00 00 00 e9 6f ff ff ff cc cc cc cc cc cc cc cc cc cc cc cc cc <54> 9c 48 83 ec 18 57 56 52 51 50 41 50 41 51 41 52 41 53 53 55 41
[  333.017750] RSP: 81170fba:ffffc90000217df0 EFLAGS: 00000246
[  333.018894] RAX: 0000000040200040 RBX: ffff88807d7dbac0 RCX: 0000000000000000
[  333.020232] RDX: 0000000000000001 RSI: ffffffff818e51b4 RDI: ffffffff818e51b4
[  333.021476] RBP: ffffc90000217e88 R08: 0000000000000001 R09: 0000000000000001
[  333.022603] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000100008044
[  333.024221] R13: ffff88807d7dbac0 R14: ffffc90000217e00 R15: ffff88807d7dbac0
[  333.025851]  ? schedule+0x54/0x100
[  333.026717]  ? schedule+0x54/0x100
[  333.027400]  ? trace_preempt_on+0x2a/0xd0
[  333.028161]  ? __next_timer_interrupt+0x110/0x110
[  333.029080]  kcompactd+0x20e/0x350
[  333.029882]  ? wait_woken+0x80/0x80
[  333.030593]  ? kcompactd_do_work+0x3a0/0x3a0
[  333.031347]  kthread+0x13c/0x180
[  333.031988]  ? kthread_park+0x90/0x90
[  333.032734]  ret_from_fork+0x22/0x30
[  333.033557] Modules linked in:
[  333.034451] ---[ end trace 901e8137e8d04982 ]---
[  333.035601] RIP: 0010:__kretprobe_trampoline_handler+0xf2/0x100
[  333.037073] Code: 48 c7 05 e5 40 ec 7e c0 cc 28 82 4c 89 ff e8 c5 fe ff ff 48 85 db 75 92 48 83 c4 08 4c 89 e0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 55 48 89 e5 41 56 41 55
[  333.041089] RSP: 0000:ffffc90000217cf8 EFLAGS: 00010246
[  333.042201] RAX: ffff88807cfe9700 RBX: 0000000000000000 RCX: 0000000000000000
[  333.043747] RDX: ffffc90000217de8 RSI: ffffffff810471e0 RDI: ffffc90000217d50
[  333.045063] RBP: ffffc90000217d28 R08: 0000000000000001 R09: 0000000000000001
[  333.046547] R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90000217d50
[  333.048055] R13: ffff88807d7dbac0 R14: ffffc90000217e00 R15: ffff88807d7dbac0
[  333.049616] FS:  0000000000000000(0000) GS:ffff88807d7c0000(0000) knlGS:0000000000000000
[  333.051487] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  333.052737] CR2: 0000000000000000 CR3: 0000000002220000 CR4: 00000000000006a0
[  333.054127] Kernel panic - not syncing: Fatal exception
[  333.055450] Kernel Offset: disabled
[  333.056207] ---[ end Kernel panic - not syncing: Fatal exception ]---

Another one is here.

 [  335.258721] ------------[ cut here ]------------
[  335.264413] kernel BUG at kernel/kprobes.c:1893!
[  335.267757] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  335.272090] CPU: 7 PID: 71 Comm: kworker/7:1 Not tainted 5.9.0-rc2+ #54
[  335.277787] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[  335.285971] Workqueue:  0x0 (mm_percpu_wq)
[  335.288156] RIP: 0010:__kretprobe_trampoline_handler+0xf2/0x100
[  335.295194] Code: 48 c7 05 e5 40 ec 7e c0 cc 28 82 4c 89 ff e8 c5 fe ff ff 48 85 db 75 92 48 83 c4 08 4c 89 e0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 55 48 89 e5 41 56 41 55
[  335.300922] RSP: 0018:ffffc9000028fdb8 EFLAGS: 00010246
[  335.302336] RAX: ffff88807c4e9700 RBX: 0000000000000000 RCX: 0000000000000000
[  335.304154] RDX: ffffc9000028fea8 RSI: ffffffff810471e0 RDI: ffffc9000028fe10
[  335.305688] RBP: ffffc9000028fde8 R08: 0000000000000001 R09: 0000000000000001
[  335.307486] R10: 0000000000000000 R11: 0000000000000001 R12: ffffc9000028fe10
[  335.309131] R13: ffff88807d7ea440 R14: ffffc900001cbd58 R15: ffff88807c4e4000
[  335.310472] FS:  0000000000000000(0000) GS:ffff88807d7c0000(0000) knlGS:0000000000000000
[  335.312121] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  335.313261] CR2: 00000000005c0a56 CR3: 0000000002220000 CR4: 00000000000006a0
[  335.314561] Call Trace:
[  335.315089]  trampoline_handler+0x43/0x60
[  335.315844]  kretprobe_trampoline+0x2a/0x50
[  335.316774] RIP: 0010:kretprobe_trampoline+0x0/0x50
[  335.317651] Code: c7 e9 2d 04 82 e8 a0 f2 0d 00 5d c3 31 f6 e9 79 ff ff ff be 01 00 00 00 e9 6f ff ff ff cc cc cc cc cc cc cc cc cc cc cc cc cc <54> 9c 48 83 ec 18 57 56 52 51 50 41 50 41 51 41 52 41 53 53 55 41
[  335.320480] RSP: 7c4e9700:ffffc9000028feb0 EFLAGS: 00000246
[  335.321410] RAX: ffff88807c4e4000 RBX: ffff88807d7ea440 RCX: 0000000000000000
[  335.322508] RDX: 0000000000000000 RSI: ffffffff818e51b4 RDI: ffff88807c4e9700
[  335.323611] RBP: ffffc9000028ff00 R08: 0000000000000001 R09: 0000000000000001
[  335.324699] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88807c4e4028
[  335.325903] R13: ffff88807d7ea440 R14: ffffc900001cbd58 R15: ffff88807c4e4000
[  335.327012]  ? schedule+0x54/0x100
[  335.327570]  ? process_one_work+0x5c0/0x5c0
[  335.328127]  kthread+0x13c/0x180
[  335.328583]  ? kthread_park+0x90/0x90
[  335.329063]  ret_from_fork+0x22/0x30
[  335.329558] Modules linked in:
[  335.329974] ---[ end trace bd6d1f4d3806b3de ]---
[  335.330562] RIP: 0010:__kretprobe_trampoline_handler+0xf2/0x100
[  335.331294] Code: 48 c7 05 e5 40 ec 7e c0 cc 28 82 4c 89 ff e8 c5 fe ff ff 48 85 db 75 92 48 83 c4 08 4c 89 e0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 55 48 89 e5 41 56 41 55
[  335.333433] RSP: 0018:ffffc9000028fdb8 EFLAGS: 00010246
[  335.334091] RAX: ffff88807c4e9700 RBX: 0000000000000000 RCX: 0000000000000000
[  335.334959] RDX: ffffc9000028fea8 RSI: ffffffff810471e0 RDI: ffffc9000028fe10
[  335.335697] RBP: ffffc9000028fde8 R08: 0000000000000001 R09: 0000000000000001
[  335.336447] R10: 0000000000000000 R11: 0000000000000001 R12: ffffc9000028fe10
[  335.337192] R13: ffff88807d7ea440 R14: ffffc900001cbd58 R15: ffff88807c4e4000
[  335.337956] FS:  0000000000000000(0000) GS:ffff88807d7c0000(0000) knlGS:0000000000000000
[  335.338917] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  335.339618] CR2: 00000000005c0a56 CR3: 0000000002220000 CR4: 00000000000006a0
[  335.340373] Kernel panic - not syncing: Fatal exception
[  335.341086] Kernel Offset: disabled
[  335.341587] ---[ end Kernel panic - not syncing: Fatal exception ]---



-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 6/7] freelist: Lock less freelist
  2020-08-28 14:46   ` Oleg Nesterov
@ 2020-08-28 15:29     ` peterz
  2020-08-29  3:05       ` Cameron
  0 siblings, 1 reply; 38+ messages in thread
From: peterz @ 2020-08-28 15:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, mhiramat, Eddy_Wu, x86, davem, rostedt,
	naveen.n.rao, anil.s.keshavamurthy, linux-arch, cameron, will,
	paulmck

On Fri, Aug 28, 2020 at 04:46:52PM +0200, Oleg Nesterov wrote:
> On 08/27, Peter Zijlstra wrote:
> >
> >  1 file changed, 129 insertions(+)
> 
> 129 lines! And I spent more than 2 hours trying to understand these
> 129 lines ;) looks correct...

Yes, even though it already has a bunch of comments, I do feel we can
maybe improve on that a little.

For now I went for a 1:1 transliteration of the blog post though.

> > +			/*
> > +			 * Yay, got the node. This means it was on the list,
> > +			 * which means should-be-on-freelist must be false no
> > +			 * matter the refcount (because nobody else knows it's
> > +			 * been taken off yet, it can't have been put back on).
> > +			 */
> > +			WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
> > +
> > +			/*
> > +			 * Decrease refcount twice, once for our ref, and once
> > +			 * for the list's ref.
> > +			 */
> > +			atomic_fetch_add(-2, &head->refs);
> 
> Do we the barriers implied by _fetch_? Why can't atomic_sub(2, refs) work?

I think we can, the original has std::memory_order_relaxed here. So I
should've used atomic_fetch_add_relaxed() but since we don't use the
return value, atomic_sub() would work just fine too.

> > +		/*
> > +		 * OK, the head must have changed on us, but we still need to decrement
> > +		 * the refcount we increased.
> > +		 */
> > +		refs = atomic_fetch_add(-1, &prev->refs);
> 
> Cosmetic, but why not atomic_fetch_dec() ?

The original had that, I didn't want to risk more bugs by 'improving'
things. But yes, that can definitely become dec().

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist
  2020-08-28  9:18       ` peterz
  2020-08-28 10:44         ` Masami Hiramatsu
@ 2020-08-29  2:29         ` Cameron
  2020-08-29  2:31           ` Cameron
  1 sibling, 1 reply; 38+ messages in thread
From: Cameron @ 2020-08-29  2:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Masami Hiramatsu, linux-kernel, Eddy_Wu, x86, davem, rostedt,
	naveen.n.rao, anil.s.keshavamurthy, linux-arch, oleg, will,
	paulmck

On Fri, Aug 28, 2020 at 5:18 AM <peterz@infradead.org> wrote:
> So the freelist->refs thing is supposed to pin freelist->next for
> concurrent usage, but if we instantly stick it on the
> current->kretprobe_instances llist while it's still elevated, we'll
> overwrite ->next, which would be bad.

I thought about this some more, and actually, it should be safe.

The next is not supposed to be changed while the node has non-zero
ref-count, true, but this is only important while the node is in the freelist.
If it's never added back to the freelist, then there is no problem (the
overwritten next may be read but never used).

But even if it is later re-added to the freelist, as long as the refcount is
still above zero, the node won't actually be in the list right away, and
again its next won't matter. When the refcount finally goes down to zero,
it will be put on the list but its next will be blindly overwritten at
that point.

Cameron

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist
  2020-08-29  2:29         ` Cameron
@ 2020-08-29  2:31           ` Cameron
  2020-08-29  9:15             ` Masami Hiramatsu
  0 siblings, 1 reply; 38+ messages in thread
From: Cameron @ 2020-08-29  2:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Masami Hiramatsu, linux-kernel, Eddy_Wu, x86, davem, rostedt,
	naveen.n.rao, anil.s.keshavamurthy, linux-arch, oleg, will,
	paulmck

On Fri, Aug 28, 2020 at 10:29 PM Cameron <cameron@moodycamel.com> wrote:
> I thought about this some more, and actually, it should be safe.

Although I should note that it's important that the flags/refcount are
not overwritten
even after the node is taken off the freelist.

Cameron

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 6/7] freelist: Lock less freelist
  2020-08-28 15:29     ` peterz
@ 2020-08-29  3:05       ` Cameron
  0 siblings, 0 replies; 38+ messages in thread
From: Cameron @ 2020-08-29  3:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, linux-kernel, Masami Hiramatsu, Eddy_Wu, x86,
	davem, rostedt, naveen.n.rao, anil.s.keshavamurthy, linux-arch,
	will, paulmck

> I'm curious whether it is correct to just set the prev->refs to zero and return
> @prev? So that it can remove an unneeded "add()&get()" pair (although in
> an unlikely branch) and __freelist_add() can be folded into freelist_add()
> for tidier code.

That is a very good question. I believe it would be correct, yes, and
would certainly simplify the code. The reference count is zero, so
nobody can increment it again, and REFS_ON_FREELIST is set (the
should-be-on-freelist flag), so instead of adding it to the freelist
to be removed later, it can be returned directly.

> On Fri, Aug 28, 2020 at 04:46:52PM +0200, Oleg Nesterov wrote:
> > 129 lines! And I spent more than 2 hours trying to understand these
> > 129 lines ;) looks correct...

Hahaha. Sorry about that. Some of the most mind-bending code I've ever
written. :-)

> > +                     /*
> > +                      * Hmm, the add failed, but we can only try again when
> > +                      * the refcount goes back to zero.
> > +                      */
> > +                     if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
> > +                             continue;
> Do we really need _release ? Why can't atomic_fetch_add_relaxed() work?

The release is to synchronize with the acquire in the compare-exchange
of try_get. However, I think you're right, between the previous
release-write to the refs and that point, there are no memory effects
that need to be synchronized, and so it could be safely replaced with
a relaxed operation.

> Cosmetic, but why not atomic_fetch_dec() ?

This is just one of my idiosyncrasies. I prefer exclusively using
atomic add, I find it easier to read. Feel free to change them all to
subs!

Cameron


> >
> > Do we the barriers implied by _fetch_? Why can't atomic_sub(2, refs) work?
>
> I think we can, the original has std::memory_order_relaxed here. So I
> should've used atomic_fetch_add_relaxed() but since we don't use the
> return value, atomic_sub() would work just fine too.
>
> > > +           /*
> > > +            * OK, the head must have changed on us, but we still need to decrement
> > > +            * the refcount we increased.
> > > +            */
> > > +           refs = atomic_fetch_add(-1, &prev->refs);
> >
> > Cosmetic, but why not atomic_fetch_dec() ?
>
> The original had that, I didn't want to risk more bugs by 'improving'
> things. But yes, that can definitely become dec().

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist
  2020-08-29  2:31           ` Cameron
@ 2020-08-29  9:15             ` Masami Hiramatsu
  0 siblings, 0 replies; 38+ messages in thread
From: Masami Hiramatsu @ 2020-08-29  9:15 UTC (permalink / raw)
  To: Cameron
  Cc: Peter Zijlstra, Masami Hiramatsu, linux-kernel, Eddy_Wu, x86,
	davem, rostedt, naveen.n.rao, anil.s.keshavamurthy, linux-arch,
	oleg, will, paulmck

On Fri, 28 Aug 2020 22:31:17 -0400
Cameron <cameron@moodycamel.com> wrote:

> On Fri, Aug 28, 2020 at 10:29 PM Cameron <cameron@moodycamel.com> wrote:
> > I thought about this some more, and actually, it should be safe.
> 
> Although I should note that it's important that the flags/refcount are
> not overwritten
> even after the node is taken off the freelist.

Thanks for the check, but I personally would like to keep it separated
because the memory layouts of those in-kernel list-like data structures
can be changed by debug config option...

Thank you,


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2020-08-29  9:15 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-27 16:12 [RFC][PATCH 0/7] kprobes: Make kretprobes lockless Peter Zijlstra
2020-08-27 16:12 ` [RFC][PATCH 1/7] llist: Add nonatomic __llist_add() Peter Zijlstra
2020-08-27 16:12 ` [RFC][PATCH 2/7] sched: Fix try_invoke_on_locked_down_task() semantics Peter Zijlstra
2020-08-27 16:12 ` [RFC][PATCH 3/7] kprobes: Remove kretprobe hash Peter Zijlstra
2020-08-27 18:00   ` Masami Hiramatsu
2020-08-28  8:44     ` peterz
2020-08-28  9:07     ` Masami Hiramatsu
2020-08-28  4:44   ` Masami Hiramatsu
2020-08-28 13:11   ` Eddy_Wu
2020-08-28 13:38     ` peterz
2020-08-28 13:51     ` Masami Hiramatsu
2020-08-28 13:58       ` peterz
2020-08-28 14:19         ` Masami Hiramatsu
2020-08-28 14:11       ` Eddy_Wu
2020-08-28 14:19         ` peterz
2020-08-28 14:41           ` Masami Hiramatsu
2020-08-28 14:49     ` Masami Hiramatsu
2020-08-27 16:12 ` [RFC][PATCH 4/7] kprobe: Dont kfree() from breakpoint context Peter Zijlstra
2020-08-27 16:12 ` [RFC][PATCH 5/7] asm-generic/atomic: Add try_cmpxchg() fallbacks Peter Zijlstra
2020-08-27 16:12 ` [RFC][PATCH 6/7] freelist: Lock less freelist Peter Zijlstra
2020-08-27 16:37   ` peterz
     [not found]     ` <CAFCw3doX6KK5DwpG_OB331Mdw8uYeVqn8YPTjKh_a-m7ZB9+3A@mail.gmail.com>
2020-08-27 16:56       ` peterz
2020-08-27 17:00         ` Cameron
2020-08-27 19:08   ` Boqun Feng
2020-08-27 19:57     ` Cameron
2020-08-28  1:34       ` Boqun Feng
2020-08-28  4:03   ` Lai Jiangshan
2020-08-28 14:46   ` Oleg Nesterov
2020-08-28 15:29     ` peterz
2020-08-29  3:05       ` Cameron
2020-08-27 16:12 ` [RFC][PATCH 7/7] kprobes: Replace rp->free_instance with freelist Peter Zijlstra
2020-08-28  8:48   ` peterz
2020-08-28  9:13     ` Masami Hiramatsu
2020-08-28  9:18       ` peterz
2020-08-28 10:44         ` Masami Hiramatsu
2020-08-29  2:29         ` Cameron
2020-08-29  2:31           ` Cameron
2020-08-29  9:15             ` Masami Hiramatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).