[PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ

live-patching.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL
@ 2021-09-29 15:17 Peter Zijlstra
  2021-09-29 15:17 ` [PATCH v2 01/11] sched: Improve try_invoke_on_locked_down_task() Peter Zijlstra
                   ` (11 more replies)
  0 siblings, 12 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:17 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, peterz, joe.lawrence, fweisbec, tglx, hca, svens,
	sumanthk, live-patching, paulmck, rostedt, x86

Hi,

One series in two parts; the first part (patches 1-5) should be in reasonable
shape and ought to fix the original issues that got this whole thing started.

The second part (patches 6-11) are still marked RFC and do scary things to try
and make NOHZ_FULL work better.

Please consider..

PS. Paul, I did do break RCU a bit and I'm not sure about the best way to put
it back together, please see patch 8 for details.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 01/11] sched: Improve try_invoke_on_locked_down_task()
  2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
@ 2021-09-29 15:17 ` Peter Zijlstra
  2021-09-29 15:17 ` [PATCH v2 02/11] sched,rcu: Rework try_invoke_on_locked_down_task() Peter Zijlstra
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:17 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, peterz, joe.lawrence, fweisbec, tglx, hca, svens,
	sumanthk, live-patching, paulmck, rostedt, x86

Clarify and tighten try_invoke_on_locked_down_task().

Basically the function calls @func under task_rq_lock(), except it
avoids taking rq->lock when possible.

This makes calling @func unconditional (the function will get renamed
in a later patch to remove the try).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   63 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 39 insertions(+), 24 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4111,41 +4111,56 @@ try_to_wake_up(struct task_struct *p, un
  * @func: Function to invoke.
  * @arg: Argument to function.
  *
- * If the specified task can be quickly locked into a definite state
- * (either sleeping or on a given runqueue), arrange to keep it in that
- * state while invoking @func(@arg).  This function can use ->on_rq and
- * task_curr() to work out what the state is, if required.  Given that
- * @func can be invoked with a runqueue lock held, it had better be quite
- * lightweight.
+ * Fix the task in it's current state by avoiding wakeups and or rq operations
+ * and call @func(@arg) on it.  This function can use ->on_rq and task_curr()
+ * to work out what the state is, if required.  Given that @func can be invoked
+ * with a runqueue lock held, it had better be quite lightweight.
  *
  * Returns:
- *	@false if the task slipped out from under the locks.
- *	@true if the task was locked onto a runqueue or is sleeping.
- *		However, @func can override this by returning @false.
+ *   Whatever @func returns
  */
 bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)
 {
+	struct rq *rq = NULL;
+	unsigned int state;
 	struct rq_flags rf;
 	bool ret = false;
-	struct rq *rq;
 
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
-	if (p->on_rq) {
+
+	state = READ_ONCE(p->__state);
+
+	/*
+	 * Ensure we load p->on_rq after p->__state, otherwise it would be
+	 * possible to, falsely, observe p->on_rq == 0.
+	 *
+	 * See try_to_wake_up() for a longer comment.
+	 */
+	smp_rmb();
+
+	/*
+	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
+	 * the task is blocked. Make sure to check @state since ttwu() can drop
+	 * locks at the end, see ttwu_queue_wakelist().
+	 */
+	if (state == TASK_RUNNING || state == TASK_WAKING || p->on_rq)
 		rq = __task_rq_lock(p, &rf);
-		if (task_rq(p) == rq)
-			ret = func(p, arg);
+
+	/*
+	 * At this point the task is pinned; either:
+	 *  - blocked and we're holding off wakeups	 (pi->lock)
+	 *  - woken, and we're holding off enqueue	 (rq->lock)
+	 *  - queued, and we're holding off schedule	 (rq->lock)
+	 *  - running, and we're holding off de-schedule (rq->lock)
+	 *
+	 * The called function (@func) can use: task_curr(), p->on_rq and
+	 * p->__state to differentiate between these states.
+	 */
+	ret = func(p, arg);
+
+	if (rq)
 		rq_unlock(rq, &rf);
-	} else {
-		switch (READ_ONCE(p->__state)) {
-		case TASK_RUNNING:
-		case TASK_WAKING:
-			break;
-		default:
-			smp_rmb(); // See smp_rmb() comment in try_to_wake_up().
-			if (!p->on_rq)
-				ret = func(p, arg);
-		}
-	}
+
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 	return ret;
 }



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 02/11] sched,rcu: Rework try_invoke_on_locked_down_task()
  2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
  2021-09-29 15:17 ` [PATCH v2 01/11] sched: Improve try_invoke_on_locked_down_task() Peter Zijlstra
@ 2021-09-29 15:17 ` Peter Zijlstra
  2021-09-29 15:17 ` [PATCH v2 03/11] sched,livepatch: Use task_call_func() Peter Zijlstra
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:17 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, peterz, joe.lawrence, fweisbec, tglx, hca, svens,
	sumanthk, live-patching, paulmck, rostedt, x86

Give try_invoke_on_locked_down_task() a saner name and have it return
an int so that the caller might distinguish between different reasons
of failure.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/wait.h    |    3 ++-
 kernel/rcu/tasks.h      |   12 ++++++------
 kernel/rcu/tree_stall.h |    8 ++++----
 kernel/sched/core.c     |    6 +++---
 4 files changed, 15 insertions(+), 14 deletions(-)

--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -1160,6 +1160,7 @@ int autoremove_wake_function(struct wait
 		(wait)->flags = 0;						\
 	} while (0)
 
-bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg);
+typedef int (*task_call_f)(struct task_struct *p, void *arg);
+extern int task_call_func(struct task_struct *p, task_call_f func, void *arg);
 
 #endif /* _LINUX_WAIT_H */
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -928,7 +928,7 @@ static void trc_read_check_handler(void
 }
 
 /* Callback function for scheduler to check locked-down task.  */
-static bool trc_inspect_reader(struct task_struct *t, void *arg)
+static int trc_inspect_reader(struct task_struct *t, void *arg)
 {
 	int cpu = task_cpu(t);
 	bool in_qs = false;
@@ -939,7 +939,7 @@ static bool trc_inspect_reader(struct ta
 
 		// If no chance of heavyweight readers, do it the hard way.
 		if (!ofl && !IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
-			return false;
+			return -EINVAL;
 
 		// If heavyweight readers are enabled on the remote task,
 		// we can inspect its state despite its currently running.
@@ -947,7 +947,7 @@ static bool trc_inspect_reader(struct ta
 		n_heavy_reader_attempts++;
 		if (!ofl && // Check for "running" idle tasks on offline CPUs.
 		    !rcu_dynticks_zero_in_eqs(cpu, &t->trc_reader_nesting))
-			return false; // No quiescent state, do it the hard way.
+			return -EINVAL; // No quiescent state, do it the hard way.
 		n_heavy_reader_updates++;
 		if (ofl)
 			n_heavy_reader_ofl_updates++;
@@ -962,7 +962,7 @@ static bool trc_inspect_reader(struct ta
 	t->trc_reader_checked = true;
 
 	if (in_qs)
-		return true;  // Already in quiescent state, done!!!
+		return 0;  // Already in quiescent state, done!!!
 
 	// The task is in a read-side critical section, so set up its
 	// state so that it will awaken the grace-period kthread upon exit
@@ -970,7 +970,7 @@ static bool trc_inspect_reader(struct ta
 	atomic_inc(&trc_n_readers_need_end); // One more to wait on.
 	WARN_ON_ONCE(READ_ONCE(t->trc_reader_special.b.need_qs));
 	WRITE_ONCE(t->trc_reader_special.b.need_qs, true);
-	return true;
+	return 0;
 }
 
 /* Attempt to extract the state for the specified task. */
@@ -992,7 +992,7 @@ static void trc_wait_for_one_reader(stru
 
 	// Attempt to nail down the task for inspection.
 	get_task_struct(t);
-	if (try_invoke_on_locked_down_task(t, trc_inspect_reader, NULL)) {
+	if (!task_call_func(t, trc_inspect_reader, NULL)) {
 		put_task_struct(t);
 		return;
 	}
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -240,16 +240,16 @@ struct rcu_stall_chk_rdr {
  * Report out the state of a not-running task that is stalling the
  * current RCU grace period.
  */
-static bool check_slow_task(struct task_struct *t, void *arg)
+static int check_slow_task(struct task_struct *t, void *arg)
 {
 	struct rcu_stall_chk_rdr *rscrp = arg;
 
 	if (task_curr(t))
-		return false; // It is running, so decline to inspect it.
+		return -EBUSY; // It is running, so decline to inspect it.
 	rscrp->nesting = t->rcu_read_lock_nesting;
 	rscrp->rs = t->rcu_read_unlock_special;
 	rscrp->on_blkd_list = !list_empty(&t->rcu_node_entry);
-	return true;
+	return 0;
 }
 
 /*
@@ -283,7 +283,7 @@ static int rcu_print_task_stall(struct r
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	while (i) {
 		t = ts[--i];
-		if (!try_invoke_on_locked_down_task(t, check_slow_task, &rscr))
+		if (task_call_func(t, check_slow_task, &rscr))
 			pr_cont(" P%d", t->pid);
 		else
 			pr_cont(" P%d/%d:%c%c%c%c",
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4106,7 +4106,7 @@ try_to_wake_up(struct task_struct *p, un
 }
 
 /**
- * try_invoke_on_locked_down_task - Invoke a function on task in fixed state
+ * task_call_func - Invoke a function on task in fixed state
  * @p: Process for which the function is to be invoked, can be @current.
  * @func: Function to invoke.
  * @arg: Argument to function.
@@ -4119,12 +4119,12 @@ try_to_wake_up(struct task_struct *p, un
  * Returns:
  *   Whatever @func returns
  */
-bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)
+int task_call_func(struct task_struct *p, task_call_f func, void *arg)
 {
 	struct rq *rq = NULL;
 	unsigned int state;
 	struct rq_flags rf;
-	bool ret = false;
+	int ret;
 
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 03/11] sched,livepatch: Use task_call_func()
  2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
  2021-09-29 15:17 ` [PATCH v2 01/11] sched: Improve try_invoke_on_locked_down_task() Peter Zijlstra
  2021-09-29 15:17 ` [PATCH v2 02/11] sched,rcu: Rework try_invoke_on_locked_down_task() Peter Zijlstra
@ 2021-09-29 15:17 ` Peter Zijlstra
  2021-10-05 11:40   ` Petr Mladek
  2021-10-06  8:59   ` Miroslav Benes
  2021-09-29 15:17 ` [PATCH v2 04/11] sched: Simplify wake_up_*idle*() Peter Zijlstra
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:17 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, peterz, joe.lawrence, fweisbec, tglx, hca, svens,
	sumanthk, live-patching, paulmck, rostedt, x86

Instead of frobbing around with scheduler internals, use the shiny new
task_call_func() interface.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/livepatch/transition.c |   90 ++++++++++++++++++++----------------------
 1 file changed, 44 insertions(+), 46 deletions(-)

--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -13,7 +13,6 @@
 #include "core.h"
 #include "patch.h"
 #include "transition.h"
-#include "../sched/sched.h"
 
 #define MAX_STACK_ENTRIES  100
 #define STACK_ERR_BUF_SIZE 128
@@ -240,7 +239,7 @@ static int klp_check_stack_func(struct k
  * Determine whether it's safe to transition the task to the target patch state
  * by looking for any to-be-patched or to-be-unpatched functions on its stack.
  */
-static int klp_check_stack(struct task_struct *task, char *err_buf)
+static int klp_check_stack(struct task_struct *task, const char **oldname)
 {
 	static unsigned long entries[MAX_STACK_ENTRIES];
 	struct klp_object *obj;
@@ -248,12 +247,8 @@ static int klp_check_stack(struct task_s
 	int ret, nr_entries;
 
 	ret = stack_trace_save_tsk_reliable(task, entries, ARRAY_SIZE(entries));
-	if (ret < 0) {
-		snprintf(err_buf, STACK_ERR_BUF_SIZE,
-			 "%s: %s:%d has an unreliable stack\n",
-			 __func__, task->comm, task->pid);
-		return ret;
-	}
+	if (ret < 0)
+		return -EINVAL;
 	nr_entries = ret;
 
 	klp_for_each_object(klp_transition_patch, obj) {
@@ -262,11 +257,8 @@ static int klp_check_stack(struct task_s
 		klp_for_each_func(obj, func) {
 			ret = klp_check_stack_func(func, entries, nr_entries);
 			if (ret) {
-				snprintf(err_buf, STACK_ERR_BUF_SIZE,
-					 "%s: %s:%d is sleeping on function %s\n",
-					 __func__, task->comm, task->pid,
-					 func->old_name);
-				return ret;
+				*oldname = func->old_name;
+				return -EADDRINUSE;
 			}
 		}
 	}
@@ -274,6 +266,22 @@ static int klp_check_stack(struct task_s
 	return 0;
 }
 
+static int klp_check_and_switch_task(struct task_struct *task, void *arg)
+{
+	int ret;
+
+	if (task_curr(task))
+		return -EBUSY;
+
+	ret = klp_check_stack(task, arg);
+	if (ret)
+		return ret;
+
+	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
+	task->patch_state = klp_target_state;
+	return 0;
+}
+
 /*
  * Try to safely switch a task to the target patch state.  If it's currently
  * running, or it's sleeping on a to-be-patched or to-be-unpatched function, or
@@ -281,13 +289,8 @@ static int klp_check_stack(struct task_s
  */
 static bool klp_try_switch_task(struct task_struct *task)
 {
-	static char err_buf[STACK_ERR_BUF_SIZE];
-	struct rq *rq;
-	struct rq_flags flags;
+	const char *old_name;
 	int ret;
-	bool success = false;
-
-	err_buf[0] = '\0';
 
 	/* check if this task has already switched over */
 	if (task->patch_state == klp_target_state)
@@ -305,36 +308,31 @@ static bool klp_try_switch_task(struct t
 	 * functions.  If all goes well, switch the task to the target patch
 	 * state.
 	 */
-	rq = task_rq_lock(task, &flags);
+	ret = task_call_func(task, klp_check_and_switch_task, &old_name);
+	switch (ret) {
+	case 0:		/* success */
+		break;
 
-	if (task_running(rq, task) && task != current) {
-		snprintf(err_buf, STACK_ERR_BUF_SIZE,
-			 "%s: %s:%d is running\n", __func__, task->comm,
-			 task->pid);
-		goto done;
+	case -EBUSY:	/* klp_check_and_switch_task() */
+		pr_debug("%s: %s:%d is running\n",
+			 __func__, task->comm, task->pid);
+		break;
+	case -EINVAL:	/* klp_check_and_switch_task() */
+		pr_debug("%s: %s:%d has an unreliable stack\n",
+			 __func__, task->comm, task->pid);
+		break;
+	case -EADDRINUSE: /* klp_check_and_switch_task() */
+		pr_debug("%s: %s:%d is sleeping on function %s\n",
+			 __func__, task->comm, task->pid, old_name);
+		break;
+
+	default:
+		pr_debug("%s: Unknown error code (%d) when trying to switch %s:%d\n",
+			 __func__, ret, task->comm, task->pid);
+		break;
 	}
 
-	ret = klp_check_stack(task, err_buf);
-	if (ret)
-		goto done;
-
-	success = true;
-
-	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
-	task->patch_state = klp_target_state;
-
-done:
-	task_rq_unlock(rq, task, &flags);
-
-	/*
-	 * Due to console deadlock issues, pr_debug() can't be used while
-	 * holding the task rq lock.  Instead we have to use a temporary buffer
-	 * and print the debug message after releasing the lock.
-	 */
-	if (err_buf[0] != '\0')
-		pr_debug("%s", err_buf);
-
-	return success;
+	return !ret;
 }
 
 /*



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 04/11] sched: Simplify wake_up_*idle*()
  2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
                   ` (2 preceding siblings ...)
  2021-09-29 15:17 ` [PATCH v2 03/11] sched,livepatch: Use task_call_func() Peter Zijlstra
@ 2021-09-29 15:17 ` Peter Zijlstra
  2021-10-13 14:32   ` Qian Cai
       [not found]   ` <CGME20211022134630eucas1p2e79e2816587d182c580459d567c1f2a9@eucas1p2.samsung.com>
  2021-09-29 15:17 ` [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle() Peter Zijlstra
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:17 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, peterz, joe.lawrence, fweisbec, tglx, hca, svens,
	sumanthk, live-patching, paulmck, rostedt, x86

Simplify and make wake_up_if_idle() more robust, also don't iterate
the whole machine with preempt_disable() in it's caller:
wake_up_all_idle_cpus().

This prepares for another wake_up_if_idle() user that needs a full
do_idle() cycle.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   14 +++++---------
 kernel/smp.c        |    6 +++---
 2 files changed, 8 insertions(+), 12 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3691,15 +3691,11 @@ void wake_up_if_idle(int cpu)
 	if (!is_idle_task(rcu_dereference(rq->curr)))
 		goto out;
 
-	if (set_nr_if_polling(rq->idle)) {
-		trace_sched_wake_idle_without_ipi(cpu);
-	} else {
-		rq_lock_irqsave(rq, &rf);
-		if (is_idle_task(rq->curr))
-			smp_send_reschedule(cpu);
-		/* Else CPU is not idle, do nothing here: */
-		rq_unlock_irqrestore(rq, &rf);
-	}
+	rq_lock_irqsave(rq, &rf);
+	if (is_idle_task(rq->curr))
+		resched_curr(rq);
+	/* Else CPU is not idle, do nothing here: */
+	rq_unlock_irqrestore(rq, &rf);
 
 out:
 	rcu_read_unlock();
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void)
 {
 	int cpu;
 
-	preempt_disable();
+	cpus_read_lock();
 	for_each_online_cpu(cpu) {
-		if (cpu == smp_processor_id())
+		if (cpu == raw_smp_processor_id())
 			continue;
 
 		wake_up_if_idle(cpu);
 	}
-	preempt_enable();
+	cpus_read_unlock();
 }
 EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
 



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle()
  2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
                   ` (3 preceding siblings ...)
  2021-09-29 15:17 ` [PATCH v2 04/11] sched: Simplify wake_up_*idle*() Peter Zijlstra
@ 2021-09-29 15:17 ` Peter Zijlstra
  2021-10-05 12:00   ` Petr Mladek
                     ` (2 more replies)
  2021-09-29 15:17 ` [RFC][PATCH v2 06/11] context_tracking: Prefix user_{enter,exit}*() Peter Zijlstra
                   ` (6 subsequent siblings)
  11 siblings, 3 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:17 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, peterz, joe.lawrence, fweisbec, tglx, hca, svens,
	sumanthk, live-patching, paulmck, rostedt, x86

Make sure to prod idle CPUs so they call klp_update_patch_state().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/livepatch/transition.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -413,8 +413,11 @@ void klp_try_complete_transition(void)
 	for_each_possible_cpu(cpu) {
 		task = idle_task(cpu);
 		if (cpu_online(cpu)) {
-			if (!klp_try_switch_task(task))
+			if (!klp_try_switch_task(task)) {
 				complete = false;
+				/* Make idle task go through the main loop. */
+				wake_up_if_idle(cpu);
+			}
 		} else if (task->patch_state != klp_target_state) {
 			/* offline idle tasks can be switched immediately */
 			clear_tsk_thread_flag(task, TIF_PATCH_PENDING);



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC][PATCH v2 06/11] context_tracking: Prefix user_{enter,exit}*()
  2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
                   ` (4 preceding siblings ...)
  2021-09-29 15:17 ` [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle() Peter Zijlstra
@ 2021-09-29 15:17 ` Peter Zijlstra
  2021-09-29 15:17 ` [RFC][PATCH v2 07/11] context_tracking: Add an atomic sequence/state count Peter Zijlstra
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:17 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, peterz, joe.lawrence, fweisbec, tglx, hca, svens,
	sumanthk, live-patching, paulmck, rostedt, x86

Put a ct_ prefix on a bunch of context_tracking functions for better
namespacing / greppability.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/Kconfig                            |    6 +++---
 arch/arm64/kernel/entry-common.c        |    4 ++--
 arch/mips/kernel/ptrace.c               |    6 +++---
 arch/mips/kernel/signal.c               |    4 ++--
 arch/powerpc/include/asm/interrupt.h    |    2 +-
 arch/powerpc/kernel/interrupt.c         |   10 +++++-----
 arch/sparc/kernel/ptrace_64.c           |    6 +++---
 arch/sparc/kernel/signal_64.c           |    4 ++--
 include/linux/context_tracking.h        |   16 ++++++++--------
 include/linux/context_tracking_state.h  |    2 +-
 include/trace/events/context_tracking.h |    8 ++++----
 kernel/context_tracking.c               |    6 +++---
 kernel/entry/common.c                   |    4 ++--
 kernel/livepatch/transition.c           |    2 +-
 kernel/sched/core.c                     |    4 ++--
 kernel/trace/ftrace.c                   |    4 ++--
 16 files changed, 44 insertions(+), 44 deletions(-)

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -760,7 +760,7 @@ config HAVE_CONTEXT_TRACKING
 	help
 	  Provide kernel/user boundaries probes necessary for subsystems
 	  that need it, such as userspace RCU extended quiescent state.
-	  Syscalls need to be wrapped inside user_exit()-user_enter(), either
+	  Syscalls need to be wrapped inside ct_user_exit()-ct_user_enter(), either
 	  optimized behind static key or through the slow path using TIF_NOHZ
 	  flag. Exceptions handlers must be wrapped as well. Irqs are already
 	  protected inside rcu_irq_enter/rcu_irq_exit() but preemption or signal
@@ -774,7 +774,7 @@ config HAVE_CONTEXT_TRACKING_OFFSTACK
 	  preempt_schedule_irq() can't be called in a preemptible section
 	  while context tracking is CONTEXT_USER. This feature reflects a sane
 	  entry implementation where the following requirements are met on
-	  critical entry code, ie: before user_exit() or after user_enter():
+	  critical entry code, ie: before ct_user_exit() or after ct_user_enter():
 
 	  - Critical entry code isn't preemptible (or better yet:
 	    not interruptible).
@@ -787,7 +787,7 @@ config HAVE_TIF_NOHZ
 	bool
 	help
 	  Arch relies on TIF_NOHZ and syscall slow path to implement context
-	  tracking calls to user_enter()/user_exit().
+	  tracking calls to ct_user_enter()/ct_user_exit().
 
 config HAVE_VIRT_CPU_ACCOUNTING
 	bool
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -100,7 +100,7 @@ static __always_inline void __enter_from
 {
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	CT_WARN_ON(ct_state() != CONTEXT_USER);
-	user_exit_irqoff();
+	ct_user_exit_irqoff();
 	trace_hardirqs_off_finish();
 }
 
@@ -118,7 +118,7 @@ static __always_inline void __exit_to_us
 {
 	trace_hardirqs_on_prepare();
 	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-	user_enter_irqoff();
+	ct_user_enter_irqoff();
 	lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -1312,7 +1312,7 @@ long arch_ptrace(struct task_struct *chi
  */
 asmlinkage long syscall_trace_enter(struct pt_regs *regs, long syscall)
 {
-	user_exit();
+	ct_user_exit();
 
 	current_thread_info()->syscall = syscall;
 
@@ -1368,7 +1368,7 @@ asmlinkage void syscall_trace_leave(stru
 	 * or do_notify_resume(), in which case we can be in RCU
 	 * user mode.
 	 */
-	user_exit();
+	ct_user_exit();
 
 	audit_syscall_exit(regs);
 
@@ -1378,5 +1378,5 @@ asmlinkage void syscall_trace_leave(stru
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall_exit(regs, 0);
 
-	user_enter();
+	ct_user_enter();
 }
--- a/arch/mips/kernel/signal.c
+++ b/arch/mips/kernel/signal.c
@@ -897,7 +897,7 @@ asmlinkage void do_notify_resume(struct
 {
 	local_irq_enable();
 
-	user_exit();
+	ct_user_exit();
 
 	if (thread_info_flags & _TIF_UPROBE)
 		uprobe_notify_resume(regs);
@@ -911,7 +911,7 @@ asmlinkage void do_notify_resume(struct
 		rseq_handle_notify_resume(NULL, regs);
 	}
 
-	user_enter();
+	ct_user_enter();
 }
 
 #if defined(CONFIG_SMP) && defined(CONFIG_MIPS_FP_SUPPORT)
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -154,7 +154,7 @@ static inline void interrupt_enter_prepa
 
 	if (user_mode(regs)) {
 		CT_WARN_ON(ct_state() != CONTEXT_USER);
-		user_exit_irqoff();
+		ct_user_exit_irqoff();
 
 		account_cpu_user_entry();
 		account_stolen_time();
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -91,7 +91,7 @@ notrace long system_call_exception(long
 	trace_hardirqs_off(); /* finish reconciling */
 
 	CT_WARN_ON(ct_state() == CONTEXT_KERNEL);
-	user_exit_irqoff();
+	ct_user_exit_irqoff();
 
 	BUG_ON(regs_is_unrecoverable(regs));
 	BUG_ON(!(regs->msr & MSR_PR));
@@ -388,9 +388,9 @@ interrupt_exit_user_prepare_main(unsigne
 
 	check_return_regs_valid(regs);
 
-	user_enter_irqoff();
+	ct_user_enter_irqoff();
 	if (!prep_irq_for_enabled_exit(true)) {
-		user_exit_irqoff();
+		ct_user_exit_irqoff();
 		local_irq_enable();
 		local_irq_disable();
 		goto again;
@@ -489,7 +489,7 @@ notrace unsigned long syscall_exit_resta
 #endif
 
 	trace_hardirqs_off();
-	user_exit_irqoff();
+	ct_user_exit_irqoff();
 	account_cpu_user_entry();
 
 	BUG_ON(!user_mode(regs));
@@ -638,7 +638,7 @@ notrace unsigned long interrupt_exit_use
 #endif
 
 	trace_hardirqs_off();
-	user_exit_irqoff();
+	ct_user_exit_irqoff();
 	account_cpu_user_entry();
 
 	BUG_ON(!user_mode(regs));
--- a/arch/sparc/kernel/ptrace_64.c
+++ b/arch/sparc/kernel/ptrace_64.c
@@ -1092,7 +1092,7 @@ asmlinkage int syscall_trace_enter(struc
 	secure_computing_strict(regs->u_regs[UREG_G1]);
 
 	if (test_thread_flag(TIF_NOHZ))
-		user_exit();
+		ct_user_exit();
 
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		ret = tracehook_report_syscall_entry(regs);
@@ -1110,7 +1110,7 @@ asmlinkage int syscall_trace_enter(struc
 asmlinkage void syscall_trace_leave(struct pt_regs *regs)
 {
 	if (test_thread_flag(TIF_NOHZ))
-		user_exit();
+		ct_user_exit();
 
 	audit_syscall_exit(regs);
 
@@ -1121,7 +1121,7 @@ asmlinkage void syscall_trace_leave(stru
 		tracehook_report_syscall_exit(regs, 0);
 
 	if (test_thread_flag(TIF_NOHZ))
-		user_enter();
+		ct_user_enter();
 }
 
 /**
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -546,14 +546,14 @@ static void do_signal(struct pt_regs *re
 
 void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long thread_info_flags)
 {
-	user_exit();
+	ct_user_exit();
 	if (thread_info_flags & _TIF_UPROBE)
 		uprobe_notify_resume(regs);
 	if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
 		do_signal(regs, orig_i0);
 	if (thread_info_flags & _TIF_NOTIFY_RESUME)
 		tracehook_notify_resume(regs);
-	user_enter();
+	ct_user_enter();
 }
 
 /*
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -22,26 +22,26 @@ extern void context_tracking_exit(enum c
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
-static inline void user_enter(void)
+static inline void ct_user_enter(void)
 {
 	if (context_tracking_enabled())
 		context_tracking_enter(CONTEXT_USER);
 
 }
-static inline void user_exit(void)
+static inline void ct_user_exit(void)
 {
 	if (context_tracking_enabled())
 		context_tracking_exit(CONTEXT_USER);
 }
 
 /* Called with interrupts disabled.  */
-static __always_inline void user_enter_irqoff(void)
+static __always_inline void ct_user_enter_irqoff(void)
 {
 	if (context_tracking_enabled())
 		__context_tracking_enter(CONTEXT_USER);
 
 }
-static __always_inline void user_exit_irqoff(void)
+static __always_inline void ct_user_exit_irqoff(void)
 {
 	if (context_tracking_enabled())
 		__context_tracking_exit(CONTEXT_USER);
@@ -98,10 +98,10 @@ static __always_inline enum ctx_state ct
 		this_cpu_read(context_tracking.state) : CONTEXT_DISABLED;
 }
 #else
-static inline void user_enter(void) { }
-static inline void user_exit(void) { }
-static inline void user_enter_irqoff(void) { }
-static inline void user_exit_irqoff(void) { }
+static inline void ct_user_enter(void) { }
+static inline void ct_user_exit(void) { }
+static inline void ct_user_enter_irqoff(void) { }
+static inline void ct_user_exit_irqoff(void) { }
 static inline enum ctx_state exception_enter(void) { return 0; }
 static inline void exception_exit(enum ctx_state prev_ctx) { }
 static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; }
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -9,7 +9,7 @@ struct context_tracking {
 	/*
 	 * When active is false, probes are unset in order
 	 * to minimize overhead: TIF flags are cleared
-	 * and calls to user_enter/exit are ignored. This
+	 * and calls to ct_user_enter/exit are ignored. This
 	 * may be further optimized using static keys.
 	 */
 	bool active;
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -73,7 +73,7 @@ void noinstr __context_tracking_enter(en
 			 * At this stage, only low level arch entry code remains and
 			 * then we'll run in userspace. We can assume there won't be
 			 * any RCU read-side critical section until the next call to
-			 * user_exit() or rcu_irq_enter(). Let's remove RCU's dependency
+			 * ct_user_exit() or rcu_irq_enter(). Let's remove RCU's dependency
 			 * on the tick.
 			 */
 			if (state == CONTEXT_USER) {
@@ -127,7 +127,7 @@ EXPORT_SYMBOL_GPL(context_tracking_enter
 
 void context_tracking_user_enter(void)
 {
-	user_enter();
+	ct_user_enter();
 }
 NOKPROBE_SYMBOL(context_tracking_user_enter);
 
@@ -184,7 +184,7 @@ EXPORT_SYMBOL_GPL(context_tracking_exit)
 
 void context_tracking_user_exit(void)
 {
-	user_exit();
+	ct_user_exit();
 }
 NOKPROBE_SYMBOL(context_tracking_user_exit);
 
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -19,7 +19,7 @@ static __always_inline void __enter_from
 	lockdep_hardirqs_off(CALLER_ADDR0);
 
 	CT_WARN_ON(ct_state() != CONTEXT_USER);
-	user_exit_irqoff();
+	ct_user_exit_irqoff();
 
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
@@ -127,7 +127,7 @@ static __always_inline void __exit_to_us
 	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 	instrumentation_end();
 
-	user_enter_irqoff();
+	ct_user_enter_irqoff();
 	arch_exit_to_user_mode();
 	lockdep_hardirqs_on(CALLER_ADDR0);
 }
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -51,7 +51,7 @@ static void klp_sync(struct work_struct
 
 /*
  * We allow to patch also functions where RCU is not watching,
- * e.g. before user_exit(). We can not rely on the RCU infrastructure
+ * e.g. before ct_user_exit(). We can not rely on the RCU infrastructure
  * to do the synchronization. Instead hard force the sched synchronization.
  *
  * This approach allows to use RCU functions for manipulating func_stack
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6444,7 +6444,7 @@ EXPORT_STATIC_CALL_TRAMP(preempt_schedul
  * recursion and tracing preempt enabling caused by the tracing
  * infrastructure itself. But as tracing can happen in areas coming
  * from userspace or just about to enter userspace, a preempt enable
- * can occur before user_exit() is called. This will cause the scheduler
+ * can occur before ct_user_exit() is called. This will cause the scheduler
  * to be called when the system is still in usermode.
  *
  * To prevent this, the preempt_enable_notrace will use this function
@@ -6475,7 +6475,7 @@ asmlinkage __visible void __sched notrac
 		preempt_disable_notrace();
 		preempt_latency_start(1);
 		/*
-		 * Needs preempt disabled in case user_exit() is traced
+		 * Needs preempt disabled in case ct_user_exit() is traced
 		 * and the tracer calls preempt_enable_notrace() causing
 		 * an infinite recursion.
 		 */
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -2996,7 +2996,7 @@ int ftrace_shutdown(struct ftrace_ops *o
 		 * We need to do a hard force of sched synchronization.
 		 * This is because we use preempt_disable() to do RCU, but
 		 * the function tracers can be called where RCU is not watching
-		 * (like before user_exit()). We can not rely on the RCU
+		 * (like before ct_user_exit()). We can not rely on the RCU
 		 * infrastructure to do the synchronization, thus we must do it
 		 * ourselves.
 		 */
@@ -5981,7 +5981,7 @@ ftrace_graph_release(struct inode *inode
 		 * We need to do a hard force of sched synchronization.
 		 * This is because we use preempt_disable() to do RCU, but
 		 * the function tracers can be called where RCU is not watching
-		 * (like before user_exit()). We can not rely on the RCU
+		 * (like before ct_user_exit()). We can not rely on the RCU
 		 * infrastructure to do the synchronization, thus we must do it
 		 * ourselves.
 		 */



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC][PATCH v2 07/11] context_tracking: Add an atomic sequence/state count
  2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
                   ` (5 preceding siblings ...)
  2021-09-29 15:17 ` [RFC][PATCH v2 06/11] context_tracking: Prefix user_{enter,exit}*() Peter Zijlstra
@ 2021-09-29 15:17 ` Peter Zijlstra
  2021-09-29 15:17 ` [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking Peter Zijlstra
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:17 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, peterz, joe.lawrence, fweisbec, tglx, hca, svens,
	sumanthk, live-patching, paulmck, rostedt, x86

Similar to dynticks RCU, add a sequence count that tracks
USER/GUEST,NMI state. Unlike RCU, use a few more state bits.

It would be possible to, like dyntics RCU, fold the USER and NMI bits,
for now don't bother and keep them explicit (doing this woulnd't be
terribly difficult, it would require __context_tracking_nmi_{enter,exit}()
to conditionally update the state).

Additionally, use bit0 to indicate there's additional work to be done
on leaving the 'USER' state.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/context_tracking.h       |   60 +++++++++++++++++++++
 include/linux/context_tracking_state.h |    3 +
 kernel/context_tracking.c              |   93 +++++++++++++++++++++++++++++++++
 kernel/entry/common.c                  |    2 
 4 files changed, 158 insertions(+)

--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -9,19 +9,47 @@
 
 #include <asm/ptrace.h>
 
+enum ct_work {
+	CT_WORK_n = 0,
+};
+
+/*
+ * context_tracking::seq
+ *
+ * bit0 - work
+ * bit1 - nmi
+ * bit2 - user
+ *
+ */
+enum ct_seq_state {
+	CT_SEQ_WORK = 0x01,
+	CT_SEQ_NMI  = 0x02,
+	CT_SEQ_USER = 0x04,
+	CT_SEQ      = 0x08,
+};
+
+static __always_inline bool __context_tracking_seq_in_user(unsigned int seq)
+{
+	return (seq & (CT_SEQ_USER | CT_SEQ_NMI)) == CT_SEQ_USER;
+}
 
 #ifdef CONFIG_CONTEXT_TRACKING
+
 extern void context_tracking_cpu_set(int cpu);
 
 /* Called with interrupts disabled.  */
 extern void __context_tracking_enter(enum ctx_state state);
 extern void __context_tracking_exit(enum ctx_state state);
+extern void __context_tracking_nmi_enter(void);
+extern void __context_tracking_nmi_exit(void);
 
 extern void context_tracking_enter(enum ctx_state state);
 extern void context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
+extern bool context_tracking_set_cpu_work(unsigned int cpu, unsigned int work);
+
 static inline void ct_user_enter(void)
 {
 	if (context_tracking_enabled())
@@ -47,6 +75,17 @@ static __always_inline void ct_user_exit
 		__context_tracking_exit(CONTEXT_USER);
 }
 
+static __always_inline void ct_nmi_enter_irqoff(void)
+{
+	if (context_tracking_enabled())
+		__context_tracking_nmi_enter();
+}
+static __always_inline void ct_nmi_exit_irqoff(void)
+{
+	if (context_tracking_enabled())
+		__context_tracking_nmi_exit();
+}
+
 static inline enum ctx_state exception_enter(void)
 {
 	enum ctx_state prev_ctx;
@@ -97,19 +136,40 @@ static __always_inline enum ctx_state ct
 	return context_tracking_enabled() ?
 		this_cpu_read(context_tracking.state) : CONTEXT_DISABLED;
 }
+
+static __always_inline unsigned int __context_tracking_cpu_seq(unsigned int cpu)
+{
+	return arch_atomic_read(per_cpu_ptr(&context_tracking.seq, cpu));
+}
+
 #else
 static inline void ct_user_enter(void) { }
 static inline void ct_user_exit(void) { }
 static inline void ct_user_enter_irqoff(void) { }
 static inline void ct_user_exit_irqoff(void) { }
+static inline void ct_nmi_enter_irqoff(void) { }
+static inline void ct_nmi_exit_irqoff(void) { }
 static inline enum ctx_state exception_enter(void) { return 0; }
 static inline void exception_exit(enum ctx_state prev_ctx) { }
 static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; }
 static inline bool context_tracking_guest_enter(void) { return false; }
 static inline void context_tracking_guest_exit(void) { }
 
+static inline bool context_tracking_set_cpu_work(unsigned int cpu, unsigned int work) { return false; }
+
+static __always_inline unsigned int __context_tracking_cpu_seq(unsigned int cpu)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_CONTEXT_TRACKING */
 
+static __always_inline bool context_tracking_cpu_in_user(unsigned int cpu)
+{
+	unsigned int seq = __context_tracking_cpu_seq(cpu);
+	return __context_tracking_seq_in_user(seq);
+}
+
 #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
 
 #ifdef CONFIG_CONTEXT_TRACKING_FORCE
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -4,6 +4,7 @@
 
 #include <linux/percpu.h>
 #include <linux/static_key.h>
+#include <linux/types.h>
 
 struct context_tracking {
 	/*
@@ -13,6 +14,8 @@ struct context_tracking {
 	 * may be further optimized using static keys.
 	 */
 	bool active;
+	atomic_t seq;
+	atomic_t work;
 	int recursion;
 	enum ctx_state {
 		CONTEXT_DISABLED = -1,	/* returned by ct_state() if unknown */
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -50,6 +50,85 @@ static __always_inline void context_trac
 	__this_cpu_dec(context_tracking.recursion);
 }
 
+/* CT_WORK_n, must be noinstr, non-blocking, NMI safe and deal with spurious calls */
+static noinstr void ct_exit_user_work(struct context_tracking *ct)
+{
+	unsigned int work = arch_atomic_read(&ct->work);
+
+#if 0
+	if (work & CT_WORK_n) {
+		/* NMI happens here and must still do/finish CT_WORK_n */
+		do_work_n();
+
+		smp_mb__before_atomic();
+		arch_atomic_andnot(CT_WORK_n, &ct->work);
+	}
+#endif
+
+	smp_mb__before_atomic();
+	arch_atomic_andnot(CT_SEQ_WORK, &ct->seq);
+}
+
+/* all CPU local */
+
+static __always_inline unsigned int ct_seq_nmi_enter(struct context_tracking *ct)
+{
+	unsigned int seq = arch_atomic_add_return(CT_SEQ_NMI, &ct->seq);
+	if (seq & CT_SEQ_WORK) /* NMI-enter is USER-exit */
+		ct_exit_user_work(ct);
+	return seq;
+}
+
+static __always_inline unsigned int ct_seq_nmi_exit(struct context_tracking *ct)
+{
+	arch_atomic_set(&ct->work, 0);
+	return arch_atomic_add_return(CT_SEQ - CT_SEQ_NMI, &ct->seq);
+}
+
+static __always_inline unsigned int ct_seq_user_enter(struct context_tracking *ct)
+{
+	arch_atomic_set(&ct->work, 0);
+	return arch_atomic_add_return(CT_SEQ_USER, &ct->seq);
+}
+
+static __always_inline unsigned int ct_seq_user_exit(struct context_tracking *ct)
+{
+	unsigned int seq = arch_atomic_add_return(CT_SEQ - CT_SEQ_USER, &ct->seq);
+	if (seq & CT_SEQ_WORK)
+		ct_exit_user_work(ct);
+	return seq;
+}
+
+/* remote */
+
+/*
+ * When returns true: guaratees that CPu will call @work
+ */
+static bool ct_seq_set_user_work(struct context_tracking *ct, unsigned int work)
+{
+	unsigned int seq;
+	bool ret = false;
+
+	if (!context_tracking_enabled() || !ct->active)
+		return false;
+
+	preempt_disable();
+	seq = atomic_read(&ct->seq);
+	if (__context_tracking_seq_in_user(seq)) {
+		/* ctrl-dep */
+		atomic_or(work, &ct->work);
+		ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
+	}
+	preempt_enable();
+
+	return ret;
+}
+
+bool context_tracking_set_cpu_work(unsigned int cpu, unsigned int work)
+{
+	return ct_seq_set_user_work(per_cpu_ptr(&context_tracking, cpu), work);
+}
+
 /**
  * context_tracking_enter - Inform the context tracking that the CPU is going
  *                          enter user or guest space mode.
@@ -83,6 +162,7 @@ void noinstr __context_tracking_enter(en
 				instrumentation_end();
 			}
 			rcu_user_enter();
+			ct_seq_user_enter(raw_cpu_ptr(&context_tracking));
 		}
 		/*
 		 * Even if context tracking is disabled on this CPU, because it's outside
@@ -154,6 +234,7 @@ void noinstr __context_tracking_exit(enu
 			 * We are going to run code that may use RCU. Inform
 			 * RCU core about that (ie: we may need the tick again).
 			 */
+			ct_seq_user_exit(raw_cpu_ptr(&context_tracking));
 			rcu_user_exit();
 			if (state == CONTEXT_USER) {
 				instrumentation_begin();
@@ -188,6 +269,18 @@ void context_tracking_user_exit(void)
 }
 NOKPROBE_SYMBOL(context_tracking_user_exit);
 
+void noinstr __context_tracking_nmi_enter(void)
+{
+	if (__this_cpu_read(context_tracking.active))
+		ct_seq_nmi_enter(raw_cpu_ptr(&context_tracking));
+}
+
+void noinstr __context_tracking_nmi_exit(void)
+{
+	if (__this_cpu_read(context_tracking.active))
+		ct_seq_nmi_exit(raw_cpu_ptr(&context_tracking));
+}
+
 void __init context_tracking_cpu_set(int cpu)
 {
 	static __initdata bool initialized = false;
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -451,6 +451,7 @@ irqentry_state_t noinstr irqentry_nmi_en
 	__nmi_enter();
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	lockdep_hardirq_enter();
+	ct_nmi_enter_irqoff();
 	rcu_nmi_enter();
 
 	instrumentation_begin();
@@ -472,6 +473,7 @@ void noinstr irqentry_nmi_exit(struct pt
 	instrumentation_end();
 
 	rcu_nmi_exit();
+	ct_nmi_exit_irqoff();
 	lockdep_hardirq_exit();
 	if (irq_state.lockdep)
 		lockdep_hardirqs_on(CALLER_ADDR0);



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking
  2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
                   ` (6 preceding siblings ...)
  2021-09-29 15:17 ` [RFC][PATCH v2 07/11] context_tracking: Add an atomic sequence/state count Peter Zijlstra
@ 2021-09-29 15:17 ` Peter Zijlstra
  2021-09-29 18:37   ` Paul E. McKenney
  2021-09-29 18:54   ` Peter Zijlstra
  2021-09-29 15:17 ` [RFC][PATCH v2 09/11] context_tracking,livepatch: Dont disturb NOHZ_FULL Peter Zijlstra
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:17 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, peterz, joe.lawrence, fweisbec, tglx, hca, svens,
	sumanthk, live-patching, paulmck, rostedt, x86

XXX I'm pretty sure I broke task-trace-rcu.
XXX trace_rcu_*() now gets an unconditional 0

Other than that, it seems like a fairly straight-forward replacement
of the RCU count with the context_tracking count.

Using context-tracking for this avoids having two (expensive) atomic
ops on the entry paths where one will do.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/context_tracking.h |    6 ++
 kernel/context_tracking.c        |   14 +++++
 kernel/rcu/tree.c                |  101 +++++----------------------------------
 kernel/rcu/tree.h                |    1 
 4 files changed, 33 insertions(+), 89 deletions(-)

--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -50,6 +50,9 @@ extern void context_tracking_user_exit(v
 
 extern bool context_tracking_set_cpu_work(unsigned int cpu, unsigned int work);
 
+extern void context_tracking_idle(void);
+extern void context_tracking_online(void);
+
 static inline void ct_user_enter(void)
 {
 	if (context_tracking_enabled())
@@ -162,6 +165,9 @@ static __always_inline unsigned int __co
 	return 0;
 }
 
+static inline void context_tracking_idle(void) { }
+static inline void context_tracking_online(void) { }
+
 #endif /* !CONFIG_CONTEXT_TRACKING */
 
 static __always_inline bool context_tracking_cpu_in_user(unsigned int cpu)
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -281,6 +281,20 @@ void noinstr __context_tracking_nmi_exit
 		ct_seq_nmi_exit(raw_cpu_ptr(&context_tracking));
 }
 
+void context_tracking_online(void)
+{
+	struct context_tracking *ct = raw_cpu_ptr(&context_tracking);
+	unsigned int seq = atomic_read(&ct->seq);
+
+	if (__context_tracking_seq_in_user(seq))
+		atomic_add_return(CT_SEQ - CT_SEQ_USER, &ct->seq);
+}
+
+void context_tracking_idle(void)
+{
+	atomic_add_return(CT_SEQ, &raw_cpu_ptr(&context_tracking)->seq);
+}
+
 void __init context_tracking_cpu_set(int cpu)
 {
 	static __initdata bool initialized = false;
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -62,6 +62,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm.h>
 #include <linux/kasan.h>
+#include <linux/context_tracking.h>
 #include "../time/tick-internal.h"
 
 #include "tree.h"
@@ -77,7 +78,6 @@
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
 	.dynticks_nesting = 1,
 	.dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE,
-	.dynticks = ATOMIC_INIT(1),
 #ifdef CONFIG_RCU_NOCB_CPU
 	.cblist.flags = SEGCBLIST_SOFTIRQ_ONLY,
 #endif
@@ -252,56 +252,6 @@ void rcu_softirq_qs(void)
 }
 
 /*
- * Increment the current CPU's rcu_data structure's ->dynticks field
- * with ordering.  Return the new value.
- */
-static noinline noinstr unsigned long rcu_dynticks_inc(int incby)
-{
-	return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
-}
-
-/*
- * Record entry into an extended quiescent state.  This is only to be
- * called when not already in an extended quiescent state, that is,
- * RCU is watching prior to the call to this function and is no longer
- * watching upon return.
- */
-static noinstr void rcu_dynticks_eqs_enter(void)
-{
-	int seq;
-
-	/*
-	 * CPUs seeing atomic_add_return() must see prior RCU read-side
-	 * critical sections, and we also must force ordering with the
-	 * next idle sojourn.
-	 */
-	rcu_dynticks_task_trace_enter();  // Before ->dynticks update!
-	seq = rcu_dynticks_inc(1);
-	// RCU is no longer watching.  Better be in extended quiescent state!
-	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & 0x1));
-}
-
-/*
- * Record exit from an extended quiescent state.  This is only to be
- * called from an extended quiescent state, that is, RCU is not watching
- * prior to the call to this function and is watching upon return.
- */
-static noinstr void rcu_dynticks_eqs_exit(void)
-{
-	int seq;
-
-	/*
-	 * CPUs seeing atomic_add_return() must see prior idle sojourns,
-	 * and we also must force ordering with the next RCU read-side
-	 * critical section.
-	 */
-	seq = rcu_dynticks_inc(1);
-	// RCU is now watching.  Better not be in an extended quiescent state!
-	rcu_dynticks_task_trace_exit();  // After ->dynticks update!
-	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & 0x1));
-}
-
-/*
  * Reset the current CPU's ->dynticks counter to indicate that the
  * newly onlined CPU is no longer in an extended quiescent state.
  * This will either leave the counter unchanged, or increment it
@@ -313,11 +263,7 @@ static noinstr void rcu_dynticks_eqs_exi
  */
 static void rcu_dynticks_eqs_online(void)
 {
-	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
-
-	if (atomic_read(&rdp->dynticks) & 0x1)
-		return;
-	rcu_dynticks_inc(1);
+	context_tracking_online();
 }
 
 /*
@@ -327,7 +273,7 @@ static void rcu_dynticks_eqs_online(void
  */
 static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
 {
-	return !(atomic_read(this_cpu_ptr(&rcu_data.dynticks)) & 0x1);
+	return context_tracking_cpu_in_user(smp_processor_id());
 }
 
 /*
@@ -337,7 +283,7 @@ static __always_inline bool rcu_dynticks
 static int rcu_dynticks_snap(struct rcu_data *rdp)
 {
 	smp_mb();  // Fundamental RCU ordering guarantee.
-	return atomic_read_acquire(&rdp->dynticks);
+	return __context_tracking_cpu_seq(rdp->cpu);
 }
 
 /*
@@ -346,7 +292,7 @@ static int rcu_dynticks_snap(struct rcu_
  */
 static bool rcu_dynticks_in_eqs(int snap)
 {
-	return !(snap & 0x1);
+	return __context_tracking_seq_in_user(snap);
 }
 
 /* Return true if the specified CPU is currently idle from an RCU viewpoint.  */
@@ -377,7 +323,7 @@ bool rcu_dynticks_zero_in_eqs(int cpu, i
 	int snap;
 
 	// If not quiescent, force back to earlier extended quiescent state.
-	snap = atomic_read(&rdp->dynticks) & ~0x1;
+	snap = __context_tracking_cpu_seq(rdp->cpu) & ~0x7;
 
 	smp_rmb(); // Order ->dynticks and *vp reads.
 	if (READ_ONCE(*vp))
@@ -385,7 +331,7 @@ bool rcu_dynticks_zero_in_eqs(int cpu, i
 	smp_rmb(); // Order *vp read and ->dynticks re-read.
 
 	// If still in the same extended quiescent state, we are good!
-	return snap == atomic_read(&rdp->dynticks);
+	return snap == __context_tracking_cpu_seq(rdp->cpu);
 }
 
 /*
@@ -401,12 +347,8 @@ bool rcu_dynticks_zero_in_eqs(int cpu, i
  */
 notrace void rcu_momentary_dyntick_idle(void)
 {
-	int seq;
-
 	raw_cpu_write(rcu_data.rcu_need_heavy_qs, false);
-	seq = rcu_dynticks_inc(2);
-	/* It is illegal to call this from idle state. */
-	WARN_ON_ONCE(!(seq & 0x1));
+	context_tracking_idle();
 	rcu_preempt_deferred_qs(current);
 }
 EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
@@ -622,18 +564,15 @@ static noinstr void rcu_eqs_enter(bool u
 
 	lockdep_assert_irqs_disabled();
 	instrumentation_begin();
-	trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks));
+	trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, 0);
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
 	rcu_prepare_for_idle();
 	rcu_preempt_deferred_qs(current);
 
-	// instrumentation for the noinstr rcu_dynticks_eqs_enter()
-	instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks));
 
 	instrumentation_end();
 	WRITE_ONCE(rdp->dynticks_nesting, 0); /* Avoid irq-access tearing. */
 	// RCU is watching here ...
-	rcu_dynticks_eqs_enter();
 	// ... but is no longer watching here.
 	rcu_dynticks_task_enter();
 }
@@ -756,8 +695,7 @@ noinstr void rcu_nmi_exit(void)
 	 * leave it in non-RCU-idle state.
 	 */
 	if (rdp->dynticks_nmi_nesting != 1) {
-		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2,
-				  atomic_read(&rdp->dynticks));
+		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, 0);
 		WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
 			   rdp->dynticks_nmi_nesting - 2);
 		instrumentation_end();
@@ -765,18 +703,15 @@ noinstr void rcu_nmi_exit(void)
 	}
 
 	/* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
-	trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks));
+	trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, 0);
 	WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
 
 	if (!in_nmi())
 		rcu_prepare_for_idle();
 
-	// instrumentation for the noinstr rcu_dynticks_eqs_enter()
-	instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks));
 	instrumentation_end();
 
 	// RCU is watching here ...
-	rcu_dynticks_eqs_enter();
 	// ... but is no longer watching here.
 
 	if (!in_nmi())
@@ -865,15 +800,11 @@ static void noinstr rcu_eqs_exit(bool us
 	}
 	rcu_dynticks_task_exit();
 	// RCU is not watching here ...
-	rcu_dynticks_eqs_exit();
 	// ... but is watching here.
 	instrumentation_begin();
 
-	// instrumentation for the noinstr rcu_dynticks_eqs_exit()
-	instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks));
-
 	rcu_cleanup_after_idle();
-	trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, atomic_read(&rdp->dynticks));
+	trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, 0);
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
 	WRITE_ONCE(rdp->dynticks_nesting, 1);
 	WARN_ON_ONCE(rdp->dynticks_nmi_nesting);
@@ -1011,7 +942,6 @@ noinstr void rcu_nmi_enter(void)
 			rcu_dynticks_task_exit();
 
 		// RCU is not watching here ...
-		rcu_dynticks_eqs_exit();
 		// ... but is watching here.
 
 		if (!in_nmi()) {
@@ -1021,11 +951,6 @@ noinstr void rcu_nmi_enter(void)
 		}
 
 		instrumentation_begin();
-		// instrumentation for the noinstr rcu_dynticks_curr_cpu_in_eqs()
-		instrument_atomic_read(&rdp->dynticks, sizeof(rdp->dynticks));
-		// instrumentation for the noinstr rcu_dynticks_eqs_exit()
-		instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks));
-
 		incby = 1;
 	} else if (!in_nmi()) {
 		instrumentation_begin();
@@ -1036,7 +961,7 @@ noinstr void rcu_nmi_enter(void)
 
 	trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
 			  rdp->dynticks_nmi_nesting,
-			  rdp->dynticks_nmi_nesting + incby, atomic_read(&rdp->dynticks));
+			  rdp->dynticks_nmi_nesting + incby, 0);
 	instrumentation_end();
 	WRITE_ONCE(rdp->dynticks_nmi_nesting, /* Prevent store tearing. */
 		   rdp->dynticks_nmi_nesting + incby);
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -184,7 +184,6 @@ struct rcu_data {
 	int dynticks_snap;		/* Per-GP tracking for dynticks. */
 	long dynticks_nesting;		/* Track process nesting level. */
 	long dynticks_nmi_nesting;	/* Track irq/NMI nesting level. */
-	atomic_t dynticks;		/* Even value for idle, else odd. */
 	bool rcu_need_heavy_qs;		/* GP old, so heavy quiescent state! */
 	bool rcu_urgent_qs;		/* GP old need light quiescent state. */
 	bool rcu_forced_tick;		/* Forced tick to provide QS. */



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC][PATCH v2 09/11] context_tracking,livepatch: Dont disturb NOHZ_FULL
  2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
                   ` (7 preceding siblings ...)
  2021-09-29 15:17 ` [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking Peter Zijlstra
@ 2021-09-29 15:17 ` Peter Zijlstra
  2021-10-06  8:12   ` Petr Mladek
  2021-09-29 15:17 ` [RFC][PATCH v2 10/11] livepatch: Remove klp_synchronize_transition() Peter Zijlstra
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:17 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, peterz, joe.lawrence, fweisbec, tglx, hca, svens,
	sumanthk, live-patching, paulmck, rostedt, x86

Using the new context_tracking infrastructure, avoid disturbing
userspace tasks when context tracking is enabled.

When context_tracking_set_cpu_work() returns true, we have the
guarantee that klp_update_patch_state() is called from noinstr code
before it runs normal kernel code. This covers
syscall/exceptions/interrupts and NMI entry.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/context_tracking.h |    2 +-
 include/linux/livepatch.h        |    2 ++
 kernel/context_tracking.c        |   11 +++++------
 kernel/livepatch/transition.c    |   29 ++++++++++++++++++++++++++---
 4 files changed, 34 insertions(+), 10 deletions(-)

--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -10,7 +10,7 @@
 #include <asm/ptrace.h>
 
 enum ct_work {
-	CT_WORK_n = 0,
+	CT_WORK_KLP = 1,
 };
 
 /*
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -201,6 +201,7 @@ void klp_module_going(struct module *mod
 
 void klp_copy_process(struct task_struct *child);
 void klp_update_patch_state(struct task_struct *task);
+void __klp_update_patch_state(struct task_struct *task);
 
 static inline bool klp_patch_pending(struct task_struct *task)
 {
@@ -242,6 +243,7 @@ static inline int klp_module_coming(stru
 static inline void klp_module_going(struct module *mod) {}
 static inline bool klp_patch_pending(struct task_struct *task) { return false; }
 static inline void klp_update_patch_state(struct task_struct *task) {}
+static inline void __klp_update_patch_state(struct task_struct *task) {}
 static inline void klp_copy_process(struct task_struct *child) {}
 
 static inline
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -21,6 +21,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/livepatch.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -55,15 +56,13 @@ static noinstr void ct_exit_user_work(struct
 {
 	unsigned int work = arch_atomic_read(&ct->work);
 
-#if 0
-	if (work & CT_WORK_n) {
+	if (work & CT_WORK_KLP) {
 		/* NMI happens here and must still do/finish CT_WORK_n */
-		do_work_n();
+		__klp_update_patch_state(current);
 
 		smp_mb__before_atomic();
-		arch_atomic_andnot(CT_WORK_n, &ct->work);
+		arch_atomic_andnot(CT_WORK_KLP, &ct->work);
 	}
-#endif
 
 	smp_mb__before_atomic();
 	arch_atomic_andnot(CT_SEQ_WORK, &ct->seq);
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -10,6 +10,7 @@
 #include <linux/cpu.h>
 #include <linux/stacktrace.h>
 #include <linux/tracehook.h>
+#include <linux/context_tracking.h>
 #include "core.h"
 #include "patch.h"
 #include "transition.h"
@@ -153,6 +154,11 @@ void klp_cancel_transition(void)
 	klp_complete_transition();
 }
 
+noinstr void __klp_update_patch_state(struct task_struct *task)
+{
+	task->patch_state = READ_ONCE(klp_target_state);
+}
+
 /*
  * Switch the patched state of the task to the set of functions in the target
  * patch state.
@@ -180,8 +186,10 @@ void klp_update_patch_state(struct task_
 	 *    of func->transition, if klp_ftrace_handler() is called later on
 	 *    the same CPU.  See __klp_disable_patch().
 	 */
-	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
+	if (test_tsk_thread_flag(task, TIF_PATCH_PENDING)) {
 		task->patch_state = READ_ONCE(klp_target_state);
+		clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
+	}
 
 	preempt_enable_notrace();
 }
@@ -270,15 +278,30 @@ static int klp_check_and_switch_task(str
 {
 	int ret;
 
-	if (task_curr(task))
+	if (task_curr(task)) {
+		/*
+		 * This only succeeds when the task is in NOHZ_FULL user
+		 * mode, the true return value guarantees any kernel entry
+		 * will call klp_update_patch_state().
+		 *
+		 * XXX: ideally we'd simply return 0 here and leave
+		 * TIF_PATCH_PENDING alone, to be fixed up by
+		 * klp_update_patch_state(), except livepatching goes wobbly
+		 * with 'pending' TIF bits on.
+		 */
+		if (context_tracking_set_cpu_work(task_cpu(task), CT_WORK_KLP))
+			goto clear;
+
 		return -EBUSY;
+	}
 
 	ret = klp_check_stack(task, arg);
 	if (ret)
 		return ret;
 
-	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
 	task->patch_state = klp_target_state;
+clear:
+	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
 	return 0;
 }
 



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC][PATCH v2 10/11] livepatch: Remove klp_synchronize_transition()
  2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
                   ` (8 preceding siblings ...)
  2021-09-29 15:17 ` [RFC][PATCH v2 09/11] context_tracking,livepatch: Dont disturb NOHZ_FULL Peter Zijlstra
@ 2021-09-29 15:17 ` Peter Zijlstra
  2021-10-06 12:30   ` Petr Mladek
  2021-09-29 15:17 ` [RFC][PATCH v2 11/11] context_tracking,x86: Fix text_poke_sync() vs NOHZ_FULL Peter Zijlstra
  2021-09-29 18:03 ` [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Paul E. McKenney
  11 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:17 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, peterz, joe.lawrence, fweisbec, tglx, hca, svens,
	sumanthk, live-patching, paulmck, rostedt, x86

The whole premise is broken, any trace usage outside of RCU is a BUG
and should be fixed. Notable all code prior (and a fair bit after)
ct_user_exit() is noinstr and explicitly doesn't allow any tracing.

Use regular RCU, which has the benefit of actually respecting
NOHZ_FULL.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/livepatch/transition.c |   32 +++-----------------------------
 1 file changed, 3 insertions(+), 29 deletions(-)

--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -42,28 +42,6 @@ static void klp_transition_work_fn(struc
 static DECLARE_DELAYED_WORK(klp_transition_work, klp_transition_work_fn);
 
 /*
- * This function is just a stub to implement a hard force
- * of synchronize_rcu(). This requires synchronizing
- * tasks even in userspace and idle.
- */
-static void klp_sync(struct work_struct *work)
-{
-}
-
-/*
- * We allow to patch also functions where RCU is not watching,
- * e.g. before ct_user_exit(). We can not rely on the RCU infrastructure
- * to do the synchronization. Instead hard force the sched synchronization.
- *
- * This approach allows to use RCU functions for manipulating func_stack
- * safely.
- */
-static void klp_synchronize_transition(void)
-{
-	schedule_on_each_cpu(klp_sync);
-}
-
-/*
  * The transition to the target patch state is complete.  Clean up the data
  * structures.
  */
@@ -96,7 +74,7 @@ static void klp_complete_transition(void
 		 * func->transition gets cleared, the handler may choose a
 		 * removed function.
 		 */
-		klp_synchronize_transition();
+		synchronize_rcu();
 	}
 
 	klp_for_each_object(klp_transition_patch, obj)
@@ -105,7 +83,7 @@ static void klp_complete_transition(void
 
 	/* Prevent klp_ftrace_handler() from seeing KLP_UNDEFINED state */
 	if (klp_target_state == KLP_PATCHED)
-		klp_synchronize_transition();
+		synchronize_rcu();
 
 	read_lock(&tasklist_lock);
 	for_each_process_thread(g, task) {
@@ -168,10 +146,6 @@ noinstr void __klp_update_patch_state(st
  */
 void klp_update_patch_state(struct task_struct *task)
 {
-	/*
-	 * A variant of synchronize_rcu() is used to allow patching functions
-	 * where RCU is not watching, see klp_synchronize_transition().
-	 */
 	preempt_disable_notrace();
 
 	/*
@@ -626,7 +600,7 @@ void klp_reverse_transition(void)
 		clear_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);
 
 	/* Let any remaining calls to klp_update_patch_state() complete */
-	klp_synchronize_transition();
+	synchronize_rcu();
 
 	klp_start_transition();
 }



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC][PATCH v2 11/11] context_tracking,x86: Fix text_poke_sync() vs NOHZ_FULL
  2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
                   ` (9 preceding siblings ...)
  2021-09-29 15:17 ` [RFC][PATCH v2 10/11] livepatch: Remove klp_synchronize_transition() Peter Zijlstra
@ 2021-09-29 15:17 ` Peter Zijlstra
  2021-10-21 18:39   ` Marcelo Tosatti
  2021-09-29 18:03 ` [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Paul E. McKenney
  11 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 15:17 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, peterz, joe.lawrence, fweisbec, tglx, hca, svens,
	sumanthk, live-patching, paulmck, rostedt, x86

Use the new context_tracking infrastructure to avoid disturbing
userspace tasks when we rewrite kernel code.

XXX re-audit the entry code to make sure only the context_tracking
static_branch is before hitting this code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/sync_core.h |    2 ++
 arch/x86/kernel/alternative.c    |    8 +++++++-
 include/linux/context_tracking.h |    1 +
 kernel/context_tracking.c        |   12 ++++++++++++
 4 files changed, 22 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -87,6 +87,8 @@ static inline void sync_core(void)
 	 */
 	iret_to_self();
 }
+#define sync_core sync_core
+
 
 /*
  * Ensure that a core serializing instruction is issued before returning
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -18,6 +18,7 @@
 #include <linux/mmu_context.h>
 #include <linux/bsearch.h>
 #include <linux/sync_core.h>
+#include <linux/context_tracking.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
@@ -924,9 +925,14 @@ static void do_sync_core(void *info)
 	sync_core();
 }
 
+static bool do_sync_core_cond(int cpu, void *info)
+{
+	return !context_tracking_set_cpu_work(cpu, CT_WORK_SYNC);
+}
+
 void text_poke_sync(void)
 {
-	on_each_cpu(do_sync_core, NULL, 1);
+	on_each_cpu_cond(do_sync_core_cond, do_sync_core, NULL, 1);
 }
 
 struct text_poke_loc {
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -11,6 +11,7 @@
 
 enum ct_work {
 	CT_WORK_KLP = 1,
+	CT_WORK_SYNC = 2,
 };
 
 /*
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -51,6 +51,10 @@ static __always_inline void context_trac
 	__this_cpu_dec(context_tracking.recursion);
 }
 
+#ifndef sync_core
+static inline void sync_core(void) { }
+#endif
+
 /* CT_WORK_n, must be noinstr, non-blocking, NMI safe and deal with spurious calls */
 static noinstr void ct_exit_user_work(struct context_tracking *ct)
 {
@@ -64,6 +68,14 @@ static noinstr void ct_exit_user_work(struct
 		arch_atomic_andnot(CT_WORK_KLP, &ct->work);
 	}
 
+	if (work & CT_WORK_SYNC) {
+		/* NMI happens here and must still do/finish CT_WORK_n */
+		sync_core();
+
+		smp_mb__before_atomic();
+		arch_atomic_andnot(CT_WORK_SYNC, &ct->work);
+	}
+
 	smp_mb__before_atomic();
 	arch_atomic_andnot(CT_SEQ_WORK, &ct->seq);
 }



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL
  2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
                   ` (10 preceding siblings ...)
  2021-09-29 15:17 ` [RFC][PATCH v2 11/11] context_tracking,x86: Fix text_poke_sync() vs NOHZ_FULL Peter Zijlstra
@ 2021-09-29 18:03 ` Paul E. McKenney
  11 siblings, 0 replies; 51+ messages in thread
From: Paul E. McKenney @ 2021-09-29 18:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, rostedt, x86

On Wed, Sep 29, 2021 at 05:17:23PM +0200, Peter Zijlstra wrote:
> Hi,
> 
> One series in two parts; the first part (patches 1-5) should be in reasonable
> shape and ought to fix the original issues that got this whole thing started.
> 
> The second part (patches 6-11) are still marked RFC and do scary things to try
> and make NOHZ_FULL work better.
> 
> Please consider..
> 
> PS. Paul, I did do break RCU a bit and I'm not sure about the best way to put
> it back together, please see patch 8 for details.

Hey, wait!!!  Breaking RCU is -my- job!!!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking
  2021-09-29 15:17 ` [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking Peter Zijlstra
@ 2021-09-29 18:37   ` Paul E. McKenney
  2021-09-29 19:09     ` Peter Zijlstra
                       ` (2 more replies)
  2021-09-29 18:54   ` Peter Zijlstra
  1 sibling, 3 replies; 51+ messages in thread
From: Paul E. McKenney @ 2021-09-29 18:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, rostedt, x86

On Wed, Sep 29, 2021 at 05:17:31PM +0200, Peter Zijlstra wrote:
> XXX I'm pretty sure I broke task-trace-rcu.
> XXX trace_rcu_*() now gets an unconditional 0
> 
> Other than that, it seems like a fairly straight-forward replacement
> of the RCU count with the context_tracking count.
> 
> Using context-tracking for this avoids having two (expensive) atomic
> ops on the entry paths where one will do.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

A few questions and exclamations interspersed.

> ---
>  include/linux/context_tracking.h |    6 ++
>  kernel/context_tracking.c        |   14 +++++
>  kernel/rcu/tree.c                |  101 +++++----------------------------------
>  kernel/rcu/tree.h                |    1 
>  4 files changed, 33 insertions(+), 89 deletions(-)
> 
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -50,6 +50,9 @@ extern void context_tracking_user_exit(v
>  
>  extern bool context_tracking_set_cpu_work(unsigned int cpu, unsigned int work);
>  
> +extern void context_tracking_idle(void);
> +extern void context_tracking_online(void);
> +
>  static inline void ct_user_enter(void)
>  {
>  	if (context_tracking_enabled())
> @@ -162,6 +165,9 @@ static __always_inline unsigned int __co
>  	return 0;
>  }
>  
> +static inline void context_tracking_idle(void) { }
> +static inline void context_tracking_online(void) { }
> +
>  #endif /* !CONFIG_CONTEXT_TRACKING */
>  
>  static __always_inline bool context_tracking_cpu_in_user(unsigned int cpu)
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -281,6 +281,20 @@ void noinstr __context_tracking_nmi_exit
>  		ct_seq_nmi_exit(raw_cpu_ptr(&context_tracking));
>  }
>  
> +void context_tracking_online(void)
> +{
> +	struct context_tracking *ct = raw_cpu_ptr(&context_tracking);
> +	unsigned int seq = atomic_read(&ct->seq);
> +
> +	if (__context_tracking_seq_in_user(seq))
> +		atomic_add_return(CT_SEQ - CT_SEQ_USER, &ct->seq);

What if the CPU went offline marked idle instead of user?  (Yes, that
no longer happens, but checking my understanding.)

The CT_SEQ_WORK bit means neither idle nor nohz_full user, correct?

So let's see if I intuited the decoder ring, where "kernel" means that
portion of the kernel that is non-noinstr...

CT_SEQ_WORK	CT_SEQ_NMI	CT_SEQ_USER	Description?
0		0		0		Idle or non-nohz_full user
0		0		1		nohz_full user
0		1		0		NMI from 0,0,0
0		1		1		NMI from 0,0,1
1		0		0		Non-idle kernel
1		0		1		Cannot happen?
1		1		0		NMI from 1,0,1
1		1		1		NMI from cannot happen

And of course if a state cannot happen, Murphy says that you will take
an NMI in that state.

> +}
> +
> +void context_tracking_idle(void)
> +{
> +	atomic_add_return(CT_SEQ, &raw_cpu_ptr(&context_tracking)->seq);

This is presumably a momentary idle.

And what happens to all of this in !CONFIG_CONTEXT_TRACKING kernels?
Of course, RCU needs it unconditionally.  (There appear to be at least
parts of it that are unconditionally available, but I figured that I
should ask.  Especially given the !CONFIG_CONTEXT_TRACKING definition
of the __context_tracking_cpu_seq() function.)

> +}
> +
>  void __init context_tracking_cpu_set(int cpu)
>  {
>  	static __initdata bool initialized = false;
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -62,6 +62,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/mm.h>
>  #include <linux/kasan.h>
> +#include <linux/context_tracking.h>
>  #include "../time/tick-internal.h"
>  
>  #include "tree.h"
> @@ -77,7 +78,6 @@
>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
>  	.dynticks_nesting = 1,
>  	.dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE,
> -	.dynticks = ATOMIC_INIT(1),
>  #ifdef CONFIG_RCU_NOCB_CPU
>  	.cblist.flags = SEGCBLIST_SOFTIRQ_ONLY,
>  #endif
> @@ -252,56 +252,6 @@ void rcu_softirq_qs(void)
>  }
>  
>  /*
> - * Increment the current CPU's rcu_data structure's ->dynticks field
> - * with ordering.  Return the new value.
> - */
> -static noinline noinstr unsigned long rcu_dynticks_inc(int incby)
> -{
> -	return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
> -}
> -
> -/*
> - * Record entry into an extended quiescent state.  This is only to be
> - * called when not already in an extended quiescent state, that is,
> - * RCU is watching prior to the call to this function and is no longer
> - * watching upon return.
> - */
> -static noinstr void rcu_dynticks_eqs_enter(void)
> -{
> -	int seq;
> -
> -	/*
> -	 * CPUs seeing atomic_add_return() must see prior RCU read-side
> -	 * critical sections, and we also must force ordering with the
> -	 * next idle sojourn.
> -	 */
> -	rcu_dynticks_task_trace_enter();  // Before ->dynticks update!
> -	seq = rcu_dynticks_inc(1);
> -	// RCU is no longer watching.  Better be in extended quiescent state!
> -	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & 0x1));
> -}
> -
> -/*
> - * Record exit from an extended quiescent state.  This is only to be
> - * called from an extended quiescent state, that is, RCU is not watching
> - * prior to the call to this function and is watching upon return.
> - */
> -static noinstr void rcu_dynticks_eqs_exit(void)
> -{
> -	int seq;
> -
> -	/*
> -	 * CPUs seeing atomic_add_return() must see prior idle sojourns,
> -	 * and we also must force ordering with the next RCU read-side
> -	 * critical section.
> -	 */
> -	seq = rcu_dynticks_inc(1);
> -	// RCU is now watching.  Better not be in an extended quiescent state!
> -	rcu_dynticks_task_trace_exit();  // After ->dynticks update!
> -	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & 0x1));
> -}
> -
> -/*
>   * Reset the current CPU's ->dynticks counter to indicate that the
>   * newly onlined CPU is no longer in an extended quiescent state.
>   * This will either leave the counter unchanged, or increment it
> @@ -313,11 +263,7 @@ static noinstr void rcu_dynticks_eqs_exi
>   */
>  static void rcu_dynticks_eqs_online(void)
>  {
> -	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> -
> -	if (atomic_read(&rdp->dynticks) & 0x1)
> -		return;
> -	rcu_dynticks_inc(1);
> +	context_tracking_online();
>  }
>  
>  /*
> @@ -327,7 +273,7 @@ static void rcu_dynticks_eqs_online(void
>   */
>  static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
>  {
> -	return !(atomic_read(this_cpu_ptr(&rcu_data.dynticks)) & 0x1);
> +	return context_tracking_cpu_in_user(smp_processor_id());

This needs to return true in any type of extended quiescent state,
not just nohz_full user execution.

>  }
>  
>  /*
> @@ -337,7 +283,7 @@ static __always_inline bool rcu_dynticks
>  static int rcu_dynticks_snap(struct rcu_data *rdp)
>  {
>  	smp_mb();  // Fundamental RCU ordering guarantee.
> -	return atomic_read_acquire(&rdp->dynticks);
> +	return __context_tracking_cpu_seq(rdp->cpu);

This will most definitely break RCU in !CONFIG_CONTEXT_TRACKING kernels.
It will always return zero.  Albeit fully ordered.  Might this be
why RCU Tasks Trace is unhappy?

Similar issues for later uses of __context_tracking_cpu_seq().

>  }
>  
>  /*
> @@ -346,7 +292,7 @@ static int rcu_dynticks_snap(struct rcu_
>   */
>  static bool rcu_dynticks_in_eqs(int snap)
>  {
> -	return !(snap & 0x1);
> +	return __context_tracking_seq_in_user(snap);
>  }
>  
>  /* Return true if the specified CPU is currently idle from an RCU viewpoint.  */
> @@ -377,7 +323,7 @@ bool rcu_dynticks_zero_in_eqs(int cpu, i
>  	int snap;
>  
>  	// If not quiescent, force back to earlier extended quiescent state.
> -	snap = atomic_read(&rdp->dynticks) & ~0x1;
> +	snap = __context_tracking_cpu_seq(rdp->cpu) & ~0x7;
>  
>  	smp_rmb(); // Order ->dynticks and *vp reads.
>  	if (READ_ONCE(*vp))
> @@ -385,7 +331,7 @@ bool rcu_dynticks_zero_in_eqs(int cpu, i
>  	smp_rmb(); // Order *vp read and ->dynticks re-read.
>  
>  	// If still in the same extended quiescent state, we are good!
> -	return snap == atomic_read(&rdp->dynticks);
> +	return snap == __context_tracking_cpu_seq(rdp->cpu);
>  }
>  
>  /*
> @@ -401,12 +347,8 @@ bool rcu_dynticks_zero_in_eqs(int cpu, i
>   */
>  notrace void rcu_momentary_dyntick_idle(void)
>  {
> -	int seq;
> -
>  	raw_cpu_write(rcu_data.rcu_need_heavy_qs, false);
> -	seq = rcu_dynticks_inc(2);
> -	/* It is illegal to call this from idle state. */
> -	WARN_ON_ONCE(!(seq & 0x1));
> +	context_tracking_idle();
>  	rcu_preempt_deferred_qs(current);
>  }
>  EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
> @@ -622,18 +564,15 @@ static noinstr void rcu_eqs_enter(bool u
>  
>  	lockdep_assert_irqs_disabled();
>  	instrumentation_begin();
> -	trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks));
> +	trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, 0);

Why not get the value from __context_tracking_cpu_seq()?

Ditto for similar changes later on.

>  	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
>  	rcu_prepare_for_idle();
>  	rcu_preempt_deferred_qs(current);
>  
> -	// instrumentation for the noinstr rcu_dynticks_eqs_enter()
> -	instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks));
>  
>  	instrumentation_end();
>  	WRITE_ONCE(rdp->dynticks_nesting, 0); /* Avoid irq-access tearing. */
>  	// RCU is watching here ...
> -	rcu_dynticks_eqs_enter();
>  	// ... but is no longer watching here.

OK, at the very least, these two comments are now nonsense.  ;-)

Is the intent to drive this from context tracking, so that these functions
only push the various required RCU state transitions?

Same for the similar changes below.

Anyway, unless I am missing something subtle (quite possible given
that I haven't looked closely at context tracking), you did succeed in
breaking RCU!  ;-)

>  	rcu_dynticks_task_enter();
>  }
> @@ -756,8 +695,7 @@ noinstr void rcu_nmi_exit(void)
>  	 * leave it in non-RCU-idle state.
>  	 */
>  	if (rdp->dynticks_nmi_nesting != 1) {
> -		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2,
> -				  atomic_read(&rdp->dynticks));
> +		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, 0);
>  		WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
>  			   rdp->dynticks_nmi_nesting - 2);
>  		instrumentation_end();
> @@ -765,18 +703,15 @@ noinstr void rcu_nmi_exit(void)
>  	}
>  
>  	/* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
> -	trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks));
> +	trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, 0);
>  	WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
>  
>  	if (!in_nmi())
>  		rcu_prepare_for_idle();
>  
> -	// instrumentation for the noinstr rcu_dynticks_eqs_enter()
> -	instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks));
>  	instrumentation_end();
>  
>  	// RCU is watching here ...
> -	rcu_dynticks_eqs_enter();
>  	// ... but is no longer watching here.
>  
>  	if (!in_nmi())
> @@ -865,15 +800,11 @@ static void noinstr rcu_eqs_exit(bool us
>  	}
>  	rcu_dynticks_task_exit();
>  	// RCU is not watching here ...
> -	rcu_dynticks_eqs_exit();
>  	// ... but is watching here.
>  	instrumentation_begin();
>  
> -	// instrumentation for the noinstr rcu_dynticks_eqs_exit()
> -	instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks));
> -
>  	rcu_cleanup_after_idle();
> -	trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, atomic_read(&rdp->dynticks));
> +	trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, 0);
>  	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
>  	WRITE_ONCE(rdp->dynticks_nesting, 1);
>  	WARN_ON_ONCE(rdp->dynticks_nmi_nesting);
> @@ -1011,7 +942,6 @@ noinstr void rcu_nmi_enter(void)
>  			rcu_dynticks_task_exit();
>  
>  		// RCU is not watching here ...
> -		rcu_dynticks_eqs_exit();
>  		// ... but is watching here.
>  
>  		if (!in_nmi()) {
> @@ -1021,11 +951,6 @@ noinstr void rcu_nmi_enter(void)
>  		}
>  
>  		instrumentation_begin();
> -		// instrumentation for the noinstr rcu_dynticks_curr_cpu_in_eqs()
> -		instrument_atomic_read(&rdp->dynticks, sizeof(rdp->dynticks));
> -		// instrumentation for the noinstr rcu_dynticks_eqs_exit()
> -		instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks));
> -
>  		incby = 1;
>  	} else if (!in_nmi()) {
>  		instrumentation_begin();
> @@ -1036,7 +961,7 @@ noinstr void rcu_nmi_enter(void)
>  
>  	trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
>  			  rdp->dynticks_nmi_nesting,
> -			  rdp->dynticks_nmi_nesting + incby, atomic_read(&rdp->dynticks));
> +			  rdp->dynticks_nmi_nesting + incby, 0);
>  	instrumentation_end();
>  	WRITE_ONCE(rdp->dynticks_nmi_nesting, /* Prevent store tearing. */
>  		   rdp->dynticks_nmi_nesting + incby);
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -184,7 +184,6 @@ struct rcu_data {
>  	int dynticks_snap;		/* Per-GP tracking for dynticks. */
>  	long dynticks_nesting;		/* Track process nesting level. */
>  	long dynticks_nmi_nesting;	/* Track irq/NMI nesting level. */
> -	atomic_t dynticks;		/* Even value for idle, else odd. */
>  	bool rcu_need_heavy_qs;		/* GP old, so heavy quiescent state! */
>  	bool rcu_urgent_qs;		/* GP old need light quiescent state. */
>  	bool rcu_forced_tick;		/* Forced tick to provide QS. */
> 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking
  2021-09-29 15:17 ` [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking Peter Zijlstra
  2021-09-29 18:37   ` Paul E. McKenney
@ 2021-09-29 18:54   ` Peter Zijlstra
  1 sibling, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 18:54 UTC (permalink / raw)
  To: gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

On Wed, Sep 29, 2021 at 05:17:31PM +0200, Peter Zijlstra wrote:
> XXX I'm pretty sure I broke task-trace-rcu.

> -static noinstr void rcu_dynticks_eqs_enter(void)
> -{
> -	int seq;
> -
> -	/*
> -	 * CPUs seeing atomic_add_return() must see prior RCU read-side
> -	 * critical sections, and we also must force ordering with the
> -	 * next idle sojourn.
> -	 */
> -	rcu_dynticks_task_trace_enter();  // Before ->dynticks update!
> -	seq = rcu_dynticks_inc(1);
> -	// RCU is no longer watching.  Better be in extended quiescent state!
> -	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & 0x1));
> -}

> -static noinstr void rcu_dynticks_eqs_exit(void)
> -{
> -	int seq;
> -
> -	/*
> -	 * CPUs seeing atomic_add_return() must see prior idle sojourns,
> -	 * and we also must force ordering with the next RCU read-side
> -	 * critical section.
> -	 */
> -	seq = rcu_dynticks_inc(1);
> -	// RCU is now watching.  Better not be in an extended quiescent state!
> -	rcu_dynticks_task_trace_exit();  // After ->dynticks update!
> -	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & 0x1));
> -}

So specifically rcu_dynticks_task_trace_{enter,exit}() are now orphaned.
After this patch, nothing calls them.

However, looking at this again, we've got:

  __context_tracking_enter()
    rcu_user_enter()
      rcu_eqs_enter()
        rcu_dynticks_eqs_enter()
	  rcu_dynticks_task_trace_enter()
	  rcu_dynticks_inc();
	rcu_dynticks_task_enter();

    ct_seq_user_enter()
      atomic_add_return()

and on the other end:

  __context_tracking_exit()
    ct_seq_user_exit()
      atomic_add_return()

    rcu_user_exit()
      rcu_esq_exit()
        rcu_dynticks_task_exit()
	rcu_dynticks_eqs_exit()
	  rcu_dynticks_inc()
	  rcu_dynticks_task_trace_exit()

And since we want to replace dynticks_inc() with ct_seq_*() the
rcu_dynticks_task_{enter,exit}() ought to be pulled before that..

Instead I orphaned rcu_dynticks_task_trace_{enter,exit}() which should
more or less stay where they are.

I seems to have confused the two :/


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking
  2021-09-29 18:37   ` Paul E. McKenney
@ 2021-09-29 19:09     ` Peter Zijlstra
  2021-09-29 19:11     ` Peter Zijlstra
  2021-09-29 19:13     ` Peter Zijlstra
  2 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 19:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, rostedt, x86

On Wed, Sep 29, 2021 at 11:37:01AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 29, 2021 at 05:17:31PM +0200, Peter Zijlstra wrote:
> The CT_SEQ_WORK bit means neither idle nor nohz_full user, correct?

That bit is indeed independent, it can get set remotely by cmpxchg when
in user/eqs state and will be tested and (eventually) cleared when
leaving user/eqs state.

> So let's see if I intuited the decoder ring, where "kernel" means that
> portion of the kernel that is non-noinstr...
> 
> CT_SEQ_WORK	CT_SEQ_NMI	CT_SEQ_USER	Description?
> 0		0		0		Idle or non-nohz_full user
> 0		0		1		nohz_full user
> 0		1		0		NMI from 0,0,0
> 0		1		1		NMI from 0,0,1
> 1		0		0		Non-idle kernel
> 1		0		1		Cannot happen?
> 1		1		0		NMI from 1,0,1
> 1		1		1		NMI from cannot happen
> 
> And of course if a state cannot happen, Murphy says that you will take
> an NMI in that state.

Urgh, you have the bits the 'wrong' way around :-)

MSB     3210
|-------||||

Where [MSB-3] is the actual sequence number, [21] is the state and [0]
is the work-pending bit.

The table for [21] is like:

USER NMI
 0    0		kernel
 0    1		kernel took nmi
 1    0		user
 1    1		user took nmi

So effectively EQS is 10 only, the other 3 states are kernel.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking
  2021-09-29 18:37   ` Paul E. McKenney
  2021-09-29 19:09     ` Peter Zijlstra
@ 2021-09-29 19:11     ` Peter Zijlstra
  2021-09-29 19:13     ` Peter Zijlstra
  2 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 19:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, rostedt, x86

On Wed, Sep 29, 2021 at 11:37:01AM -0700, Paul E. McKenney wrote:

> > +void context_tracking_idle(void)
> > +{
> > +	atomic_add_return(CT_SEQ, &raw_cpu_ptr(&context_tracking)->seq);
> 
> This is presumably a momentary idle.

> >  notrace void rcu_momentary_dyntick_idle(void)
> >  {
> > -	int seq;
> > -
> >  	raw_cpu_write(rcu_data.rcu_need_heavy_qs, false);
> > -	seq = rcu_dynticks_inc(2);
> > -	/* It is illegal to call this from idle state. */
> > -	WARN_ON_ONCE(!(seq & 0x1));
> > +	context_tracking_idle();
> >  	rcu_preempt_deferred_qs(current);
> >  }

It's whatever that is. It increments the actual sequence count without
modifying the state.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking
  2021-09-29 18:37   ` Paul E. McKenney
  2021-09-29 19:09     ` Peter Zijlstra
  2021-09-29 19:11     ` Peter Zijlstra
@ 2021-09-29 19:13     ` Peter Zijlstra
  2021-09-29 19:24       ` Peter Zijlstra
  2 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 19:13 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, rostedt, x86

On Wed, Sep 29, 2021 at 11:37:01AM -0700, Paul E. McKenney wrote:

> And what happens to all of this in !CONFIG_CONTEXT_TRACKING kernels?
> Of course, RCU needs it unconditionally.  (There appear to be at least
> parts of it that are unconditionally available, but I figured that I
> should ask.  Especially given the !CONFIG_CONTEXT_TRACKING definition
> of the __context_tracking_cpu_seq() function.)

For !CONFIG_CONTEXT_TRACKING it goes *poof*.

Since the thing was called dynticks, I presumed it was actually dynticks
only, silly me (also, I didn't see any obvious !context_tracking usage
of it, i'll go audit it more carefully.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking
  2021-09-29 19:13     ` Peter Zijlstra
@ 2021-09-29 19:24       ` Peter Zijlstra
  2021-09-29 19:45         ` Paul E. McKenney
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2021-09-29 19:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, rostedt, x86

On Wed, Sep 29, 2021 at 09:13:26PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 29, 2021 at 11:37:01AM -0700, Paul E. McKenney wrote:
> 
> > And what happens to all of this in !CONFIG_CONTEXT_TRACKING kernels?
> > Of course, RCU needs it unconditionally.  (There appear to be at least
> > parts of it that are unconditionally available, but I figured that I
> > should ask.  Especially given the !CONFIG_CONTEXT_TRACKING definition
> > of the __context_tracking_cpu_seq() function.)
> 
> For !CONFIG_CONTEXT_TRACKING it goes *poof*.
> 
> Since the thing was called dynticks, I presumed it was actually dynticks
> only, silly me (also, I didn't see any obvious !context_tracking usage
> of it, i'll go audit it more carefully.

Oh argh, it does idle too... damn. And I don't suppose having 2 counters
is going to be nice :/

I'll go back to thinking about this.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking
  2021-09-29 19:24       ` Peter Zijlstra
@ 2021-09-29 19:45         ` Paul E. McKenney
  0 siblings, 0 replies; 51+ messages in thread
From: Paul E. McKenney @ 2021-09-29 19:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, rostedt, x86

On Wed, Sep 29, 2021 at 09:24:31PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 29, 2021 at 09:13:26PM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 29, 2021 at 11:37:01AM -0700, Paul E. McKenney wrote:
> > 
> > > And what happens to all of this in !CONFIG_CONTEXT_TRACKING kernels?
> > > Of course, RCU needs it unconditionally.  (There appear to be at least
> > > parts of it that are unconditionally available, but I figured that I
> > > should ask.  Especially given the !CONFIG_CONTEXT_TRACKING definition
> > > of the __context_tracking_cpu_seq() function.)
> > 
> > For !CONFIG_CONTEXT_TRACKING it goes *poof*.
> > 
> > Since the thing was called dynticks, I presumed it was actually dynticks
> > only, silly me (also, I didn't see any obvious !context_tracking usage
> > of it, i'll go audit it more carefully.
> 
> Oh argh, it does idle too... damn. And I don't suppose having 2 counters
> is going to be nice :/
> 
> I'll go back to thinking about this.

Glad I could help?  For some definition of "help"?  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/11] sched,livepatch: Use task_call_func()
  2021-09-29 15:17 ` [PATCH v2 03/11] sched,livepatch: Use task_call_func() Peter Zijlstra
@ 2021-10-05 11:40   ` Petr Mladek
  2021-10-05 14:03     ` Peter Zijlstra
  2021-10-06  8:59   ` Miroslav Benes
  1 sibling, 1 reply; 51+ messages in thread
From: Petr Mladek @ 2021-10-05 11:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, mingo, linux-kernel, joe.lawrence,
	fweisbec, tglx, hca, svens, sumanthk, live-patching, paulmck,
	rostedt, x86

On Wed 2021-09-29 17:17:26, Peter Zijlstra wrote:
> Instead of frobbing around with scheduler internals, use the shiny new
> task_call_func() interface.
> 
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -274,6 +266,22 @@ static int klp_check_stack(struct task_s
>  	return 0;
>  }
>  
> +static int klp_check_and_switch_task(struct task_struct *task, void *arg)
> +{
> +	int ret;
> +
> +	if (task_curr(task))

This must be

	if (task_curr(task) && task != current)

, otherwise the task is not able to migrate itself. The condition was
lost when reshuffling the original code, see below.

JFYI, I have missed it during review. I am actually surprised that the
process could check its own stack reliably. But it seems to work.

I found the problem when "busy target module" selftest failed.
It was not able to livepatch the workqueue worker that was
proceeding klp_transition_work_fn().


> +		return -EBUSY;
> +
> +	ret = klp_check_stack(task, arg);
> +	if (ret)
> +		return ret;
> +
> +	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> +	task->patch_state = klp_target_state;
> +	return 0;
> +}
> +
>  /*
>   * Try to safely switch a task to the target patch state.  If it's currently
>   * running, or it's sleeping on a to-be-patched or to-be-unpatched function, or
> @@ -305,36 +308,31 @@ static bool klp_try_switch_task(struct t
>  	 * functions.  If all goes well, switch the task to the target patch
>  	 * state.
>  	 */
> -	rq = task_rq_lock(task, &flags);
> +	ret = task_call_func(task, klp_check_and_switch_task, &old_name);

It looks correct. JFYI, this is why:

The logic seems to be exactly the same, except for the one fallout
mentioned above. So the only problem might be races.

The only important thing is that the task must not be running on any CPU
when klp_check_stack(task, arg) is called. By other word, the stack
must stay the same when being checked.

The original code prevented races by taking task_rq_lock().
And task_call_func() is slightly more relaxed but it looks safe enough:

  + it still takes rq lock when the task is in runnable state.
  + it always takes p->pi_lock that prevents moving the task
    into runnable state by try_to_wake_up().


> +	switch (ret) {
> +	case 0:		/* success */
> +		break;
>  
> -	if (task_running(rq, task) && task != current) {

This is the original code that checked (task != current).

> -		snprintf(err_buf, STACK_ERR_BUF_SIZE,
> -			 "%s: %s:%d is running\n", __func__, task->comm,
> -			 task->pid);
> -		goto done;

With the added (task != current) check:

Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle()
  2021-09-29 15:17 ` [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle() Peter Zijlstra
@ 2021-10-05 12:00   ` Petr Mladek
  2021-10-06  9:16   ` Miroslav Benes
  2021-10-13 19:37   ` Arnd Bergmann
  2 siblings, 0 replies; 51+ messages in thread
From: Petr Mladek @ 2021-10-05 12:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, mingo, linux-kernel, joe.lawrence,
	fweisbec, tglx, hca, svens, sumanthk, live-patching, paulmck,
	rostedt, x86

On Wed 2021-09-29 17:17:28, Peter Zijlstra wrote:
> Make sure to prod idle CPUs so they call klp_update_patch_state().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>

Thanks a lot for updating this API for livepatching.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/11] sched,livepatch: Use task_call_func()
  2021-10-05 11:40   ` Petr Mladek
@ 2021-10-05 14:03     ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-10-05 14:03 UTC (permalink / raw)
  To: Petr Mladek
  Cc: gor, jpoimboe, jikos, mbenes, mingo, linux-kernel, joe.lawrence,
	fweisbec, tglx, hca, svens, sumanthk, live-patching, paulmck,
	rostedt, x86

On Tue, Oct 05, 2021 at 01:40:24PM +0200, Petr Mladek wrote:
> On Wed 2021-09-29 17:17:26, Peter Zijlstra wrote:
> > Instead of frobbing around with scheduler internals, use the shiny new
> > task_call_func() interface.
> > 
> > --- a/kernel/livepatch/transition.c
> > +++ b/kernel/livepatch/transition.c
> > @@ -274,6 +266,22 @@ static int klp_check_stack(struct task_s
> >  	return 0;
> >  }
> >  
> > +static int klp_check_and_switch_task(struct task_struct *task, void *arg)
> > +{
> > +	int ret;
> > +
> > +	if (task_curr(task))
> 
> This must be
> 
> 	if (task_curr(task) && task != current)
> 
> , otherwise the task is not able to migrate itself. The condition was
> lost when reshuffling the original code, see below.

Urgh, yeah, I misread that and figued task_curr() should already capture
current, but the extra clause excludes current :/

> JFYI, I have missed it during review. I am actually surprised that the
> process could check its own stack reliably. But it seems to work.

Ah, unwinding yourself is actually the only sane option ;-)

> > -	rq = task_rq_lock(task, &flags);
> > +	ret = task_call_func(task, klp_check_and_switch_task, &old_name);
> 
> It looks correct. JFYI, this is why:
> 
> The logic seems to be exactly the same, except for the one fallout
> mentioned above. So the only problem might be races.
> 
> The only important thing is that the task must not be running on any CPU
> when klp_check_stack(task, arg) is called. By other word, the stack
> must stay the same when being checked.
> 
> The original code prevented races by taking task_rq_lock().
> And task_call_func() is slightly more relaxed but it looks safe enough:
> 
>   + it still takes rq lock when the task is in runnable state.
>   + it always takes p->pi_lock that prevents moving the task
>     into runnable state by try_to_wake_up().

Correct, the new task_call_func() is trying hard to not take rq->lock,
but should be effectively identical to task_rq_lock().

> With the added (task != current) check:

Done

> Reviewed-by: Petr Mladek <pmladek@suse.com>
> Tested-by: Petr Mladek <pmladek@suse.com>

Thanks!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 09/11] context_tracking,livepatch: Dont disturb NOHZ_FULL
  2021-09-29 15:17 ` [RFC][PATCH v2 09/11] context_tracking,livepatch: Dont disturb NOHZ_FULL Peter Zijlstra
@ 2021-10-06  8:12   ` Petr Mladek
  2021-10-06  9:04     ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Petr Mladek @ 2021-10-06  8:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, mingo, linux-kernel, joe.lawrence,
	fweisbec, tglx, hca, svens, sumanthk, live-patching, paulmck,
	rostedt, x86

On Wed 2021-09-29 17:17:32, Peter Zijlstra wrote:
> Using the new context_tracking infrastructure, avoid disturbing
> userspace tasks when context tracking is enabled.
> 
> When context_tracking_set_cpu_work() returns true, we have the
> guarantee that klp_update_patch_state() is called from noinstr code
> before it runs normal kernel code. This covers
> syscall/exceptions/interrupts and NMI entry.

This patch touches the most tricky (lockless) parts of the livepatch code.
I always have to refresh my head about all the dependencies.

Sigh, I guess that the livepatch code looks over complicated to you.

The main problem is that we want to migrate tasks only when they
are not inside any livepatched function. It allows to do semantic
changes which is needed by some sort of critical security fixes.


> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -55,15 +56,13 @@ static noinstr void ct_exit_user_work(struct
>  {
>  	unsigned int work = arch_atomic_read(&ct->work);
>  
> -#if 0
> -	if (work & CT_WORK_n) {
> +	if (work & CT_WORK_KLP) {
>  		/* NMI happens here and must still do/finish CT_WORK_n */
> -		do_work_n();
> +		__klp_update_patch_state(current);
>  
>  		smp_mb__before_atomic();
> -		arch_atomic_andnot(CT_WORK_n, &ct->work);
> +		arch_atomic_andnot(CT_WORK_KLP, &ct->work);
>  	}
> -#endif
>  
>  	smp_mb__before_atomic();
>  	arch_atomic_andnot(CT_SEQ_WORK, &ct->seq);
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -153,6 +154,11 @@ void klp_cancel_transition(void)
>  	klp_complete_transition();
>  }
>  
> +noinstr void __klp_update_patch_state(struct task_struct *task)
> +{
> +	task->patch_state = READ_ONCE(klp_target_state);
> +}
> +
>  /*
>   * Switch the patched state of the task to the set of functions in the target
>   * patch state.
> @@ -180,8 +186,10 @@ void klp_update_patch_state(struct task_
>  	 *    of func->transition, if klp_ftrace_handler() is called later on
>  	 *    the same CPU.  See __klp_disable_patch().
>  	 */
> -	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
> +	if (test_tsk_thread_flag(task, TIF_PATCH_PENDING)) {

This would require smp_rmb() here. It will make sure that we will
read the right @klp_target_state when TIF_PATCH_PENDING is set.

, where @klp_target_state is set in klp_init_transition()
  and TIF_PATCH_PENDING is set in klp_start_transition()

There are actually two related smp_wmp() barriers between these two
assignments in:

	1st in klp_init_transition()
	2nd in __klp_enable_patch()

One would be enough for klp_update_patch_state(). But we need
both for klp_ftrace_handler(), see the smp_rmb() there.
In particular, they synchronize:

   + ops->func_stack vs.
   + func->transition vs.
   + current->patch_state


>  		task->patch_state = READ_ONCE(klp_target_state);

Note that smp_wmb() is not needed here because
klp_complete_transition() calls klp_synchronize_transition()
aka synchronize_rcu() before clearing klp_target_state.
This is why the original code worked.


> +		clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> +	}
>  
>  	preempt_enable_notrace();
>  }
> @@ -270,15 +278,30 @@ static int klp_check_and_switch_task(str
>  {
>  	int ret;
>  
> -	if (task_curr(task))
> +	if (task_curr(task)) {
> +		/*
> +		 * This only succeeds when the task is in NOHZ_FULL user
> +		 * mode, the true return value guarantees any kernel entry
> +		 * will call klp_update_patch_state().
> +		 *
> +		 * XXX: ideally we'd simply return 0 here and leave
> +		 * TIF_PATCH_PENDING alone, to be fixed up by
> +		 * klp_update_patch_state(), except livepatching goes wobbly
> +		 * with 'pending' TIF bits on.
> +		 */
> +		if (context_tracking_set_cpu_work(task_cpu(task), CT_WORK_KLP))
> +			goto clear;

If I get it correctly, this will clear TIF_PATCH_PENDING immediately
but task->patch_state = READ_ONCE(klp_target_state) will be
done later by ct_exit_user_work().

This is a bit problematic:

  1. The global @klp_target_state is set to KLP_UNDEFINED when all
     processes have TIF_PATCH_PENDING is cleared. This is actually
     still fine because func->transition is cleared as well.
     As a result, current->patch_state is ignored in klp_ftrace_handler.

  2. The real problem happens when another livepatch is enabled.
     The global @klp_target_state is set to new value and
     func->transition is set again. In this case, the delayed
     ct_exit_user_work() might assign wrong value that might
     really be used by klp_ftrace_handler().


IMHO, the original solution from v1 was better. We only needed to
be careful when updating task->patch_state and clearing
TIF_PATCH_PENDING to avoid the race.

The following might work:

static int klp_check_and_switch_task(struct task_struct *task, void *arg)
{
	int ret;

	/*
	 * Stack is reliable only when the task is not running on any CPU,
	 * except for the task running this code.
	 */
	if (task_curr(task) && task != current) {
		/*
		 * This only succeeds when the task is in NOHZ_FULL user
		 * mode. Such a task might be migrated immediately. We
		 * only need to be careful to set task->patch_state before
		 * clearing TIF_PATCH_PENDING so that the task migrates
		 * itself when entring kernel in the meatime.
		 */
		if (is_ct_user(task)) {
			klp_update_patch_state(task);
			return 0;
		}

		return -EBUSY;
	}

	ret = klp_check_stack(task, arg);
	if (ret)
		return ret;

	/*
	 * The task neither is running on any CPU and nor it can get
	 * running. As a result, the ordering is not important and
	 * barrier is not needed.
	 */
	task->patch_state = klp_target_state;
	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);

	return 0;
}

, where is_ct_user(task) would return true when task is running in
CONTEXT_USER. If I get the context_tracking API correctly then
it might be implemeted the following way:


#ifdef CONFIG_CONTEXT_TRACKING

/*
 * XXX: The value is reliable depending the context where this is called.
 * At least migrating between CPUs should get prevented.
 */
static __always_inline bool is_ct_user(struct task_struct *task)
{
	int seq;

	if (!context_tracking_enabled())
		return false;

	seq = __context_tracking_cpu_seq(task_cpu(task));
	return __context_tracking_seq_in_user(seq);
}

#else

static __always_inline bool is_ct_user(struct task_struct *task)
{
	return false;
}

#endif /* CONFIG_CONTEXT_TRACKING */

Best Regards,
Petr

>  		return -EBUSY;
> +	}
>  
>  	ret = klp_check_stack(task, arg);
>  	if (ret)
>  		return ret;
>  
> -	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
>  	task->patch_state = klp_target_state;
> +clear:
> +	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/11] sched,livepatch: Use task_call_func()
  2021-09-29 15:17 ` [PATCH v2 03/11] sched,livepatch: Use task_call_func() Peter Zijlstra
  2021-10-05 11:40   ` Petr Mladek
@ 2021-10-06  8:59   ` Miroslav Benes
  1 sibling, 0 replies; 51+ messages in thread
From: Miroslav Benes @ 2021-10-06  8:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, pmladek, mingo, linux-kernel, joe.lawrence,
	fweisbec, tglx, hca, svens, sumanthk, live-patching, paulmck,
	rostedt, x86

On Wed, 29 Sep 2021, Peter Zijlstra wrote:

> Instead of frobbing around with scheduler internals, use the shiny new
> task_call_func() interface.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

This looks really nice. With the added check for "task != current" 
that Petr pointed out

Acked-by: Miroslav Benes <mbenes@suse.cz>

M

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 09/11] context_tracking,livepatch: Dont disturb NOHZ_FULL
  2021-10-06  8:12   ` Petr Mladek
@ 2021-10-06  9:04     ` Peter Zijlstra
  2021-10-06 10:29       ` Petr Mladek
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2021-10-06  9:04 UTC (permalink / raw)
  To: Petr Mladek
  Cc: gor, jpoimboe, jikos, mbenes, mingo, linux-kernel, joe.lawrence,
	fweisbec, tglx, hca, svens, sumanthk, live-patching, paulmck,
	rostedt, x86

On Wed, Oct 06, 2021 at 10:12:17AM +0200, Petr Mladek wrote:

> > @@ -180,8 +186,10 @@ void klp_update_patch_state(struct task_
> >  	 *    of func->transition, if klp_ftrace_handler() is called later on
> >  	 *    the same CPU.  See __klp_disable_patch().
> >  	 */
> > -	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
> > +	if (test_tsk_thread_flag(task, TIF_PATCH_PENDING)) {
> 
> This would require smp_rmb() here. It will make sure that we will
> read the right @klp_target_state when TIF_PATCH_PENDING is set.

Bah, yes.

> >  		task->patch_state = READ_ONCE(klp_target_state);
> 
> Note that smp_wmb() is not needed here because
> klp_complete_transition() calls klp_synchronize_transition()
> aka synchronize_rcu() before clearing klp_target_state.
> This is why the original code worked.
> 
> 
> > +		clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> > +	}
> >  
> >  	preempt_enable_notrace();
> >  }
> > @@ -270,15 +278,30 @@ static int klp_check_and_switch_task(str
> >  {
> >  	int ret;
> >  
> > -	if (task_curr(task))
> > +	if (task_curr(task)) {
> > +		/*
> > +		 * This only succeeds when the task is in NOHZ_FULL user
> > +		 * mode, the true return value guarantees any kernel entry
> > +		 * will call klp_update_patch_state().
> > +		 *
> > +		 * XXX: ideally we'd simply return 0 here and leave
> > +		 * TIF_PATCH_PENDING alone, to be fixed up by
> > +		 * klp_update_patch_state(), except livepatching goes wobbly
> > +		 * with 'pending' TIF bits on.
> > +		 */
> > +		if (context_tracking_set_cpu_work(task_cpu(task), CT_WORK_KLP))
> > +			goto clear;
> 
> If I get it correctly, this will clear TIF_PATCH_PENDING immediately
> but task->patch_state = READ_ONCE(klp_target_state) will be
> done later by ct_exit_user_work().
> 
> This is a bit problematic:
> 
>   1. The global @klp_target_state is set to KLP_UNDEFINED when all
>      processes have TIF_PATCH_PENDING is cleared. This is actually
>      still fine because func->transition is cleared as well.
>      As a result, current->patch_state is ignored in klp_ftrace_handler.
> 
>   2. The real problem happens when another livepatch is enabled.
>      The global @klp_target_state is set to new value and
>      func->transition is set again. In this case, the delayed
>      ct_exit_user_work() might assign wrong value that might
>      really be used by klp_ftrace_handler().

Urgghh.. totally missed that.

> IMHO, the original solution from v1 was better. We only needed to

It was also terribly broken in other 'fun' ways. See below.

> be careful when updating task->patch_state and clearing
> TIF_PATCH_PENDING to avoid the race.
> 
> The following might work:
> 
> static int klp_check_and_switch_task(struct task_struct *task, void *arg)
> {
> 	int ret;
> 
> 	/*
> 	 * Stack is reliable only when the task is not running on any CPU,
> 	 * except for the task running this code.
> 	 */
> 	if (task_curr(task) && task != current) {
> 		/*
> 		 * This only succeeds when the task is in NOHZ_FULL user
> 		 * mode. Such a task might be migrated immediately. We
> 		 * only need to be careful to set task->patch_state before
> 		 * clearing TIF_PATCH_PENDING so that the task migrates
> 		 * itself when entring kernel in the meatime.
> 		 */
> 		if (is_ct_user(task)) {
> 			klp_update_patch_state(task);
> 			return 0;
> 		}
> 
> 		return -EBUSY;
> 	}
> 
> 	ret = klp_check_stack(task, arg);
> 	if (ret)
> 		return ret;
> 
> 	/*
> 	 * The task neither is running on any CPU and nor it can get
> 	 * running. As a result, the ordering is not important and
> 	 * barrier is not needed.
> 	 */
> 	task->patch_state = klp_target_state;
> 	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> 
> 	return 0;
> }
> 
> , where is_ct_user(task) would return true when task is running in
> CONTEXT_USER. If I get the context_tracking API correctly then
> it might be implemeted the following way:

That's not sufficient, you need to tag the remote task with a ct_work
item to also runs klp_update_patch_state(), otherwise the remote CPU can
enter kernel space between checking is_ct_user() and doing
klp_update_patch_state():

	CPU0				CPU1

					<user>

	if (is_ct_user()) // true
					<kernel-entry>
					  // run some kernel code
	  klp_update_patch_state()
	  *WHOOPSIE*


So it needs to be something like:


	CPU0				CPU1

					<user>

	if (context_tracking_set_cpu_work(task_cpu(), CT_WORK_KLP))

					<kernel-entry>
	  klp_update_patch_state	  klp_update_patch_state()


So that CPU0 and CPU1 race to complete klp_update_patch_state() *before*
any regular (!noinstr) code gets run.

Which then means it needs to look something like:

noinstr void klp_update_patch_state(struct task_struct *task)
{
	struct thread_info *ti = task_thread_info(task);

	preempt_disable_notrace();
	if (arch_test_bit(TIF_PATCH_PENDING, (unsigned long *)&ti->flags)) {
		/*
		 * Order loads of TIF_PATCH_PENDING vs klp_target_state.
		 * See klp_init_transition().
		 */
		smp_rmb();
		task->patch_state = __READ_ONCE(klp_target_state);
		/*
		 * Concurrent against self; must observe updated
		 * task->patch_state if !TIF_PATCH_PENDING.
		 */
		smp_mb__before_atomic();
		arch_clear_bit(TIF_PATCH_PENDING, (unsigned long *)&ti->flags);
	} else {
		/*
		 * Concurrent against self, see smp_mb__before_atomic()
		 * above.
		 */
		smp_rmb();
	}
	preempt_enable_notrace();
}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle()
  2021-09-29 15:17 ` [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle() Peter Zijlstra
  2021-10-05 12:00   ` Petr Mladek
@ 2021-10-06  9:16   ` Miroslav Benes
  2021-10-07  9:18     ` Vasily Gorbik
  2021-10-13 19:37   ` Arnd Bergmann
  2 siblings, 1 reply; 51+ messages in thread
From: Miroslav Benes @ 2021-10-06  9:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, pmladek, mingo, linux-kernel, joe.lawrence,
	fweisbec, tglx, hca, svens, sumanthk, live-patching, paulmck,
	rostedt, x86

On Wed, 29 Sep 2021, Peter Zijlstra wrote:

> Make sure to prod idle CPUs so they call klp_update_patch_state().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/livepatch/transition.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -413,8 +413,11 @@ void klp_try_complete_transition(void)
>  	for_each_possible_cpu(cpu) {
>  		task = idle_task(cpu);
>  		if (cpu_online(cpu)) {
> -			if (!klp_try_switch_task(task))
> +			if (!klp_try_switch_task(task)) {
>  				complete = false;
> +				/* Make idle task go through the main loop. */
> +				wake_up_if_idle(cpu);
> +			}

Right, it should be enough.

Acked-by: Miroslav Benes <mbenes@suse.cz>

It would be nice to get Vasily's Tested-by tag on this one.

M

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 09/11] context_tracking,livepatch: Dont disturb NOHZ_FULL
  2021-10-06  9:04     ` Peter Zijlstra
@ 2021-10-06 10:29       ` Petr Mladek
  2021-10-06 11:41         ` Peter Zijlstra
  2021-10-06 11:48         ` Miroslav Benes
  0 siblings, 2 replies; 51+ messages in thread
From: Petr Mladek @ 2021-10-06 10:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, mingo, linux-kernel, joe.lawrence,
	fweisbec, tglx, hca, svens, sumanthk, live-patching, paulmck,
	rostedt, x86

On Wed 2021-10-06 11:04:26, Peter Zijlstra wrote:
> On Wed, Oct 06, 2021 at 10:12:17AM +0200, Petr Mladek wrote:
> > IMHO, the original solution from v1 was better. We only needed to
> 
> It was also terribly broken in other 'fun' ways. See below.
> 
> > be careful when updating task->patch_state and clearing
> > TIF_PATCH_PENDING to avoid the race.
> > 
> > The following might work:
> > 
> > static int klp_check_and_switch_task(struct task_struct *task, void *arg)
> > {
> > 	int ret;
> > 
> > 	/*
> > 	 * Stack is reliable only when the task is not running on any CPU,
> > 	 * except for the task running this code.
> > 	 */
> > 	if (task_curr(task) && task != current) {
> > 		/*
> > 		 * This only succeeds when the task is in NOHZ_FULL user
> > 		 * mode. Such a task might be migrated immediately. We
> > 		 * only need to be careful to set task->patch_state before
> > 		 * clearing TIF_PATCH_PENDING so that the task migrates
> > 		 * itself when entring kernel in the meatime.
> > 		 */
> > 		if (is_ct_user(task)) {
> > 			klp_update_patch_state(task);
> > 			return 0;
> > 		}
> > 
> > 		return -EBUSY;
> > 	}
> > 
> > 	ret = klp_check_stack(task, arg);
> > 	if (ret)
> > 		return ret;
> > 
> > 	/*
> > 	 * The task neither is running on any CPU and nor it can get
> > 	 * running. As a result, the ordering is not important and
> > 	 * barrier is not needed.
> > 	 */
> > 	task->patch_state = klp_target_state;
> > 	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> > 
> > 	return 0;
> > }
> > 
> > , where is_ct_user(task) would return true when task is running in
> > CONTEXT_USER. If I get the context_tracking API correctly then
> > it might be implemeted the following way:
> 
> That's not sufficient, you need to tag the remote task with a ct_work
> item to also runs klp_update_patch_state(), otherwise the remote CPU can
> enter kernel space between checking is_ct_user() and doing
> klp_update_patch_state():
> 
> 	CPU0				CPU1
> 
> 					<user>
> 
> 	if (is_ct_user()) // true
> 					<kernel-entry>
> 					  // run some kernel code
> 	  klp_update_patch_state()
> 	  *WHOOPSIE*
> 
> 
> So it needs to be something like:
> 
> 
> 	CPU0				CPU1
> 
> 					<user>
> 
> 	if (context_tracking_set_cpu_work(task_cpu(), CT_WORK_KLP))
> 
> 					<kernel-entry>
> 	  klp_update_patch_state	  klp_update_patch_state()
> 
> 
> So that CPU0 and CPU1 race to complete klp_update_patch_state() *before*
> any regular (!noinstr) code gets run.

Grr, you are right. I thought that we migrated the task when entering
kernel even before. But it seems that we do it only when leaving
the kernel in exit_to_user_mode_loop().


> Which then means it needs to look something like:
> 
> noinstr void klp_update_patch_state(struct task_struct *task)
> {
> 	struct thread_info *ti = task_thread_info(task);
> 
> 	preempt_disable_notrace();
> 	if (arch_test_bit(TIF_PATCH_PENDING, (unsigned long *)&ti->flags)) {
> 		/*
> 		 * Order loads of TIF_PATCH_PENDING vs klp_target_state.
> 		 * See klp_init_transition().
> 		 */
> 		smp_rmb();
> 		task->patch_state = __READ_ONCE(klp_target_state);
> 		/*
> 		 * Concurrent against self; must observe updated
> 		 * task->patch_state if !TIF_PATCH_PENDING.
> 		 */
> 		smp_mb__before_atomic();

IMHO, smp_wmb() should be enough. We are here only when this
CPU set task->patch_state right above. So that CPU running
this code should see the correct task->patch_state.

The read barrier is needed only when @task is entering kernel and
does not see TIF_PATCH_PENDING. It is handled by smp_rmb() in
the "else" branch below.

It is possible that both CPUs see TIF_PATCH_PENDING and both
set task->patch_state. But it should not cause any harm
because they set the same value. Unless something really
crazy happens with the internal CPU busses and caches.


> 		arch_clear_bit(TIF_PATCH_PENDING, (unsigned long *)&ti->flags);
> 	} else {
> 		/*
> 		 * Concurrent against self, see smp_mb__before_atomic()
> 		 * above.
> 		 */
> 		smp_rmb();

Yeah, this is the counter part against the above smp_wmb().

> 	}
> 	preempt_enable_notrace();
> }

Now, I am scared to increase my paranoia level and search for even more
possible races. I feel overwhelmed at the moment ;-)

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 09/11] context_tracking,livepatch: Dont disturb NOHZ_FULL
  2021-10-06 10:29       ` Petr Mladek
@ 2021-10-06 11:41         ` Peter Zijlstra
  2021-10-06 11:48         ` Miroslav Benes
  1 sibling, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-10-06 11:41 UTC (permalink / raw)
  To: Petr Mladek
  Cc: gor, jpoimboe, jikos, mbenes, mingo, linux-kernel, joe.lawrence,
	fweisbec, tglx, hca, svens, sumanthk, live-patching, paulmck,
	rostedt, x86

On Wed, Oct 06, 2021 at 12:29:32PM +0200, Petr Mladek wrote:
> On Wed 2021-10-06 11:04:26, Peter Zijlstra wrote:

> > So it needs to be something like:
> > 
> > 
> > 	CPU0				CPU1
> > 
> > 					<user>
> > 
> > 	if (context_tracking_set_cpu_work(task_cpu(), CT_WORK_KLP))
> > 
> > 					<kernel-entry>
> > 	  klp_update_patch_state	  klp_update_patch_state()
> > 
> > 
> > So that CPU0 and CPU1 race to complete klp_update_patch_state() *before*
> > any regular (!noinstr) code gets run.
> 
> Grr, you are right. I thought that we migrated the task when entering
> kernel even before. But it seems that we do it only when leaving
> the kernel in exit_to_user_mode_loop().

Yep... :-)

> > Which then means it needs to look something like:
> > 
> > noinstr void klp_update_patch_state(struct task_struct *task)
> > {
> > 	struct thread_info *ti = task_thread_info(task);
> > 
> > 	preempt_disable_notrace();
> > 	if (arch_test_bit(TIF_PATCH_PENDING, (unsigned long *)&ti->flags)) {
> > 		/*
> > 		 * Order loads of TIF_PATCH_PENDING vs klp_target_state.
> > 		 * See klp_init_transition().
> > 		 */
> > 		smp_rmb();
> > 		task->patch_state = __READ_ONCE(klp_target_state);
> > 		/*
> > 		 * Concurrent against self; must observe updated
> > 		 * task->patch_state if !TIF_PATCH_PENDING.
> > 		 */
> > 		smp_mb__before_atomic();
> 
> IMHO, smp_wmb() should be enough. We are here only when this
> CPU set task->patch_state right above. So that CPU running
> this code should see the correct task->patch_state.

Yes, I think smp_wmb() and smp_mb__before_atomic() are NOPS for all the
same architectures, so that might indeed be a better choice.

> The read barrier is needed only when @task is entering kernel and
> does not see TIF_PATCH_PENDING. It is handled by smp_rmb() in
> the "else" branch below.
> 
> It is possible that both CPUs see TIF_PATCH_PENDING and both
> set task->patch_state. But it should not cause any harm
> because they set the same value. Unless something really
> crazy happens with the internal CPU busses and caches.

Right, not our problem :-) Lots would be broken beyond repair in that
case.

> > 		arch_clear_bit(TIF_PATCH_PENDING, (unsigned long *)&ti->flags);
> > 	} else {
> > 		/*
> > 		 * Concurrent against self, see smp_mb__before_atomic()
> > 		 * above.
> > 		 */
> > 		smp_rmb();
> 
> Yeah, this is the counter part against the above smp_wmb().
> 
> > 	}
> > 	preempt_enable_notrace();
> > }
> 
> Now, I am scared to increase my paranoia level and search for even more
> possible races. I feel overwhelmed at the moment ;-)

:-)

Anyway, I still need to figure out how to extract this context tracking
stuff from RCU and not make a giant mess of things, so until that
time....

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 09/11] context_tracking,livepatch: Dont disturb NOHZ_FULL
  2021-10-06 10:29       ` Petr Mladek
  2021-10-06 11:41         ` Peter Zijlstra
@ 2021-10-06 11:48         ` Miroslav Benes
  1 sibling, 0 replies; 51+ messages in thread
From: Miroslav Benes @ 2021-10-06 11:48 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Peter Zijlstra, gor, jpoimboe, jikos, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

> > That's not sufficient, you need to tag the remote task with a ct_work
> > item to also runs klp_update_patch_state(), otherwise the remote CPU can
> > enter kernel space between checking is_ct_user() and doing
> > klp_update_patch_state():
> > 
> > 	CPU0				CPU1
> > 
> > 					<user>
> > 
> > 	if (is_ct_user()) // true
> > 					<kernel-entry>
> > 					  // run some kernel code
> > 	  klp_update_patch_state()
> > 	  *WHOOPSIE*
> > 
> > 
> > So it needs to be something like:
> > 
> > 
> > 	CPU0				CPU1
> > 
> > 					<user>
> > 
> > 	if (context_tracking_set_cpu_work(task_cpu(), CT_WORK_KLP))
> > 
> > 					<kernel-entry>
> > 	  klp_update_patch_state	  klp_update_patch_state()
> > 
> > 
> > So that CPU0 and CPU1 race to complete klp_update_patch_state() *before*
> > any regular (!noinstr) code gets run.
> 
> Grr, you are right. I thought that we migrated the task when entering
> kernel even before. But it seems that we do it only when leaving
> the kernel in exit_to_user_mode_loop().

That is correct. You are probably confused by the old kGraft 
implementation which added the TIF also to _TIF_WORK_SYSCALL_ENTRY so 
that syscall_trace_enter() processed it too. But upstream solution has 
always switched tasks only on their exit to user mode, because it was 
deemed sufficient.

Btw, just for fun, the old kGraft in one of its first incarnations also 
had the following horrible hack to exactly "solve" the problem.

+/*
+ * Tasks which are running in userspace after the patching has been started
+ * can immediately be marked as migrated to the new universe.
+ *
+ * If this function returns non-zero (i.e. also when error happens), the task
+ * needs to be migrated using kgraft lazy mechanism.
+ */
+static inline bool kgr_needs_lazy_migration(struct task_struct *p)
+{
+       unsigned long s[3];
+       struct stack_trace t = {
+               .nr_entries = 0,
+               .skip = 0,
+               .max_entries = 3,
+               .entries = s,
+       };
+
+       save_stack_trace_tsk(p, &t);
+
+       return t.nr_entries > 2;
+}

Miroslav

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 10/11] livepatch: Remove klp_synchronize_transition()
  2021-09-29 15:17 ` [RFC][PATCH v2 10/11] livepatch: Remove klp_synchronize_transition() Peter Zijlstra
@ 2021-10-06 12:30   ` Petr Mladek
  0 siblings, 0 replies; 51+ messages in thread
From: Petr Mladek @ 2021-10-06 12:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, mingo, linux-kernel, joe.lawrence,
	fweisbec, tglx, hca, svens, sumanthk, live-patching, paulmck,
	rostedt, x86

On Wed 2021-09-29 17:17:33, Peter Zijlstra wrote:
> The whole premise is broken, any trace usage outside of RCU is a BUG
> and should be fixed. Notable all code prior (and a fair bit after)
> ct_user_exit() is noinstr and explicitly doesn't allow any tracing.

I see. Is the situation the same with idle threads? I see that some
functions, called inside rcu_idle_enter()/exit() are still visible
for ftrace. For example, arch_cpu_idle() or tick_check_broadcast_expired().

That said, I am not sure if anyone would ever want to livepatch
this code. And there is still a workaround by livepatching
the callers.

Also it should be easy to catch eventual problems if we check
rcu_is_watching() in klp_ftrace_handler(). Most affected
scheduler and idle code paths will likely be called
even during the simple boot test.

> Use regular RCU, which has the benefit of actually respecting
> NOHZ_FULL.

Good to know.

After all, the patch looks good to me. I would just add something like:

--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -69,6 +69,12 @@ static void notrace klp_ftrace_handler(unsigned long ip,
 	if (WARN_ON_ONCE(!func))
 		goto unlock;

+	if (!rcu_is_watching()) {
+		printk_deferred_once(KERN_WARNING
+				     "warning: called \"%s\" without RCU watching. The function is not guarded by the consistency model. Also beware when removing the livepatch module.\n",
+				     func->old_name);
+	}
+
 	/*
 	 * In the enable path, enforce the order of the ops->func_stack and
 	 * func->transition reads.  The corresponding write barrier is in

But we could do this in a separate patch later.

Feel free to use:

Reviewed-by: Petr Mladek <pmladek@suse.com>

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle()
  2021-10-06  9:16   ` Miroslav Benes
@ 2021-10-07  9:18     ` Vasily Gorbik
  2021-10-07 10:02       ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Vasily Gorbik @ 2021-10-07  9:18 UTC (permalink / raw)
  To: Peter Zijlstra, Miroslav Benes
  Cc: jpoimboe, jikos, pmladek, mingo, linux-kernel, joe.lawrence,
	fweisbec, tglx, hca, svens, sumanthk, live-patching, paulmck,
	rostedt, x86

On Wed, Oct 06, 2021 at 11:16:21AM +0200, Miroslav Benes wrote:
> On Wed, 29 Sep 2021, Peter Zijlstra wrote:
> 
> > Make sure to prod idle CPUs so they call klp_update_patch_state().
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/livepatch/transition.c |    5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > --- a/kernel/livepatch/transition.c
> > +++ b/kernel/livepatch/transition.c
> > @@ -413,8 +413,11 @@ void klp_try_complete_transition(void)
> >  	for_each_possible_cpu(cpu) {
> >  		task = idle_task(cpu);
> >  		if (cpu_online(cpu)) {
> > -			if (!klp_try_switch_task(task))
> > +			if (!klp_try_switch_task(task)) {
> >  				complete = false;
> > +				/* Make idle task go through the main loop. */
> > +				wake_up_if_idle(cpu);
> > +			}
> 
> Right, it should be enough.
> 
> Acked-by: Miroslav Benes <mbenes@suse.cz>
> 
> It would be nice to get Vasily's Tested-by tag on this one.

I gave patches a spin on s390 with livepatch kselftest as well as with
https://github.com/lpechacek/qa_test_klp.git

BTW, commit 43c79fbad385 ("klp_tc_17: Avoid running the test on
s390x") is no longer required, since s390 implements HAVE_KPROBES_ON_FTRACE
since v5.6, so I just reverted test disablement.

Patches 1-6 work nicely, for them

Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390

Thanks a lot!

Starting with patch 8 is where I start seeing this with my config:

Oct 07 10:46:00 kernel: Freeing unused kernel image (initmem) memory: 6524K
Oct 07 10:46:00 kernel: INFO: task swapper/0:1 blocked for more than 122 seconds.
Oct 07 10:46:00 kernel:       Not tainted 5.15.0-rc4-69810-ga714851e1aad-dirty #74
Oct 07 10:46:00 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 07 10:46:00 kernel: task:swapper/0       state:D stack:10648 pid:    1 ppid:     0 flags:0x00000000
Oct 07 10:46:00 kernel: Call Trace:
Oct 07 10:46:00 kernel:  [<0000000000e164b6>] __schedule+0x36e/0x8b0
Oct 07 10:46:00 kernel:  [<0000000000e16a4e>] schedule+0x56/0x128
Oct 07 10:46:00 kernel:  [<0000000000e1e426>] schedule_timeout+0x106/0x160
Oct 07 10:46:00 kernel:  [<0000000000e18316>] wait_for_completion+0xc6/0x118
Oct 07 10:46:00 kernel:  [<000000000020a15c>] rcu_barrier.part.0+0x17c/0x2c0
Oct 07 10:46:00 kernel:  [<0000000000e0fcc0>] kernel_init+0x60/0x168
Oct 07 10:46:00 kernel:  [<000000000010390c>] __ret_from_fork+0x3c/0x58
Oct 07 10:46:00 kernel:  [<0000000000e2094a>] ret_from_fork+0xa/0x30
Oct 07 10:46:00 kernel: 1 lock held by swapper/0/1:
Oct 07 10:46:00 kernel:  #0: 0000000001469600 (rcu_state.barrier_mutex){+.+.}-{3:3}, at: rcu_barrier+0x42/0x80
Oct 07 10:46:00 kernel:
                        Showing all locks held in the system:
Oct 07 10:46:00 kernel: 1 lock held by swapper/0/1:
Oct 07 10:46:00 kernel:  #0: 0000000001469600 (rcu_state.barrier_mutex){+.+.}-{3:3}, at: rcu_barrier+0x42/0x80
Oct 07 10:46:00 kernel: 2 locks held by kworker/u680:0/8:
Oct 07 10:46:00 kernel:  #0: 000000008013cd48 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x222/0x738
Oct 07 10:46:00 kernel:  #1: 0000038000043dc8 ((kfence_timer).work){+.+.}-{0:0}, at: process_one_work+0x222/0x738
Oct 07 10:46:00 kernel: 1 lock held by khungtaskd/413:
Oct 07 10:46:00 kernel:  #0: 000000000145c980 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x50
Oct 07 10:46:00 kernel:
Oct 07 10:46:00 kernel: =============================================

So, will keep an eye on the rest of these patches and re-test in future, thanks!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle()
  2021-10-07  9:18     ` Vasily Gorbik
@ 2021-10-07 10:02       ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-10-07 10:02 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Miroslav Benes, jpoimboe, jikos, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

On Thu, Oct 07, 2021 at 11:18:13AM +0200, Vasily Gorbik wrote:

> Patches 1-6 work nicely, for them
> 
> Acked-by: Vasily Gorbik <gor@linux.ibm.com>
> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
> 
> Thanks a lot!

Thanks, will add tags.

> Starting with patch 8 is where I start seeing this with my config:

Yeah, I properly wrecked things there.. still trying to work out how to
fix it :-)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/11] sched: Simplify wake_up_*idle*()
  2021-09-29 15:17 ` [PATCH v2 04/11] sched: Simplify wake_up_*idle*() Peter Zijlstra
@ 2021-10-13 14:32   ` Qian Cai
  2021-10-19  3:47     ` Qian Cai
       [not found]   ` <CGME20211022134630eucas1p2e79e2816587d182c580459d567c1f2a9@eucas1p2.samsung.com>
  1 sibling, 1 reply; 51+ messages in thread
From: Qian Cai @ 2021-10-13 14:32 UTC (permalink / raw)
  To: Peter Zijlstra, gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86



On 9/29/2021 11:17 AM, Peter Zijlstra wrote:
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void)
>  {
>  	int cpu;
>  
> -	preempt_disable();
> +	cpus_read_lock();
>  	for_each_online_cpu(cpu) {
> -		if (cpu == smp_processor_id())
> +		if (cpu == raw_smp_processor_id())
>  			continue;
>  
>  		wake_up_if_idle(cpu);
>  	}
> -	preempt_enable();
> +	cpus_read_unlock();

Peter, it looks like this thing introduced a deadlock during CPU online/offline.

[  630.145166][  T129] WARNING: possible recursive locking detected
[  630.151164][  T129] 5.15.0-rc5-next-20211013+ #145 Not tainted
[  630.156988][  T129] --------------------------------------------
[  630.162984][  T129] cpuhp/21/129 is trying to acquire lock:
[  630.168547][  T129] ffff800011f466d0 (cpu_hotplug_lock){++++}-{0:0}, at: wake_up_all_idle_cpus+0x40/0xe8
wake_up_all_idle_cpus at /usr/src/linux-next/kernel/smp.c:1174
[  630.178040][  T129]
[  630.178040][){++++}-{0:0}, at help us debug this:
[  630.202292][  T129]  Possible unsafe locking scenario:
[  630.202292][  T129]
[  630.209590][  T129]        CPU0
[  630.212720][  T129]        ----
[  630.215851][  T129]   lock(cpu_hotplug_lock);
[  630.220202][  T129]   lock(cpu_hotplug_lock);
[  630.224553][  T129]
[  630.224553][  T129]  *** DEADLOCK ***
[  630.224553][  T129]
[  630.232545][  T129]  May be due to missing lock nesting notation
[  630.232545][  T129]
[  630.240711][  T129] 3 locks held by cpuhp/21/129:
[  630.245406][  T129]  #0: ffff800011f466d0 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0xe0/0x588
[  630.254976][  T129]  #1: ffff800011f46780 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0xe0/0x588
[  630.264372][  T129]  #2: ffff8000191fb9c8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x24/0x38
[  630.274031][  T129]
[  630.274031][  T129] stack backtrace:
[  630.279767][  T129] CPU: 21 PID: 129 Comm: cpuhp/21 Not tainted 5.15.0-rc5-next-20211013+ #145
[  630.288371][  T129] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
[  630.296886][  T129] Call trace:
[  630.300017][  T129]  dump_backtrace+0x0/0x3b8
[  630.304369][  T129]  show_stack+0x20/0x30
[  630.308371][  T129]  dump_stack_lvl+0x8c/0xb8
[  630.312722][  T129]  dump_stack+0x1c/0x38
[  630.316723][  T129]  validate_chain+0x1d84/0x1da0
[  630.321421][  T129]  __lock_acquire+0xab0/0x2040
[  630.326033][  T129]  lock_acquire+0x32c/0xb08
[  630.330390][  T129]  cpus_read_lock+0x94/0x308
[  630.334827][  T129]  wake_up_all_idle_cpus+0x40/0xe8
[  630.339784][  T129]  cpuidle_uninstall_idle_handler+0x3c/0x50
[  630.345524][  T129]  cpuidle_pause_and_lock+0x28/0x38
[  630.350569][  T129]  acpi_processor_hotplug+0xc0/0x170
[  630.355701][  T129]  acpi_soft_cpu_online+0x124/0x250
[  630.360745][  T129]  cpuhp_invoke_callback+0x51c/0x2ab8
[  630.365963][  T129]  cpuhp_thread_fun+0x204/0x588
[  630.370659][  T129]  smpboot_thread_fn+0x3f0/0xc40
[  630.375444][  T129]  kthread+0x3d8/0x488
[  630.379360][  T129]  ret_from_fork+0x10/0x20
[  863.525716][  T191] INFO: task cpuhp/21:129 blocked for more than 122 seconds.
[  863.532954][  T191]       Not tainted 5.15.0-rc5-next-20211013+ #145
[  863.539361][  T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  863.547927][  T191] task:cpuhp/21        state:D stack:59104 pid:  129 ppid:     2 flags:0x00000008
[  863.557029][  T191] Call trace:
[  863.560171][  T191]  __switch_to+0x184/0x400
[  863.564448][  T191]  __schedule+0x74c/0x1940
[  863.568753][  T191]  schedule+0x110/0x318
[  863.572764][  T191]  percpu_rwsem_wait+0x1b8/0x348
[  863.577592][  T191]  __percpu_down_read+0xb0/0x148
[  863.582386][  T191]  cpus_read_lock+0x2b0/0x308
[  863.586961][  T191]  wake_up_all_idle_cpus+0x40/0xe8
[  863.591931][  T191]  cpuidle_uninstall_idle_handler+0x3c/0x50
[  863.597716][  T191]  cpuidle_pause_and_lock+0x28/0x38
[  863.602771][  T191]  acpi_processor_hotplug+0xc0/0x170
[  863.607946][  T191]  acpi_soft_cpu_online+0x124/0x250
[  863.613001][  T191]  cpuhp_invoke_callback+0x51c/0x2ab8
[  863.618261][  T191]  cpuhp_thread_fun+0x204/0x588
[  863.622967][  T191]  smpboot_thread_fn+0x3f0/0xc40
[  863.627787][  T191]  kthread+0x3d8/0x488
[  863.631712][  T191]  ret_from_fork+0x10/0x20
[  863.636020][  T191] INFO: task kworker/0:2:189 blocked for more than 122 seconds.
[  863.643500][  T191]       Not tainted 5.15.0-rc5-next-20211013+ #145
[  863.649882][  T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  863.658425][  T191] task:kworker/0:2     state:D stack:58368 pid:  189 ppid:     2 flags:0x00000008
[  863.667516][  T191] Workqueue: events vmstat_shepherd
[  863.672573][  T191] Call trace:
[  863.675731][  T191]  __switch_to+0x184/0x400
[  863.680001][  T191]  __schedule+0x74c/0x1940
[  863.684268][  T191]  schedule+0x110/0x318
[  863.688295][  T191]  percpu_rwsem_wait+0x1b8/0x348
[  863.693085][  T191]  __percpu_down_read+0xb0/0x148
[  863.697892][  T191]  cpus_read_lock+0x2b0/0x308
[  863.702421][  T191]  vmstat_shepherd+0x5c/0x1a8
[  863.706977][  T191]  process_one_work+0x808/0x19d0
[  863.711767][  T191]  worker_thread+0x334/0xae0
[  863.716227][  T191]  kthread+0x3d8/0x488
[  863.720149][  T191]  ret_from_fork+0x10/0x20
[  863.724487][  T191] INFO: task lsbug:4642 blocked for more than 123 seconds.
[  863.731565][  T191]       Not tainted 5.15.0-rc5-next-20211013+ #145
[  863.737938][  T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  863.746490][  T191] task:lsbug           state:D stack:55536 pid: 4642 ppid:  4638 flags:0x00000008
[  863.755549][  T191] Call trace:
[  863.758712][  T191]  __switch_to+0x184/0x400
[  863.762984][  T191]  __schedule+0x74c/0x1940
[  863.767286][  T191]  schedule+0x110/0x318
[  863.771294][  T191]  schedule_timeout+0x188/0x238
[  863.776016][  T191]  wait_for_completion+0x174/0x290
[  863.780979][  T191]  __cpuhp_kick_ap+0x158/0x1a8
[  863.785592][  T191]  cpuhp_kick_ap+0x1f0/0x828
[  863.790053][  T191]  bringup_cpu+0x180/0x1e0
[  863.794320][  T191]  cpuhp_invoke_callback+0x51c/0x2ab8
[  863.799561][  T191]  cpuhp_invoke_callback_range+0xa4/0x108
[  863.805130][  T191]  cpu_up+0x528/0xd78
[  863.808982][  T191]  cpu_device_up+0x4c/0x68
[  863.813249][  T191]  cpu_subsys_online+0xc0/0x1f8
[  863.817972][  T191]  device_online+0x10c/0x180
[  863.822413][  T191]  online_store+0x10c/0x118
[  863.826791][  T191]  dev_attr_store+0x44/0x78
[  863.831148][  T191]  sysfs_kf_write+0xe8/0x138
[  863.835590][  T191]  kernfs_fop_write_iter+0x26c/0x3d0
[  863.840745][  T191]  new_sync_write+0x2bc/0x4f8
[  863.845275][  T191]  vfs_write+0x714/0xcd8
[  863.849387][  T191]  ksys_write+0xf8/0x1e0
[  863.853481][  T191]  __arm64_sys_write+0x74/0xa8
[  863.858113][  T191]  invoke_syscall.constprop.0+0xdc/0x1d8
[  863.863597][  T191]  do_el0_svc+0xe4/0x298
[  863.867710][  T191]  el0_svc+0x64/0x130
[  863.871545][  T191]  el0t_64_sync_handler+0xb0/0xb8
[  863.876437][  T191]  el0t_64_sync+0x180/0x184
[  863.880798][  T191] INFO: task mount:4682 blocked for more than 123 seconds.
[  863.887881][  T191]       Not tainted 5.15.0-rc5-next-20211013+ #145
[  863.894232][  T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  863.902776][  T191] task:mount           state:D stack:55856 pid: 4682 ppid:  1101 flags:0x00000000
[  863.911865][  T191] Call trace:
[  863.915003][  T191]  __switch_to+0x184/0x400
[  863.919296][  T191]  __schedule+0x74c/0x1940
[  863.923564][  T191]  schedule+0x110/0x318
[  863.927590][  T191]  percpu_rwsem_wait+0x1b8/0x348
[  863.932380][  T191]  __percpu_down_read+0xb0/0x148
[  863.937187][  T191]  cpus_read_lock+0x2b0/0x308
[  863.941715][  T191]  alloc_workqueue+0x730/0xd48
[  863.946357][  T191]  loop_configure+0x2d4/0x1180 [loop]
[  863.951592][  T191]  lo_ioctl+0x5dc/0x1228 [loop]
[  863.956321][  T191]  blkdev_ioctl+0x258/0x820
[  863.960678][  T191]  __arm64_sys_ioctl+0x114/0x180
[  863.965468][  T191]  invoke_syscall.constprop.0+0xdc/0x1d8
[  863.970974][  T191]  do_el0_svc+0xe4/0x298
[  863.975069][  T191]  el0_svc+0x64/0x130
[  863.978922][  T191]  el0t_64_sync_handler+0xb0/0xb8
[  863.983798][  T191]  el0t_64_sync+0x180/0x184
[  863.988172][  T191] INFO: lockdep is turned off.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle()
  2021-09-29 15:17 ` [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle() Peter Zijlstra
  2021-10-05 12:00   ` Petr Mladek
  2021-10-06  9:16   ` Miroslav Benes
@ 2021-10-13 19:37   ` Arnd Bergmann
  2021-10-14 10:42     ` Peter Zijlstra
  2 siblings, 1 reply; 51+ messages in thread
From: Arnd Bergmann @ 2021-10-13 19:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vasily Gorbik, Josh Poimboeuf, Jiri Kosina, Miroslav Benes,
	Petr Mladek, Ingo Molnar, Linux Kernel Mailing List,
	joe.lawrence, Frédéric Weisbecker, Thomas Gleixner,
	Heiko Carstens, svens, sumanthk, live-patching, Paul E. McKenney,
	Steven Rostedt, the arch/x86 maintainers

On Wed, Sep 29, 2021 at 6:10 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Make sure to prod idle CPUs so they call klp_update_patch_state().
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/livepatch/transition.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -413,8 +413,11 @@ void klp_try_complete_transition(void)
>         for_each_possible_cpu(cpu) {
>                 task = idle_task(cpu);
>                 if (cpu_online(cpu)) {
> -                       if (!klp_try_switch_task(task))
> +                       if (!klp_try_switch_task(task)) {
>                                 complete = false;
> +                               /* Make idle task go through the main loop. */
> +                               wake_up_if_idle(cpu);
> +                       }

This caused a build regression on non-SMP kernels:

x86_64-linux-ld: kernel/livepatch/transition.o: in function
`klp_try_complete_transition':
transition.c:(.text+0x106e): undefined reference to `wake_up_if_idle'

Maybe add a IS_ENABLED(CONFIG_SMP) check to one of the if() conditions?

        Arnd

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle()
  2021-10-13 19:37   ` Arnd Bergmann
@ 2021-10-14 10:42     ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-10-14 10:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Vasily Gorbik, Josh Poimboeuf, Jiri Kosina, Miroslav Benes,
	Petr Mladek, Ingo Molnar, Linux Kernel Mailing List,
	joe.lawrence, Frédéric Weisbecker, Thomas Gleixner,
	Heiko Carstens, svens, sumanthk, live-patching, Paul E. McKenney,
	Steven Rostedt, the arch/x86 maintainers

On Wed, Oct 13, 2021 at 09:37:01PM +0200, Arnd Bergmann wrote:
> On Wed, Sep 29, 2021 at 6:10 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Make sure to prod idle CPUs so they call klp_update_patch_state().
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/livepatch/transition.c |    5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > --- a/kernel/livepatch/transition.c
> > +++ b/kernel/livepatch/transition.c
> > @@ -413,8 +413,11 @@ void klp_try_complete_transition(void)
> >         for_each_possible_cpu(cpu) {
> >                 task = idle_task(cpu);
> >                 if (cpu_online(cpu)) {
> > -                       if (!klp_try_switch_task(task))
> > +                       if (!klp_try_switch_task(task)) {
> >                                 complete = false;
> > +                               /* Make idle task go through the main loop. */
> > +                               wake_up_if_idle(cpu);
> > +                       }
> 
> This caused a build regression on non-SMP kernels:

:-(

> x86_64-linux-ld: kernel/livepatch/transition.o: in function
> `klp_try_complete_transition':
> transition.c:(.text+0x106e): undefined reference to `wake_up_if_idle'
> 
> Maybe add a IS_ENABLED(CONFIG_SMP) check to one of the if() conditions?

I'll just add a stub for that function I think. Let me rebase the thing
before I push more crap ontop..

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/11] sched: Simplify wake_up_*idle*()
  2021-10-13 14:32   ` Qian Cai
@ 2021-10-19  3:47     ` Qian Cai
  2021-10-19  8:56       ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Qian Cai @ 2021-10-19  3:47 UTC (permalink / raw)
  To: Peter Zijlstra, gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

Peter, any thoughts? I did confirm that reverting the commit fixed the issue.

On 10/13/2021 10:32 AM, Qian Cai wrote:
> 
> 
> On 9/29/2021 11:17 AM, Peter Zijlstra wrote:
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void)
>>  {
>>  	int cpu;
>>  
>> -	preempt_disable();
>> +	cpus_read_lock();
>>  	for_each_online_cpu(cpu) {
>> -		if (cpu == smp_processor_id())
>> +		if (cpu == raw_smp_processor_id())
>>  			continue;
>>  
>>  		wake_up_if_idle(cpu);
>>  	}
>> -	preempt_enable();
>> +	cpus_read_unlock();
> 
> Peter, it looks like this thing introduced a deadlock during CPU online/offline.
> 
> [  630.145166][  T129] WARNING: possible recursive locking detected
> [  630.151164][  T129] 5.15.0-rc5-next-20211013+ #145 Not tainted
> [  630.156988][  T129] --------------------------------------------
> [  630.162984][  T129] cpuhp/21/129 is trying to acquire lock:
> [  630.168547][  T129] ffff800011f466d0 (cpu_hotplug_lock){++++}-{0:0}, at: wake_up_all_idle_cpus+0x40/0xe8
> wake_up_all_idle_cpus at /usr/src/linux-next/kernel/smp.c:1174
> [  630.178040][  T129]
> [  630.178040][){++++}-{0:0}, at help us debug this:
> [  630.202292][  T129]  Possible unsafe locking scenario:
> [  630.202292][  T129]
> [  630.209590][  T129]        CPU0
> [  630.212720][  T129]        ----
> [  630.215851][  T129]   lock(cpu_hotplug_lock);
> [  630.220202][  T129]   lock(cpu_hotplug_lock);
> [  630.224553][  T129]
> [  630.224553][  T129]  *** DEADLOCK ***
> [  630.224553][  T129]
> [  630.232545][  T129]  May be due to missing lock nesting notation
> [  630.232545][  T129]
> [  630.240711][  T129] 3 locks held by cpuhp/21/129:
> [  630.245406][  T129]  #0: ffff800011f466d0 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0xe0/0x588
> [  630.254976][  T129]  #1: ffff800011f46780 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0xe0/0x588
> [  630.264372][  T129]  #2: ffff8000191fb9c8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x24/0x38
> [  630.274031][  T129]
> [  630.274031][  T129] stack backtrace:
> [  630.279767][  T129] CPU: 21 PID: 129 Comm: cpuhp/21 Not tainted 5.15.0-rc5-next-20211013+ #145
> [  630.288371][  T129] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> [  630.296886][  T129] Call trace:
> [  630.300017][  T129]  dump_backtrace+0x0/0x3b8
> [  630.304369][  T129]  show_stack+0x20/0x30
> [  630.308371][  T129]  dump_stack_lvl+0x8c/0xb8
> [  630.312722][  T129]  dump_stack+0x1c/0x38
> [  630.316723][  T129]  validate_chain+0x1d84/0x1da0
> [  630.321421][  T129]  __lock_acquire+0xab0/0x2040
> [  630.326033][  T129]  lock_acquire+0x32c/0xb08
> [  630.330390][  T129]  cpus_read_lock+0x94/0x308
> [  630.334827][  T129]  wake_up_all_idle_cpus+0x40/0xe8
> [  630.339784][  T129]  cpuidle_uninstall_idle_handler+0x3c/0x50
> [  630.345524][  T129]  cpuidle_pause_and_lock+0x28/0x38
> [  630.350569][  T129]  acpi_processor_hotplug+0xc0/0x170
> [  630.355701][  T129]  acpi_soft_cpu_online+0x124/0x250
> [  630.360745][  T129]  cpuhp_invoke_callback+0x51c/0x2ab8
> [  630.365963][  T129]  cpuhp_thread_fun+0x204/0x588
> [  630.370659][  T129]  smpboot_thread_fn+0x3f0/0xc40
> [  630.375444][  T129]  kthread+0x3d8/0x488
> [  630.379360][  T129]  ret_from_fork+0x10/0x20
> [  863.525716][  T191] INFO: task cpuhp/21:129 blocked for more than 122 seconds.
> [  863.532954][  T191]       Not tainted 5.15.0-rc5-next-20211013+ #145
> [  863.539361][  T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  863.547927][  T191] task:cpuhp/21        state:D stack:59104 pid:  129 ppid:     2 flags:0x00000008
> [  863.557029][  T191] Call trace:
> [  863.560171][  T191]  __switch_to+0x184/0x400
> [  863.564448][  T191]  __schedule+0x74c/0x1940
> [  863.568753][  T191]  schedule+0x110/0x318
> [  863.572764][  T191]  percpu_rwsem_wait+0x1b8/0x348
> [  863.577592][  T191]  __percpu_down_read+0xb0/0x148
> [  863.582386][  T191]  cpus_read_lock+0x2b0/0x308
> [  863.586961][  T191]  wake_up_all_idle_cpus+0x40/0xe8
> [  863.591931][  T191]  cpuidle_uninstall_idle_handler+0x3c/0x50
> [  863.597716][  T191]  cpuidle_pause_and_lock+0x28/0x38
> [  863.602771][  T191]  acpi_processor_hotplug+0xc0/0x170
> [  863.607946][  T191]  acpi_soft_cpu_online+0x124/0x250
> [  863.613001][  T191]  cpuhp_invoke_callback+0x51c/0x2ab8
> [  863.618261][  T191]  cpuhp_thread_fun+0x204/0x588
> [  863.622967][  T191]  smpboot_thread_fn+0x3f0/0xc40
> [  863.627787][  T191]  kthread+0x3d8/0x488
> [  863.631712][  T191]  ret_from_fork+0x10/0x20
> [  863.636020][  T191] INFO: task kworker/0:2:189 blocked for more than 122 seconds.
> [  863.643500][  T191]       Not tainted 5.15.0-rc5-next-20211013+ #145
> [  863.649882][  T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  863.658425][  T191] task:kworker/0:2     state:D stack:58368 pid:  189 ppid:     2 flags:0x00000008
> [  863.667516][  T191] Workqueue: events vmstat_shepherd
> [  863.672573][  T191] Call trace:
> [  863.675731][  T191]  __switch_to+0x184/0x400
> [  863.680001][  T191]  __schedule+0x74c/0x1940
> [  863.684268][  T191]  schedule+0x110/0x318
> [  863.688295][  T191]  percpu_rwsem_wait+0x1b8/0x348
> [  863.693085][  T191]  __percpu_down_read+0xb0/0x148
> [  863.697892][  T191]  cpus_read_lock+0x2b0/0x308
> [  863.702421][  T191]  vmstat_shepherd+0x5c/0x1a8
> [  863.706977][  T191]  process_one_work+0x808/0x19d0
> [  863.711767][  T191]  worker_thread+0x334/0xae0
> [  863.716227][  T191]  kthread+0x3d8/0x488
> [  863.720149][  T191]  ret_from_fork+0x10/0x20
> [  863.724487][  T191] INFO: task lsbug:4642 blocked for more than 123 seconds.
> [  863.731565][  T191]       Not tainted 5.15.0-rc5-next-20211013+ #145
> [  863.737938][  T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  863.746490][  T191] task:lsbug           state:D stack:55536 pid: 4642 ppid:  4638 flags:0x00000008
> [  863.755549][  T191] Call trace:
> [  863.758712][  T191]  __switch_to+0x184/0x400
> [  863.762984][  T191]  __schedule+0x74c/0x1940
> [  863.767286][  T191]  schedule+0x110/0x318
> [  863.771294][  T191]  schedule_timeout+0x188/0x238
> [  863.776016][  T191]  wait_for_completion+0x174/0x290
> [  863.780979][  T191]  __cpuhp_kick_ap+0x158/0x1a8
> [  863.785592][  T191]  cpuhp_kick_ap+0x1f0/0x828
> [  863.790053][  T191]  bringup_cpu+0x180/0x1e0
> [  863.794320][  T191]  cpuhp_invoke_callback+0x51c/0x2ab8
> [  863.799561][  T191]  cpuhp_invoke_callback_range+0xa4/0x108
> [  863.805130][  T191]  cpu_up+0x528/0xd78
> [  863.808982][  T191]  cpu_device_up+0x4c/0x68
> [  863.813249][  T191]  cpu_subsys_online+0xc0/0x1f8
> [  863.817972][  T191]  device_online+0x10c/0x180
> [  863.822413][  T191]  online_store+0x10c/0x118
> [  863.826791][  T191]  dev_attr_store+0x44/0x78
> [  863.831148][  T191]  sysfs_kf_write+0xe8/0x138
> [  863.835590][  T191]  kernfs_fop_write_iter+0x26c/0x3d0
> [  863.840745][  T191]  new_sync_write+0x2bc/0x4f8
> [  863.845275][  T191]  vfs_write+0x714/0xcd8
> [  863.849387][  T191]  ksys_write+0xf8/0x1e0
> [  863.853481][  T191]  __arm64_sys_write+0x74/0xa8
> [  863.858113][  T191]  invoke_syscall.constprop.0+0xdc/0x1d8
> [  863.863597][  T191]  do_el0_svc+0xe4/0x298
> [  863.867710][  T191]  el0_svc+0x64/0x130
> [  863.871545][  T191]  el0t_64_sync_handler+0xb0/0xb8
> [  863.876437][  T191]  el0t_64_sync+0x180/0x184
> [  863.880798][  T191] INFO: task mount:4682 blocked for more than 123 seconds.
> [  863.887881][  T191]       Not tainted 5.15.0-rc5-next-20211013+ #145
> [  863.894232][  T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  863.902776][  T191] task:mount           state:D stack:55856 pid: 4682 ppid:  1101 flags:0x00000000
> [  863.911865][  T191] Call trace:
> [  863.915003][  T191]  __switch_to+0x184/0x400
> [  863.919296][  T191]  __schedule+0x74c/0x1940
> [  863.923564][  T191]  schedule+0x110/0x318
> [  863.927590][  T191]  percpu_rwsem_wait+0x1b8/0x348
> [  863.932380][  T191]  __percpu_down_read+0xb0/0x148
> [  863.937187][  T191]  cpus_read_lock+0x2b0/0x308
> [  863.941715][  T191]  alloc_workqueue+0x730/0xd48
> [  863.946357][  T191]  loop_configure+0x2d4/0x1180 [loop]
> [  863.951592][  T191]  lo_ioctl+0x5dc/0x1228 [loop]
> [  863.956321][  T191]  blkdev_ioctl+0x258/0x820
> [  863.960678][  T191]  __arm64_sys_ioctl+0x114/0x180
> [  863.965468][  T191]  invoke_syscall.constprop.0+0xdc/0x1d8
> [  863.970974][  T191]  do_el0_svc+0xe4/0x298
> [  863.975069][  T191]  el0_svc+0x64/0x130
> [  863.978922][  T191]  el0t_64_sync_handler+0xb0/0xb8
> [  863.983798][  T191]  el0t_64_sync+0x180/0x184
> [  863.988172][  T191] INFO: lockdep is turned off.
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/11] sched: Simplify wake_up_*idle*()
  2021-10-19  3:47     ` Qian Cai
@ 2021-10-19  8:56       ` Peter Zijlstra
  2021-10-19  9:10         ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2021-10-19  8:56 UTC (permalink / raw)
  To: Qian Cai
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

On Mon, Oct 18, 2021 at 11:47:32PM -0400, Qian Cai wrote:
> Peter, any thoughts? I did confirm that reverting the commit fixed the issue.
> 
> On 10/13/2021 10:32 AM, Qian Cai wrote:
> > 
> > 
> > On 9/29/2021 11:17 AM, Peter Zijlstra wrote:
> >> --- a/kernel/smp.c
> >> +++ b/kernel/smp.c
> >> @@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void)
> >>  {
> >>  	int cpu;
> >>  
> >> -	preempt_disable();
> >> +	cpus_read_lock();
> >>  	for_each_online_cpu(cpu) {
> >> -		if (cpu == smp_processor_id())
> >> +		if (cpu == raw_smp_processor_id())
> >>  			continue;
> >>  
> >>  		wake_up_if_idle(cpu);
> >>  	}
> >> -	preempt_enable();
> >> +	cpus_read_unlock();

Right, so yesterday I thought: YW2KGrvvv/vSA+97@hirez.programming.kicks-ass.net
but today I might have another idea, lemme go prod at this a bit more.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/11] sched: Simplify wake_up_*idle*()
  2021-10-19  8:56       ` Peter Zijlstra
@ 2021-10-19  9:10         ` Peter Zijlstra
  2021-10-19 15:32           ` Qian Cai
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2021-10-19  9:10 UTC (permalink / raw)
  To: Qian Cai
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

On Tue, Oct 19, 2021 at 10:56:48AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 18, 2021 at 11:47:32PM -0400, Qian Cai wrote:
> > Peter, any thoughts? I did confirm that reverting the commit fixed the issue.
> > 
> > On 10/13/2021 10:32 AM, Qian Cai wrote:
> > > 
> > > 
> > > On 9/29/2021 11:17 AM, Peter Zijlstra wrote:
> > >> --- a/kernel/smp.c
> > >> +++ b/kernel/smp.c
> > >> @@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void)
> > >>  {
> > >>  	int cpu;
> > >>  
> > >> -	preempt_disable();
> > >> +	cpus_read_lock();
> > >>  	for_each_online_cpu(cpu) {
> > >> -		if (cpu == smp_processor_id())
> > >> +		if (cpu == raw_smp_processor_id())
> > >>  			continue;
> > >>  
> > >>  		wake_up_if_idle(cpu);
> > >>  	}
> > >> -	preempt_enable();
> > >> +	cpus_read_unlock();
> 
> Right, so yesterday I thought: YW2KGrvvv/vSA+97@hirez.programming.kicks-ass.net
> but today I might have another idea, lemme go prod at this a bit more.

Here, hows this then?

---
diff --git a/kernel/smp.c b/kernel/smp.c
index ad0b68a3a3d3..61ddc7a3bafa 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -1170,14 +1170,12 @@ void wake_up_all_idle_cpus(void)
 {
 	int cpu;
 
-	cpus_read_lock();
-	for_each_online_cpu(cpu) {
-		if (cpu == raw_smp_processor_id())
-			continue;
-
-		wake_up_if_idle(cpu);
+	for_each_cpu(cpu) {
+		preempt_disable();
+		if (cpu != smp_processor_id() && cpu_online(cpu))
+			wake_up_if_idle(cpu);
+		preempt_enable();
 	}
-	cpus_read_unlock();
 }
 EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
 

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/11] sched: Simplify wake_up_*idle*()
  2021-10-19  9:10         ` Peter Zijlstra
@ 2021-10-19 15:32           ` Qian Cai
  2021-10-19 15:50             ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Qian Cai @ 2021-10-19 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86



On 10/19/2021 5:10 AM, Peter Zijlstra wrote:
> Here, hows this then?
> 
> ---
> diff --git a/kernel/smp.c b/kernel/smp.c
> index ad0b68a3a3d3..61ddc7a3bafa 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -1170,14 +1170,12 @@ void wake_up_all_idle_cpus(void)
>  {
>  	int cpu;
>  
> -	cpus_read_lock();
> -	for_each_online_cpu(cpu) {
> -		if (cpu == raw_smp_processor_id())
> -			continue;
> -
> -		wake_up_if_idle(cpu);
> +	for_each_cpu(cpu) {
> +		preempt_disable();
> +		if (cpu != smp_processor_id() && cpu_online(cpu))
> +			wake_up_if_idle(cpu);
> +		preempt_enable();
>  	}
> -	cpus_read_unlock();
>  }
>  EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);

This does not compile.

kernel/smp.c:1173:18: error: macro "for_each_cpu" requires 2 arguments, but only 1 given

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/11] sched: Simplify wake_up_*idle*()
  2021-10-19 15:32           ` Qian Cai
@ 2021-10-19 15:50             ` Peter Zijlstra
  2021-10-19 19:22               ` Qian Cai
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2021-10-19 15:50 UTC (permalink / raw)
  To: Qian Cai
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

On Tue, Oct 19, 2021 at 11:32:15AM -0400, Qian Cai wrote:
> 
> 
> On 10/19/2021 5:10 AM, Peter Zijlstra wrote:
> > Here, hows this then?
> > 
> > ---
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index ad0b68a3a3d3..61ddc7a3bafa 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -1170,14 +1170,12 @@ void wake_up_all_idle_cpus(void)
> >  {
> >  	int cpu;
> >  
> > -	cpus_read_lock();
> > -	for_each_online_cpu(cpu) {
> > -		if (cpu == raw_smp_processor_id())
> > -			continue;
> > -
> > -		wake_up_if_idle(cpu);
> > +	for_each_cpu(cpu) {
> > +		preempt_disable();
> > +		if (cpu != smp_processor_id() && cpu_online(cpu))
> > +			wake_up_if_idle(cpu);
> > +		preempt_enable();
> >  	}
> > -	cpus_read_unlock();
> >  }
> >  EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
> 
> This does not compile.
> 
> kernel/smp.c:1173:18: error: macro "for_each_cpu" requires 2 arguments, but only 1 given

Bah, for_each_possible_cpu(), lemme update the patch, I'm sure the
robots will scream bloody murder on that too.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/11] sched: Simplify wake_up_*idle*()
  2021-10-19 15:50             ` Peter Zijlstra
@ 2021-10-19 19:22               ` Qian Cai
  2021-10-19 20:27                 ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Qian Cai @ 2021-10-19 19:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86



On 10/19/21 11:50 AM, Peter Zijlstra wrote:
>>> diff --git a/kernel/smp.c b/kernel/smp.c
>>> index ad0b68a3a3d3..61ddc7a3bafa 100644
>>> --- a/kernel/smp.c
>>> +++ b/kernel/smp.c
>>> @@ -1170,14 +1170,12 @@ void wake_up_all_idle_cpus(void)
>>>  {
>>>  	int cpu;
>>>  
>>> -	cpus_read_lock();
>>> -	for_each_online_cpu(cpu) {
>>> -		if (cpu == raw_smp_processor_id())
>>> -			continue;
>>> -
>>> -		wake_up_if_idle(cpu);
>>> +	for_each_cpu(cpu) {
>>> +		preempt_disable();
>>> +		if (cpu != smp_processor_id() && cpu_online(cpu))
>>> +			wake_up_if_idle(cpu);
>>> +		preempt_enable();
>>>  	}
>>> -	cpus_read_unlock();
>>>  }
>>>  EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
>>
>> This does not compile.
>>
>> kernel/smp.c:1173:18: error: macro "for_each_cpu" requires 2 arguments, but only 1 given
> 
> Bah, for_each_possible_cpu(), lemme update the patch, I'm sure the
> robots will scream bloody murder on that too.

This survived the regression tests here.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/11] sched: Simplify wake_up_*idle*()
  2021-10-19 19:22               ` Qian Cai
@ 2021-10-19 20:27                 ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-10-19 20:27 UTC (permalink / raw)
  To: Qian Cai
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

On Tue, Oct 19, 2021 at 03:22:43PM -0400, Qian Cai wrote:
> 
> 
> On 10/19/21 11:50 AM, Peter Zijlstra wrote:
> >>> diff --git a/kernel/smp.c b/kernel/smp.c
> >>> index ad0b68a3a3d3..61ddc7a3bafa 100644
> >>> --- a/kernel/smp.c
> >>> +++ b/kernel/smp.c
> >>> @@ -1170,14 +1170,12 @@ void wake_up_all_idle_cpus(void)
> >>>  {
> >>>  	int cpu;
> >>>  
> >>> -	cpus_read_lock();
> >>> -	for_each_online_cpu(cpu) {
> >>> -		if (cpu == raw_smp_processor_id())
> >>> -			continue;
> >>> -
> >>> -		wake_up_if_idle(cpu);
> >>> +	for_each_cpu(cpu) {
> >>> +		preempt_disable();
> >>> +		if (cpu != smp_processor_id() && cpu_online(cpu))
> >>> +			wake_up_if_idle(cpu);
> >>> +		preempt_enable();
> >>>  	}
> >>> -	cpus_read_unlock();
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
> >>
> >> This does not compile.
> >>
> >> kernel/smp.c:1173:18: error: macro "for_each_cpu" requires 2 arguments, but only 1 given
> > 
> > Bah, for_each_possible_cpu(), lemme update the patch, I'm sure the
> > robots will scream bloody murder on that too.
> 
> This survived the regression tests here.

Thanks!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 11/11] context_tracking,x86: Fix text_poke_sync() vs NOHZ_FULL
  2021-09-29 15:17 ` [RFC][PATCH v2 11/11] context_tracking,x86: Fix text_poke_sync() vs NOHZ_FULL Peter Zijlstra
@ 2021-10-21 18:39   ` Marcelo Tosatti
  2021-10-21 18:40     ` Marcelo Tosatti
  2021-10-21 19:25     ` Peter Zijlstra
  0 siblings, 2 replies; 51+ messages in thread
From: Marcelo Tosatti @ 2021-10-21 18:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

Peter,

static __always_inline void arch_exit_to_user_mode(void)
{
        mds_user_clear_cpu_buffers();
}

/**
 * mds_user_clear_cpu_buffers - Mitigation for MDS and TAA vulnerability
 *
 * Clear CPU buffers if the corresponding static key is enabled
 */
static __always_inline void mds_user_clear_cpu_buffers(void)
{
        if (static_branch_likely(&mds_user_clear))
                mds_clear_cpu_buffers();
}

We were discussing how to perform objtool style validation 
that no code after the check for 

> +             /* NMI happens here and must still do/finish CT_WORK_n */
> +             sync_core();

But after the discussion with you, it seems doing the TLB checking 
and (also sync_core) checking very late/very early on exit/entry 
makes things easier to review.

Can then use a single atomic variable with USER/KERNEL state and cmpxchg
loops.

On Wed, Sep 29, 2021 at 05:17:34PM +0200, Peter Zijlstra wrote:
> Use the new context_tracking infrastructure to avoid disturbing
> userspace tasks when we rewrite kernel code.
> 
> XXX re-audit the entry code to make sure only the context_tracking
> static_branch is before hitting this code.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/include/asm/sync_core.h |    2 ++
>  arch/x86/kernel/alternative.c    |    8 +++++++-
>  include/linux/context_tracking.h |    1 +
>  kernel/context_tracking.c        |   12 ++++++++++++
>  4 files changed, 22 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/include/asm/sync_core.h
> +++ b/arch/x86/include/asm/sync_core.h
> @@ -87,6 +87,8 @@ static inline void sync_core(void)
>  	 */
>  	iret_to_self();
>  }
> +#define sync_core sync_core
> +
>  
>  /*
>   * Ensure that a core serializing instruction is issued before returning
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -18,6 +18,7 @@
>  #include <linux/mmu_context.h>
>  #include <linux/bsearch.h>
>  #include <linux/sync_core.h>
> +#include <linux/context_tracking.h>
>  #include <asm/text-patching.h>
>  #include <asm/alternative.h>
>  #include <asm/sections.h>
> @@ -924,9 +925,14 @@ static void do_sync_core(void *info)
>  	sync_core();
>  }
>  
> +static bool do_sync_core_cond(int cpu, void *info)
> +{
> +	return !context_tracking_set_cpu_work(cpu, CT_WORK_SYNC);
> +}
> +
>  void text_poke_sync(void)
>  {
> -	on_each_cpu(do_sync_core, NULL, 1);
> +	on_each_cpu_cond(do_sync_core_cond, do_sync_core, NULL, 1);
>  }
>  
>  struct text_poke_loc {
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -11,6 +11,7 @@
>  
>  enum ct_work {
>  	CT_WORK_KLP = 1,
> +	CT_WORK_SYNC = 2,
>  };
>  
>  /*
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -51,6 +51,10 @@ static __always_inline void context_trac
>  	__this_cpu_dec(context_tracking.recursion);
>  }
>  
> +#ifndef sync_core
> +static inline void sync_core(void) { }
> +#endif
> +
>  /* CT_WORK_n, must be noinstr, non-blocking, NMI safe and deal with spurious calls */
>  static noinstr void ct_exit_user_work(struct context_tracking *ct)
>  {
> @@ -64,6 +68,14 @@ static noinstr void ct_exit_user_work(struct
>  		arch_atomic_andnot(CT_WORK_KLP, &ct->work);
>  	}
>  
> +	if (work & CT_WORK_SYNC) {
> +		/* NMI happens here and must still do/finish CT_WORK_n */
> +		sync_core();
> +
> +		smp_mb__before_atomic();
> +		arch_atomic_andnot(CT_WORK_SYNC, &ct->work);
> +	}
> +
>  	smp_mb__before_atomic();
>  	arch_atomic_andnot(CT_SEQ_WORK, &ct->seq);
>  }
> 
> 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 11/11] context_tracking,x86: Fix text_poke_sync() vs NOHZ_FULL
  2021-10-21 18:39   ` Marcelo Tosatti
@ 2021-10-21 18:40     ` Marcelo Tosatti
  2021-10-21 19:25     ` Peter Zijlstra
  1 sibling, 0 replies; 51+ messages in thread
From: Marcelo Tosatti @ 2021-10-21 18:40 UTC (permalink / raw)
  To: Peter Zijlstra, Nicolas Saenz Julienne
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

+CC Nicolas.

On Thu, Oct 21, 2021 at 03:39:35PM -0300, Marcelo Tosatti wrote:
> Peter,
> 
> static __always_inline void arch_exit_to_user_mode(void)
> {
>         mds_user_clear_cpu_buffers();
> }
> 
> /**
>  * mds_user_clear_cpu_buffers - Mitigation for MDS and TAA vulnerability
>  *
>  * Clear CPU buffers if the corresponding static key is enabled
>  */
> static __always_inline void mds_user_clear_cpu_buffers(void)
> {
>         if (static_branch_likely(&mds_user_clear))
>                 mds_clear_cpu_buffers();
> }
> 
> We were discussing how to perform objtool style validation 
> that no code after the check for 
> 
> > +             /* NMI happens here and must still do/finish CT_WORK_n */
> > +             sync_core();
> 
> But after the discussion with you, it seems doing the TLB checking 
> and (also sync_core) checking very late/very early on exit/entry 
> makes things easier to review.
> 
> Can then use a single atomic variable with USER/KERNEL state and cmpxchg
> loops.
> 
> On Wed, Sep 29, 2021 at 05:17:34PM +0200, Peter Zijlstra wrote:
> > Use the new context_tracking infrastructure to avoid disturbing
> > userspace tasks when we rewrite kernel code.
> > 
> > XXX re-audit the entry code to make sure only the context_tracking
> > static_branch is before hitting this code.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  arch/x86/include/asm/sync_core.h |    2 ++
> >  arch/x86/kernel/alternative.c    |    8 +++++++-
> >  include/linux/context_tracking.h |    1 +
> >  kernel/context_tracking.c        |   12 ++++++++++++
> >  4 files changed, 22 insertions(+), 1 deletion(-)
> > 
> > --- a/arch/x86/include/asm/sync_core.h
> > +++ b/arch/x86/include/asm/sync_core.h
> > @@ -87,6 +87,8 @@ static inline void sync_core(void)
> >  	 */
> >  	iret_to_self();
> >  }
> > +#define sync_core sync_core
> > +
> >  
> >  /*
> >   * Ensure that a core serializing instruction is issued before returning
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/mmu_context.h>
> >  #include <linux/bsearch.h>
> >  #include <linux/sync_core.h>
> > +#include <linux/context_tracking.h>
> >  #include <asm/text-patching.h>
> >  #include <asm/alternative.h>
> >  #include <asm/sections.h>
> > @@ -924,9 +925,14 @@ static void do_sync_core(void *info)
> >  	sync_core();
> >  }
> >  
> > +static bool do_sync_core_cond(int cpu, void *info)
> > +{
> > +	return !context_tracking_set_cpu_work(cpu, CT_WORK_SYNC);
> > +}
> > +
> >  void text_poke_sync(void)
> >  {
> > -	on_each_cpu(do_sync_core, NULL, 1);
> > +	on_each_cpu_cond(do_sync_core_cond, do_sync_core, NULL, 1);
> >  }
> >  
> >  struct text_poke_loc {
> > --- a/include/linux/context_tracking.h
> > +++ b/include/linux/context_tracking.h
> > @@ -11,6 +11,7 @@
> >  
> >  enum ct_work {
> >  	CT_WORK_KLP = 1,
> > +	CT_WORK_SYNC = 2,
> >  };
> >  
> >  /*
> > --- a/kernel/context_tracking.c
> > +++ b/kernel/context_tracking.c
> > @@ -51,6 +51,10 @@ static __always_inline void context_trac
> >  	__this_cpu_dec(context_tracking.recursion);
> >  }
> >  
> > +#ifndef sync_core
> > +static inline void sync_core(void) { }
> > +#endif
> > +
> >  /* CT_WORK_n, must be noinstr, non-blocking, NMI safe and deal with spurious calls */
> >  static noinstr void ct_exit_user_work(struct context_tracking *ct)
> >  {
> > @@ -64,6 +68,14 @@ static noinstr void ct_exit_user_work(struct
> >  		arch_atomic_andnot(CT_WORK_KLP, &ct->work);
> >  	}
> >  
> > +	if (work & CT_WORK_SYNC) {
> > +		/* NMI happens here and must still do/finish CT_WORK_n */
> > +		sync_core();
> > +
> > +		smp_mb__before_atomic();
> > +		arch_atomic_andnot(CT_WORK_SYNC, &ct->work);
> > +	}
> > +
> >  	smp_mb__before_atomic();
> >  	arch_atomic_andnot(CT_SEQ_WORK, &ct->seq);
> >  }
> > 
> > 
> > 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 11/11] context_tracking,x86: Fix text_poke_sync() vs NOHZ_FULL
  2021-10-21 18:39   ` Marcelo Tosatti
  2021-10-21 18:40     ` Marcelo Tosatti
@ 2021-10-21 19:25     ` Peter Zijlstra
  2021-10-21 19:57       ` Marcelo Tosatti
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2021-10-21 19:25 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

On Thu, Oct 21, 2021 at 03:39:35PM -0300, Marcelo Tosatti wrote:
> Peter,
> 
> static __always_inline void arch_exit_to_user_mode(void)
> {
>         mds_user_clear_cpu_buffers();
> }
> 
> /**
>  * mds_user_clear_cpu_buffers - Mitigation for MDS and TAA vulnerability
>  *
>  * Clear CPU buffers if the corresponding static key is enabled
>  */
> static __always_inline void mds_user_clear_cpu_buffers(void)
> {
>         if (static_branch_likely(&mds_user_clear))
>                 mds_clear_cpu_buffers();
> }
> 
> We were discussing how to perform objtool style validation 
> that no code after the check for 

I'm not sure what the point of the above is... Were you trying to ask
for validation that nothing runs after the mds_user_clear_cpu_buffer()?

That isn't strictly true today, there's lockdep code after it. I can't
recall why that order is as it is though.

Pretty much everything in noinstr is magical, we just have to think
harder there (and possibly start writing more comments there).

> > +             /* NMI happens here and must still do/finish CT_WORK_n */
> > +             sync_core();
> 
> But after the discussion with you, it seems doing the TLB checking 
> and (also sync_core) checking very late/very early on exit/entry 
> makes things easier to review.

I don't know about late, it must happen *very* early in entry. The
sync_core() must happen before any self-modifying code gets called
(static_branch, static_call, etc..) with possible exception of the
context_tracking static_branch.

The TLBi must also happen super early, possibly while still on the
entry stack (since the task stack is vmap'ed). We currently don't run C
code on the entry stack, that needs quite a bit of careful work to make
happen.

> Can then use a single atomic variable with USER/KERNEL state and cmpxchg
> loops.

We're not going to add an atomic to context tracking. There is one, we
just got to extract/share it with RCU.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 11/11] context_tracking,x86: Fix text_poke_sync() vs NOHZ_FULL
  2021-10-21 19:25     ` Peter Zijlstra
@ 2021-10-21 19:57       ` Marcelo Tosatti
  2021-10-21 20:18         ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Marcelo Tosatti @ 2021-10-21 19:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

On Thu, Oct 21, 2021 at 09:25:43PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 21, 2021 at 03:39:35PM -0300, Marcelo Tosatti wrote:
> > Peter,
> > 
> > static __always_inline void arch_exit_to_user_mode(void)
> > {
> >         mds_user_clear_cpu_buffers();
> > }
> > 
> > /**
> >  * mds_user_clear_cpu_buffers - Mitigation for MDS and TAA vulnerability
> >  *
> >  * Clear CPU buffers if the corresponding static key is enabled
> >  */
> > static __always_inline void mds_user_clear_cpu_buffers(void)
> > {
> >         if (static_branch_likely(&mds_user_clear))
> >                 mds_clear_cpu_buffers();
> > }
> > 
> > We were discussing how to perform objtool style validation 
> > that no code after the check for 
> 
> I'm not sure what the point of the above is... Were you trying to ask
> for validation that nothing runs after the mds_user_clear_cpu_buffer()?
> 
> That isn't strictly true today, there's lockdep code after it. I can't
> recall why that order is as it is though.
> 
> Pretty much everything in noinstr is magical, we just have to think
> harder there (and possibly start writing more comments there).

mds_user_clear_cpu_buffers happens after sync_core, in your patchset, 
if i am not mistaken.

> > > +             /* NMI happens here and must still do/finish CT_WORK_n */
> > > +             sync_core();
> > 
> > But after the discussion with you, it seems doing the TLB checking 
> > and (also sync_core) checking very late/very early on exit/entry 
> > makes things easier to review.
> 
> I don't know about late, it must happen *very* early in entry. The
> sync_core() must happen before any self-modifying code gets called
> (static_branch, static_call, etc..) with possible exception of the
> context_tracking static_branch.
> 
> The TLBi must also happen super early, possibly while still on the
> entry stack (since the task stack is vmap'ed).

But will it be ever be freed/remapped from other CPUs while the task
is running?

> We currently don't run C
> code on the entry stack, that needs quite a bit of careful work to make
> happen.

Was thinking of coding in ASM after (as early as possible) the write to 
switch to kernel CR3:

 Kernel entry:
 -------------

       cpu = smp_processor_id();

       if (isolation_enabled(cpu)) {
               reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE);
               if (reqs & CPU_REQ_FLUSH_TLB)
			flush_tlb_all();
               if (reqs & CPU_REQ_SYNC_CORE)
			sync_core();
       }                           

Exit to userspace (as close to write to CR3 with user pagetable
pointer):
 -----------------

       cpu = smp_processor_id();

       if (isolation_enabled(cpu)) {
               atomic_or(IN_USER_MODE, &percpudata->user_kernel_state);
       }

You think that is a bad idea (in ASM, not C) ? 
And request side can be in C:

 Request side:
 -------------

       int targetcpu;

       do {
               struct percpudata *pcpudata = per_cpu(&percpudata, targetcpu);

               old_state = pcpudata->user_kernel_state;

               /* in kernel mode ? */
               if (!(old_state & IN_USER_MODE)) {
                       smp_call_function_single(request_fn, targetcpu, 1);
                       break;
               }                                                                                                                         
               new_state = remote_state | CPU_REQ_FLUSH_TLB; // (or CPU_REQ_INV_ICACHE)
       } while (atomic_cmpxchg(&pcpudata->user_kernel_state, old_state, new_state) != old_state);   

(need logic to protect from atomic_cmpxchg always failing, but shouldnt
be difficult).

> > Can then use a single atomic variable with USER/KERNEL state and cmpxchg
> > loops.
> 
> We're not going to add an atomic to context tracking. There is one, we
> just got to extract/share it with RCU.

Again, to avoid kernel TLB flushes you'd have to ensure:

kernel entry:
	instrA addr1,addr2,addr3
	instrB addr2,addr3,addr4  <--- that no address here has TLBs
				       modified and flushed
	instrC addr5,addr6,addr7
        reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE);
        if (reqs & CPU_REQ_FLUSH_TLB)
        	flush_tlb_all();

kernel exit:

        atomic_or(IN_USER_MODE, &percpudata->user_kernel_state);
	instrA addr1,addr2,addr3
	instrB addr2,addr3,addr4  <--- that no address here has TLBs
				       modified and flushed

This could be conditional on "task isolated mode" enabled (would be 
better if it didnt, though).

			


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 11/11] context_tracking,x86: Fix text_poke_sync() vs NOHZ_FULL
  2021-10-21 19:57       ` Marcelo Tosatti
@ 2021-10-21 20:18         ` Peter Zijlstra
  2021-10-26 18:19           ` Marcelo Tosatti
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2021-10-21 20:18 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

On Thu, Oct 21, 2021 at 04:57:09PM -0300, Marcelo Tosatti wrote:
> > Pretty much everything in noinstr is magical, we just have to think
> > harder there (and possibly start writing more comments there).
> 
> mds_user_clear_cpu_buffers happens after sync_core, in your patchset, 
> if i am not mistaken.

Of course it does, mds_user_clear_cpu_buffers() is on exit, the
sync_core() is on entry.

> > > > +             /* NMI happens here and must still do/finish CT_WORK_n */
> > > > +             sync_core();
> > > 
> > > But after the discussion with you, it seems doing the TLB checking 
> > > and (also sync_core) checking very late/very early on exit/entry 
> > > makes things easier to review.
> > 
> > I don't know about late, it must happen *very* early in entry. The
> > sync_core() must happen before any self-modifying code gets called
> > (static_branch, static_call, etc..) with possible exception of the
> > context_tracking static_branch.
> > 
> > The TLBi must also happen super early, possibly while still on the
> > entry stack (since the task stack is vmap'ed).
> 
> But will it be ever be freed/remapped from other CPUs while the task
> is running?

Probably not, still something we need to be really careful with.
> 
> > We currently don't run C
> > code on the entry stack, that needs quite a bit of careful work to make
> > happen.
> 
> Was thinking of coding in ASM after (as early as possible) the write to 
> switch to kernel CR3:

No, we're not going to add new feature to ASM. You'll just have to wait
until all that gets lifted to C.

>  Kernel entry:
>  -------------
> 
>        cpu = smp_processor_id();
> 
>        if (isolation_enabled(cpu)) {
>                reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE);
>                if (reqs & CPU_REQ_FLUSH_TLB)
> 			flush_tlb_all();
>                if (reqs & CPU_REQ_SYNC_CORE)
> 			sync_core();
>        }                           
> 
> Exit to userspace (as close to write to CR3 with user pagetable
> pointer):
>  -----------------
> 
>        cpu = smp_processor_id();
> 
>        if (isolation_enabled(cpu)) {
>                atomic_or(IN_USER_MODE, &percpudata->user_kernel_state);
>        }
> 
> You think that is a bad idea (in ASM, not C) ? 

Those atomics are a bad idea and not goig to happen.

> > We're not going to add an atomic to context tracking. There is one, we
> > just got to extract/share it with RCU.
> 
> Again, to avoid kernel TLB flushes you'd have to ensure:

I know how it works, but we're not going to add a second atomic to
entry/exit. RCU has one in there, that's going to be it. Again, we just
got to extract/share.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/11] sched: Simplify wake_up_*idle*()
       [not found]   ` <CGME20211022134630eucas1p2e79e2816587d182c580459d567c1f2a9@eucas1p2.samsung.com>
@ 2021-10-22 13:46     ` Marek Szyprowski
  0 siblings, 0 replies; 51+ messages in thread
From: Marek Szyprowski @ 2021-10-22 13:46 UTC (permalink / raw)
  To: Peter Zijlstra, gor, jpoimboe, jikos, mbenes, pmladek, mingo
  Cc: linux-kernel, joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

Hi

On 29.09.2021 17:17, Peter Zijlstra wrote:
> Simplify and make wake_up_if_idle() more robust, also don't iterate
> the whole machine with preempt_disable() in it's caller:
> wake_up_all_idle_cpus().
>
> This prepares for another wake_up_if_idle() user that needs a full
> do_idle() cycle.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

This patch landed recently in linux-next as commit 8850cb663b5c ("sched: 
Simplify wake_up_*idle*()"). It causes the following warning on the 
arm64 virt machine under qemu during the system suspend/resume cycle:

--->8---

  printk: Suspending console(s) (use no_console_suspend to debug)

  ============================================
  WARNING: possible recursive locking detected
  5.15.0-rc6-next-20211022 #10905 Not tainted
  --------------------------------------------
  rtcwake/1326 is trying to acquire lock:
  ffffd4e9192e8130 (cpu_hotplug_lock){++++}-{0:0}, at: 
wake_up_all_idle_cpus+0x24/0x98

  but task is already holding lock:
  ffffd4e9192e8130 (cpu_hotplug_lock){++++}-{0:0}, at: 
suspend_devices_and_enter+0x740/0x9f0

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(cpu_hotplug_lock);
    lock(cpu_hotplug_lock);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  5 locks held by rtcwake/1326:
   #0: ffff54ad86a78438 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0x64/0xf0
   #1: ffff54ad84094a88 (&of->mutex){+.+.}-{3:3}, at: 
kernfs_fop_write_iter+0xf4/0x1a8
   #2: ffff54ad83b17a88 (kn->active#43){.+.+}-{0:0}, at: 
kernfs_fop_write_iter+0xfc/0x1a8
   #3: ffffd4e9192efab0 (system_transition_mutex){+.+.}-{3:3}, at: 
pm_suspend+0x214/0x3d0
   #4: ffffd4e9192e8130 (cpu_hotplug_lock){++++}-{0:0}, at: 
suspend_devices_and_enter+0x740/0x9f0

  stack backtrace:
  CPU: 0 PID: 1326 Comm: rtcwake Not tainted 5.15.0-rc6-next-20211022 #10905
  Hardware name: linux,dummy-virt (DT)
  Call trace:
   dump_backtrace+0x0/0x1d0
   show_stack+0x14/0x20
   dump_stack_lvl+0x88/0xb0
   dump_stack+0x14/0x2c
   __lock_acquire+0x171c/0x17b8
   lock_acquire+0x234/0x378
   cpus_read_lock+0x5c/0x150
   wake_up_all_idle_cpus+0x24/0x98
   suspend_devices_and_enter+0x748/0x9f0
   pm_suspend+0x2b0/0x3d0
   state_store+0x84/0x108
   kobj_attr_store+0x14/0x28
   sysfs_kf_write+0x60/0x70
   kernfs_fop_write_iter+0x124/0x1a8
   new_sync_write+0xe8/0x1b0
   vfs_write+0x1d0/0x408
   ksys_write+0x64/0xf0
   __arm64_sys_write+0x14/0x20
   invoke_syscall+0x40/0xf8
   el0_svc_common.constprop.3+0x8c/0x120
   do_el0_svc_compat+0x18/0x48
   el0_svc_compat+0x48/0x100
   el0t_32_sync_handler+0xec/0x140
   el0t_32_sync+0x170/0x174
  OOM killer enabled.
  Restarting tasks ... done.
  PM: suspend exit

--->8---

Let me know if there is anything I can help to debug and fix this issue.

> ---
>   kernel/sched/core.c |   14 +++++---------
>   kernel/smp.c        |    6 +++---
>   2 files changed, 8 insertions(+), 12 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3691,15 +3691,11 @@ void wake_up_if_idle(int cpu)
>   	if (!is_idle_task(rcu_dereference(rq->curr)))
>   		goto out;
>   
> -	if (set_nr_if_polling(rq->idle)) {
> -		trace_sched_wake_idle_without_ipi(cpu);
> -	} else {
> -		rq_lock_irqsave(rq, &rf);
> -		if (is_idle_task(rq->curr))
> -			smp_send_reschedule(cpu);
> -		/* Else CPU is not idle, do nothing here: */
> -		rq_unlock_irqrestore(rq, &rf);
> -	}
> +	rq_lock_irqsave(rq, &rf);
> +	if (is_idle_task(rq->curr))
> +		resched_curr(rq);
> +	/* Else CPU is not idle, do nothing here: */
> +	rq_unlock_irqrestore(rq, &rf);
>   
>   out:
>   	rcu_read_unlock();
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void)
>   {
>   	int cpu;
>   
> -	preempt_disable();
> +	cpus_read_lock();
>   	for_each_online_cpu(cpu) {
> -		if (cpu == smp_processor_id())
> +		if (cpu == raw_smp_processor_id())
>   			continue;
>   
>   		wake_up_if_idle(cpu);
>   	}
> -	preempt_enable();
> +	cpus_read_unlock();
>   }
>   EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
>   
>
>
>
Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 11/11] context_tracking,x86: Fix text_poke_sync() vs NOHZ_FULL
  2021-10-21 20:18         ` Peter Zijlstra
@ 2021-10-26 18:19           ` Marcelo Tosatti
  2021-10-26 19:38             ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Marcelo Tosatti @ 2021-10-26 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

On Thu, Oct 21, 2021 at 10:18:59PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 21, 2021 at 04:57:09PM -0300, Marcelo Tosatti wrote:
> > > Pretty much everything in noinstr is magical, we just have to think
> > > harder there (and possibly start writing more comments there).
> > 
> > mds_user_clear_cpu_buffers happens after sync_core, in your patchset, 
> > if i am not mistaken.
> 
> Of course it does, mds_user_clear_cpu_buffers() is on exit, the
> sync_core() is on entry.

                                                                  static_key enable/disable

__exit_to_user_mode ->                                            context_tracking_set_cpu_work(cpu, work)
   user_enter_irqoff ->                                                  preempt_disable();
   __context_tracking_enter(CONTEXT_USER);                               seq = atomic_read(&ct->seq);
      ct_seq_user_enter(raw_cpu_ptr(&context_tracking));                 if (__context_tracking_seq_in_user(seq)) {
      {                                                                          /* ctrl-dep */
        arch_atomic_set(&ct->work, 0);                                           atomic_or(work, &ct->work);
        return arch_atomic_add_return(CT_SEQ_USER, &ct->seq);                    ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
                                                                         }
      }                                                                  preempt_enable();
   arch_exit_to_user_mode()
   mds_user_clear_cpu_buffers();  <--- sync_core work queued,
                                       but not executed.
                                       i-cache potentially stale?

ct_seq_user_enter should happen _after_ all possible static_key users?

(or recheck that there is no pending work after any possible
rewritable code/static_key user).

> 
> > > > > +             /* NMI happens here and must still do/finish CT_WORK_n */
> > > > > +             sync_core();
> > > > 
> > > > But after the discussion with you, it seems doing the TLB checking 
> > > > and (also sync_core) checking very late/very early on exit/entry 
> > > > makes things easier to review.
> > > 
> > > I don't know about late, it must happen *very* early in entry. The
> > > sync_core() must happen before any self-modifying code gets called
> > > (static_branch, static_call, etc..) with possible exception of the
> > > context_tracking static_branch.
> > > 
> > > The TLBi must also happen super early, possibly while still on the
> > > entry stack (since the task stack is vmap'ed).
> > 
> > But will it be ever be freed/remapped from other CPUs while the task
> > is running?
> 
> Probably not, still something we need to be really careful with.
> > 
> > > We currently don't run C
> > > code on the entry stack, that needs quite a bit of careful work to make
> > > happen.
> > 
> > Was thinking of coding in ASM after (as early as possible) the write to 
> > switch to kernel CR3:
> 
> No, we're not going to add new feature to ASM. You'll just have to wait
> until all that gets lifted to C.
> 
> >  Kernel entry:
> >  -------------
> > 
> >        cpu = smp_processor_id();
> > 
> >        if (isolation_enabled(cpu)) {
> >                reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE);
> >                if (reqs & CPU_REQ_FLUSH_TLB)
> > 			flush_tlb_all();
> >                if (reqs & CPU_REQ_SYNC_CORE)
> > 			sync_core();
> >        }                           
> > 
> > Exit to userspace (as close to write to CR3 with user pagetable
> > pointer):
> >  -----------------
> > 
> >        cpu = smp_processor_id();
> > 
> >        if (isolation_enabled(cpu)) {
> >                atomic_or(IN_USER_MODE, &percpudata->user_kernel_state);
> >        }
> > 
> > You think that is a bad idea (in ASM, not C) ? 
> 
> Those atomics are a bad idea and not goig to happen.
> 
> > > We're not going to add an atomic to context tracking. There is one, we
> > > just got to extract/share it with RCU.
> > 
> > Again, to avoid kernel TLB flushes you'd have to ensure:
> 
> I know how it works, but we're not going to add a second atomic to
> entry/exit. RCU has one in there, that's going to be it. Again, we just
> got to extract/share.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC][PATCH v2 11/11] context_tracking,x86: Fix text_poke_sync() vs NOHZ_FULL
  2021-10-26 18:19           ` Marcelo Tosatti
@ 2021-10-26 19:38             ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2021-10-26 19:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: gor, jpoimboe, jikos, mbenes, pmladek, mingo, linux-kernel,
	joe.lawrence, fweisbec, tglx, hca, svens, sumanthk,
	live-patching, paulmck, rostedt, x86

On Tue, Oct 26, 2021 at 03:19:11PM -0300, Marcelo Tosatti wrote:
> On Thu, Oct 21, 2021 at 10:18:59PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 21, 2021 at 04:57:09PM -0300, Marcelo Tosatti wrote:
> > > > Pretty much everything in noinstr is magical, we just have to think
> > > > harder there (and possibly start writing more comments there).
> > > 
> > > mds_user_clear_cpu_buffers happens after sync_core, in your patchset, 
> > > if i am not mistaken.
> > 
> > Of course it does, mds_user_clear_cpu_buffers() is on exit, the
> > sync_core() is on entry.
> 
>                                                                   static_key enable/disable
> 
> __exit_to_user_mode ->                                            context_tracking_set_cpu_work(cpu, work)
>    user_enter_irqoff ->                                                  preempt_disable();
>    __context_tracking_enter(CONTEXT_USER);                               seq = atomic_read(&ct->seq);
>       ct_seq_user_enter(raw_cpu_ptr(&context_tracking));                 if (__context_tracking_seq_in_user(seq)) {
>       {                                                                          /* ctrl-dep */
>         arch_atomic_set(&ct->work, 0);                                           atomic_or(work, &ct->work);
>         return arch_atomic_add_return(CT_SEQ_USER, &ct->seq);                    ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
>                                                                          }
>       }                                                                  preempt_enable();
>    arch_exit_to_user_mode()
>    mds_user_clear_cpu_buffers();  <--- sync_core work queued,
>                                        but not executed.
>                                        i-cache potentially stale?
> 
> ct_seq_user_enter should happen _after_ all possible static_key users?

Right, so this one is actually okay, because that branch is *never*
changed after boot.

I'm not quite sure why it isn't an alternative(). At some point I
proposed static_call_lock() [1] and the corrolary is static_branch_lock(),
which I suppose could be employed here. But I'm not sure that actually
helps much with auditing all that.


[1] https://lkml.kernel.org/r/20210904105529.GA5106@worktop.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2021-10-26 19:41 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-29 15:17 [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Peter Zijlstra
2021-09-29 15:17 ` [PATCH v2 01/11] sched: Improve try_invoke_on_locked_down_task() Peter Zijlstra
2021-09-29 15:17 ` [PATCH v2 02/11] sched,rcu: Rework try_invoke_on_locked_down_task() Peter Zijlstra
2021-09-29 15:17 ` [PATCH v2 03/11] sched,livepatch: Use task_call_func() Peter Zijlstra
2021-10-05 11:40   ` Petr Mladek
2021-10-05 14:03     ` Peter Zijlstra
2021-10-06  8:59   ` Miroslav Benes
2021-09-29 15:17 ` [PATCH v2 04/11] sched: Simplify wake_up_*idle*() Peter Zijlstra
2021-10-13 14:32   ` Qian Cai
2021-10-19  3:47     ` Qian Cai
2021-10-19  8:56       ` Peter Zijlstra
2021-10-19  9:10         ` Peter Zijlstra
2021-10-19 15:32           ` Qian Cai
2021-10-19 15:50             ` Peter Zijlstra
2021-10-19 19:22               ` Qian Cai
2021-10-19 20:27                 ` Peter Zijlstra
     [not found]   ` <CGME20211022134630eucas1p2e79e2816587d182c580459d567c1f2a9@eucas1p2.samsung.com>
2021-10-22 13:46     ` Marek Szyprowski
2021-09-29 15:17 ` [PATCH v2 05/11] sched,livepatch: Use wake_up_if_idle() Peter Zijlstra
2021-10-05 12:00   ` Petr Mladek
2021-10-06  9:16   ` Miroslav Benes
2021-10-07  9:18     ` Vasily Gorbik
2021-10-07 10:02       ` Peter Zijlstra
2021-10-13 19:37   ` Arnd Bergmann
2021-10-14 10:42     ` Peter Zijlstra
2021-09-29 15:17 ` [RFC][PATCH v2 06/11] context_tracking: Prefix user_{enter,exit}*() Peter Zijlstra
2021-09-29 15:17 ` [RFC][PATCH v2 07/11] context_tracking: Add an atomic sequence/state count Peter Zijlstra
2021-09-29 15:17 ` [RFC][PATCH v2 08/11] context_tracking,rcu: Replace RCU dynticks counter with context_tracking Peter Zijlstra
2021-09-29 18:37   ` Paul E. McKenney
2021-09-29 19:09     ` Peter Zijlstra
2021-09-29 19:11     ` Peter Zijlstra
2021-09-29 19:13     ` Peter Zijlstra
2021-09-29 19:24       ` Peter Zijlstra
2021-09-29 19:45         ` Paul E. McKenney
2021-09-29 18:54   ` Peter Zijlstra
2021-09-29 15:17 ` [RFC][PATCH v2 09/11] context_tracking,livepatch: Dont disturb NOHZ_FULL Peter Zijlstra
2021-10-06  8:12   ` Petr Mladek
2021-10-06  9:04     ` Peter Zijlstra
2021-10-06 10:29       ` Petr Mladek
2021-10-06 11:41         ` Peter Zijlstra
2021-10-06 11:48         ` Miroslav Benes
2021-09-29 15:17 ` [RFC][PATCH v2 10/11] livepatch: Remove klp_synchronize_transition() Peter Zijlstra
2021-10-06 12:30   ` Petr Mladek
2021-09-29 15:17 ` [RFC][PATCH v2 11/11] context_tracking,x86: Fix text_poke_sync() vs NOHZ_FULL Peter Zijlstra
2021-10-21 18:39   ` Marcelo Tosatti
2021-10-21 18:40     ` Marcelo Tosatti
2021-10-21 19:25     ` Peter Zijlstra
2021-10-21 19:57       ` Marcelo Tosatti
2021-10-21 20:18         ` Peter Zijlstra
2021-10-26 18:19           ` Marcelo Tosatti
2021-10-26 19:38             ` Peter Zijlstra
2021-09-29 18:03 ` [PATCH v2 00/11] sched,rcu,context_tracking,livepatch: Improve livepatch task transitions for idle and NOHZ_FULL Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).