Hi, One series in two parts; the first part (patches 1-5) should be in reasonable shape and ought to fix the original issues that got this whole thing started. The second part (patches 6-11) are still marked RFC and do scary things to try and make NOHZ_FULL work better. Please consider.. PS. Paul, I did do break RCU a bit and I'm not sure about the best way to put it back together, please see patch 8 for details.
Clarify and tighten try_invoke_on_locked_down_task(). Basically the function calls @func under task_rq_lock(), except it avoids taking rq->lock when possible. This makes calling @func unconditional (the function will get renamed in a later patch to remove the try). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/core.c | 63 ++++++++++++++++++++++++++++++++-------------------- 1 file changed, 39 insertions(+), 24 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4111,41 +4111,56 @@ try_to_wake_up(struct task_struct *p, un * @func: Function to invoke. * @arg: Argument to function. * - * If the specified task can be quickly locked into a definite state - * (either sleeping or on a given runqueue), arrange to keep it in that - * state while invoking @func(@arg). This function can use ->on_rq and - * task_curr() to work out what the state is, if required. Given that - * @func can be invoked with a runqueue lock held, it had better be quite - * lightweight. + * Fix the task in it's current state by avoiding wakeups and or rq operations + * and call @func(@arg) on it. This function can use ->on_rq and task_curr() + * to work out what the state is, if required. Given that @func can be invoked + * with a runqueue lock held, it had better be quite lightweight. * * Returns: - * @false if the task slipped out from under the locks. - * @true if the task was locked onto a runqueue or is sleeping. - * However, @func can override this by returning @false. + * Whatever @func returns */ bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg) { + struct rq *rq = NULL; + unsigned int state; struct rq_flags rf; bool ret = false; - struct rq *rq; raw_spin_lock_irqsave(&p->pi_lock, rf.flags); - if (p->on_rq) { + + state = READ_ONCE(p->__state); + + /* + * Ensure we load p->on_rq after p->__state, otherwise it would be + * possible to, falsely, observe p->on_rq == 0. + * + * See try_to_wake_up() for a longer comment. + */ + smp_rmb(); + + /* + * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when + * the task is blocked. Make sure to check @state since ttwu() can drop + * locks at the end, see ttwu_queue_wakelist(). + */ + if (state == TASK_RUNNING || state == TASK_WAKING || p->on_rq) rq = __task_rq_lock(p, &rf); - if (task_rq(p) == rq) - ret = func(p, arg); + + /* + * At this point the task is pinned; either: + * - blocked and we're holding off wakeups (pi->lock) + * - woken, and we're holding off enqueue (rq->lock) + * - queued, and we're holding off schedule (rq->lock) + * - running, and we're holding off de-schedule (rq->lock) + * + * The called function (@func) can use: task_curr(), p->on_rq and + * p->__state to differentiate between these states. + */ + ret = func(p, arg); + + if (rq) rq_unlock(rq, &rf); - } else { - switch (READ_ONCE(p->__state)) { - case TASK_RUNNING: - case TASK_WAKING: - break; - default: - smp_rmb(); // See smp_rmb() comment in try_to_wake_up(). - if (!p->on_rq) - ret = func(p, arg); - } - } + raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags); return ret; }
Give try_invoke_on_locked_down_task() a saner name and have it return an int so that the caller might distinguish between different reasons of failure. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> --- include/linux/wait.h | 3 ++- kernel/rcu/tasks.h | 12 ++++++------ kernel/rcu/tree_stall.h | 8 ++++---- kernel/sched/core.c | 6 +++--- 4 files changed, 15 insertions(+), 14 deletions(-) --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -1160,6 +1160,7 @@ int autoremove_wake_function(struct wait (wait)->flags = 0; \ } while (0) -bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg); +typedef int (*task_call_f)(struct task_struct *p, void *arg); +extern int task_call_func(struct task_struct *p, task_call_f func, void *arg); #endif /* _LINUX_WAIT_H */ --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -928,7 +928,7 @@ static void trc_read_check_handler(void } /* Callback function for scheduler to check locked-down task. */ -static bool trc_inspect_reader(struct task_struct *t, void *arg) +static int trc_inspect_reader(struct task_struct *t, void *arg) { int cpu = task_cpu(t); bool in_qs = false; @@ -939,7 +939,7 @@ static bool trc_inspect_reader(struct ta // If no chance of heavyweight readers, do it the hard way. if (!ofl && !IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB)) - return false; + return -EINVAL; // If heavyweight readers are enabled on the remote task, // we can inspect its state despite its currently running. @@ -947,7 +947,7 @@ static bool trc_inspect_reader(struct ta n_heavy_reader_attempts++; if (!ofl && // Check for "running" idle tasks on offline CPUs. !rcu_dynticks_zero_in_eqs(cpu, &t->trc_reader_nesting)) - return false; // No quiescent state, do it the hard way. + return -EINVAL; // No quiescent state, do it the hard way. n_heavy_reader_updates++; if (ofl) n_heavy_reader_ofl_updates++; @@ -962,7 +962,7 @@ static bool trc_inspect_reader(struct ta t->trc_reader_checked = true; if (in_qs) - return true; // Already in quiescent state, done!!! + return 0; // Already in quiescent state, done!!! // The task is in a read-side critical section, so set up its // state so that it will awaken the grace-period kthread upon exit @@ -970,7 +970,7 @@ static bool trc_inspect_reader(struct ta atomic_inc(&trc_n_readers_need_end); // One more to wait on. WARN_ON_ONCE(READ_ONCE(t->trc_reader_special.b.need_qs)); WRITE_ONCE(t->trc_reader_special.b.need_qs, true); - return true; + return 0; } /* Attempt to extract the state for the specified task. */ @@ -992,7 +992,7 @@ static void trc_wait_for_one_reader(stru // Attempt to nail down the task for inspection. get_task_struct(t); - if (try_invoke_on_locked_down_task(t, trc_inspect_reader, NULL)) { + if (!task_call_func(t, trc_inspect_reader, NULL)) { put_task_struct(t); return; } --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -240,16 +240,16 @@ struct rcu_stall_chk_rdr { * Report out the state of a not-running task that is stalling the * current RCU grace period. */ -static bool check_slow_task(struct task_struct *t, void *arg) +static int check_slow_task(struct task_struct *t, void *arg) { struct rcu_stall_chk_rdr *rscrp = arg; if (task_curr(t)) - return false; // It is running, so decline to inspect it. + return -EBUSY; // It is running, so decline to inspect it. rscrp->nesting = t->rcu_read_lock_nesting; rscrp->rs = t->rcu_read_unlock_special; rscrp->on_blkd_list = !list_empty(&t->rcu_node_entry); - return true; + return 0; } /* @@ -283,7 +283,7 @@ static int rcu_print_task_stall(struct r raw_spin_unlock_irqrestore_rcu_node(rnp, flags); while (i) { t = ts[--i]; - if (!try_invoke_on_locked_down_task(t, check_slow_task, &rscr)) + if (task_call_func(t, check_slow_task, &rscr)) pr_cont(" P%d", t->pid); else pr_cont(" P%d/%d:%c%c%c%c", --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4106,7 +4106,7 @@ try_to_wake_up(struct task_struct *p, un } /** - * try_invoke_on_locked_down_task - Invoke a function on task in fixed state + * task_call_func - Invoke a function on task in fixed state * @p: Process for which the function is to be invoked, can be @current. * @func: Function to invoke. * @arg: Argument to function. @@ -4119,12 +4119,12 @@ try_to_wake_up(struct task_struct *p, un * Returns: * Whatever @func returns */ -bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg) +int task_call_func(struct task_struct *p, task_call_f func, void *arg) { struct rq *rq = NULL; unsigned int state; struct rq_flags rf; - bool ret = false; + int ret; raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
Instead of frobbing around with scheduler internals, use the shiny new task_call_func() interface. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/livepatch/transition.c | 90 ++++++++++++++++++++---------------------- 1 file changed, 44 insertions(+), 46 deletions(-) --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -13,7 +13,6 @@ #include "core.h" #include "patch.h" #include "transition.h" -#include "../sched/sched.h" #define MAX_STACK_ENTRIES 100 #define STACK_ERR_BUF_SIZE 128 @@ -240,7 +239,7 @@ static int klp_check_stack_func(struct k * Determine whether it's safe to transition the task to the target patch state * by looking for any to-be-patched or to-be-unpatched functions on its stack. */ -static int klp_check_stack(struct task_struct *task, char *err_buf) +static int klp_check_stack(struct task_struct *task, const char **oldname) { static unsigned long entries[MAX_STACK_ENTRIES]; struct klp_object *obj; @@ -248,12 +247,8 @@ static int klp_check_stack(struct task_s int ret, nr_entries; ret = stack_trace_save_tsk_reliable(task, entries, ARRAY_SIZE(entries)); - if (ret < 0) { - snprintf(err_buf, STACK_ERR_BUF_SIZE, - "%s: %s:%d has an unreliable stack\n", - __func__, task->comm, task->pid); - return ret; - } + if (ret < 0) + return -EINVAL; nr_entries = ret; klp_for_each_object(klp_transition_patch, obj) { @@ -262,11 +257,8 @@ static int klp_check_stack(struct task_s klp_for_each_func(obj, func) { ret = klp_check_stack_func(func, entries, nr_entries); if (ret) { - snprintf(err_buf, STACK_ERR_BUF_SIZE, - "%s: %s:%d is sleeping on function %s\n", - __func__, task->comm, task->pid, - func->old_name); - return ret; + *oldname = func->old_name; + return -EADDRINUSE; } } } @@ -274,6 +266,22 @@ static int klp_check_stack(struct task_s return 0; } +static int klp_check_and_switch_task(struct task_struct *task, void *arg) +{ + int ret; + + if (task_curr(task)) + return -EBUSY; + + ret = klp_check_stack(task, arg); + if (ret) + return ret; + + clear_tsk_thread_flag(task, TIF_PATCH_PENDING); + task->patch_state = klp_target_state; + return 0; +} + /* * Try to safely switch a task to the target patch state. If it's currently * running, or it's sleeping on a to-be-patched or to-be-unpatched function, or @@ -281,13 +289,8 @@ static int klp_check_stack(struct task_s */ static bool klp_try_switch_task(struct task_struct *task) { - static char err_buf[STACK_ERR_BUF_SIZE]; - struct rq *rq; - struct rq_flags flags; + const char *old_name; int ret; - bool success = false; - - err_buf[0] = '\0'; /* check if this task has already switched over */ if (task->patch_state == klp_target_state) @@ -305,36 +308,31 @@ static bool klp_try_switch_task(struct t * functions. If all goes well, switch the task to the target patch * state. */ - rq = task_rq_lock(task, &flags); + ret = task_call_func(task, klp_check_and_switch_task, &old_name); + switch (ret) { + case 0: /* success */ + break; - if (task_running(rq, task) && task != current) { - snprintf(err_buf, STACK_ERR_BUF_SIZE, - "%s: %s:%d is running\n", __func__, task->comm, - task->pid); - goto done; + case -EBUSY: /* klp_check_and_switch_task() */ + pr_debug("%s: %s:%d is running\n", + __func__, task->comm, task->pid); + break; + case -EINVAL: /* klp_check_and_switch_task() */ + pr_debug("%s: %s:%d has an unreliable stack\n", + __func__, task->comm, task->pid); + break; + case -EADDRINUSE: /* klp_check_and_switch_task() */ + pr_debug("%s: %s:%d is sleeping on function %s\n", + __func__, task->comm, task->pid, old_name); + break; + + default: + pr_debug("%s: Unknown error code (%d) when trying to switch %s:%d\n", + __func__, ret, task->comm, task->pid); + break; } - ret = klp_check_stack(task, err_buf); - if (ret) - goto done; - - success = true; - - clear_tsk_thread_flag(task, TIF_PATCH_PENDING); - task->patch_state = klp_target_state; - -done: - task_rq_unlock(rq, task, &flags); - - /* - * Due to console deadlock issues, pr_debug() can't be used while - * holding the task rq lock. Instead we have to use a temporary buffer - * and print the debug message after releasing the lock. - */ - if (err_buf[0] != '\0') - pr_debug("%s", err_buf); - - return success; + return !ret; } /*
Simplify and make wake_up_if_idle() more robust, also don't iterate the whole machine with preempt_disable() in it's caller: wake_up_all_idle_cpus(). This prepares for another wake_up_if_idle() user that needs a full do_idle() cycle. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/core.c | 14 +++++--------- kernel/smp.c | 6 +++--- 2 files changed, 8 insertions(+), 12 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3691,15 +3691,11 @@ void wake_up_if_idle(int cpu) if (!is_idle_task(rcu_dereference(rq->curr))) goto out; - if (set_nr_if_polling(rq->idle)) { - trace_sched_wake_idle_without_ipi(cpu); - } else { - rq_lock_irqsave(rq, &rf); - if (is_idle_task(rq->curr)) - smp_send_reschedule(cpu); - /* Else CPU is not idle, do nothing here: */ - rq_unlock_irqrestore(rq, &rf); - } + rq_lock_irqsave(rq, &rf); + if (is_idle_task(rq->curr)) + resched_curr(rq); + /* Else CPU is not idle, do nothing here: */ + rq_unlock_irqrestore(rq, &rf); out: rcu_read_unlock(); --- a/kernel/smp.c +++ b/kernel/smp.c @@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void) { int cpu; - preempt_disable(); + cpus_read_lock(); for_each_online_cpu(cpu) { - if (cpu == smp_processor_id()) + if (cpu == raw_smp_processor_id()) continue; wake_up_if_idle(cpu); } - preempt_enable(); + cpus_read_unlock(); } EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
Make sure to prod idle CPUs so they call klp_update_patch_state(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/livepatch/transition.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -413,8 +413,11 @@ void klp_try_complete_transition(void) for_each_possible_cpu(cpu) { task = idle_task(cpu); if (cpu_online(cpu)) { - if (!klp_try_switch_task(task)) + if (!klp_try_switch_task(task)) { complete = false; + /* Make idle task go through the main loop. */ + wake_up_if_idle(cpu); + } } else if (task->patch_state != klp_target_state) { /* offline idle tasks can be switched immediately */ clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
Put a ct_ prefix on a bunch of context_tracking functions for better namespacing / greppability. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- arch/Kconfig | 6 +++--- arch/arm64/kernel/entry-common.c | 4 ++-- arch/mips/kernel/ptrace.c | 6 +++--- arch/mips/kernel/signal.c | 4 ++-- arch/powerpc/include/asm/interrupt.h | 2 +- arch/powerpc/kernel/interrupt.c | 10 +++++----- arch/sparc/kernel/ptrace_64.c | 6 +++--- arch/sparc/kernel/signal_64.c | 4 ++-- include/linux/context_tracking.h | 16 ++++++++-------- include/linux/context_tracking_state.h | 2 +- include/trace/events/context_tracking.h | 8 ++++---- kernel/context_tracking.c | 6 +++--- kernel/entry/common.c | 4 ++-- kernel/livepatch/transition.c | 2 +- kernel/sched/core.c | 4 ++-- kernel/trace/ftrace.c | 4 ++-- 16 files changed, 44 insertions(+), 44 deletions(-) --- a/arch/Kconfig +++ b/arch/Kconfig @@ -760,7 +760,7 @@ config HAVE_CONTEXT_TRACKING help Provide kernel/user boundaries probes necessary for subsystems that need it, such as userspace RCU extended quiescent state. - Syscalls need to be wrapped inside user_exit()-user_enter(), either + Syscalls need to be wrapped inside ct_user_exit()-ct_user_enter(), either optimized behind static key or through the slow path using TIF_NOHZ flag. Exceptions handlers must be wrapped as well. Irqs are already protected inside rcu_irq_enter/rcu_irq_exit() but preemption or signal @@ -774,7 +774,7 @@ config HAVE_CONTEXT_TRACKING_OFFSTACK preempt_schedule_irq() can't be called in a preemptible section while context tracking is CONTEXT_USER. This feature reflects a sane entry implementation where the following requirements are met on - critical entry code, ie: before user_exit() or after user_enter(): + critical entry code, ie: before ct_user_exit() or after ct_user_enter(): - Critical entry code isn't preemptible (or better yet: not interruptible). @@ -787,7 +787,7 @@ config HAVE_TIF_NOHZ bool help Arch relies on TIF_NOHZ and syscall slow path to implement context - tracking calls to user_enter()/user_exit(). + tracking calls to ct_user_enter()/ct_user_exit(). config HAVE_VIRT_CPU_ACCOUNTING bool --- a/arch/arm64/kernel/entry-common.c +++ b/arch/arm64/kernel/entry-common.c @@ -100,7 +100,7 @@ static __always_inline void __enter_from { lockdep_hardirqs_off(CALLER_ADDR0); CT_WARN_ON(ct_state() != CONTEXT_USER); - user_exit_irqoff(); + ct_user_exit_irqoff(); trace_hardirqs_off_finish(); } @@ -118,7 +118,7 @@ static __always_inline void __exit_to_us { trace_hardirqs_on_prepare(); lockdep_hardirqs_on_prepare(CALLER_ADDR0); - user_enter_irqoff(); + ct_user_enter_irqoff(); lockdep_hardirqs_on(CALLER_ADDR0); } --- a/arch/mips/kernel/ptrace.c +++ b/arch/mips/kernel/ptrace.c @@ -1312,7 +1312,7 @@ long arch_ptrace(struct task_struct *chi */ asmlinkage long syscall_trace_enter(struct pt_regs *regs, long syscall) { - user_exit(); + ct_user_exit(); current_thread_info()->syscall = syscall; @@ -1368,7 +1368,7 @@ asmlinkage void syscall_trace_leave(stru * or do_notify_resume(), in which case we can be in RCU * user mode. */ - user_exit(); + ct_user_exit(); audit_syscall_exit(regs); @@ -1378,5 +1378,5 @@ asmlinkage void syscall_trace_leave(stru if (test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall_exit(regs, 0); - user_enter(); + ct_user_enter(); } --- a/arch/mips/kernel/signal.c +++ b/arch/mips/kernel/signal.c @@ -897,7 +897,7 @@ asmlinkage void do_notify_resume(struct { local_irq_enable(); - user_exit(); + ct_user_exit(); if (thread_info_flags & _TIF_UPROBE) uprobe_notify_resume(regs); @@ -911,7 +911,7 @@ asmlinkage void do_notify_resume(struct rseq_handle_notify_resume(NULL, regs); } - user_enter(); + ct_user_enter(); } #if defined(CONFIG_SMP) && defined(CONFIG_MIPS_FP_SUPPORT) --- a/arch/powerpc/include/asm/interrupt.h +++ b/arch/powerpc/include/asm/interrupt.h @@ -154,7 +154,7 @@ static inline void interrupt_enter_prepa if (user_mode(regs)) { CT_WARN_ON(ct_state() != CONTEXT_USER); - user_exit_irqoff(); + ct_user_exit_irqoff(); account_cpu_user_entry(); account_stolen_time(); --- a/arch/powerpc/kernel/interrupt.c +++ b/arch/powerpc/kernel/interrupt.c @@ -91,7 +91,7 @@ notrace long system_call_exception(long trace_hardirqs_off(); /* finish reconciling */ CT_WARN_ON(ct_state() == CONTEXT_KERNEL); - user_exit_irqoff(); + ct_user_exit_irqoff(); BUG_ON(regs_is_unrecoverable(regs)); BUG_ON(!(regs->msr & MSR_PR)); @@ -388,9 +388,9 @@ interrupt_exit_user_prepare_main(unsigne check_return_regs_valid(regs); - user_enter_irqoff(); + ct_user_enter_irqoff(); if (!prep_irq_for_enabled_exit(true)) { - user_exit_irqoff(); + ct_user_exit_irqoff(); local_irq_enable(); local_irq_disable(); goto again; @@ -489,7 +489,7 @@ notrace unsigned long syscall_exit_resta #endif trace_hardirqs_off(); - user_exit_irqoff(); + ct_user_exit_irqoff(); account_cpu_user_entry(); BUG_ON(!user_mode(regs)); @@ -638,7 +638,7 @@ notrace unsigned long interrupt_exit_use #endif trace_hardirqs_off(); - user_exit_irqoff(); + ct_user_exit_irqoff(); account_cpu_user_entry(); BUG_ON(!user_mode(regs)); --- a/arch/sparc/kernel/ptrace_64.c +++ b/arch/sparc/kernel/ptrace_64.c @@ -1092,7 +1092,7 @@ asmlinkage int syscall_trace_enter(struc secure_computing_strict(regs->u_regs[UREG_G1]); if (test_thread_flag(TIF_NOHZ)) - user_exit(); + ct_user_exit(); if (test_thread_flag(TIF_SYSCALL_TRACE)) ret = tracehook_report_syscall_entry(regs); @@ -1110,7 +1110,7 @@ asmlinkage int syscall_trace_enter(struc asmlinkage void syscall_trace_leave(struct pt_regs *regs) { if (test_thread_flag(TIF_NOHZ)) - user_exit(); + ct_user_exit(); audit_syscall_exit(regs); @@ -1121,7 +1121,7 @@ asmlinkage void syscall_trace_leave(stru tracehook_report_syscall_exit(regs, 0); if (test_thread_flag(TIF_NOHZ)) - user_enter(); + ct_user_enter(); } /** --- a/arch/sparc/kernel/signal_64.c +++ b/arch/sparc/kernel/signal_64.c @@ -546,14 +546,14 @@ static void do_signal(struct pt_regs *re void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long thread_info_flags) { - user_exit(); + ct_user_exit(); if (thread_info_flags & _TIF_UPROBE) uprobe_notify_resume(regs); if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) do_signal(regs, orig_i0); if (thread_info_flags & _TIF_NOTIFY_RESUME) tracehook_notify_resume(regs); - user_enter(); + ct_user_enter(); } /* --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -22,26 +22,26 @@ extern void context_tracking_exit(enum c extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); -static inline void user_enter(void) +static inline void ct_user_enter(void) { if (context_tracking_enabled()) context_tracking_enter(CONTEXT_USER); } -static inline void user_exit(void) +static inline void ct_user_exit(void) { if (context_tracking_enabled()) context_tracking_exit(CONTEXT_USER); } /* Called with interrupts disabled. */ -static __always_inline void user_enter_irqoff(void) +static __always_inline void ct_user_enter_irqoff(void) { if (context_tracking_enabled()) __context_tracking_enter(CONTEXT_USER); } -static __always_inline void user_exit_irqoff(void) +static __always_inline void ct_user_exit_irqoff(void) { if (context_tracking_enabled()) __context_tracking_exit(CONTEXT_USER); @@ -98,10 +98,10 @@ static __always_inline enum ctx_state ct this_cpu_read(context_tracking.state) : CONTEXT_DISABLED; } #else -static inline void user_enter(void) { } -static inline void user_exit(void) { } -static inline void user_enter_irqoff(void) { } -static inline void user_exit_irqoff(void) { } +static inline void ct_user_enter(void) { } +static inline void ct_user_exit(void) { } +static inline void ct_user_enter_irqoff(void) { } +static inline void ct_user_exit_irqoff(void) { } static inline enum ctx_state exception_enter(void) { return 0; } static inline void exception_exit(enum ctx_state prev_ctx) { } static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; } --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -9,7 +9,7 @@ struct context_tracking { /* * When active is false, probes are unset in order * to minimize overhead: TIF flags are cleared - * and calls to user_enter/exit are ignored. This + * and calls to ct_user_enter/exit are ignored. This * may be further optimized using static keys. */ bool active; --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -73,7 +73,7 @@ void noinstr __context_tracking_enter(en * At this stage, only low level arch entry code remains and * then we'll run in userspace. We can assume there won't be * any RCU read-side critical section until the next call to - * user_exit() or rcu_irq_enter(). Let's remove RCU's dependency + * ct_user_exit() or rcu_irq_enter(). Let's remove RCU's dependency * on the tick. */ if (state == CONTEXT_USER) { @@ -127,7 +127,7 @@ EXPORT_SYMBOL_GPL(context_tracking_enter void context_tracking_user_enter(void) { - user_enter(); + ct_user_enter(); } NOKPROBE_SYMBOL(context_tracking_user_enter); @@ -184,7 +184,7 @@ EXPORT_SYMBOL_GPL(context_tracking_exit) void context_tracking_user_exit(void) { - user_exit(); + ct_user_exit(); } NOKPROBE_SYMBOL(context_tracking_user_exit); --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -19,7 +19,7 @@ static __always_inline void __enter_from lockdep_hardirqs_off(CALLER_ADDR0); CT_WARN_ON(ct_state() != CONTEXT_USER); - user_exit_irqoff(); + ct_user_exit_irqoff(); instrumentation_begin(); trace_hardirqs_off_finish(); @@ -127,7 +127,7 @@ static __always_inline void __exit_to_us lockdep_hardirqs_on_prepare(CALLER_ADDR0); instrumentation_end(); - user_enter_irqoff(); + ct_user_enter_irqoff(); arch_exit_to_user_mode(); lockdep_hardirqs_on(CALLER_ADDR0); } --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -51,7 +51,7 @@ static void klp_sync(struct work_struct /* * We allow to patch also functions where RCU is not watching, - * e.g. before user_exit(). We can not rely on the RCU infrastructure + * e.g. before ct_user_exit(). We can not rely on the RCU infrastructure * to do the synchronization. Instead hard force the sched synchronization. * * This approach allows to use RCU functions for manipulating func_stack --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6444,7 +6444,7 @@ EXPORT_STATIC_CALL_TRAMP(preempt_schedul * recursion and tracing preempt enabling caused by the tracing * infrastructure itself. But as tracing can happen in areas coming * from userspace or just about to enter userspace, a preempt enable - * can occur before user_exit() is called. This will cause the scheduler + * can occur before ct_user_exit() is called. This will cause the scheduler * to be called when the system is still in usermode. * * To prevent this, the preempt_enable_notrace will use this function @@ -6475,7 +6475,7 @@ asmlinkage __visible void __sched notrac preempt_disable_notrace(); preempt_latency_start(1); /* - * Needs preempt disabled in case user_exit() is traced + * Needs preempt disabled in case ct_user_exit() is traced * and the tracer calls preempt_enable_notrace() causing * an infinite recursion. */ --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -2996,7 +2996,7 @@ int ftrace_shutdown(struct ftrace_ops *o * We need to do a hard force of sched synchronization. * This is because we use preempt_disable() to do RCU, but * the function tracers can be called where RCU is not watching - * (like before user_exit()). We can not rely on the RCU + * (like before ct_user_exit()). We can not rely on the RCU * infrastructure to do the synchronization, thus we must do it * ourselves. */ @@ -5981,7 +5981,7 @@ ftrace_graph_release(struct inode *inode * We need to do a hard force of sched synchronization. * This is because we use preempt_disable() to do RCU, but * the function tracers can be called where RCU is not watching - * (like before user_exit()). We can not rely on the RCU + * (like before ct_user_exit()). We can not rely on the RCU * infrastructure to do the synchronization, thus we must do it * ourselves. */
Similar to dynticks RCU, add a sequence count that tracks USER/GUEST,NMI state. Unlike RCU, use a few more state bits. It would be possible to, like dyntics RCU, fold the USER and NMI bits, for now don't bother and keep them explicit (doing this woulnd't be terribly difficult, it would require __context_tracking_nmi_{enter,exit}() to conditionally update the state). Additionally, use bit0 to indicate there's additional work to be done on leaving the 'USER' state. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/context_tracking.h | 60 +++++++++++++++++++++ include/linux/context_tracking_state.h | 3 + kernel/context_tracking.c | 93 +++++++++++++++++++++++++++++++++ kernel/entry/common.c | 2 4 files changed, 158 insertions(+) --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -9,19 +9,47 @@ #include <asm/ptrace.h> +enum ct_work { + CT_WORK_n = 0, +}; + +/* + * context_tracking::seq + * + * bit0 - work + * bit1 - nmi + * bit2 - user + * + */ +enum ct_seq_state { + CT_SEQ_WORK = 0x01, + CT_SEQ_NMI = 0x02, + CT_SEQ_USER = 0x04, + CT_SEQ = 0x08, +}; + +static __always_inline bool __context_tracking_seq_in_user(unsigned int seq) +{ + return (seq & (CT_SEQ_USER | CT_SEQ_NMI)) == CT_SEQ_USER; +} #ifdef CONFIG_CONTEXT_TRACKING + extern void context_tracking_cpu_set(int cpu); /* Called with interrupts disabled. */ extern void __context_tracking_enter(enum ctx_state state); extern void __context_tracking_exit(enum ctx_state state); +extern void __context_tracking_nmi_enter(void); +extern void __context_tracking_nmi_exit(void); extern void context_tracking_enter(enum ctx_state state); extern void context_tracking_exit(enum ctx_state state); extern void context_tracking_user_enter(void); extern void context_tracking_user_exit(void); +extern bool context_tracking_set_cpu_work(unsigned int cpu, unsigned int work); + static inline void ct_user_enter(void) { if (context_tracking_enabled()) @@ -47,6 +75,17 @@ static __always_inline void ct_user_exit __context_tracking_exit(CONTEXT_USER); } +static __always_inline void ct_nmi_enter_irqoff(void) +{ + if (context_tracking_enabled()) + __context_tracking_nmi_enter(); +} +static __always_inline void ct_nmi_exit_irqoff(void) +{ + if (context_tracking_enabled()) + __context_tracking_nmi_exit(); +} + static inline enum ctx_state exception_enter(void) { enum ctx_state prev_ctx; @@ -97,19 +136,40 @@ static __always_inline enum ctx_state ct return context_tracking_enabled() ? this_cpu_read(context_tracking.state) : CONTEXT_DISABLED; } + +static __always_inline unsigned int __context_tracking_cpu_seq(unsigned int cpu) +{ + return arch_atomic_read(per_cpu_ptr(&context_tracking.seq, cpu)); +} + #else static inline void ct_user_enter(void) { } static inline void ct_user_exit(void) { } static inline void ct_user_enter_irqoff(void) { } static inline void ct_user_exit_irqoff(void) { } +static inline void ct_nmi_enter_irqoff(void) { } +static inline void ct_nmi_exit_irqoff(void) { } static inline enum ctx_state exception_enter(void) { return 0; } static inline void exception_exit(enum ctx_state prev_ctx) { } static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; } static inline bool context_tracking_guest_enter(void) { return false; } static inline void context_tracking_guest_exit(void) { } +static inline bool context_tracking_set_cpu_work(unsigned int cpu, unsigned int work) { return false; } + +static __always_inline unsigned int __context_tracking_cpu_seq(unsigned int cpu) +{ + return 0; +} + #endif /* !CONFIG_CONTEXT_TRACKING */ +static __always_inline bool context_tracking_cpu_in_user(unsigned int cpu) +{ + unsigned int seq = __context_tracking_cpu_seq(cpu); + return __context_tracking_seq_in_user(seq); +} + #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond)) #ifdef CONFIG_CONTEXT_TRACKING_FORCE --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -4,6 +4,7 @@ #include <linux/percpu.h> #include <linux/static_key.h> +#include <linux/types.h> struct context_tracking { /* @@ -13,6 +14,8 @@ struct context_tracking { * may be further optimized using static keys. */ bool active; + atomic_t seq; + atomic_t work; int recursion; enum ctx_state { CONTEXT_DISABLED = -1, /* returned by ct_state() if unknown */ --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -50,6 +50,85 @@ static __always_inline void context_trac __this_cpu_dec(context_tracking.recursion); } +/* CT_WORK_n, must be noinstr, non-blocking, NMI safe and deal with spurious calls */ +static noinstr void ct_exit_user_work(struct context_tracking *ct) +{ + unsigned int work = arch_atomic_read(&ct->work); + +#if 0 + if (work & CT_WORK_n) { + /* NMI happens here and must still do/finish CT_WORK_n */ + do_work_n(); + + smp_mb__before_atomic(); + arch_atomic_andnot(CT_WORK_n, &ct->work); + } +#endif + + smp_mb__before_atomic(); + arch_atomic_andnot(CT_SEQ_WORK, &ct->seq); +} + +/* all CPU local */ + +static __always_inline unsigned int ct_seq_nmi_enter(struct context_tracking *ct) +{ + unsigned int seq = arch_atomic_add_return(CT_SEQ_NMI, &ct->seq); + if (seq & CT_SEQ_WORK) /* NMI-enter is USER-exit */ + ct_exit_user_work(ct); + return seq; +} + +static __always_inline unsigned int ct_seq_nmi_exit(struct context_tracking *ct) +{ + arch_atomic_set(&ct->work, 0); + return arch_atomic_add_return(CT_SEQ - CT_SEQ_NMI, &ct->seq); +} + +static __always_inline unsigned int ct_seq_user_enter(struct context_tracking *ct) +{ + arch_atomic_set(&ct->work, 0); + return arch_atomic_add_return(CT_SEQ_USER, &ct->seq); +} + +static __always_inline unsigned int ct_seq_user_exit(struct context_tracking *ct) +{ + unsigned int seq = arch_atomic_add_return(CT_SEQ - CT_SEQ_USER, &ct->seq); + if (seq & CT_SEQ_WORK) + ct_exit_user_work(ct); + return seq; +} + +/* remote */ + +/* + * When returns true: guaratees that CPu will call @work + */ +static bool ct_seq_set_user_work(struct context_tracking *ct, unsigned int work) +{ + unsigned int seq; + bool ret = false; + + if (!context_tracking_enabled() || !ct->active) + return false; + + preempt_disable(); + seq = atomic_read(&ct->seq); + if (__context_tracking_seq_in_user(seq)) { + /* ctrl-dep */ + atomic_or(work, &ct->work); + ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK); + } + preempt_enable(); + + return ret; +} + +bool context_tracking_set_cpu_work(unsigned int cpu, unsigned int work) +{ + return ct_seq_set_user_work(per_cpu_ptr(&context_tracking, cpu), work); +} + /** * context_tracking_enter - Inform the context tracking that the CPU is going * enter user or guest space mode. @@ -83,6 +162,7 @@ void noinstr __context_tracking_enter(en instrumentation_end(); } rcu_user_enter(); + ct_seq_user_enter(raw_cpu_ptr(&context_tracking)); } /* * Even if context tracking is disabled on this CPU, because it's outside @@ -154,6 +234,7 @@ void noinstr __context_tracking_exit(enu * We are going to run code that may use RCU. Inform * RCU core about that (ie: we may need the tick again). */ + ct_seq_user_exit(raw_cpu_ptr(&context_tracking)); rcu_user_exit(); if (state == CONTEXT_USER) { instrumentation_begin(); @@ -188,6 +269,18 @@ void context_tracking_user_exit(void) } NOKPROBE_SYMBOL(context_tracking_user_exit); +void noinstr __context_tracking_nmi_enter(void) +{ + if (__this_cpu_read(context_tracking.active)) + ct_seq_nmi_enter(raw_cpu_ptr(&context_tracking)); +} + +void noinstr __context_tracking_nmi_exit(void) +{ + if (__this_cpu_read(context_tracking.active)) + ct_seq_nmi_exit(raw_cpu_ptr(&context_tracking)); +} + void __init context_tracking_cpu_set(int cpu) { static __initdata bool initialized = false; --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -451,6 +451,7 @@ irqentry_state_t noinstr irqentry_nmi_en __nmi_enter(); lockdep_hardirqs_off(CALLER_ADDR0); lockdep_hardirq_enter(); + ct_nmi_enter_irqoff(); rcu_nmi_enter(); instrumentation_begin(); @@ -472,6 +473,7 @@ void noinstr irqentry_nmi_exit(struct pt instrumentation_end(); rcu_nmi_exit(); + ct_nmi_exit_irqoff(); lockdep_hardirq_exit(); if (irq_state.lockdep) lockdep_hardirqs_on(CALLER_ADDR0);
XXX I'm pretty sure I broke task-trace-rcu. XXX trace_rcu_*() now gets an unconditional 0 Other than that, it seems like a fairly straight-forward replacement of the RCU count with the context_tracking count. Using context-tracking for this avoids having two (expensive) atomic ops on the entry paths where one will do. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/context_tracking.h | 6 ++ kernel/context_tracking.c | 14 +++++ kernel/rcu/tree.c | 101 +++++---------------------------------- kernel/rcu/tree.h | 1 4 files changed, 33 insertions(+), 89 deletions(-) --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -50,6 +50,9 @@ extern void context_tracking_user_exit(v extern bool context_tracking_set_cpu_work(unsigned int cpu, unsigned int work); +extern void context_tracking_idle(void); +extern void context_tracking_online(void); + static inline void ct_user_enter(void) { if (context_tracking_enabled()) @@ -162,6 +165,9 @@ static __always_inline unsigned int __co return 0; } +static inline void context_tracking_idle(void) { } +static inline void context_tracking_online(void) { } + #endif /* !CONFIG_CONTEXT_TRACKING */ static __always_inline bool context_tracking_cpu_in_user(unsigned int cpu) --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -281,6 +281,20 @@ void noinstr __context_tracking_nmi_exit ct_seq_nmi_exit(raw_cpu_ptr(&context_tracking)); } +void context_tracking_online(void) +{ + struct context_tracking *ct = raw_cpu_ptr(&context_tracking); + unsigned int seq = atomic_read(&ct->seq); + + if (__context_tracking_seq_in_user(seq)) + atomic_add_return(CT_SEQ - CT_SEQ_USER, &ct->seq); +} + +void context_tracking_idle(void) +{ + atomic_add_return(CT_SEQ, &raw_cpu_ptr(&context_tracking)->seq); +} + void __init context_tracking_cpu_set(int cpu) { static __initdata bool initialized = false; --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -62,6 +62,7 @@ #include <linux/vmalloc.h> #include <linux/mm.h> #include <linux/kasan.h> +#include <linux/context_tracking.h> #include "../time/tick-internal.h" #include "tree.h" @@ -77,7 +78,6 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = { .dynticks_nesting = 1, .dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE, - .dynticks = ATOMIC_INIT(1), #ifdef CONFIG_RCU_NOCB_CPU .cblist.flags = SEGCBLIST_SOFTIRQ_ONLY, #endif @@ -252,56 +252,6 @@ void rcu_softirq_qs(void) } /* - * Increment the current CPU's rcu_data structure's ->dynticks field - * with ordering. Return the new value. - */ -static noinline noinstr unsigned long rcu_dynticks_inc(int incby) -{ - return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks)); -} - -/* - * Record entry into an extended quiescent state. This is only to be - * called when not already in an extended quiescent state, that is, - * RCU is watching prior to the call to this function and is no longer - * watching upon return. - */ -static noinstr void rcu_dynticks_eqs_enter(void) -{ - int seq; - - /* - * CPUs seeing atomic_add_return() must see prior RCU read-side - * critical sections, and we also must force ordering with the - * next idle sojourn. - */ - rcu_dynticks_task_trace_enter(); // Before ->dynticks update! - seq = rcu_dynticks_inc(1); - // RCU is no longer watching. Better be in extended quiescent state! - WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & 0x1)); -} - -/* - * Record exit from an extended quiescent state. This is only to be - * called from an extended quiescent state, that is, RCU is not watching - * prior to the call to this function and is watching upon return. - */ -static noinstr void rcu_dynticks_eqs_exit(void) -{ - int seq; - - /* - * CPUs seeing atomic_add_return() must see prior idle sojourns, - * and we also must force ordering with the next RCU read-side - * critical section. - */ - seq = rcu_dynticks_inc(1); - // RCU is now watching. Better not be in an extended quiescent state! - rcu_dynticks_task_trace_exit(); // After ->dynticks update! - WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & 0x1)); -} - -/* * Reset the current CPU's ->dynticks counter to indicate that the * newly onlined CPU is no longer in an extended quiescent state. * This will either leave the counter unchanged, or increment it @@ -313,11 +263,7 @@ static noinstr void rcu_dynticks_eqs_exi */ static void rcu_dynticks_eqs_online(void) { - struct rcu_data *rdp = this_cpu_ptr(&rcu_data); - - if (atomic_read(&rdp->dynticks) & 0x1) - return; - rcu_dynticks_inc(1); + context_tracking_online(); } /* @@ -327,7 +273,7 @@ static void rcu_dynticks_eqs_online(void */ static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void) { - return !(atomic_read(this_cpu_ptr(&rcu_data.dynticks)) & 0x1); + return context_tracking_cpu_in_user(smp_processor_id()); } /* @@ -337,7 +283,7 @@ static __always_inline bool rcu_dynticks static int rcu_dynticks_snap(struct rcu_data *rdp) { smp_mb(); // Fundamental RCU ordering guarantee. - return atomic_read_acquire(&rdp->dynticks); + return __context_tracking_cpu_seq(rdp->cpu); } /* @@ -346,7 +292,7 @@ static int rcu_dynticks_snap(struct rcu_ */ static bool rcu_dynticks_in_eqs(int snap) { - return !(snap & 0x1); + return __context_tracking_seq_in_user(snap); } /* Return true if the specified CPU is currently idle from an RCU viewpoint. */ @@ -377,7 +323,7 @@ bool rcu_dynticks_zero_in_eqs(int cpu, i int snap; // If not quiescent, force back to earlier extended quiescent state. - snap = atomic_read(&rdp->dynticks) & ~0x1; + snap = __context_tracking_cpu_seq(rdp->cpu) & ~0x7; smp_rmb(); // Order ->dynticks and *vp reads. if (READ_ONCE(*vp)) @@ -385,7 +331,7 @@ bool rcu_dynticks_zero_in_eqs(int cpu, i smp_rmb(); // Order *vp read and ->dynticks re-read. // If still in the same extended quiescent state, we are good! - return snap == atomic_read(&rdp->dynticks); + return snap == __context_tracking_cpu_seq(rdp->cpu); } /* @@ -401,12 +347,8 @@ bool rcu_dynticks_zero_in_eqs(int cpu, i */ notrace void rcu_momentary_dyntick_idle(void) { - int seq; - raw_cpu_write(rcu_data.rcu_need_heavy_qs, false); - seq = rcu_dynticks_inc(2); - /* It is illegal to call this from idle state. */ - WARN_ON_ONCE(!(seq & 0x1)); + context_tracking_idle(); rcu_preempt_deferred_qs(current); } EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle); @@ -622,18 +564,15 @@ static noinstr void rcu_eqs_enter(bool u lockdep_assert_irqs_disabled(); instrumentation_begin(); - trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks)); + trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, 0); WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current)); rcu_prepare_for_idle(); rcu_preempt_deferred_qs(current); - // instrumentation for the noinstr rcu_dynticks_eqs_enter() - instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks)); instrumentation_end(); WRITE_ONCE(rdp->dynticks_nesting, 0); /* Avoid irq-access tearing. */ // RCU is watching here ... - rcu_dynticks_eqs_enter(); // ... but is no longer watching here. rcu_dynticks_task_enter(); } @@ -756,8 +695,7 @@ noinstr void rcu_nmi_exit(void) * leave it in non-RCU-idle state. */ if (rdp->dynticks_nmi_nesting != 1) { - trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, - atomic_read(&rdp->dynticks)); + trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, 0); WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */ rdp->dynticks_nmi_nesting - 2); instrumentation_end(); @@ -765,18 +703,15 @@ noinstr void rcu_nmi_exit(void) } /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */ - trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks)); + trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, 0); WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */ if (!in_nmi()) rcu_prepare_for_idle(); - // instrumentation for the noinstr rcu_dynticks_eqs_enter() - instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks)); instrumentation_end(); // RCU is watching here ... - rcu_dynticks_eqs_enter(); // ... but is no longer watching here. if (!in_nmi()) @@ -865,15 +800,11 @@ static void noinstr rcu_eqs_exit(bool us } rcu_dynticks_task_exit(); // RCU is not watching here ... - rcu_dynticks_eqs_exit(); // ... but is watching here. instrumentation_begin(); - // instrumentation for the noinstr rcu_dynticks_eqs_exit() - instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks)); - rcu_cleanup_after_idle(); - trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, atomic_read(&rdp->dynticks)); + trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, 0); WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current)); WRITE_ONCE(rdp->dynticks_nesting, 1); WARN_ON_ONCE(rdp->dynticks_nmi_nesting); @@ -1011,7 +942,6 @@ noinstr void rcu_nmi_enter(void) rcu_dynticks_task_exit(); // RCU is not watching here ... - rcu_dynticks_eqs_exit(); // ... but is watching here. if (!in_nmi()) { @@ -1021,11 +951,6 @@ noinstr void rcu_nmi_enter(void) } instrumentation_begin(); - // instrumentation for the noinstr rcu_dynticks_curr_cpu_in_eqs() - instrument_atomic_read(&rdp->dynticks, sizeof(rdp->dynticks)); - // instrumentation for the noinstr rcu_dynticks_eqs_exit() - instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks)); - incby = 1; } else if (!in_nmi()) { instrumentation_begin(); @@ -1036,7 +961,7 @@ noinstr void rcu_nmi_enter(void) trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="), rdp->dynticks_nmi_nesting, - rdp->dynticks_nmi_nesting + incby, atomic_read(&rdp->dynticks)); + rdp->dynticks_nmi_nesting + incby, 0); instrumentation_end(); WRITE_ONCE(rdp->dynticks_nmi_nesting, /* Prevent store tearing. */ rdp->dynticks_nmi_nesting + incby); --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -184,7 +184,6 @@ struct rcu_data { int dynticks_snap; /* Per-GP tracking for dynticks. */ long dynticks_nesting; /* Track process nesting level. */ long dynticks_nmi_nesting; /* Track irq/NMI nesting level. */ - atomic_t dynticks; /* Even value for idle, else odd. */ bool rcu_need_heavy_qs; /* GP old, so heavy quiescent state! */ bool rcu_urgent_qs; /* GP old need light quiescent state. */ bool rcu_forced_tick; /* Forced tick to provide QS. */
Using the new context_tracking infrastructure, avoid disturbing userspace tasks when context tracking is enabled. When context_tracking_set_cpu_work() returns true, we have the guarantee that klp_update_patch_state() is called from noinstr code before it runs normal kernel code. This covers syscall/exceptions/interrupts and NMI entry. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/context_tracking.h | 2 +- include/linux/livepatch.h | 2 ++ kernel/context_tracking.c | 11 +++++------ kernel/livepatch/transition.c | 29 ++++++++++++++++++++++++++--- 4 files changed, 34 insertions(+), 10 deletions(-) --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -10,7 +10,7 @@ #include <asm/ptrace.h> enum ct_work { - CT_WORK_n = 0, + CT_WORK_KLP = 1, }; /* --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -201,6 +201,7 @@ void klp_module_going(struct module *mod void klp_copy_process(struct task_struct *child); void klp_update_patch_state(struct task_struct *task); +void __klp_update_patch_state(struct task_struct *task); static inline bool klp_patch_pending(struct task_struct *task) { @@ -242,6 +243,7 @@ static inline int klp_module_coming(stru static inline void klp_module_going(struct module *mod) {} static inline bool klp_patch_pending(struct task_struct *task) { return false; } static inline void klp_update_patch_state(struct task_struct *task) {} +static inline void __klp_update_patch_state(struct task_struct *task) {} static inline void klp_copy_process(struct task_struct *child) {} static inline --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -21,6 +21,7 @@ #include <linux/hardirq.h> #include <linux/export.h> #include <linux/kprobes.h> +#include <linux/livepatch.h> #define CREATE_TRACE_POINTS #include <trace/events/context_tracking.h> @@ -55,15 +56,13 @@ static noinstr void ct_exit_user_work(struct { unsigned int work = arch_atomic_read(&ct->work); -#if 0 - if (work & CT_WORK_n) { + if (work & CT_WORK_KLP) { /* NMI happens here and must still do/finish CT_WORK_n */ - do_work_n(); + __klp_update_patch_state(current); smp_mb__before_atomic(); - arch_atomic_andnot(CT_WORK_n, &ct->work); + arch_atomic_andnot(CT_WORK_KLP, &ct->work); } -#endif smp_mb__before_atomic(); arch_atomic_andnot(CT_SEQ_WORK, &ct->seq); --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -10,6 +10,7 @@ #include <linux/cpu.h> #include <linux/stacktrace.h> #include <linux/tracehook.h> +#include <linux/context_tracking.h> #include "core.h" #include "patch.h" #include "transition.h" @@ -153,6 +154,11 @@ void klp_cancel_transition(void) klp_complete_transition(); } +noinstr void __klp_update_patch_state(struct task_struct *task) +{ + task->patch_state = READ_ONCE(klp_target_state); +} + /* * Switch the patched state of the task to the set of functions in the target * patch state. @@ -180,8 +186,10 @@ void klp_update_patch_state(struct task_ * of func->transition, if klp_ftrace_handler() is called later on * the same CPU. See __klp_disable_patch(). */ - if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING)) + if (test_tsk_thread_flag(task, TIF_PATCH_PENDING)) { task->patch_state = READ_ONCE(klp_target_state); + clear_tsk_thread_flag(task, TIF_PATCH_PENDING); + } preempt_enable_notrace(); } @@ -270,15 +278,30 @@ static int klp_check_and_switch_task(str { int ret; - if (task_curr(task)) + if (task_curr(task)) { + /* + * This only succeeds when the task is in NOHZ_FULL user + * mode, the true return value guarantees any kernel entry + * will call klp_update_patch_state(). + * + * XXX: ideally we'd simply return 0 here and leave + * TIF_PATCH_PENDING alone, to be fixed up by + * klp_update_patch_state(), except livepatching goes wobbly + * with 'pending' TIF bits on. + */ + if (context_tracking_set_cpu_work(task_cpu(task), CT_WORK_KLP)) + goto clear; + return -EBUSY; + } ret = klp_check_stack(task, arg); if (ret) return ret; - clear_tsk_thread_flag(task, TIF_PATCH_PENDING); task->patch_state = klp_target_state; +clear: + clear_tsk_thread_flag(task, TIF_PATCH_PENDING); return 0; }
The whole premise is broken, any trace usage outside of RCU is a BUG and should be fixed. Notable all code prior (and a fair bit after) ct_user_exit() is noinstr and explicitly doesn't allow any tracing. Use regular RCU, which has the benefit of actually respecting NOHZ_FULL. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/livepatch/transition.c | 32 +++----------------------------- 1 file changed, 3 insertions(+), 29 deletions(-) --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -42,28 +42,6 @@ static void klp_transition_work_fn(struc static DECLARE_DELAYED_WORK(klp_transition_work, klp_transition_work_fn); /* - * This function is just a stub to implement a hard force - * of synchronize_rcu(). This requires synchronizing - * tasks even in userspace and idle. - */ -static void klp_sync(struct work_struct *work) -{ -} - -/* - * We allow to patch also functions where RCU is not watching, - * e.g. before ct_user_exit(). We can not rely on the RCU infrastructure - * to do the synchronization. Instead hard force the sched synchronization. - * - * This approach allows to use RCU functions for manipulating func_stack - * safely. - */ -static void klp_synchronize_transition(void) -{ - schedule_on_each_cpu(klp_sync); -} - -/* * The transition to the target patch state is complete. Clean up the data * structures. */ @@ -96,7 +74,7 @@ static void klp_complete_transition(void * func->transition gets cleared, the handler may choose a * removed function. */ - klp_synchronize_transition(); + synchronize_rcu(); } klp_for_each_object(klp_transition_patch, obj) @@ -105,7 +83,7 @@ static void klp_complete_transition(void /* Prevent klp_ftrace_handler() from seeing KLP_UNDEFINED state */ if (klp_target_state == KLP_PATCHED) - klp_synchronize_transition(); + synchronize_rcu(); read_lock(&tasklist_lock); for_each_process_thread(g, task) { @@ -168,10 +146,6 @@ noinstr void __klp_update_patch_state(st */ void klp_update_patch_state(struct task_struct *task) { - /* - * A variant of synchronize_rcu() is used to allow patching functions - * where RCU is not watching, see klp_synchronize_transition(). - */ preempt_disable_notrace(); /* @@ -626,7 +600,7 @@ void klp_reverse_transition(void) clear_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING); /* Let any remaining calls to klp_update_patch_state() complete */ - klp_synchronize_transition(); + synchronize_rcu(); klp_start_transition(); }
Use the new context_tracking infrastructure to avoid disturbing userspace tasks when we rewrite kernel code. XXX re-audit the entry code to make sure only the context_tracking static_branch is before hitting this code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- arch/x86/include/asm/sync_core.h | 2 ++ arch/x86/kernel/alternative.c | 8 +++++++- include/linux/context_tracking.h | 1 + kernel/context_tracking.c | 12 ++++++++++++ 4 files changed, 22 insertions(+), 1 deletion(-) --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -87,6 +87,8 @@ static inline void sync_core(void) */ iret_to_self(); } +#define sync_core sync_core + /* * Ensure that a core serializing instruction is issued before returning --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -18,6 +18,7 @@ #include <linux/mmu_context.h> #include <linux/bsearch.h> #include <linux/sync_core.h> +#include <linux/context_tracking.h> #include <asm/text-patching.h> #include <asm/alternative.h> #include <asm/sections.h> @@ -924,9 +925,14 @@ static void do_sync_core(void *info) sync_core(); } +static bool do_sync_core_cond(int cpu, void *info) +{ + return !context_tracking_set_cpu_work(cpu, CT_WORK_SYNC); +} + void text_poke_sync(void) { - on_each_cpu(do_sync_core, NULL, 1); + on_each_cpu_cond(do_sync_core_cond, do_sync_core, NULL, 1); } struct text_poke_loc { --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -11,6 +11,7 @@ enum ct_work { CT_WORK_KLP = 1, + CT_WORK_SYNC = 2, }; /* --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -51,6 +51,10 @@ static __always_inline void context_trac __this_cpu_dec(context_tracking.recursion); } +#ifndef sync_core +static inline void sync_core(void) { } +#endif + /* CT_WORK_n, must be noinstr, non-blocking, NMI safe and deal with spurious calls */ static noinstr void ct_exit_user_work(struct context_tracking *ct) { @@ -64,6 +68,14 @@ static noinstr void ct_exit_user_work(struct arch_atomic_andnot(CT_WORK_KLP, &ct->work); } + if (work & CT_WORK_SYNC) { + /* NMI happens here and must still do/finish CT_WORK_n */ + sync_core(); + + smp_mb__before_atomic(); + arch_atomic_andnot(CT_WORK_SYNC, &ct->work); + } + smp_mb__before_atomic(); arch_atomic_andnot(CT_SEQ_WORK, &ct->seq); }
On Wed, Sep 29, 2021 at 05:17:23PM +0200, Peter Zijlstra wrote:
> Hi,
>
> One series in two parts; the first part (patches 1-5) should be in reasonable
> shape and ought to fix the original issues that got this whole thing started.
>
> The second part (patches 6-11) are still marked RFC and do scary things to try
> and make NOHZ_FULL work better.
>
> Please consider..
>
> PS. Paul, I did do break RCU a bit and I'm not sure about the best way to put
> it back together, please see patch 8 for details.
Hey, wait!!! Breaking RCU is -my- job!!! ;-)
Thanx, Paul
On Wed, Sep 29, 2021 at 05:17:31PM +0200, Peter Zijlstra wrote: > XXX I'm pretty sure I broke task-trace-rcu. > XXX trace_rcu_*() now gets an unconditional 0 > > Other than that, it seems like a fairly straight-forward replacement > of the RCU count with the context_tracking count. > > Using context-tracking for this avoids having two (expensive) atomic > ops on the entry paths where one will do. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> A few questions and exclamations interspersed. > --- > include/linux/context_tracking.h | 6 ++ > kernel/context_tracking.c | 14 +++++ > kernel/rcu/tree.c | 101 +++++---------------------------------- > kernel/rcu/tree.h | 1 > 4 files changed, 33 insertions(+), 89 deletions(-) > > --- a/include/linux/context_tracking.h > +++ b/include/linux/context_tracking.h > @@ -50,6 +50,9 @@ extern void context_tracking_user_exit(v > > extern bool context_tracking_set_cpu_work(unsigned int cpu, unsigned int work); > > +extern void context_tracking_idle(void); > +extern void context_tracking_online(void); > + > static inline void ct_user_enter(void) > { > if (context_tracking_enabled()) > @@ -162,6 +165,9 @@ static __always_inline unsigned int __co > return 0; > } > > +static inline void context_tracking_idle(void) { } > +static inline void context_tracking_online(void) { } > + > #endif /* !CONFIG_CONTEXT_TRACKING */ > > static __always_inline bool context_tracking_cpu_in_user(unsigned int cpu) > --- a/kernel/context_tracking.c > +++ b/kernel/context_tracking.c > @@ -281,6 +281,20 @@ void noinstr __context_tracking_nmi_exit > ct_seq_nmi_exit(raw_cpu_ptr(&context_tracking)); > } > > +void context_tracking_online(void) > +{ > + struct context_tracking *ct = raw_cpu_ptr(&context_tracking); > + unsigned int seq = atomic_read(&ct->seq); > + > + if (__context_tracking_seq_in_user(seq)) > + atomic_add_return(CT_SEQ - CT_SEQ_USER, &ct->seq); What if the CPU went offline marked idle instead of user? (Yes, that no longer happens, but checking my understanding.) The CT_SEQ_WORK bit means neither idle nor nohz_full user, correct? So let's see if I intuited the decoder ring, where "kernel" means that portion of the kernel that is non-noinstr... CT_SEQ_WORK CT_SEQ_NMI CT_SEQ_USER Description? 0 0 0 Idle or non-nohz_full user 0 0 1 nohz_full user 0 1 0 NMI from 0,0,0 0 1 1 NMI from 0,0,1 1 0 0 Non-idle kernel 1 0 1 Cannot happen? 1 1 0 NMI from 1,0,1 1 1 1 NMI from cannot happen And of course if a state cannot happen, Murphy says that you will take an NMI in that state. > +} > + > +void context_tracking_idle(void) > +{ > + atomic_add_return(CT_SEQ, &raw_cpu_ptr(&context_tracking)->seq); This is presumably a momentary idle. And what happens to all of this in !CONFIG_CONTEXT_TRACKING kernels? Of course, RCU needs it unconditionally. (There appear to be at least parts of it that are unconditionally available, but I figured that I should ask. Especially given the !CONFIG_CONTEXT_TRACKING definition of the __context_tracking_cpu_seq() function.) > +} > + > void __init context_tracking_cpu_set(int cpu) > { > static __initdata bool initialized = false; > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -62,6 +62,7 @@ > #include <linux/vmalloc.h> > #include <linux/mm.h> > #include <linux/kasan.h> > +#include <linux/context_tracking.h> > #include "../time/tick-internal.h" > > #include "tree.h" > @@ -77,7 +78,6 @@ > static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = { > .dynticks_nesting = 1, > .dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE, > - .dynticks = ATOMIC_INIT(1), > #ifdef CONFIG_RCU_NOCB_CPU > .cblist.flags = SEGCBLIST_SOFTIRQ_ONLY, > #endif > @@ -252,56 +252,6 @@ void rcu_softirq_qs(void) > } > > /* > - * Increment the current CPU's rcu_data structure's ->dynticks field > - * with ordering. Return the new value. > - */ > -static noinline noinstr unsigned long rcu_dynticks_inc(int incby) > -{ > - return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks)); > -} > - > -/* > - * Record entry into an extended quiescent state. This is only to be > - * called when not already in an extended quiescent state, that is, > - * RCU is watching prior to the call to this function and is no longer > - * watching upon return. > - */ > -static noinstr void rcu_dynticks_eqs_enter(void) > -{ > - int seq; > - > - /* > - * CPUs seeing atomic_add_return() must see prior RCU read-side > - * critical sections, and we also must force ordering with the > - * next idle sojourn. > - */ > - rcu_dynticks_task_trace_enter(); // Before ->dynticks update! > - seq = rcu_dynticks_inc(1); > - // RCU is no longer watching. Better be in extended quiescent state! > - WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & 0x1)); > -} > - > -/* > - * Record exit from an extended quiescent state. This is only to be > - * called from an extended quiescent state, that is, RCU is not watching > - * prior to the call to this function and is watching upon return. > - */ > -static noinstr void rcu_dynticks_eqs_exit(void) > -{ > - int seq; > - > - /* > - * CPUs seeing atomic_add_return() must see prior idle sojourns, > - * and we also must force ordering with the next RCU read-side > - * critical section. > - */ > - seq = rcu_dynticks_inc(1); > - // RCU is now watching. Better not be in an extended quiescent state! > - rcu_dynticks_task_trace_exit(); // After ->dynticks update! > - WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & 0x1)); > -} > - > -/* > * Reset the current CPU's ->dynticks counter to indicate that the > * newly onlined CPU is no longer in an extended quiescent state. > * This will either leave the counter unchanged, or increment it > @@ -313,11 +263,7 @@ static noinstr void rcu_dynticks_eqs_exi > */ > static void rcu_dynticks_eqs_online(void) > { > - struct rcu_data *rdp = this_cpu_ptr(&rcu_data); > - > - if (atomic_read(&rdp->dynticks) & 0x1) > - return; > - rcu_dynticks_inc(1); > + context_tracking_online(); > } > > /* > @@ -327,7 +273,7 @@ static void rcu_dynticks_eqs_online(void > */ > static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void) > { > - return !(atomic_read(this_cpu_ptr(&rcu_data.dynticks)) & 0x1); > + return context_tracking_cpu_in_user(smp_processor_id()); This needs to return true in any type of extended quiescent state, not just nohz_full user execution. > } > > /* > @@ -337,7 +283,7 @@ static __always_inline bool rcu_dynticks > static int rcu_dynticks_snap(struct rcu_data *rdp) > { > smp_mb(); // Fundamental RCU ordering guarantee. > - return atomic_read_acquire(&rdp->dynticks); > + return __context_tracking_cpu_seq(rdp->cpu); This will most definitely break RCU in !CONFIG_CONTEXT_TRACKING kernels. It will always return zero. Albeit fully ordered. Might this be why RCU Tasks Trace is unhappy? Similar issues for later uses of __context_tracking_cpu_seq(). > } > > /* > @@ -346,7 +292,7 @@ static int rcu_dynticks_snap(struct rcu_ > */ > static bool rcu_dynticks_in_eqs(int snap) > { > - return !(snap & 0x1); > + return __context_tracking_seq_in_user(snap); > } > > /* Return true if the specified CPU is currently idle from an RCU viewpoint. */ > @@ -377,7 +323,7 @@ bool rcu_dynticks_zero_in_eqs(int cpu, i > int snap; > > // If not quiescent, force back to earlier extended quiescent state. > - snap = atomic_read(&rdp->dynticks) & ~0x1; > + snap = __context_tracking_cpu_seq(rdp->cpu) & ~0x7; > > smp_rmb(); // Order ->dynticks and *vp reads. > if (READ_ONCE(*vp)) > @@ -385,7 +331,7 @@ bool rcu_dynticks_zero_in_eqs(int cpu, i > smp_rmb(); // Order *vp read and ->dynticks re-read. > > // If still in the same extended quiescent state, we are good! > - return snap == atomic_read(&rdp->dynticks); > + return snap == __context_tracking_cpu_seq(rdp->cpu); > } > > /* > @@ -401,12 +347,8 @@ bool rcu_dynticks_zero_in_eqs(int cpu, i > */ > notrace void rcu_momentary_dyntick_idle(void) > { > - int seq; > - > raw_cpu_write(rcu_data.rcu_need_heavy_qs, false); > - seq = rcu_dynticks_inc(2); > - /* It is illegal to call this from idle state. */ > - WARN_ON_ONCE(!(seq & 0x1)); > + context_tracking_idle(); > rcu_preempt_deferred_qs(current); > } > EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle); > @@ -622,18 +564,15 @@ static noinstr void rcu_eqs_enter(bool u > > lockdep_assert_irqs_disabled(); > instrumentation_begin(); > - trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks)); > + trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, 0); Why not get the value from __context_tracking_cpu_seq()? Ditto for similar changes later on. > WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current)); > rcu_prepare_for_idle(); > rcu_preempt_deferred_qs(current); > > - // instrumentation for the noinstr rcu_dynticks_eqs_enter() > - instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks)); > > instrumentation_end(); > WRITE_ONCE(rdp->dynticks_nesting, 0); /* Avoid irq-access tearing. */ > // RCU is watching here ... > - rcu_dynticks_eqs_enter(); > // ... but is no longer watching here. OK, at the very least, these two comments are now nonsense. ;-) Is the intent to drive this from context tracking, so that these functions only push the various required RCU state transitions? Same for the similar changes below. Anyway, unless I am missing something subtle (quite possible given that I haven't looked closely at context tracking), you did succeed in breaking RCU! ;-) > rcu_dynticks_task_enter(); > } > @@ -756,8 +695,7 @@ noinstr void rcu_nmi_exit(void) > * leave it in non-RCU-idle state. > */ > if (rdp->dynticks_nmi_nesting != 1) { > - trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, > - atomic_read(&rdp->dynticks)); > + trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, 0); > WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */ > rdp->dynticks_nmi_nesting - 2); > instrumentation_end(); > @@ -765,18 +703,15 @@ noinstr void rcu_nmi_exit(void) > } > > /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */ > - trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks)); > + trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, 0); > WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */ > > if (!in_nmi()) > rcu_prepare_for_idle(); > > - // instrumentation for the noinstr rcu_dynticks_eqs_enter() > - instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks)); > instrumentation_end(); > > // RCU is watching here ... > - rcu_dynticks_eqs_enter(); > // ... but is no longer watching here. > > if (!in_nmi()) > @@ -865,15 +800,11 @@ static void noinstr rcu_eqs_exit(bool us > } > rcu_dynticks_task_exit(); > // RCU is not watching here ... > - rcu_dynticks_eqs_exit(); > // ... but is watching here. > instrumentation_begin(); > > - // instrumentation for the noinstr rcu_dynticks_eqs_exit() > - instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks)); > - > rcu_cleanup_after_idle(); > - trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, atomic_read(&rdp->dynticks)); > + trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, 0); > WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current)); > WRITE_ONCE(rdp->dynticks_nesting, 1); > WARN_ON_ONCE(rdp->dynticks_nmi_nesting); > @@ -1011,7 +942,6 @@ noinstr void rcu_nmi_enter(void) > rcu_dynticks_task_exit(); > > // RCU is not watching here ... > - rcu_dynticks_eqs_exit(); > // ... but is watching here. > > if (!in_nmi()) { > @@ -1021,11 +951,6 @@ noinstr void rcu_nmi_enter(void) > } > > instrumentation_begin(); > - // instrumentation for the noinstr rcu_dynticks_curr_cpu_in_eqs() > - instrument_atomic_read(&rdp->dynticks, sizeof(rdp->dynticks)); > - // instrumentation for the noinstr rcu_dynticks_eqs_exit() > - instrument_atomic_write(&rdp->dynticks, sizeof(rdp->dynticks)); > - > incby = 1; > } else if (!in_nmi()) { > instrumentation_begin(); > @@ -1036,7 +961,7 @@ noinstr void rcu_nmi_enter(void) > > trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="), > rdp->dynticks_nmi_nesting, > - rdp->dynticks_nmi_nesting + incby, atomic_read(&rdp->dynticks)); > + rdp->dynticks_nmi_nesting + incby, 0); > instrumentation_end(); > WRITE_ONCE(rdp->dynticks_nmi_nesting, /* Prevent store tearing. */ > rdp->dynticks_nmi_nesting + incby); > --- a/kernel/rcu/tree.h > +++ b/kernel/rcu/tree.h > @@ -184,7 +184,6 @@ struct rcu_data { > int dynticks_snap; /* Per-GP tracking for dynticks. */ > long dynticks_nesting; /* Track process nesting level. */ > long dynticks_nmi_nesting; /* Track irq/NMI nesting level. */ > - atomic_t dynticks; /* Even value for idle, else odd. */ > bool rcu_need_heavy_qs; /* GP old, so heavy quiescent state! */ > bool rcu_urgent_qs; /* GP old need light quiescent state. */ > bool rcu_forced_tick; /* Forced tick to provide QS. */ > >
On Wed, Sep 29, 2021 at 05:17:31PM +0200, Peter Zijlstra wrote: > XXX I'm pretty sure I broke task-trace-rcu. > -static noinstr void rcu_dynticks_eqs_enter(void) > -{ > - int seq; > - > - /* > - * CPUs seeing atomic_add_return() must see prior RCU read-side > - * critical sections, and we also must force ordering with the > - * next idle sojourn. > - */ > - rcu_dynticks_task_trace_enter(); // Before ->dynticks update! > - seq = rcu_dynticks_inc(1); > - // RCU is no longer watching. Better be in extended quiescent state! > - WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & 0x1)); > -} > -static noinstr void rcu_dynticks_eqs_exit(void) > -{ > - int seq; > - > - /* > - * CPUs seeing atomic_add_return() must see prior idle sojourns, > - * and we also must force ordering with the next RCU read-side > - * critical section. > - */ > - seq = rcu_dynticks_inc(1); > - // RCU is now watching. Better not be in an extended quiescent state! > - rcu_dynticks_task_trace_exit(); // After ->dynticks update! > - WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & 0x1)); > -} So specifically rcu_dynticks_task_trace_{enter,exit}() are now orphaned. After this patch, nothing calls them. However, looking at this again, we've got: __context_tracking_enter() rcu_user_enter() rcu_eqs_enter() rcu_dynticks_eqs_enter() rcu_dynticks_task_trace_enter() rcu_dynticks_inc(); rcu_dynticks_task_enter(); ct_seq_user_enter() atomic_add_return() and on the other end: __context_tracking_exit() ct_seq_user_exit() atomic_add_return() rcu_user_exit() rcu_esq_exit() rcu_dynticks_task_exit() rcu_dynticks_eqs_exit() rcu_dynticks_inc() rcu_dynticks_task_trace_exit() And since we want to replace dynticks_inc() with ct_seq_*() the rcu_dynticks_task_{enter,exit}() ought to be pulled before that.. Instead I orphaned rcu_dynticks_task_trace_{enter,exit}() which should more or less stay where they are. I seems to have confused the two :/
On Wed, Sep 29, 2021 at 11:37:01AM -0700, Paul E. McKenney wrote: > On Wed, Sep 29, 2021 at 05:17:31PM +0200, Peter Zijlstra wrote: > The CT_SEQ_WORK bit means neither idle nor nohz_full user, correct? That bit is indeed independent, it can get set remotely by cmpxchg when in user/eqs state and will be tested and (eventually) cleared when leaving user/eqs state. > So let's see if I intuited the decoder ring, where "kernel" means that > portion of the kernel that is non-noinstr... > > CT_SEQ_WORK CT_SEQ_NMI CT_SEQ_USER Description? > 0 0 0 Idle or non-nohz_full user > 0 0 1 nohz_full user > 0 1 0 NMI from 0,0,0 > 0 1 1 NMI from 0,0,1 > 1 0 0 Non-idle kernel > 1 0 1 Cannot happen? > 1 1 0 NMI from 1,0,1 > 1 1 1 NMI from cannot happen > > And of course if a state cannot happen, Murphy says that you will take > an NMI in that state. Urgh, you have the bits the 'wrong' way around :-) MSB 3210 |-------|||| Where [MSB-3] is the actual sequence number, [21] is the state and [0] is the work-pending bit. The table for [21] is like: USER NMI 0 0 kernel 0 1 kernel took nmi 1 0 user 1 1 user took nmi So effectively EQS is 10 only, the other 3 states are kernel.
On Wed, Sep 29, 2021 at 11:37:01AM -0700, Paul E. McKenney wrote: > > +void context_tracking_idle(void) > > +{ > > + atomic_add_return(CT_SEQ, &raw_cpu_ptr(&context_tracking)->seq); > > This is presumably a momentary idle. > > notrace void rcu_momentary_dyntick_idle(void) > > { > > - int seq; > > - > > raw_cpu_write(rcu_data.rcu_need_heavy_qs, false); > > - seq = rcu_dynticks_inc(2); > > - /* It is illegal to call this from idle state. */ > > - WARN_ON_ONCE(!(seq & 0x1)); > > + context_tracking_idle(); > > rcu_preempt_deferred_qs(current); > > } It's whatever that is. It increments the actual sequence count without modifying the state.
On Wed, Sep 29, 2021 at 11:37:01AM -0700, Paul E. McKenney wrote:
> And what happens to all of this in !CONFIG_CONTEXT_TRACKING kernels?
> Of course, RCU needs it unconditionally. (There appear to be at least
> parts of it that are unconditionally available, but I figured that I
> should ask. Especially given the !CONFIG_CONTEXT_TRACKING definition
> of the __context_tracking_cpu_seq() function.)
For !CONFIG_CONTEXT_TRACKING it goes *poof*.
Since the thing was called dynticks, I presumed it was actually dynticks
only, silly me (also, I didn't see any obvious !context_tracking usage
of it, i'll go audit it more carefully.
On Wed, Sep 29, 2021 at 09:13:26PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 29, 2021 at 11:37:01AM -0700, Paul E. McKenney wrote:
>
> > And what happens to all of this in !CONFIG_CONTEXT_TRACKING kernels?
> > Of course, RCU needs it unconditionally. (There appear to be at least
> > parts of it that are unconditionally available, but I figured that I
> > should ask. Especially given the !CONFIG_CONTEXT_TRACKING definition
> > of the __context_tracking_cpu_seq() function.)
>
> For !CONFIG_CONTEXT_TRACKING it goes *poof*.
>
> Since the thing was called dynticks, I presumed it was actually dynticks
> only, silly me (also, I didn't see any obvious !context_tracking usage
> of it, i'll go audit it more carefully.
Oh argh, it does idle too... damn. And I don't suppose having 2 counters
is going to be nice :/
I'll go back to thinking about this.
On Wed, Sep 29, 2021 at 09:24:31PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 29, 2021 at 09:13:26PM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 29, 2021 at 11:37:01AM -0700, Paul E. McKenney wrote:
> >
> > > And what happens to all of this in !CONFIG_CONTEXT_TRACKING kernels?
> > > Of course, RCU needs it unconditionally. (There appear to be at least
> > > parts of it that are unconditionally available, but I figured that I
> > > should ask. Especially given the !CONFIG_CONTEXT_TRACKING definition
> > > of the __context_tracking_cpu_seq() function.)
> >
> > For !CONFIG_CONTEXT_TRACKING it goes *poof*.
> >
> > Since the thing was called dynticks, I presumed it was actually dynticks
> > only, silly me (also, I didn't see any obvious !context_tracking usage
> > of it, i'll go audit it more carefully.
>
> Oh argh, it does idle too... damn. And I don't suppose having 2 counters
> is going to be nice :/
>
> I'll go back to thinking about this.
Glad I could help? For some definition of "help"? ;-)
Thanx, Paul
On Wed 2021-09-29 17:17:26, Peter Zijlstra wrote: > Instead of frobbing around with scheduler internals, use the shiny new > task_call_func() interface. > > --- a/kernel/livepatch/transition.c > +++ b/kernel/livepatch/transition.c > @@ -274,6 +266,22 @@ static int klp_check_stack(struct task_s > return 0; > } > > +static int klp_check_and_switch_task(struct task_struct *task, void *arg) > +{ > + int ret; > + > + if (task_curr(task)) This must be if (task_curr(task) && task != current) , otherwise the task is not able to migrate itself. The condition was lost when reshuffling the original code, see below. JFYI, I have missed it during review. I am actually surprised that the process could check its own stack reliably. But it seems to work. I found the problem when "busy target module" selftest failed. It was not able to livepatch the workqueue worker that was proceeding klp_transition_work_fn(). > + return -EBUSY; > + > + ret = klp_check_stack(task, arg); > + if (ret) > + return ret; > + > + clear_tsk_thread_flag(task, TIF_PATCH_PENDING); > + task->patch_state = klp_target_state; > + return 0; > +} > + > /* > * Try to safely switch a task to the target patch state. If it's currently > * running, or it's sleeping on a to-be-patched or to-be-unpatched function, or > @@ -305,36 +308,31 @@ static bool klp_try_switch_task(struct t > * functions. If all goes well, switch the task to the target patch > * state. > */ > - rq = task_rq_lock(task, &flags); > + ret = task_call_func(task, klp_check_and_switch_task, &old_name); It looks correct. JFYI, this is why: The logic seems to be exactly the same, except for the one fallout mentioned above. So the only problem might be races. The only important thing is that the task must not be running on any CPU when klp_check_stack(task, arg) is called. By other word, the stack must stay the same when being checked. The original code prevented races by taking task_rq_lock(). And task_call_func() is slightly more relaxed but it looks safe enough: + it still takes rq lock when the task is in runnable state. + it always takes p->pi_lock that prevents moving the task into runnable state by try_to_wake_up(). > + switch (ret) { > + case 0: /* success */ > + break; > > - if (task_running(rq, task) && task != current) { This is the original code that checked (task != current). > - snprintf(err_buf, STACK_ERR_BUF_SIZE, > - "%s: %s:%d is running\n", __func__, task->comm, > - task->pid); > - goto done; With the added (task != current) check: Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Best Regards, Petr
On Wed 2021-09-29 17:17:28, Peter Zijlstra wrote:
> Make sure to prod idle CPUs so they call klp_update_patch_state().
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Thanks a lot for updating this API for livepatching.
Best Regards,
Petr
On Tue, Oct 05, 2021 at 01:40:24PM +0200, Petr Mladek wrote: > On Wed 2021-09-29 17:17:26, Peter Zijlstra wrote: > > Instead of frobbing around with scheduler internals, use the shiny new > > task_call_func() interface. > > > > --- a/kernel/livepatch/transition.c > > +++ b/kernel/livepatch/transition.c > > @@ -274,6 +266,22 @@ static int klp_check_stack(struct task_s > > return 0; > > } > > > > +static int klp_check_and_switch_task(struct task_struct *task, void *arg) > > +{ > > + int ret; > > + > > + if (task_curr(task)) > > This must be > > if (task_curr(task) && task != current) > > , otherwise the task is not able to migrate itself. The condition was > lost when reshuffling the original code, see below. Urgh, yeah, I misread that and figued task_curr() should already capture current, but the extra clause excludes current :/ > JFYI, I have missed it during review. I am actually surprised that the > process could check its own stack reliably. But it seems to work. Ah, unwinding yourself is actually the only sane option ;-) > > - rq = task_rq_lock(task, &flags); > > + ret = task_call_func(task, klp_check_and_switch_task, &old_name); > > It looks correct. JFYI, this is why: > > The logic seems to be exactly the same, except for the one fallout > mentioned above. So the only problem might be races. > > The only important thing is that the task must not be running on any CPU > when klp_check_stack(task, arg) is called. By other word, the stack > must stay the same when being checked. > > The original code prevented races by taking task_rq_lock(). > And task_call_func() is slightly more relaxed but it looks safe enough: > > + it still takes rq lock when the task is in runnable state. > + it always takes p->pi_lock that prevents moving the task > into runnable state by try_to_wake_up(). Correct, the new task_call_func() is trying hard to not take rq->lock, but should be effectively identical to task_rq_lock(). > With the added (task != current) check: Done > Reviewed-by: Petr Mladek <pmladek@suse.com> > Tested-by: Petr Mladek <pmladek@suse.com> Thanks!
On Wed 2021-09-29 17:17:32, Peter Zijlstra wrote: > Using the new context_tracking infrastructure, avoid disturbing > userspace tasks when context tracking is enabled. > > When context_tracking_set_cpu_work() returns true, we have the > guarantee that klp_update_patch_state() is called from noinstr code > before it runs normal kernel code. This covers > syscall/exceptions/interrupts and NMI entry. This patch touches the most tricky (lockless) parts of the livepatch code. I always have to refresh my head about all the dependencies. Sigh, I guess that the livepatch code looks over complicated to you. The main problem is that we want to migrate tasks only when they are not inside any livepatched function. It allows to do semantic changes which is needed by some sort of critical security fixes. > --- a/kernel/context_tracking.c > +++ b/kernel/context_tracking.c > @@ -55,15 +56,13 @@ static noinstr void ct_exit_user_work(struct > { > unsigned int work = arch_atomic_read(&ct->work); > > -#if 0 > - if (work & CT_WORK_n) { > + if (work & CT_WORK_KLP) { > /* NMI happens here and must still do/finish CT_WORK_n */ > - do_work_n(); > + __klp_update_patch_state(current); > > smp_mb__before_atomic(); > - arch_atomic_andnot(CT_WORK_n, &ct->work); > + arch_atomic_andnot(CT_WORK_KLP, &ct->work); > } > -#endif > > smp_mb__before_atomic(); > arch_atomic_andnot(CT_SEQ_WORK, &ct->seq); > --- a/kernel/livepatch/transition.c > +++ b/kernel/livepatch/transition.c > @@ -153,6 +154,11 @@ void klp_cancel_transition(void) > klp_complete_transition(); > } > > +noinstr void __klp_update_patch_state(struct task_struct *task) > +{ > + task->patch_state = READ_ONCE(klp_target_state); > +} > + > /* > * Switch the patched state of the task to the set of functions in the target > * patch state. > @@ -180,8 +186,10 @@ void klp_update_patch_state(struct task_ > * of func->transition, if klp_ftrace_handler() is called later on > * the same CPU. See __klp_disable_patch(). > */ > - if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING)) > + if (test_tsk_thread_flag(task, TIF_PATCH_PENDING)) { This would require smp_rmb() here. It will make sure that we will read the right @klp_target_state when TIF_PATCH_PENDING is set. , where @klp_target_state is set in klp_init_transition() and TIF_PATCH_PENDING is set in klp_start_transition() There are actually two related smp_wmp() barriers between these two assignments in: 1st in klp_init_transition() 2nd in __klp_enable_patch() One would be enough for klp_update_patch_state(). But we need both for klp_ftrace_handler(), see the smp_rmb() there. In particular, they synchronize: + ops->func_stack vs. + func->transition vs. + current->patch_state > task->patch_state = READ_ONCE(klp_target_state); Note that smp_wmb() is not needed here because klp_complete_transition() calls klp_synchronize_transition() aka synchronize_rcu() before clearing klp_target_state. This is why the original code worked. > + clear_tsk_thread_flag(task, TIF_PATCH_PENDING); > + } > > preempt_enable_notrace(); > } > @@ -270,15 +278,30 @@ static int klp_check_and_switch_task(str > { > int ret; > > - if (task_curr(task)) > + if (task_curr(task)) { > + /* > + * This only succeeds when the task is in NOHZ_FULL user > + * mode, the true return value guarantees any kernel entry > + * will call klp_update_patch_state(). > + * > + * XXX: ideally we'd simply return 0 here and leave > + * TIF_PATCH_PENDING alone, to be fixed up by > + * klp_update_patch_state(), except livepatching goes wobbly > + * with 'pending' TIF bits on. > + */ > + if (context_tracking_set_cpu_work(task_cpu(task), CT_WORK_KLP)) > + goto clear; If I get it correctly, this will clear TIF_PATCH_PENDING immediately but task->patch_state = READ_ONCE(klp_target_state) will be done later by ct_exit_user_work(). This is a bit problematic: 1. The global @klp_target_state is set to KLP_UNDEFINED when all processes have TIF_PATCH_PENDING is cleared. This is actually still fine because func->transition is cleared as well. As a result, current->patch_state is ignored in klp_ftrace_handler. 2. The real problem happens when another livepatch is enabled. The global @klp_target_state is set to new value and func->transition is set again. In this case, the delayed ct_exit_user_work() might assign wrong value that might really be used by klp_ftrace_handler(). IMHO, the original solution from v1 was better. We only needed to be careful when updating task->patch_state and clearing TIF_PATCH_PENDING to avoid the race. The following might work: static int klp_check_and_switch_task(struct task_struct *task, void *arg) { int ret; /* * Stack is reliable only when the task is not running on any CPU, * except for the task running this code. */ if (task_curr(task) && task != current) { /* * This only succeeds when the task is in NOHZ_FULL user * mode. Such a task might be migrated immediately. We * only need to be careful to set task->patch_state before * clearing TIF_PATCH_PENDING so that the task migrates * itself when entring kernel in the meatime. */ if (is_ct_user(task)) { klp_update_patch_state(task); return 0; } return -EBUSY; } ret = klp_check_stack(task, arg); if (ret) return ret; /* * The task neither is running on any CPU and nor it can get * running. As a result, the ordering is not important and * barrier is not needed. */ task->patch_state = klp_target_state; clear_tsk_thread_flag(task, TIF_PATCH_PENDING); return 0; } , where is_ct_user(task) would return true when task is running in CONTEXT_USER. If I get the context_tracking API correctly then it might be implemeted the following way: #ifdef CONFIG_CONTEXT_TRACKING /* * XXX: The value is reliable depending the context where this is called. * At least migrating between CPUs should get prevented. */ static __always_inline bool is_ct_user(struct task_struct *task) { int seq; if (!context_tracking_enabled()) return false; seq = __context_tracking_cpu_seq(task_cpu(task)); return __context_tracking_seq_in_user(seq); } #else static __always_inline bool is_ct_user(struct task_struct *task) { return false; } #endif /* CONFIG_CONTEXT_TRACKING */ Best Regards, Petr > return -EBUSY; > + } > > ret = klp_check_stack(task, arg); > if (ret) > return ret; > > - clear_tsk_thread_flag(task, TIF_PATCH_PENDING); > task->patch_state = klp_target_state; > +clear: > + clear_tsk_thread_flag(task, TIF_PATCH_PENDING); > return 0; > }
On Wed, 29 Sep 2021, Peter Zijlstra wrote:
> Instead of frobbing around with scheduler internals, use the shiny new
> task_call_func() interface.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
This looks really nice. With the added check for "task != current"
that Petr pointed out
Acked-by: Miroslav Benes <mbenes@suse.cz>
M
On Wed, Oct 06, 2021 at 10:12:17AM +0200, Petr Mladek wrote: > > @@ -180,8 +186,10 @@ void klp_update_patch_state(struct task_ > > * of func->transition, if klp_ftrace_handler() is called later on > > * the same CPU. See __klp_disable_patch(). > > */ > > - if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING)) > > + if (test_tsk_thread_flag(task, TIF_PATCH_PENDING)) { > > This would require smp_rmb() here. It will make sure that we will > read the right @klp_target_state when TIF_PATCH_PENDING is set. Bah, yes. > > task->patch_state = READ_ONCE(klp_target_state); > > Note that smp_wmb() is not needed here because > klp_complete_transition() calls klp_synchronize_transition() > aka synchronize_rcu() before clearing klp_target_state. > This is why the original code worked. > > > > + clear_tsk_thread_flag(task, TIF_PATCH_PENDING); > > + } > > > > preempt_enable_notrace(); > > } > > @@ -270,15 +278,30 @@ static int klp_check_and_switch_task(str > > { > > int ret; > > > > - if (task_curr(task)) > > + if (task_curr(task)) { > > + /* > > + * This only succeeds when the task is in NOHZ_FULL user > > + * mode, the true return value guarantees any kernel entry > > + * will call klp_update_patch_state(). > > + * > > + * XXX: ideally we'd simply return 0 here and leave > > + * TIF_PATCH_PENDING alone, to be fixed up by > > + * klp_update_patch_state(), except livepatching goes wobbly > > + * with 'pending' TIF bits on. > > + */ > > + if (context_tracking_set_cpu_work(task_cpu(task), CT_WORK_KLP)) > > + goto clear; > > If I get it correctly, this will clear TIF_PATCH_PENDING immediately > but task->patch_state = READ_ONCE(klp_target_state) will be > done later by ct_exit_user_work(). > > This is a bit problematic: > > 1. The global @klp_target_state is set to KLP_UNDEFINED when all > processes have TIF_PATCH_PENDING is cleared. This is actually > still fine because func->transition is cleared as well. > As a result, current->patch_state is ignored in klp_ftrace_handler. > > 2. The real problem happens when another livepatch is enabled. > The global @klp_target_state is set to new value and > func->transition is set again. In this case, the delayed > ct_exit_user_work() might assign wrong value that might > really be used by klp_ftrace_handler(). Urgghh.. totally missed that. > IMHO, the original solution from v1 was better. We only needed to It was also terribly broken in other 'fun' ways. See below. > be careful when updating task->patch_state and clearing > TIF_PATCH_PENDING to avoid the race. > > The following might work: > > static int klp_check_and_switch_task(struct task_struct *task, void *arg) > { > int ret; > > /* > * Stack is reliable only when the task is not running on any CPU, > * except for the task running this code. > */ > if (task_curr(task) && task != current) { > /* > * This only succeeds when the task is in NOHZ_FULL user > * mode. Such a task might be migrated immediately. We > * only need to be careful to set task->patch_state before > * clearing TIF_PATCH_PENDING so that the task migrates > * itself when entring kernel in the meatime. > */ > if (is_ct_user(task)) { > klp_update_patch_state(task); > return 0; > } > > return -EBUSY; > } > > ret = klp_check_stack(task, arg); > if (ret) > return ret; > > /* > * The task neither is running on any CPU and nor it can get > * running. As a result, the ordering is not important and > * barrier is not needed. > */ > task->patch_state = klp_target_state; > clear_tsk_thread_flag(task, TIF_PATCH_PENDING); > > return 0; > } > > , where is_ct_user(task) would return true when task is running in > CONTEXT_USER. If I get the context_tracking API correctly then > it might be implemeted the following way: That's not sufficient, you need to tag the remote task with a ct_work item to also runs klp_update_patch_state(), otherwise the remote CPU can enter kernel space between checking is_ct_user() and doing klp_update_patch_state(): CPU0 CPU1 <user> if (is_ct_user()) // true <kernel-entry> // run some kernel code klp_update_patch_state() *WHOOPSIE* So it needs to be something like: CPU0 CPU1 <user> if (context_tracking_set_cpu_work(task_cpu(), CT_WORK_KLP)) <kernel-entry> klp_update_patch_state klp_update_patch_state() So that CPU0 and CPU1 race to complete klp_update_patch_state() *before* any regular (!noinstr) code gets run. Which then means it needs to look something like: noinstr void klp_update_patch_state(struct task_struct *task) { struct thread_info *ti = task_thread_info(task); preempt_disable_notrace(); if (arch_test_bit(TIF_PATCH_PENDING, (unsigned long *)&ti->flags)) { /* * Order loads of TIF_PATCH_PENDING vs klp_target_state. * See klp_init_transition(). */ smp_rmb(); task->patch_state = __READ_ONCE(klp_target_state); /* * Concurrent against self; must observe updated * task->patch_state if !TIF_PATCH_PENDING. */ smp_mb__before_atomic(); arch_clear_bit(TIF_PATCH_PENDING, (unsigned long *)&ti->flags); } else { /* * Concurrent against self, see smp_mb__before_atomic() * above. */ smp_rmb(); } preempt_enable_notrace(); }
On Wed, 29 Sep 2021, Peter Zijlstra wrote:
> Make sure to prod idle CPUs so they call klp_update_patch_state().
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> kernel/livepatch/transition.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -413,8 +413,11 @@ void klp_try_complete_transition(void)
> for_each_possible_cpu(cpu) {
> task = idle_task(cpu);
> if (cpu_online(cpu)) {
> - if (!klp_try_switch_task(task))
> + if (!klp_try_switch_task(task)) {
> complete = false;
> + /* Make idle task go through the main loop. */
> + wake_up_if_idle(cpu);
> + }
Right, it should be enough.
Acked-by: Miroslav Benes <mbenes@suse.cz>
It would be nice to get Vasily's Tested-by tag on this one.
M
On Wed 2021-10-06 11:04:26, Peter Zijlstra wrote: > On Wed, Oct 06, 2021 at 10:12:17AM +0200, Petr Mladek wrote: > > IMHO, the original solution from v1 was better. We only needed to > > It was also terribly broken in other 'fun' ways. See below. > > > be careful when updating task->patch_state and clearing > > TIF_PATCH_PENDING to avoid the race. > > > > The following might work: > > > > static int klp_check_and_switch_task(struct task_struct *task, void *arg) > > { > > int ret; > > > > /* > > * Stack is reliable only when the task is not running on any CPU, > > * except for the task running this code. > > */ > > if (task_curr(task) && task != current) { > > /* > > * This only succeeds when the task is in NOHZ_FULL user > > * mode. Such a task might be migrated immediately. We > > * only need to be careful to set task->patch_state before > > * clearing TIF_PATCH_PENDING so that the task migrates > > * itself when entring kernel in the meatime. > > */ > > if (is_ct_user(task)) { > > klp_update_patch_state(task); > > return 0; > > } > > > > return -EBUSY; > > } > > > > ret = klp_check_stack(task, arg); > > if (ret) > > return ret; > > > > /* > > * The task neither is running on any CPU and nor it can get > > * running. As a result, the ordering is not important and > > * barrier is not needed. > > */ > > task->patch_state = klp_target_state; > > clear_tsk_thread_flag(task, TIF_PATCH_PENDING); > > > > return 0; > > } > > > > , where is_ct_user(task) would return true when task is running in > > CONTEXT_USER. If I get the context_tracking API correctly then > > it might be implemeted the following way: > > That's not sufficient, you need to tag the remote task with a ct_work > item to also runs klp_update_patch_state(), otherwise the remote CPU can > enter kernel space between checking is_ct_user() and doing > klp_update_patch_state(): > > CPU0 CPU1 > > <user> > > if (is_ct_user()) // true > <kernel-entry> > // run some kernel code > klp_update_patch_state() > *WHOOPSIE* > > > So it needs to be something like: > > > CPU0 CPU1 > > <user> > > if (context_tracking_set_cpu_work(task_cpu(), CT_WORK_KLP)) > > <kernel-entry> > klp_update_patch_state klp_update_patch_state() > > > So that CPU0 and CPU1 race to complete klp_update_patch_state() *before* > any regular (!noinstr) code gets run. Grr, you are right. I thought that we migrated the task when entering kernel even before. But it seems that we do it only when leaving the kernel in exit_to_user_mode_loop(). > Which then means it needs to look something like: > > noinstr void klp_update_patch_state(struct task_struct *task) > { > struct thread_info *ti = task_thread_info(task); > > preempt_disable_notrace(); > if (arch_test_bit(TIF_PATCH_PENDING, (unsigned long *)&ti->flags)) { > /* > * Order loads of TIF_PATCH_PENDING vs klp_target_state. > * See klp_init_transition(). > */ > smp_rmb(); > task->patch_state = __READ_ONCE(klp_target_state); > /* > * Concurrent against self; must observe updated > * task->patch_state if !TIF_PATCH_PENDING. > */ > smp_mb__before_atomic(); IMHO, smp_wmb() should be enough. We are here only when this CPU set task->patch_state right above. So that CPU running this code should see the correct task->patch_state. The read barrier is needed only when @task is entering kernel and does not see TIF_PATCH_PENDING. It is handled by smp_rmb() in the "else" branch below. It is possible that both CPUs see TIF_PATCH_PENDING and both set task->patch_state. But it should not cause any harm because they set the same value. Unless something really crazy happens with the internal CPU busses and caches. > arch_clear_bit(TIF_PATCH_PENDING, (unsigned long *)&ti->flags); > } else { > /* > * Concurrent against self, see smp_mb__before_atomic() > * above. > */ > smp_rmb(); Yeah, this is the counter part against the above smp_wmb(). > } > preempt_enable_notrace(); > } Now, I am scared to increase my paranoia level and search for even more possible races. I feel overwhelmed at the moment ;-) Best Regards, Petr
On Wed, Oct 06, 2021 at 12:29:32PM +0200, Petr Mladek wrote: > On Wed 2021-10-06 11:04:26, Peter Zijlstra wrote: > > So it needs to be something like: > > > > > > CPU0 CPU1 > > > > <user> > > > > if (context_tracking_set_cpu_work(task_cpu(), CT_WORK_KLP)) > > > > <kernel-entry> > > klp_update_patch_state klp_update_patch_state() > > > > > > So that CPU0 and CPU1 race to complete klp_update_patch_state() *before* > > any regular (!noinstr) code gets run. > > Grr, you are right. I thought that we migrated the task when entering > kernel even before. But it seems that we do it only when leaving > the kernel in exit_to_user_mode_loop(). Yep... :-) > > Which then means it needs to look something like: > > > > noinstr void klp_update_patch_state(struct task_struct *task) > > { > > struct thread_info *ti = task_thread_info(task); > > > > preempt_disable_notrace(); > > if (arch_test_bit(TIF_PATCH_PENDING, (unsigned long *)&ti->flags)) { > > /* > > * Order loads of TIF_PATCH_PENDING vs klp_target_state. > > * See klp_init_transition(). > > */ > > smp_rmb(); > > task->patch_state = __READ_ONCE(klp_target_state); > > /* > > * Concurrent against self; must observe updated > > * task->patch_state if !TIF_PATCH_PENDING. > > */ > > smp_mb__before_atomic(); > > IMHO, smp_wmb() should be enough. We are here only when this > CPU set task->patch_state right above. So that CPU running > this code should see the correct task->patch_state. Yes, I think smp_wmb() and smp_mb__before_atomic() are NOPS for all the same architectures, so that might indeed be a better choice. > The read barrier is needed only when @task is entering kernel and > does not see TIF_PATCH_PENDING. It is handled by smp_rmb() in > the "else" branch below. > > It is possible that both CPUs see TIF_PATCH_PENDING and both > set task->patch_state. But it should not cause any harm > because they set the same value. Unless something really > crazy happens with the internal CPU busses and caches. Right, not our problem :-) Lots would be broken beyond repair in that case. > > arch_clear_bit(TIF_PATCH_PENDING, (unsigned long *)&ti->flags); > > } else { > > /* > > * Concurrent against self, see smp_mb__before_atomic() > > * above. > > */ > > smp_rmb(); > > Yeah, this is the counter part against the above smp_wmb(). > > > } > > preempt_enable_notrace(); > > } > > Now, I am scared to increase my paranoia level and search for even more > possible races. I feel overwhelmed at the moment ;-) :-) Anyway, I still need to figure out how to extract this context tracking stuff from RCU and not make a giant mess of things, so until that time....
> > That's not sufficient, you need to tag the remote task with a ct_work
> > item to also runs klp_update_patch_state(), otherwise the remote CPU can
> > enter kernel space between checking is_ct_user() and doing
> > klp_update_patch_state():
> >
> > CPU0 CPU1
> >
> > <user>
> >
> > if (is_ct_user()) // true
> > <kernel-entry>
> > // run some kernel code
> > klp_update_patch_state()
> > *WHOOPSIE*
> >
> >
> > So it needs to be something like:
> >
> >
> > CPU0 CPU1
> >
> > <user>
> >
> > if (context_tracking_set_cpu_work(task_cpu(), CT_WORK_KLP))
> >
> > <kernel-entry>
> > klp_update_patch_state klp_update_patch_state()
> >
> >
> > So that CPU0 and CPU1 race to complete klp_update_patch_state() *before*
> > any regular (!noinstr) code gets run.
>
> Grr, you are right. I thought that we migrated the task when entering
> kernel even before. But it seems that we do it only when leaving
> the kernel in exit_to_user_mode_loop().
That is correct. You are probably confused by the old kGraft
implementation which added the TIF also to _TIF_WORK_SYSCALL_ENTRY so
that syscall_trace_enter() processed it too. But upstream solution has
always switched tasks only on their exit to user mode, because it was
deemed sufficient.
Btw, just for fun, the old kGraft in one of its first incarnations also
had the following horrible hack to exactly "solve" the problem.
+/*
+ * Tasks which are running in userspace after the patching has been started
+ * can immediately be marked as migrated to the new universe.
+ *
+ * If this function returns non-zero (i.e. also when error happens), the task
+ * needs to be migrated using kgraft lazy mechanism.
+ */
+static inline bool kgr_needs_lazy_migration(struct task_struct *p)
+{
+ unsigned long s[3];
+ struct stack_trace t = {
+ .nr_entries = 0,
+ .skip = 0,
+ .max_entries = 3,
+ .entries = s,
+ };
+
+ save_stack_trace_tsk(p, &t);
+
+ return t.nr_entries > 2;
+}
Miroslav
On Wed 2021-09-29 17:17:33, Peter Zijlstra wrote: > The whole premise is broken, any trace usage outside of RCU is a BUG > and should be fixed. Notable all code prior (and a fair bit after) > ct_user_exit() is noinstr and explicitly doesn't allow any tracing. I see. Is the situation the same with idle threads? I see that some functions, called inside rcu_idle_enter()/exit() are still visible for ftrace. For example, arch_cpu_idle() or tick_check_broadcast_expired(). That said, I am not sure if anyone would ever want to livepatch this code. And there is still a workaround by livepatching the callers. Also it should be easy to catch eventual problems if we check rcu_is_watching() in klp_ftrace_handler(). Most affected scheduler and idle code paths will likely be called even during the simple boot test. > Use regular RCU, which has the benefit of actually respecting > NOHZ_FULL. Good to know. After all, the patch looks good to me. I would just add something like: --- a/kernel/livepatch/patch.c +++ b/kernel/livepatch/patch.c @@ -69,6 +69,12 @@ static void notrace klp_ftrace_handler(unsigned long ip, if (WARN_ON_ONCE(!func)) goto unlock; + if (!rcu_is_watching()) { + printk_deferred_once(KERN_WARNING + "warning: called \"%s\" without RCU watching. The function is not guarded by the consistency model. Also beware when removing the livepatch module.\n", + func->old_name); + } + /* * In the enable path, enforce the order of the ops->func_stack and * func->transition reads. The corresponding write barrier is in But we could do this in a separate patch later. Feel free to use: Reviewed-by: Petr Mladek <pmladek@suse.com> Best Regards, Petr
On Wed, Oct 06, 2021 at 11:16:21AM +0200, Miroslav Benes wrote: > On Wed, 29 Sep 2021, Peter Zijlstra wrote: > > > Make sure to prod idle CPUs so they call klp_update_patch_state(). > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > --- > > kernel/livepatch/transition.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > --- a/kernel/livepatch/transition.c > > +++ b/kernel/livepatch/transition.c > > @@ -413,8 +413,11 @@ void klp_try_complete_transition(void) > > for_each_possible_cpu(cpu) { > > task = idle_task(cpu); > > if (cpu_online(cpu)) { > > - if (!klp_try_switch_task(task)) > > + if (!klp_try_switch_task(task)) { > > complete = false; > > + /* Make idle task go through the main loop. */ > > + wake_up_if_idle(cpu); > > + } > > Right, it should be enough. > > Acked-by: Miroslav Benes <mbenes@suse.cz> > > It would be nice to get Vasily's Tested-by tag on this one. I gave patches a spin on s390 with livepatch kselftest as well as with https://github.com/lpechacek/qa_test_klp.git BTW, commit 43c79fbad385 ("klp_tc_17: Avoid running the test on s390x") is no longer required, since s390 implements HAVE_KPROBES_ON_FTRACE since v5.6, so I just reverted test disablement. Patches 1-6 work nicely, for them Acked-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 Thanks a lot! Starting with patch 8 is where I start seeing this with my config: Oct 07 10:46:00 kernel: Freeing unused kernel image (initmem) memory: 6524K Oct 07 10:46:00 kernel: INFO: task swapper/0:1 blocked for more than 122 seconds. Oct 07 10:46:00 kernel: Not tainted 5.15.0-rc4-69810-ga714851e1aad-dirty #74 Oct 07 10:46:00 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 07 10:46:00 kernel: task:swapper/0 state:D stack:10648 pid: 1 ppid: 0 flags:0x00000000 Oct 07 10:46:00 kernel: Call Trace: Oct 07 10:46:00 kernel: [<0000000000e164b6>] __schedule+0x36e/0x8b0 Oct 07 10:46:00 kernel: [<0000000000e16a4e>] schedule+0x56/0x128 Oct 07 10:46:00 kernel: [<0000000000e1e426>] schedule_timeout+0x106/0x160 Oct 07 10:46:00 kernel: [<0000000000e18316>] wait_for_completion+0xc6/0x118 Oct 07 10:46:00 kernel: [<000000000020a15c>] rcu_barrier.part.0+0x17c/0x2c0 Oct 07 10:46:00 kernel: [<0000000000e0fcc0>] kernel_init+0x60/0x168 Oct 07 10:46:00 kernel: [<000000000010390c>] __ret_from_fork+0x3c/0x58 Oct 07 10:46:00 kernel: [<0000000000e2094a>] ret_from_fork+0xa/0x30 Oct 07 10:46:00 kernel: 1 lock held by swapper/0/1: Oct 07 10:46:00 kernel: #0: 0000000001469600 (rcu_state.barrier_mutex){+.+.}-{3:3}, at: rcu_barrier+0x42/0x80 Oct 07 10:46:00 kernel: Showing all locks held in the system: Oct 07 10:46:00 kernel: 1 lock held by swapper/0/1: Oct 07 10:46:00 kernel: #0: 0000000001469600 (rcu_state.barrier_mutex){+.+.}-{3:3}, at: rcu_barrier+0x42/0x80 Oct 07 10:46:00 kernel: 2 locks held by kworker/u680:0/8: Oct 07 10:46:00 kernel: #0: 000000008013cd48 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x222/0x738 Oct 07 10:46:00 kernel: #1: 0000038000043dc8 ((kfence_timer).work){+.+.}-{0:0}, at: process_one_work+0x222/0x738 Oct 07 10:46:00 kernel: 1 lock held by khungtaskd/413: Oct 07 10:46:00 kernel: #0: 000000000145c980 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x50 Oct 07 10:46:00 kernel: Oct 07 10:46:00 kernel: ============================================= So, will keep an eye on the rest of these patches and re-test in future, thanks!
On Thu, Oct 07, 2021 at 11:18:13AM +0200, Vasily Gorbik wrote: > Patches 1-6 work nicely, for them > > Acked-by: Vasily Gorbik <gor@linux.ibm.com> > Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 > > Thanks a lot! Thanks, will add tags. > Starting with patch 8 is where I start seeing this with my config: Yeah, I properly wrecked things there.. still trying to work out how to fix it :-)
On 9/29/2021 11:17 AM, Peter Zijlstra wrote:
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void)
> {
> int cpu;
>
> - preempt_disable();
> + cpus_read_lock();
> for_each_online_cpu(cpu) {
> - if (cpu == smp_processor_id())
> + if (cpu == raw_smp_processor_id())
> continue;
>
> wake_up_if_idle(cpu);
> }
> - preempt_enable();
> + cpus_read_unlock();
Peter, it looks like this thing introduced a deadlock during CPU online/offline.
[ 630.145166][ T129] WARNING: possible recursive locking detected
[ 630.151164][ T129] 5.15.0-rc5-next-20211013+ #145 Not tainted
[ 630.156988][ T129] --------------------------------------------
[ 630.162984][ T129] cpuhp/21/129 is trying to acquire lock:
[ 630.168547][ T129] ffff800011f466d0 (cpu_hotplug_lock){++++}-{0:0}, at: wake_up_all_idle_cpus+0x40/0xe8
wake_up_all_idle_cpus at /usr/src/linux-next/kernel/smp.c:1174
[ 630.178040][ T129]
[ 630.178040][){++++}-{0:0}, at help us debug this:
[ 630.202292][ T129] Possible unsafe locking scenario:
[ 630.202292][ T129]
[ 630.209590][ T129] CPU0
[ 630.212720][ T129] ----
[ 630.215851][ T129] lock(cpu_hotplug_lock);
[ 630.220202][ T129] lock(cpu_hotplug_lock);
[ 630.224553][ T129]
[ 630.224553][ T129] *** DEADLOCK ***
[ 630.224553][ T129]
[ 630.232545][ T129] May be due to missing lock nesting notation
[ 630.232545][ T129]
[ 630.240711][ T129] 3 locks held by cpuhp/21/129:
[ 630.245406][ T129] #0: ffff800011f466d0 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0xe0/0x588
[ 630.254976][ T129] #1: ffff800011f46780 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0xe0/0x588
[ 630.264372][ T129] #2: ffff8000191fb9c8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x24/0x38
[ 630.274031][ T129]
[ 630.274031][ T129] stack backtrace:
[ 630.279767][ T129] CPU: 21 PID: 129 Comm: cpuhp/21 Not tainted 5.15.0-rc5-next-20211013+ #145
[ 630.288371][ T129] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
[ 630.296886][ T129] Call trace:
[ 630.300017][ T129] dump_backtrace+0x0/0x3b8
[ 630.304369][ T129] show_stack+0x20/0x30
[ 630.308371][ T129] dump_stack_lvl+0x8c/0xb8
[ 630.312722][ T129] dump_stack+0x1c/0x38
[ 630.316723][ T129] validate_chain+0x1d84/0x1da0
[ 630.321421][ T129] __lock_acquire+0xab0/0x2040
[ 630.326033][ T129] lock_acquire+0x32c/0xb08
[ 630.330390][ T129] cpus_read_lock+0x94/0x308
[ 630.334827][ T129] wake_up_all_idle_cpus+0x40/0xe8
[ 630.339784][ T129] cpuidle_uninstall_idle_handler+0x3c/0x50
[ 630.345524][ T129] cpuidle_pause_and_lock+0x28/0x38
[ 630.350569][ T129] acpi_processor_hotplug+0xc0/0x170
[ 630.355701][ T129] acpi_soft_cpu_online+0x124/0x250
[ 630.360745][ T129] cpuhp_invoke_callback+0x51c/0x2ab8
[ 630.365963][ T129] cpuhp_thread_fun+0x204/0x588
[ 630.370659][ T129] smpboot_thread_fn+0x3f0/0xc40
[ 630.375444][ T129] kthread+0x3d8/0x488
[ 630.379360][ T129] ret_from_fork+0x10/0x20
[ 863.525716][ T191] INFO: task cpuhp/21:129 blocked for more than 122 seconds.
[ 863.532954][ T191] Not tainted 5.15.0-rc5-next-20211013+ #145
[ 863.539361][ T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 863.547927][ T191] task:cpuhp/21 state:D stack:59104 pid: 129 ppid: 2 flags:0x00000008
[ 863.557029][ T191] Call trace:
[ 863.560171][ T191] __switch_to+0x184/0x400
[ 863.564448][ T191] __schedule+0x74c/0x1940
[ 863.568753][ T191] schedule+0x110/0x318
[ 863.572764][ T191] percpu_rwsem_wait+0x1b8/0x348
[ 863.577592][ T191] __percpu_down_read+0xb0/0x148
[ 863.582386][ T191] cpus_read_lock+0x2b0/0x308
[ 863.586961][ T191] wake_up_all_idle_cpus+0x40/0xe8
[ 863.591931][ T191] cpuidle_uninstall_idle_handler+0x3c/0x50
[ 863.597716][ T191] cpuidle_pause_and_lock+0x28/0x38
[ 863.602771][ T191] acpi_processor_hotplug+0xc0/0x170
[ 863.607946][ T191] acpi_soft_cpu_online+0x124/0x250
[ 863.613001][ T191] cpuhp_invoke_callback+0x51c/0x2ab8
[ 863.618261][ T191] cpuhp_thread_fun+0x204/0x588
[ 863.622967][ T191] smpboot_thread_fn+0x3f0/0xc40
[ 863.627787][ T191] kthread+0x3d8/0x488
[ 863.631712][ T191] ret_from_fork+0x10/0x20
[ 863.636020][ T191] INFO: task kworker/0:2:189 blocked for more than 122 seconds.
[ 863.643500][ T191] Not tainted 5.15.0-rc5-next-20211013+ #145
[ 863.649882][ T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 863.658425][ T191] task:kworker/0:2 state:D stack:58368 pid: 189 ppid: 2 flags:0x00000008
[ 863.667516][ T191] Workqueue: events vmstat_shepherd
[ 863.672573][ T191] Call trace:
[ 863.675731][ T191] __switch_to+0x184/0x400
[ 863.680001][ T191] __schedule+0x74c/0x1940
[ 863.684268][ T191] schedule+0x110/0x318
[ 863.688295][ T191] percpu_rwsem_wait+0x1b8/0x348
[ 863.693085][ T191] __percpu_down_read+0xb0/0x148
[ 863.697892][ T191] cpus_read_lock+0x2b0/0x308
[ 863.702421][ T191] vmstat_shepherd+0x5c/0x1a8
[ 863.706977][ T191] process_one_work+0x808/0x19d0
[ 863.711767][ T191] worker_thread+0x334/0xae0
[ 863.716227][ T191] kthread+0x3d8/0x488
[ 863.720149][ T191] ret_from_fork+0x10/0x20
[ 863.724487][ T191] INFO: task lsbug:4642 blocked for more than 123 seconds.
[ 863.731565][ T191] Not tainted 5.15.0-rc5-next-20211013+ #145
[ 863.737938][ T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 863.746490][ T191] task:lsbug state:D stack:55536 pid: 4642 ppid: 4638 flags:0x00000008
[ 863.755549][ T191] Call trace:
[ 863.758712][ T191] __switch_to+0x184/0x400
[ 863.762984][ T191] __schedule+0x74c/0x1940
[ 863.767286][ T191] schedule+0x110/0x318
[ 863.771294][ T191] schedule_timeout+0x188/0x238
[ 863.776016][ T191] wait_for_completion+0x174/0x290
[ 863.780979][ T191] __cpuhp_kick_ap+0x158/0x1a8
[ 863.785592][ T191] cpuhp_kick_ap+0x1f0/0x828
[ 863.790053][ T191] bringup_cpu+0x180/0x1e0
[ 863.794320][ T191] cpuhp_invoke_callback+0x51c/0x2ab8
[ 863.799561][ T191] cpuhp_invoke_callback_range+0xa4/0x108
[ 863.805130][ T191] cpu_up+0x528/0xd78
[ 863.808982][ T191] cpu_device_up+0x4c/0x68
[ 863.813249][ T191] cpu_subsys_online+0xc0/0x1f8
[ 863.817972][ T191] device_online+0x10c/0x180
[ 863.822413][ T191] online_store+0x10c/0x118
[ 863.826791][ T191] dev_attr_store+0x44/0x78
[ 863.831148][ T191] sysfs_kf_write+0xe8/0x138
[ 863.835590][ T191] kernfs_fop_write_iter+0x26c/0x3d0
[ 863.840745][ T191] new_sync_write+0x2bc/0x4f8
[ 863.845275][ T191] vfs_write+0x714/0xcd8
[ 863.849387][ T191] ksys_write+0xf8/0x1e0
[ 863.853481][ T191] __arm64_sys_write+0x74/0xa8
[ 863.858113][ T191] invoke_syscall.constprop.0+0xdc/0x1d8
[ 863.863597][ T191] do_el0_svc+0xe4/0x298
[ 863.867710][ T191] el0_svc+0x64/0x130
[ 863.871545][ T191] el0t_64_sync_handler+0xb0/0xb8
[ 863.876437][ T191] el0t_64_sync+0x180/0x184
[ 863.880798][ T191] INFO: task mount:4682 blocked for more than 123 seconds.
[ 863.887881][ T191] Not tainted 5.15.0-rc5-next-20211013+ #145
[ 863.894232][ T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 863.902776][ T191] task:mount state:D stack:55856 pid: 4682 ppid: 1101 flags:0x00000000
[ 863.911865][ T191] Call trace:
[ 863.915003][ T191] __switch_to+0x184/0x400
[ 863.919296][ T191] __schedule+0x74c/0x1940
[ 863.923564][ T191] schedule+0x110/0x318
[ 863.927590][ T191] percpu_rwsem_wait+0x1b8/0x348
[ 863.932380][ T191] __percpu_down_read+0xb0/0x148
[ 863.937187][ T191] cpus_read_lock+0x2b0/0x308
[ 863.941715][ T191] alloc_workqueue+0x730/0xd48
[ 863.946357][ T191] loop_configure+0x2d4/0x1180 [loop]
[ 863.951592][ T191] lo_ioctl+0x5dc/0x1228 [loop]
[ 863.956321][ T191] blkdev_ioctl+0x258/0x820
[ 863.960678][ T191] __arm64_sys_ioctl+0x114/0x180
[ 863.965468][ T191] invoke_syscall.constprop.0+0xdc/0x1d8
[ 863.970974][ T191] do_el0_svc+0xe4/0x298
[ 863.975069][ T191] el0_svc+0x64/0x130
[ 863.978922][ T191] el0t_64_sync_handler+0xb0/0xb8
[ 863.983798][ T191] el0t_64_sync+0x180/0x184
[ 863.988172][ T191] INFO: lockdep is turned off.
On Wed, Sep 29, 2021 at 6:10 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Make sure to prod idle CPUs so they call klp_update_patch_state().
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> kernel/livepatch/transition.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -413,8 +413,11 @@ void klp_try_complete_transition(void)
> for_each_possible_cpu(cpu) {
> task = idle_task(cpu);
> if (cpu_online(cpu)) {
> - if (!klp_try_switch_task(task))
> + if (!klp_try_switch_task(task)) {
> complete = false;
> + /* Make idle task go through the main loop. */
> + wake_up_if_idle(cpu);
> + }
This caused a build regression on non-SMP kernels:
x86_64-linux-ld: kernel/livepatch/transition.o: in function
`klp_try_complete_transition':
transition.c:(.text+0x106e): undefined reference to `wake_up_if_idle'
Maybe add a IS_ENABLED(CONFIG_SMP) check to one of the if() conditions?
Arnd
On Wed, Oct 13, 2021 at 09:37:01PM +0200, Arnd Bergmann wrote: > On Wed, Sep 29, 2021 at 6:10 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > Make sure to prod idle CPUs so they call klp_update_patch_state(). > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > --- > > kernel/livepatch/transition.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > --- a/kernel/livepatch/transition.c > > +++ b/kernel/livepatch/transition.c > > @@ -413,8 +413,11 @@ void klp_try_complete_transition(void) > > for_each_possible_cpu(cpu) { > > task = idle_task(cpu); > > if (cpu_online(cpu)) { > > - if (!klp_try_switch_task(task)) > > + if (!klp_try_switch_task(task)) { > > complete = false; > > + /* Make idle task go through the main loop. */ > > + wake_up_if_idle(cpu); > > + } > > This caused a build regression on non-SMP kernels: :-( > x86_64-linux-ld: kernel/livepatch/transition.o: in function > `klp_try_complete_transition': > transition.c:(.text+0x106e): undefined reference to `wake_up_if_idle' > > Maybe add a IS_ENABLED(CONFIG_SMP) check to one of the if() conditions? I'll just add a stub for that function I think. Let me rebase the thing before I push more crap ontop..
Peter, any thoughts? I did confirm that reverting the commit fixed the issue.
On 10/13/2021 10:32 AM, Qian Cai wrote:
>
>
> On 9/29/2021 11:17 AM, Peter Zijlstra wrote:
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void)
>> {
>> int cpu;
>>
>> - preempt_disable();
>> + cpus_read_lock();
>> for_each_online_cpu(cpu) {
>> - if (cpu == smp_processor_id())
>> + if (cpu == raw_smp_processor_id())
>> continue;
>>
>> wake_up_if_idle(cpu);
>> }
>> - preempt_enable();
>> + cpus_read_unlock();
>
> Peter, it looks like this thing introduced a deadlock during CPU online/offline.
>
> [ 630.145166][ T129] WARNING: possible recursive locking detected
> [ 630.151164][ T129] 5.15.0-rc5-next-20211013+ #145 Not tainted
> [ 630.156988][ T129] --------------------------------------------
> [ 630.162984][ T129] cpuhp/21/129 is trying to acquire lock:
> [ 630.168547][ T129] ffff800011f466d0 (cpu_hotplug_lock){++++}-{0:0}, at: wake_up_all_idle_cpus+0x40/0xe8
> wake_up_all_idle_cpus at /usr/src/linux-next/kernel/smp.c:1174
> [ 630.178040][ T129]
> [ 630.178040][){++++}-{0:0}, at help us debug this:
> [ 630.202292][ T129] Possible unsafe locking scenario:
> [ 630.202292][ T129]
> [ 630.209590][ T129] CPU0
> [ 630.212720][ T129] ----
> [ 630.215851][ T129] lock(cpu_hotplug_lock);
> [ 630.220202][ T129] lock(cpu_hotplug_lock);
> [ 630.224553][ T129]
> [ 630.224553][ T129] *** DEADLOCK ***
> [ 630.224553][ T129]
> [ 630.232545][ T129] May be due to missing lock nesting notation
> [ 630.232545][ T129]
> [ 630.240711][ T129] 3 locks held by cpuhp/21/129:
> [ 630.245406][ T129] #0: ffff800011f466d0 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0xe0/0x588
> [ 630.254976][ T129] #1: ffff800011f46780 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0xe0/0x588
> [ 630.264372][ T129] #2: ffff8000191fb9c8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x24/0x38
> [ 630.274031][ T129]
> [ 630.274031][ T129] stack backtrace:
> [ 630.279767][ T129] CPU: 21 PID: 129 Comm: cpuhp/21 Not tainted 5.15.0-rc5-next-20211013+ #145
> [ 630.288371][ T129] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> [ 630.296886][ T129] Call trace:
> [ 630.300017][ T129] dump_backtrace+0x0/0x3b8
> [ 630.304369][ T129] show_stack+0x20/0x30
> [ 630.308371][ T129] dump_stack_lvl+0x8c/0xb8
> [ 630.312722][ T129] dump_stack+0x1c/0x38
> [ 630.316723][ T129] validate_chain+0x1d84/0x1da0
> [ 630.321421][ T129] __lock_acquire+0xab0/0x2040
> [ 630.326033][ T129] lock_acquire+0x32c/0xb08
> [ 630.330390][ T129] cpus_read_lock+0x94/0x308
> [ 630.334827][ T129] wake_up_all_idle_cpus+0x40/0xe8
> [ 630.339784][ T129] cpuidle_uninstall_idle_handler+0x3c/0x50
> [ 630.345524][ T129] cpuidle_pause_and_lock+0x28/0x38
> [ 630.350569][ T129] acpi_processor_hotplug+0xc0/0x170
> [ 630.355701][ T129] acpi_soft_cpu_online+0x124/0x250
> [ 630.360745][ T129] cpuhp_invoke_callback+0x51c/0x2ab8
> [ 630.365963][ T129] cpuhp_thread_fun+0x204/0x588
> [ 630.370659][ T129] smpboot_thread_fn+0x3f0/0xc40
> [ 630.375444][ T129] kthread+0x3d8/0x488
> [ 630.379360][ T129] ret_from_fork+0x10/0x20
> [ 863.525716][ T191] INFO: task cpuhp/21:129 blocked for more than 122 seconds.
> [ 863.532954][ T191] Not tainted 5.15.0-rc5-next-20211013+ #145
> [ 863.539361][ T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 863.547927][ T191] task:cpuhp/21 state:D stack:59104 pid: 129 ppid: 2 flags:0x00000008
> [ 863.557029][ T191] Call trace:
> [ 863.560171][ T191] __switch_to+0x184/0x400
> [ 863.564448][ T191] __schedule+0x74c/0x1940
> [ 863.568753][ T191] schedule+0x110/0x318
> [ 863.572764][ T191] percpu_rwsem_wait+0x1b8/0x348
> [ 863.577592][ T191] __percpu_down_read+0xb0/0x148
> [ 863.582386][ T191] cpus_read_lock+0x2b0/0x308
> [ 863.586961][ T191] wake_up_all_idle_cpus+0x40/0xe8
> [ 863.591931][ T191] cpuidle_uninstall_idle_handler+0x3c/0x50
> [ 863.597716][ T191] cpuidle_pause_and_lock+0x28/0x38
> [ 863.602771][ T191] acpi_processor_hotplug+0xc0/0x170
> [ 863.607946][ T191] acpi_soft_cpu_online+0x124/0x250
> [ 863.613001][ T191] cpuhp_invoke_callback+0x51c/0x2ab8
> [ 863.618261][ T191] cpuhp_thread_fun+0x204/0x588
> [ 863.622967][ T191] smpboot_thread_fn+0x3f0/0xc40
> [ 863.627787][ T191] kthread+0x3d8/0x488
> [ 863.631712][ T191] ret_from_fork+0x10/0x20
> [ 863.636020][ T191] INFO: task kworker/0:2:189 blocked for more than 122 seconds.
> [ 863.643500][ T191] Not tainted 5.15.0-rc5-next-20211013+ #145
> [ 863.649882][ T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 863.658425][ T191] task:kworker/0:2 state:D stack:58368 pid: 189 ppid: 2 flags:0x00000008
> [ 863.667516][ T191] Workqueue: events vmstat_shepherd
> [ 863.672573][ T191] Call trace:
> [ 863.675731][ T191] __switch_to+0x184/0x400
> [ 863.680001][ T191] __schedule+0x74c/0x1940
> [ 863.684268][ T191] schedule+0x110/0x318
> [ 863.688295][ T191] percpu_rwsem_wait+0x1b8/0x348
> [ 863.693085][ T191] __percpu_down_read+0xb0/0x148
> [ 863.697892][ T191] cpus_read_lock+0x2b0/0x308
> [ 863.702421][ T191] vmstat_shepherd+0x5c/0x1a8
> [ 863.706977][ T191] process_one_work+0x808/0x19d0
> [ 863.711767][ T191] worker_thread+0x334/0xae0
> [ 863.716227][ T191] kthread+0x3d8/0x488
> [ 863.720149][ T191] ret_from_fork+0x10/0x20
> [ 863.724487][ T191] INFO: task lsbug:4642 blocked for more than 123 seconds.
> [ 863.731565][ T191] Not tainted 5.15.0-rc5-next-20211013+ #145
> [ 863.737938][ T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 863.746490][ T191] task:lsbug state:D stack:55536 pid: 4642 ppid: 4638 flags:0x00000008
> [ 863.755549][ T191] Call trace:
> [ 863.758712][ T191] __switch_to+0x184/0x400
> [ 863.762984][ T191] __schedule+0x74c/0x1940
> [ 863.767286][ T191] schedule+0x110/0x318
> [ 863.771294][ T191] schedule_timeout+0x188/0x238
> [ 863.776016][ T191] wait_for_completion+0x174/0x290
> [ 863.780979][ T191] __cpuhp_kick_ap+0x158/0x1a8
> [ 863.785592][ T191] cpuhp_kick_ap+0x1f0/0x828
> [ 863.790053][ T191] bringup_cpu+0x180/0x1e0
> [ 863.794320][ T191] cpuhp_invoke_callback+0x51c/0x2ab8
> [ 863.799561][ T191] cpuhp_invoke_callback_range+0xa4/0x108
> [ 863.805130][ T191] cpu_up+0x528/0xd78
> [ 863.808982][ T191] cpu_device_up+0x4c/0x68
> [ 863.813249][ T191] cpu_subsys_online+0xc0/0x1f8
> [ 863.817972][ T191] device_online+0x10c/0x180
> [ 863.822413][ T191] online_store+0x10c/0x118
> [ 863.826791][ T191] dev_attr_store+0x44/0x78
> [ 863.831148][ T191] sysfs_kf_write+0xe8/0x138
> [ 863.835590][ T191] kernfs_fop_write_iter+0x26c/0x3d0
> [ 863.840745][ T191] new_sync_write+0x2bc/0x4f8
> [ 863.845275][ T191] vfs_write+0x714/0xcd8
> [ 863.849387][ T191] ksys_write+0xf8/0x1e0
> [ 863.853481][ T191] __arm64_sys_write+0x74/0xa8
> [ 863.858113][ T191] invoke_syscall.constprop.0+0xdc/0x1d8
> [ 863.863597][ T191] do_el0_svc+0xe4/0x298
> [ 863.867710][ T191] el0_svc+0x64/0x130
> [ 863.871545][ T191] el0t_64_sync_handler+0xb0/0xb8
> [ 863.876437][ T191] el0t_64_sync+0x180/0x184
> [ 863.880798][ T191] INFO: task mount:4682 blocked for more than 123 seconds.
> [ 863.887881][ T191] Not tainted 5.15.0-rc5-next-20211013+ #145
> [ 863.894232][ T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 863.902776][ T191] task:mount state:D stack:55856 pid: 4682 ppid: 1101 flags:0x00000000
> [ 863.911865][ T191] Call trace:
> [ 863.915003][ T191] __switch_to+0x184/0x400
> [ 863.919296][ T191] __schedule+0x74c/0x1940
> [ 863.923564][ T191] schedule+0x110/0x318
> [ 863.927590][ T191] percpu_rwsem_wait+0x1b8/0x348
> [ 863.932380][ T191] __percpu_down_read+0xb0/0x148
> [ 863.937187][ T191] cpus_read_lock+0x2b0/0x308
> [ 863.941715][ T191] alloc_workqueue+0x730/0xd48
> [ 863.946357][ T191] loop_configure+0x2d4/0x1180 [loop]
> [ 863.951592][ T191] lo_ioctl+0x5dc/0x1228 [loop]
> [ 863.956321][ T191] blkdev_ioctl+0x258/0x820
> [ 863.960678][ T191] __arm64_sys_ioctl+0x114/0x180
> [ 863.965468][ T191] invoke_syscall.constprop.0+0xdc/0x1d8
> [ 863.970974][ T191] do_el0_svc+0xe4/0x298
> [ 863.975069][ T191] el0_svc+0x64/0x130
> [ 863.978922][ T191] el0t_64_sync_handler+0xb0/0xb8
> [ 863.983798][ T191] el0t_64_sync+0x180/0x184
> [ 863.988172][ T191] INFO: lockdep is turned off.
>
On Mon, Oct 18, 2021 at 11:47:32PM -0400, Qian Cai wrote:
> Peter, any thoughts? I did confirm that reverting the commit fixed the issue.
>
> On 10/13/2021 10:32 AM, Qian Cai wrote:
> >
> >
> > On 9/29/2021 11:17 AM, Peter Zijlstra wrote:
> >> --- a/kernel/smp.c
> >> +++ b/kernel/smp.c
> >> @@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void)
> >> {
> >> int cpu;
> >>
> >> - preempt_disable();
> >> + cpus_read_lock();
> >> for_each_online_cpu(cpu) {
> >> - if (cpu == smp_processor_id())
> >> + if (cpu == raw_smp_processor_id())
> >> continue;
> >>
> >> wake_up_if_idle(cpu);
> >> }
> >> - preempt_enable();
> >> + cpus_read_unlock();
Right, so yesterday I thought: YW2KGrvvv/vSA+97@hirez.programming.kicks-ass.net
but today I might have another idea, lemme go prod at this a bit more.
On Tue, Oct 19, 2021 at 10:56:48AM +0200, Peter Zijlstra wrote: > On Mon, Oct 18, 2021 at 11:47:32PM -0400, Qian Cai wrote: > > Peter, any thoughts? I did confirm that reverting the commit fixed the issue. > > > > On 10/13/2021 10:32 AM, Qian Cai wrote: > > > > > > > > > On 9/29/2021 11:17 AM, Peter Zijlstra wrote: > > >> --- a/kernel/smp.c > > >> +++ b/kernel/smp.c > > >> @@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void) > > >> { > > >> int cpu; > > >> > > >> - preempt_disable(); > > >> + cpus_read_lock(); > > >> for_each_online_cpu(cpu) { > > >> - if (cpu == smp_processor_id()) > > >> + if (cpu == raw_smp_processor_id()) > > >> continue; > > >> > > >> wake_up_if_idle(cpu); > > >> } > > >> - preempt_enable(); > > >> + cpus_read_unlock(); > > Right, so yesterday I thought: YW2KGrvvv/vSA+97@hirez.programming.kicks-ass.net > but today I might have another idea, lemme go prod at this a bit more. Here, hows this then? --- diff --git a/kernel/smp.c b/kernel/smp.c index ad0b68a3a3d3..61ddc7a3bafa 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -1170,14 +1170,12 @@ void wake_up_all_idle_cpus(void) { int cpu; - cpus_read_lock(); - for_each_online_cpu(cpu) { - if (cpu == raw_smp_processor_id()) - continue; - - wake_up_if_idle(cpu); + for_each_cpu(cpu) { + preempt_disable(); + if (cpu != smp_processor_id() && cpu_online(cpu)) + wake_up_if_idle(cpu); + preempt_enable(); } - cpus_read_unlock(); } EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
On 10/19/2021 5:10 AM, Peter Zijlstra wrote:
> Here, hows this then?
>
> ---
> diff --git a/kernel/smp.c b/kernel/smp.c
> index ad0b68a3a3d3..61ddc7a3bafa 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -1170,14 +1170,12 @@ void wake_up_all_idle_cpus(void)
> {
> int cpu;
>
> - cpus_read_lock();
> - for_each_online_cpu(cpu) {
> - if (cpu == raw_smp_processor_id())
> - continue;
> -
> - wake_up_if_idle(cpu);
> + for_each_cpu(cpu) {
> + preempt_disable();
> + if (cpu != smp_processor_id() && cpu_online(cpu))
> + wake_up_if_idle(cpu);
> + preempt_enable();
> }
> - cpus_read_unlock();
> }
> EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
This does not compile.
kernel/smp.c:1173:18: error: macro "for_each_cpu" requires 2 arguments, but only 1 given
On Tue, Oct 19, 2021 at 11:32:15AM -0400, Qian Cai wrote:
>
>
> On 10/19/2021 5:10 AM, Peter Zijlstra wrote:
> > Here, hows this then?
> >
> > ---
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index ad0b68a3a3d3..61ddc7a3bafa 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -1170,14 +1170,12 @@ void wake_up_all_idle_cpus(void)
> > {
> > int cpu;
> >
> > - cpus_read_lock();
> > - for_each_online_cpu(cpu) {
> > - if (cpu == raw_smp_processor_id())
> > - continue;
> > -
> > - wake_up_if_idle(cpu);
> > + for_each_cpu(cpu) {
> > + preempt_disable();
> > + if (cpu != smp_processor_id() && cpu_online(cpu))
> > + wake_up_if_idle(cpu);
> > + preempt_enable();
> > }
> > - cpus_read_unlock();
> > }
> > EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
>
> This does not compile.
>
> kernel/smp.c:1173:18: error: macro "for_each_cpu" requires 2 arguments, but only 1 given
Bah, for_each_possible_cpu(), lemme update the patch, I'm sure the
robots will scream bloody murder on that too.
On 10/19/21 11:50 AM, Peter Zijlstra wrote:
>>> diff --git a/kernel/smp.c b/kernel/smp.c
>>> index ad0b68a3a3d3..61ddc7a3bafa 100644
>>> --- a/kernel/smp.c
>>> +++ b/kernel/smp.c
>>> @@ -1170,14 +1170,12 @@ void wake_up_all_idle_cpus(void)
>>> {
>>> int cpu;
>>>
>>> - cpus_read_lock();
>>> - for_each_online_cpu(cpu) {
>>> - if (cpu == raw_smp_processor_id())
>>> - continue;
>>> -
>>> - wake_up_if_idle(cpu);
>>> + for_each_cpu(cpu) {
>>> + preempt_disable();
>>> + if (cpu != smp_processor_id() && cpu_online(cpu))
>>> + wake_up_if_idle(cpu);
>>> + preempt_enable();
>>> }
>>> - cpus_read_unlock();
>>> }
>>> EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
>>
>> This does not compile.
>>
>> kernel/smp.c:1173:18: error: macro "for_each_cpu" requires 2 arguments, but only 1 given
>
> Bah, for_each_possible_cpu(), lemme update the patch, I'm sure the
> robots will scream bloody murder on that too.
This survived the regression tests here.
On Tue, Oct 19, 2021 at 03:22:43PM -0400, Qian Cai wrote:
>
>
> On 10/19/21 11:50 AM, Peter Zijlstra wrote:
> >>> diff --git a/kernel/smp.c b/kernel/smp.c
> >>> index ad0b68a3a3d3..61ddc7a3bafa 100644
> >>> --- a/kernel/smp.c
> >>> +++ b/kernel/smp.c
> >>> @@ -1170,14 +1170,12 @@ void wake_up_all_idle_cpus(void)
> >>> {
> >>> int cpu;
> >>>
> >>> - cpus_read_lock();
> >>> - for_each_online_cpu(cpu) {
> >>> - if (cpu == raw_smp_processor_id())
> >>> - continue;
> >>> -
> >>> - wake_up_if_idle(cpu);
> >>> + for_each_cpu(cpu) {
> >>> + preempt_disable();
> >>> + if (cpu != smp_processor_id() && cpu_online(cpu))
> >>> + wake_up_if_idle(cpu);
> >>> + preempt_enable();
> >>> }
> >>> - cpus_read_unlock();
> >>> }
> >>> EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
> >>
> >> This does not compile.
> >>
> >> kernel/smp.c:1173:18: error: macro "for_each_cpu" requires 2 arguments, but only 1 given
> >
> > Bah, for_each_possible_cpu(), lemme update the patch, I'm sure the
> > robots will scream bloody murder on that too.
>
> This survived the regression tests here.
Thanks!
Peter, static __always_inline void arch_exit_to_user_mode(void) { mds_user_clear_cpu_buffers(); } /** * mds_user_clear_cpu_buffers - Mitigation for MDS and TAA vulnerability * * Clear CPU buffers if the corresponding static key is enabled */ static __always_inline void mds_user_clear_cpu_buffers(void) { if (static_branch_likely(&mds_user_clear)) mds_clear_cpu_buffers(); } We were discussing how to perform objtool style validation that no code after the check for > + /* NMI happens here and must still do/finish CT_WORK_n */ > + sync_core(); But after the discussion with you, it seems doing the TLB checking and (also sync_core) checking very late/very early on exit/entry makes things easier to review. Can then use a single atomic variable with USER/KERNEL state and cmpxchg loops. On Wed, Sep 29, 2021 at 05:17:34PM +0200, Peter Zijlstra wrote: > Use the new context_tracking infrastructure to avoid disturbing > userspace tasks when we rewrite kernel code. > > XXX re-audit the entry code to make sure only the context_tracking > static_branch is before hitting this code. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > arch/x86/include/asm/sync_core.h | 2 ++ > arch/x86/kernel/alternative.c | 8 +++++++- > include/linux/context_tracking.h | 1 + > kernel/context_tracking.c | 12 ++++++++++++ > 4 files changed, 22 insertions(+), 1 deletion(-) > > --- a/arch/x86/include/asm/sync_core.h > +++ b/arch/x86/include/asm/sync_core.h > @@ -87,6 +87,8 @@ static inline void sync_core(void) > */ > iret_to_self(); > } > +#define sync_core sync_core > + > > /* > * Ensure that a core serializing instruction is issued before returning > --- a/arch/x86/kernel/alternative.c > +++ b/arch/x86/kernel/alternative.c > @@ -18,6 +18,7 @@ > #include <linux/mmu_context.h> > #include <linux/bsearch.h> > #include <linux/sync_core.h> > +#include <linux/context_tracking.h> > #include <asm/text-patching.h> > #include <asm/alternative.h> > #include <asm/sections.h> > @@ -924,9 +925,14 @@ static void do_sync_core(void *info) > sync_core(); > } > > +static bool do_sync_core_cond(int cpu, void *info) > +{ > + return !context_tracking_set_cpu_work(cpu, CT_WORK_SYNC); > +} > + > void text_poke_sync(void) > { > - on_each_cpu(do_sync_core, NULL, 1); > + on_each_cpu_cond(do_sync_core_cond, do_sync_core, NULL, 1); > } > > struct text_poke_loc { > --- a/include/linux/context_tracking.h > +++ b/include/linux/context_tracking.h > @@ -11,6 +11,7 @@ > > enum ct_work { > CT_WORK_KLP = 1, > + CT_WORK_SYNC = 2, > }; > > /* > --- a/kernel/context_tracking.c > +++ b/kernel/context_tracking.c > @@ -51,6 +51,10 @@ static __always_inline void context_trac > __this_cpu_dec(context_tracking.recursion); > } > > +#ifndef sync_core > +static inline void sync_core(void) { } > +#endif > + > /* CT_WORK_n, must be noinstr, non-blocking, NMI safe and deal with spurious calls */ > static noinstr void ct_exit_user_work(struct context_tracking *ct) > { > @@ -64,6 +68,14 @@ static noinstr void ct_exit_user_work(struct > arch_atomic_andnot(CT_WORK_KLP, &ct->work); > } > > + if (work & CT_WORK_SYNC) { > + /* NMI happens here and must still do/finish CT_WORK_n */ > + sync_core(); > + > + smp_mb__before_atomic(); > + arch_atomic_andnot(CT_WORK_SYNC, &ct->work); > + } > + > smp_mb__before_atomic(); > arch_atomic_andnot(CT_SEQ_WORK, &ct->seq); > } > > >
+CC Nicolas.
On Thu, Oct 21, 2021 at 03:39:35PM -0300, Marcelo Tosatti wrote:
> Peter,
>
> static __always_inline void arch_exit_to_user_mode(void)
> {
> mds_user_clear_cpu_buffers();
> }
>
> /**
> * mds_user_clear_cpu_buffers - Mitigation for MDS and TAA vulnerability
> *
> * Clear CPU buffers if the corresponding static key is enabled
> */
> static __always_inline void mds_user_clear_cpu_buffers(void)
> {
> if (static_branch_likely(&mds_user_clear))
> mds_clear_cpu_buffers();
> }
>
> We were discussing how to perform objtool style validation
> that no code after the check for
>
> > + /* NMI happens here and must still do/finish CT_WORK_n */
> > + sync_core();
>
> But after the discussion with you, it seems doing the TLB checking
> and (also sync_core) checking very late/very early on exit/entry
> makes things easier to review.
>
> Can then use a single atomic variable with USER/KERNEL state and cmpxchg
> loops.
>
> On Wed, Sep 29, 2021 at 05:17:34PM +0200, Peter Zijlstra wrote:
> > Use the new context_tracking infrastructure to avoid disturbing
> > userspace tasks when we rewrite kernel code.
> >
> > XXX re-audit the entry code to make sure only the context_tracking
> > static_branch is before hitting this code.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> > arch/x86/include/asm/sync_core.h | 2 ++
> > arch/x86/kernel/alternative.c | 8 +++++++-
> > include/linux/context_tracking.h | 1 +
> > kernel/context_tracking.c | 12 ++++++++++++
> > 4 files changed, 22 insertions(+), 1 deletion(-)
> >
> > --- a/arch/x86/include/asm/sync_core.h
> > +++ b/arch/x86/include/asm/sync_core.h
> > @@ -87,6 +87,8 @@ static inline void sync_core(void)
> > */
> > iret_to_self();
> > }
> > +#define sync_core sync_core
> > +
> >
> > /*
> > * Ensure that a core serializing instruction is issued before returning
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -18,6 +18,7 @@
> > #include <linux/mmu_context.h>
> > #include <linux/bsearch.h>
> > #include <linux/sync_core.h>
> > +#include <linux/context_tracking.h>
> > #include <asm/text-patching.h>
> > #include <asm/alternative.h>
> > #include <asm/sections.h>
> > @@ -924,9 +925,14 @@ static void do_sync_core(void *info)
> > sync_core();
> > }
> >
> > +static bool do_sync_core_cond(int cpu, void *info)
> > +{
> > + return !context_tracking_set_cpu_work(cpu, CT_WORK_SYNC);
> > +}
> > +
> > void text_poke_sync(void)
> > {
> > - on_each_cpu(do_sync_core, NULL, 1);
> > + on_each_cpu_cond(do_sync_core_cond, do_sync_core, NULL, 1);
> > }
> >
> > struct text_poke_loc {
> > --- a/include/linux/context_tracking.h
> > +++ b/include/linux/context_tracking.h
> > @@ -11,6 +11,7 @@
> >
> > enum ct_work {
> > CT_WORK_KLP = 1,
> > + CT_WORK_SYNC = 2,
> > };
> >
> > /*
> > --- a/kernel/context_tracking.c
> > +++ b/kernel/context_tracking.c
> > @@ -51,6 +51,10 @@ static __always_inline void context_trac
> > __this_cpu_dec(context_tracking.recursion);
> > }
> >
> > +#ifndef sync_core
> > +static inline void sync_core(void) { }
> > +#endif
> > +
> > /* CT_WORK_n, must be noinstr, non-blocking, NMI safe and deal with spurious calls */
> > static noinstr void ct_exit_user_work(struct context_tracking *ct)
> > {
> > @@ -64,6 +68,14 @@ static noinstr void ct_exit_user_work(struct
> > arch_atomic_andnot(CT_WORK_KLP, &ct->work);
> > }
> >
> > + if (work & CT_WORK_SYNC) {
> > + /* NMI happens here and must still do/finish CT_WORK_n */
> > + sync_core();
> > +
> > + smp_mb__before_atomic();
> > + arch_atomic_andnot(CT_WORK_SYNC, &ct->work);
> > + }
> > +
> > smp_mb__before_atomic();
> > arch_atomic_andnot(CT_SEQ_WORK, &ct->seq);
> > }
> >
> >
> >
On Thu, Oct 21, 2021 at 03:39:35PM -0300, Marcelo Tosatti wrote: > Peter, > > static __always_inline void arch_exit_to_user_mode(void) > { > mds_user_clear_cpu_buffers(); > } > > /** > * mds_user_clear_cpu_buffers - Mitigation for MDS and TAA vulnerability > * > * Clear CPU buffers if the corresponding static key is enabled > */ > static __always_inline void mds_user_clear_cpu_buffers(void) > { > if (static_branch_likely(&mds_user_clear)) > mds_clear_cpu_buffers(); > } > > We were discussing how to perform objtool style validation > that no code after the check for I'm not sure what the point of the above is... Were you trying to ask for validation that nothing runs after the mds_user_clear_cpu_buffer()? That isn't strictly true today, there's lockdep code after it. I can't recall why that order is as it is though. Pretty much everything in noinstr is magical, we just have to think harder there (and possibly start writing more comments there). > > + /* NMI happens here and must still do/finish CT_WORK_n */ > > + sync_core(); > > But after the discussion with you, it seems doing the TLB checking > and (also sync_core) checking very late/very early on exit/entry > makes things easier to review. I don't know about late, it must happen *very* early in entry. The sync_core() must happen before any self-modifying code gets called (static_branch, static_call, etc..) with possible exception of the context_tracking static_branch. The TLBi must also happen super early, possibly while still on the entry stack (since the task stack is vmap'ed). We currently don't run C code on the entry stack, that needs quite a bit of careful work to make happen. > Can then use a single atomic variable with USER/KERNEL state and cmpxchg > loops. We're not going to add an atomic to context tracking. There is one, we just got to extract/share it with RCU.
On Thu, Oct 21, 2021 at 09:25:43PM +0200, Peter Zijlstra wrote: > On Thu, Oct 21, 2021 at 03:39:35PM -0300, Marcelo Tosatti wrote: > > Peter, > > > > static __always_inline void arch_exit_to_user_mode(void) > > { > > mds_user_clear_cpu_buffers(); > > } > > > > /** > > * mds_user_clear_cpu_buffers - Mitigation for MDS and TAA vulnerability > > * > > * Clear CPU buffers if the corresponding static key is enabled > > */ > > static __always_inline void mds_user_clear_cpu_buffers(void) > > { > > if (static_branch_likely(&mds_user_clear)) > > mds_clear_cpu_buffers(); > > } > > > > We were discussing how to perform objtool style validation > > that no code after the check for > > I'm not sure what the point of the above is... Were you trying to ask > for validation that nothing runs after the mds_user_clear_cpu_buffer()? > > That isn't strictly true today, there's lockdep code after it. I can't > recall why that order is as it is though. > > Pretty much everything in noinstr is magical, we just have to think > harder there (and possibly start writing more comments there). mds_user_clear_cpu_buffers happens after sync_core, in your patchset, if i am not mistaken. > > > + /* NMI happens here and must still do/finish CT_WORK_n */ > > > + sync_core(); > > > > But after the discussion with you, it seems doing the TLB checking > > and (also sync_core) checking very late/very early on exit/entry > > makes things easier to review. > > I don't know about late, it must happen *very* early in entry. The > sync_core() must happen before any self-modifying code gets called > (static_branch, static_call, etc..) with possible exception of the > context_tracking static_branch. > > The TLBi must also happen super early, possibly while still on the > entry stack (since the task stack is vmap'ed). But will it be ever be freed/remapped from other CPUs while the task is running? > We currently don't run C > code on the entry stack, that needs quite a bit of careful work to make > happen. Was thinking of coding in ASM after (as early as possible) the write to switch to kernel CR3: Kernel entry: ------------- cpu = smp_processor_id(); if (isolation_enabled(cpu)) { reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE); if (reqs & CPU_REQ_FLUSH_TLB) flush_tlb_all(); if (reqs & CPU_REQ_SYNC_CORE) sync_core(); } Exit to userspace (as close to write to CR3 with user pagetable pointer): ----------------- cpu = smp_processor_id(); if (isolation_enabled(cpu)) { atomic_or(IN_USER_MODE, &percpudata->user_kernel_state); } You think that is a bad idea (in ASM, not C) ? And request side can be in C: Request side: ------------- int targetcpu; do { struct percpudata *pcpudata = per_cpu(&percpudata, targetcpu); old_state = pcpudata->user_kernel_state; /* in kernel mode ? */ if (!(old_state & IN_USER_MODE)) { smp_call_function_single(request_fn, targetcpu, 1); break; } new_state = remote_state | CPU_REQ_FLUSH_TLB; // (or CPU_REQ_INV_ICACHE) } while (atomic_cmpxchg(&pcpudata->user_kernel_state, old_state, new_state) != old_state); (need logic to protect from atomic_cmpxchg always failing, but shouldnt be difficult). > > Can then use a single atomic variable with USER/KERNEL state and cmpxchg > > loops. > > We're not going to add an atomic to context tracking. There is one, we > just got to extract/share it with RCU. Again, to avoid kernel TLB flushes you'd have to ensure: kernel entry: instrA addr1,addr2,addr3 instrB addr2,addr3,addr4 <--- that no address here has TLBs modified and flushed instrC addr5,addr6,addr7 reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE); if (reqs & CPU_REQ_FLUSH_TLB) flush_tlb_all(); kernel exit: atomic_or(IN_USER_MODE, &percpudata->user_kernel_state); instrA addr1,addr2,addr3 instrB addr2,addr3,addr4 <--- that no address here has TLBs modified and flushed This could be conditional on "task isolated mode" enabled (would be better if it didnt, though).
On Thu, Oct 21, 2021 at 04:57:09PM -0300, Marcelo Tosatti wrote: > > Pretty much everything in noinstr is magical, we just have to think > > harder there (and possibly start writing more comments there). > > mds_user_clear_cpu_buffers happens after sync_core, in your patchset, > if i am not mistaken. Of course it does, mds_user_clear_cpu_buffers() is on exit, the sync_core() is on entry. > > > > + /* NMI happens here and must still do/finish CT_WORK_n */ > > > > + sync_core(); > > > > > > But after the discussion with you, it seems doing the TLB checking > > > and (also sync_core) checking very late/very early on exit/entry > > > makes things easier to review. > > > > I don't know about late, it must happen *very* early in entry. The > > sync_core() must happen before any self-modifying code gets called > > (static_branch, static_call, etc..) with possible exception of the > > context_tracking static_branch. > > > > The TLBi must also happen super early, possibly while still on the > > entry stack (since the task stack is vmap'ed). > > But will it be ever be freed/remapped from other CPUs while the task > is running? Probably not, still something we need to be really careful with. > > > We currently don't run C > > code on the entry stack, that needs quite a bit of careful work to make > > happen. > > Was thinking of coding in ASM after (as early as possible) the write to > switch to kernel CR3: No, we're not going to add new feature to ASM. You'll just have to wait until all that gets lifted to C. > Kernel entry: > ------------- > > cpu = smp_processor_id(); > > if (isolation_enabled(cpu)) { > reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE); > if (reqs & CPU_REQ_FLUSH_TLB) > flush_tlb_all(); > if (reqs & CPU_REQ_SYNC_CORE) > sync_core(); > } > > Exit to userspace (as close to write to CR3 with user pagetable > pointer): > ----------------- > > cpu = smp_processor_id(); > > if (isolation_enabled(cpu)) { > atomic_or(IN_USER_MODE, &percpudata->user_kernel_state); > } > > You think that is a bad idea (in ASM, not C) ? Those atomics are a bad idea and not goig to happen. > > We're not going to add an atomic to context tracking. There is one, we > > just got to extract/share it with RCU. > > Again, to avoid kernel TLB flushes you'd have to ensure: I know how it works, but we're not going to add a second atomic to entry/exit. RCU has one in there, that's going to be it. Again, we just got to extract/share.
Hi On 29.09.2021 17:17, Peter Zijlstra wrote: > Simplify and make wake_up_if_idle() more robust, also don't iterate > the whole machine with preempt_disable() in it's caller: > wake_up_all_idle_cpus(). > > This prepares for another wake_up_if_idle() user that needs a full > do_idle() cycle. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> This patch landed recently in linux-next as commit 8850cb663b5c ("sched: Simplify wake_up_*idle*()"). It causes the following warning on the arm64 virt machine under qemu during the system suspend/resume cycle: --->8--- printk: Suspending console(s) (use no_console_suspend to debug) ============================================ WARNING: possible recursive locking detected 5.15.0-rc6-next-20211022 #10905 Not tainted -------------------------------------------- rtcwake/1326 is trying to acquire lock: ffffd4e9192e8130 (cpu_hotplug_lock){++++}-{0:0}, at: wake_up_all_idle_cpus+0x24/0x98 but task is already holding lock: ffffd4e9192e8130 (cpu_hotplug_lock){++++}-{0:0}, at: suspend_devices_and_enter+0x740/0x9f0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(cpu_hotplug_lock); lock(cpu_hotplug_lock); *** DEADLOCK *** May be due to missing lock nesting notation 5 locks held by rtcwake/1326: #0: ffff54ad86a78438 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0x64/0xf0 #1: ffff54ad84094a88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf4/0x1a8 #2: ffff54ad83b17a88 (kn->active#43){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xfc/0x1a8 #3: ffffd4e9192efab0 (system_transition_mutex){+.+.}-{3:3}, at: pm_suspend+0x214/0x3d0 #4: ffffd4e9192e8130 (cpu_hotplug_lock){++++}-{0:0}, at: suspend_devices_and_enter+0x740/0x9f0 stack backtrace: CPU: 0 PID: 1326 Comm: rtcwake Not tainted 5.15.0-rc6-next-20211022 #10905 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x1d0 show_stack+0x14/0x20 dump_stack_lvl+0x88/0xb0 dump_stack+0x14/0x2c __lock_acquire+0x171c/0x17b8 lock_acquire+0x234/0x378 cpus_read_lock+0x5c/0x150 wake_up_all_idle_cpus+0x24/0x98 suspend_devices_and_enter+0x748/0x9f0 pm_suspend+0x2b0/0x3d0 state_store+0x84/0x108 kobj_attr_store+0x14/0x28 sysfs_kf_write+0x60/0x70 kernfs_fop_write_iter+0x124/0x1a8 new_sync_write+0xe8/0x1b0 vfs_write+0x1d0/0x408 ksys_write+0x64/0xf0 __arm64_sys_write+0x14/0x20 invoke_syscall+0x40/0xf8 el0_svc_common.constprop.3+0x8c/0x120 do_el0_svc_compat+0x18/0x48 el0_svc_compat+0x48/0x100 el0t_32_sync_handler+0xec/0x140 el0t_32_sync+0x170/0x174 OOM killer enabled. Restarting tasks ... done. PM: suspend exit --->8--- Let me know if there is anything I can help to debug and fix this issue. > --- > kernel/sched/core.c | 14 +++++--------- > kernel/smp.c | 6 +++--- > 2 files changed, 8 insertions(+), 12 deletions(-) > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3691,15 +3691,11 @@ void wake_up_if_idle(int cpu) > if (!is_idle_task(rcu_dereference(rq->curr))) > goto out; > > - if (set_nr_if_polling(rq->idle)) { > - trace_sched_wake_idle_without_ipi(cpu); > - } else { > - rq_lock_irqsave(rq, &rf); > - if (is_idle_task(rq->curr)) > - smp_send_reschedule(cpu); > - /* Else CPU is not idle, do nothing here: */ > - rq_unlock_irqrestore(rq, &rf); > - } > + rq_lock_irqsave(rq, &rf); > + if (is_idle_task(rq->curr)) > + resched_curr(rq); > + /* Else CPU is not idle, do nothing here: */ > + rq_unlock_irqrestore(rq, &rf); > > out: > rcu_read_unlock(); > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void) > { > int cpu; > > - preempt_disable(); > + cpus_read_lock(); > for_each_online_cpu(cpu) { > - if (cpu == smp_processor_id()) > + if (cpu == raw_smp_processor_id()) > continue; > > wake_up_if_idle(cpu); > } > - preempt_enable(); > + cpus_read_unlock(); > } > EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus); > > > > Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland
On Thu, Oct 21, 2021 at 10:18:59PM +0200, Peter Zijlstra wrote: > On Thu, Oct 21, 2021 at 04:57:09PM -0300, Marcelo Tosatti wrote: > > > Pretty much everything in noinstr is magical, we just have to think > > > harder there (and possibly start writing more comments there). > > > > mds_user_clear_cpu_buffers happens after sync_core, in your patchset, > > if i am not mistaken. > > Of course it does, mds_user_clear_cpu_buffers() is on exit, the > sync_core() is on entry. static_key enable/disable __exit_to_user_mode -> context_tracking_set_cpu_work(cpu, work) user_enter_irqoff -> preempt_disable(); __context_tracking_enter(CONTEXT_USER); seq = atomic_read(&ct->seq); ct_seq_user_enter(raw_cpu_ptr(&context_tracking)); if (__context_tracking_seq_in_user(seq)) { { /* ctrl-dep */ arch_atomic_set(&ct->work, 0); atomic_or(work, &ct->work); return arch_atomic_add_return(CT_SEQ_USER, &ct->seq); ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK); } } preempt_enable(); arch_exit_to_user_mode() mds_user_clear_cpu_buffers(); <--- sync_core work queued, but not executed. i-cache potentially stale? ct_seq_user_enter should happen _after_ all possible static_key users? (or recheck that there is no pending work after any possible rewritable code/static_key user). > > > > > > + /* NMI happens here and must still do/finish CT_WORK_n */ > > > > > + sync_core(); > > > > > > > > But after the discussion with you, it seems doing the TLB checking > > > > and (also sync_core) checking very late/very early on exit/entry > > > > makes things easier to review. > > > > > > I don't know about late, it must happen *very* early in entry. The > > > sync_core() must happen before any self-modifying code gets called > > > (static_branch, static_call, etc..) with possible exception of the > > > context_tracking static_branch. > > > > > > The TLBi must also happen super early, possibly while still on the > > > entry stack (since the task stack is vmap'ed). > > > > But will it be ever be freed/remapped from other CPUs while the task > > is running? > > Probably not, still something we need to be really careful with. > > > > > We currently don't run C > > > code on the entry stack, that needs quite a bit of careful work to make > > > happen. > > > > Was thinking of coding in ASM after (as early as possible) the write to > > switch to kernel CR3: > > No, we're not going to add new feature to ASM. You'll just have to wait > until all that gets lifted to C. > > > Kernel entry: > > ------------- > > > > cpu = smp_processor_id(); > > > > if (isolation_enabled(cpu)) { > > reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE); > > if (reqs & CPU_REQ_FLUSH_TLB) > > flush_tlb_all(); > > if (reqs & CPU_REQ_SYNC_CORE) > > sync_core(); > > } > > > > Exit to userspace (as close to write to CR3 with user pagetable > > pointer): > > ----------------- > > > > cpu = smp_processor_id(); > > > > if (isolation_enabled(cpu)) { > > atomic_or(IN_USER_MODE, &percpudata->user_kernel_state); > > } > > > > You think that is a bad idea (in ASM, not C) ? > > Those atomics are a bad idea and not goig to happen. > > > > We're not going to add an atomic to context tracking. There is one, we > > > just got to extract/share it with RCU. > > > > Again, to avoid kernel TLB flushes you'd have to ensure: > > I know how it works, but we're not going to add a second atomic to > entry/exit. RCU has one in there, that's going to be it. Again, we just > got to extract/share.
On Tue, Oct 26, 2021 at 03:19:11PM -0300, Marcelo Tosatti wrote: > On Thu, Oct 21, 2021 at 10:18:59PM +0200, Peter Zijlstra wrote: > > On Thu, Oct 21, 2021 at 04:57:09PM -0300, Marcelo Tosatti wrote: > > > > Pretty much everything in noinstr is magical, we just have to think > > > > harder there (and possibly start writing more comments there). > > > > > > mds_user_clear_cpu_buffers happens after sync_core, in your patchset, > > > if i am not mistaken. > > > > Of course it does, mds_user_clear_cpu_buffers() is on exit, the > > sync_core() is on entry. > > static_key enable/disable > > __exit_to_user_mode -> context_tracking_set_cpu_work(cpu, work) > user_enter_irqoff -> preempt_disable(); > __context_tracking_enter(CONTEXT_USER); seq = atomic_read(&ct->seq); > ct_seq_user_enter(raw_cpu_ptr(&context_tracking)); if (__context_tracking_seq_in_user(seq)) { > { /* ctrl-dep */ > arch_atomic_set(&ct->work, 0); atomic_or(work, &ct->work); > return arch_atomic_add_return(CT_SEQ_USER, &ct->seq); ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK); > } > } preempt_enable(); > arch_exit_to_user_mode() > mds_user_clear_cpu_buffers(); <--- sync_core work queued, > but not executed. > i-cache potentially stale? > > ct_seq_user_enter should happen _after_ all possible static_key users? Right, so this one is actually okay, because that branch is *never* changed after boot. I'm not quite sure why it isn't an alternative(). At some point I proposed static_call_lock() [1] and the corrolary is static_branch_lock(), which I suppose could be employed here. But I'm not sure that actually helps much with auditing all that. [1] https://lkml.kernel.org/r/20210904105529.GA5106@worktop.programming.kicks-ass.net