All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches
@ 2020-05-05 13:16 Thomas Gleixner
  2020-05-05 13:16 ` [patch V4 part 1 01/36] rcu: Add comments marking transitions between RCU watching and not Thomas Gleixner
                   ` (36 more replies)
  0 siblings, 37 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Folks!

This is the hopefully final version of the rework of the entry and
exception code to ensure that instrumentation cannot touch the fragile
parts of the hardware induced entry and exception code trainwreck. It
further ensures correctness vs. RCU and moves quite some code out of the
assembly code into C.

V3 can befound here:

 https://lore.kernel.org/r/20200320175956.033706968@linutronix.de

The protection against instrumentation is based on moving the fragile code
parts into a special text section: .noinstr.text which is excluded from any
form of instrumentation

The correctness is validated via objtool extensions. The necessary updates
to objtool are available here:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git objtool/core

The series has a total of 138 patches and is split into 5 parts. It's based
on v5.7-rc3 with the objtool/core and the locking/kcsan branches of the tip
tree merged on top. The base tree is available here:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-base

The full series with all parts applied is available here:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-v4-part-5

The first part, i.e. this series is available from:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-v4-part-1

This part contains preparatory patches and fixes of various sorts which
have been either developed in course of this project or have been collected
from previous versions and related discussions about the whole entry
vs. RCU vs. instrumentation correctness problem.

 - Prevention of breakpoints in the entry code
 - Splitting the scheduler IPI
 - Correct ordering of user space exit work
 - Cleanup and restriction of the async page fault handling
 - Kprobes support for noinstr sections in built-in and modules code
 - The introduction of noinstr.text section
 - Preparatory work in tracing, lockdep and RCU
 - The nmi_enter() consolidation
 - Atomic fallback interaction

The largest patch of this series is the atomic fallback rework which is
necessary to address the interaction with KCSAN and pending work from Will
Deacon in that area. The header file is autogenerated, but actually
included in tree because regenerating on every build is too expensive.

The objtool check for the noinstr.text correctness is not yet added to the
build machinery and has to be invoked manually for now:

   objtool check -fal vmlinux.o

The checking only works for builtin code as objtool cannot do a combined
analysis of vmlinux.o and a module.o

Thanks,

	tglx

8<----------
 Documentation/trace/ftrace-design.rst        |    8 
 arch/arm64/include/asm/atomic.h              |    6 
 arch/arm64/include/asm/hardirq.h             |   78 
 arch/arm64/kernel/sdei.c                     |   14 
 arch/arm64/kernel/traps.c                    |    8 
 arch/powerpc/kernel/traps.c                  |   22 
 arch/sh/Kconfig                              |    1 
 arch/sh/kernel/traps.c                       |   12 
 arch/x86/entry/Makefile                      |    8 
 arch/x86/entry/common.c                      |    8 
 arch/x86/entry/entry_32.S                    |    8 
 arch/x86/entry/entry_64.S                    |    6 
 arch/x86/entry/thunk_64.S                    |    5 
 arch/x86/include/asm/atomic.h                |   17 
 arch/x86/include/asm/atomic64_32.h           |    9 
 arch/x86/include/asm/atomic64_64.h           |   15 
 arch/x86/include/asm/bug.h                   |    3 
 arch/x86/include/asm/irqflags.h              |   24 
 arch/x86/include/asm/kvm_para.h              |   23 
 arch/x86/include/asm/paravirt.h              |    2 
 arch/x86/include/asm/traps.h                 |    5 
 arch/x86/include/asm/x86_init.h              |    2 
 arch/x86/kernel/cpu/mce/core.c               |   65 
 arch/x86/kernel/cpu/mce/p5.c                 |    5 
 arch/x86/kernel/cpu/mce/winchip.c            |    5 
 arch/x86/kernel/hw_breakpoint.c              |   25 
 arch/x86/kernel/kvm.c                        |  158 -
 arch/x86/kernel/traps.c                      |  117 -
 arch/x86/kernel/tsc.c                        |    4 
 arch/x86/kernel/x86_init.c                   |    1 
 arch/x86/kvm/mmu/mmu.c                       |    2 
 arch/x86/mm/fault.c                          |   19 
 include/asm-generic/bug.h                    |    9 
 include/asm-generic/sections.h               |    3 
 include/asm-generic/vmlinux.lds.h            |    4 
 include/linux/atomic-arch-fallback.h         | 2291 +++++++++++++++++++++++++++
 include/linux/atomic-fallback.h              |    8 
 include/linux/atomic.h                       |   11 
 include/linux/compiler.h                     |   25 
 include/linux/compiler_types.h               |    4 
 include/linux/ftrace_irq.h                   |   15 
 include/linux/hardirq.h                      |   18 
 include/linux/irqflags.h                     |    6 
 include/linux/lockdep.h                      |   23 
 include/linux/module.h                       |    8 
 include/linux/preempt.h                      |    4 
 include/linux/sched.h                        |   17 
 kernel/kprobes.c                             |   85 -
 kernel/locking/lockdep.c                     |   89 -
 kernel/module.c                              |   10 
 kernel/panic.c                               |    4 
 kernel/printk/internal.h                     |    8 
 kernel/printk/printk_safe.c                  |    9 
 kernel/rcu/tree.c                            |  139 -
 kernel/rcu/tree_plugin.h                     |    4 
 kernel/rcu/update.c                          |    7 
 kernel/sched/core.c                          |   66 
 kernel/sched/fair.c                          |    5 
 kernel/sched/sched.h                         |    6 
 kernel/trace/Kconfig                         |   10 
 kernel/trace/trace_clock.c                   |    3 
 kernel/trace/trace_hwlat.c                   |    2 
 kernel/trace/trace_preemptirq.c              |   39 
 lib/debug_locks.c                            |    2 
 samples/kprobes/kprobe_example.c             |    6 
 samples/kprobes/kretprobe_example.c          |    2 
 scripts/atomic/fallbacks/acquire             |    4 
 scripts/atomic/fallbacks/add_negative        |    6 
 scripts/atomic/fallbacks/add_unless          |    6 
 scripts/atomic/fallbacks/andnot              |    4 
 scripts/atomic/fallbacks/dec                 |    4 
 scripts/atomic/fallbacks/dec_and_test        |    6 
 scripts/atomic/fallbacks/dec_if_positive     |    6 
 scripts/atomic/fallbacks/dec_unless_positive |    6 
 scripts/atomic/fallbacks/fence               |    4 
 scripts/atomic/fallbacks/fetch_add_unless    |    8 
 scripts/atomic/fallbacks/inc                 |    4 
 scripts/atomic/fallbacks/inc_and_test        |    6 
 scripts/atomic/fallbacks/inc_not_zero        |    6 
 scripts/atomic/fallbacks/inc_unless_negative |    6 
 scripts/atomic/fallbacks/read_acquire        |    2 
 scripts/atomic/fallbacks/release             |    4 
 scripts/atomic/fallbacks/set_release         |    2 
 scripts/atomic/fallbacks/sub_and_test        |    6 
 scripts/atomic/fallbacks/try_cmpxchg         |    4 
 scripts/atomic/gen-atomic-fallback.sh        |   29 
 scripts/atomic/gen-atomics.sh                |    5 
 scripts/mod/modpost.c                        |    2 
 88 files changed, 3146 insertions(+), 611 deletions(-)

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 01/36] rcu: Add comments marking transitions between RCU watching and not
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-05 13:16 ` [patch V4 part 1 02/36] x86/hw_breakpoint: Prevent data breakpoints on cpu_entry_area Thomas Gleixner
                   ` (35 subsequent siblings)
  36 siblings, 0 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

From: "Paul E. McKenney" <paulmck@kernel.org>

It is not as clear as it might be just where in RCU's idle entry/exit
code RCU stops and starts watching the current CPU.  This commit therefore
adds comments calling out the transitions.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200313024046.27622-2-paulmck@kernel.org

---
 kernel/rcu/tree.c |   29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -225,7 +225,9 @@ void rcu_softirq_qs(void)
 
 /*
  * Record entry into an extended quiescent state.  This is only to be
- * called when not already in an extended quiescent state.
+ * called when not already in an extended quiescent state, that is,
+ * RCU is watching prior to the call to this function and is no longer
+ * watching upon return.
  */
 static void rcu_dynticks_eqs_enter(void)
 {
@@ -238,7 +240,7 @@ static void rcu_dynticks_eqs_enter(void)
 	 * next idle sojourn.
 	 */
 	seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdp->dynticks);
-	/* Better be in an extended quiescent state! */
+	// RCU is no longer watching.  Better be in extended quiescent state!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
 		     (seq & RCU_DYNTICK_CTRL_CTR));
 	/* Better not have special action (TLB flush) pending! */
@@ -248,7 +250,8 @@ static void rcu_dynticks_eqs_enter(void)
 
 /*
  * Record exit from an extended quiescent state.  This is only to be
- * called from an extended quiescent state.
+ * called from an extended quiescent state, that is, RCU is not watching
+ * prior to the call to this function and is watching upon return.
  */
 static void rcu_dynticks_eqs_exit(void)
 {
@@ -261,6 +264,7 @@ static void rcu_dynticks_eqs_exit(void)
 	 * critical section.
 	 */
 	seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdp->dynticks);
+	// RCU is now watching.  Better not be in an extended quiescent state!
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
 		     !(seq & RCU_DYNTICK_CTRL_CTR));
 	if (seq & RCU_DYNTICK_CTRL_MASK) {
@@ -571,6 +575,7 @@ static void rcu_eqs_enter(bool user)
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
 		     rdp->dynticks_nesting == 0);
 	if (rdp->dynticks_nesting != 1) {
+		// RCU will still be watching, so just do accounting and leave.
 		rdp->dynticks_nesting--;
 		return;
 	}
@@ -583,7 +588,9 @@ static void rcu_eqs_enter(bool user)
 	rcu_prepare_for_idle();
 	rcu_preempt_deferred_qs(current);
 	WRITE_ONCE(rdp->dynticks_nesting, 0); /* Avoid irq-access tearing. */
+	// RCU is watching here ...
 	rcu_dynticks_eqs_enter();
+	// ... but is no longer watching here.
 	rcu_dynticks_task_enter();
 }
 
@@ -663,7 +670,9 @@ static __always_inline void rcu_nmi_exit
 	if (irq)
 		rcu_prepare_for_idle();
 
+	// RCU is watching here ...
 	rcu_dynticks_eqs_enter();
+	// ... but is no longer watching here.
 
 	if (irq)
 		rcu_dynticks_task_enter();
@@ -738,11 +747,14 @@ static void rcu_eqs_exit(bool user)
 	oldval = rdp->dynticks_nesting;
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && oldval < 0);
 	if (oldval) {
+		// RCU was already watching, so just do accounting and leave.
 		rdp->dynticks_nesting++;
 		return;
 	}
 	rcu_dynticks_task_exit();
+	// RCU is not watching here ...
 	rcu_dynticks_eqs_exit();
+	// ... but is watching here.
 	rcu_cleanup_after_idle();
 	trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, atomic_read(&rdp->dynticks));
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
@@ -819,7 +831,9 @@ static __always_inline void rcu_nmi_ente
 		if (irq)
 			rcu_dynticks_task_exit();
 
+		// RCU is not watching here ...
 		rcu_dynticks_eqs_exit();
+		// ... but is watching here.
 
 		if (irq)
 			rcu_cleanup_after_idle();
@@ -829,9 +843,16 @@ static __always_inline void rcu_nmi_ente
 		   rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
 		   READ_ONCE(rdp->rcu_urgent_qs) &&
 		   !READ_ONCE(rdp->rcu_forced_tick)) {
+		// We get here only if we had already exited the extended
+		// quiescent state and this was an interrupt (not an NMI).
+		// Therefore, (1) RCU is already watching and (2) The fact
+		// that we are in an interrupt handler and that the rcu_node
+		// lock is an irq-disabled lock prevents self-deadlock.
+		// So we can safely recheck under the lock.
 		raw_spin_lock_rcu_node(rdp->mynode);
-		// Recheck under lock.
 		if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
+			// A nohz_full CPU is in the kernel and RCU
+			// needs a quiescent state.  Turn on the tick!
 			WRITE_ONCE(rdp->rcu_forced_tick, true);
 			tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
 		}


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 02/36] x86/hw_breakpoint: Prevent data breakpoints on cpu_entry_area
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
  2020-05-05 13:16 ` [patch V4 part 1 01/36] rcu: Add comments marking transitions between RCU watching and not Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06  8:14   ` Borislav Petkov
                     ` (4 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 03/36] sched: Clean up scheduler_ipi() Thomas Gleixner
                   ` (34 subsequent siblings)
  36 siblings, 5 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

From: Andy Lutomirski <luto@kernel.org>

A data breakpoint near the top of an IST stack will cause unresoverable
recursion.  A data breakpoint on the GDT, IDT, or TSS is terrifying.
Prevent either of these from happening.

Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/hw_breakpoint.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -227,10 +227,35 @@ int arch_check_bp_in_kernelspace(struct
 	return (va >= TASK_SIZE_MAX) || ((va + len - 1) >= TASK_SIZE_MAX);
 }
 
+/*
+ * Checks whether the range from addr to end, inclusive, overlaps the CPU
+ * entry area range.
+ */
+static inline bool within_cpu_entry_area(unsigned long addr, unsigned long end)
+{
+	return end >= CPU_ENTRY_AREA_PER_CPU &&
+	       addr < (CPU_ENTRY_AREA_PER_CPU + CPU_ENTRY_AREA_TOTAL_SIZE);
+}
+
 static int arch_build_bp_info(struct perf_event *bp,
 			      const struct perf_event_attr *attr,
 			      struct arch_hw_breakpoint *hw)
 {
+	unsigned long bp_end;
+
+	bp_end = attr->bp_addr + attr->bp_len - 1;
+	if (bp_end < attr->bp_addr)
+		return -EINVAL;
+
+	/*
+	 * Prevent any breakpoint of any type that overlaps the
+	 * cpu_entry_area.  This protects the IST stacks and also
+	 * reduces the chance that we ever find out what happens if
+	 * there's a data breakpoint on the GDT, IDT, or TSS.
+	 */
+	if (within_cpu_entry_area(attr->bp_addr, bp_end))
+		return -EINVAL;
+
 	hw->address = attr->bp_addr;
 	hw->mask = 0;
 


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 03/36] sched: Clean up scheduler_ipi()
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
  2020-05-05 13:16 ` [patch V4 part 1 01/36] rcu: Add comments marking transitions between RCU watching and not Thomas Gleixner
  2020-05-05 13:16 ` [patch V4 part 1 02/36] x86/hw_breakpoint: Prevent data breakpoints on cpu_entry_area Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06  8:32   ` Thomas Gleixner
                     ` (3 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 04/36] sched: Make scheduler_ipi inline Thomas Gleixner
                   ` (33 subsequent siblings)
  36 siblings, 4 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

The scheduler IPI has grown weird and wonderful over the years, time
for spring cleaning.

Move all the non-trivial stuff out of it and into a regular smp function
call IPI. This then reduces the schedule_ipi() to most of it's former MOP
glory and ensures to keep the interrupt vector lean and mean.

Aside of that avoiding the full irq_enter() in the x86 IPI implementation
is incorrect as scheduler_ipi() can be instrumented. To work around that
scheduler_ipi() had an irq_enter/exit() hack when heavy work was
pending. This is gone now.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c  |   64 +++++++++++++++++++++++----------------------------
 kernel/sched/fair.c  |    5 +--
 kernel/sched/sched.h |    6 +++-
 3 files changed, 36 insertions(+), 39 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -219,6 +219,13 @@ void update_rq_clock(struct rq *rq)
 	update_rq_clock_task(rq, delta);
 }
 
+static inline void
+rq_csd_init(struct rq *rq, call_single_data_t *csd, smp_call_func_t func)
+{
+	csd->flags = 0;
+	csd->func = func;
+	csd->info = rq;
+}
 
 #ifdef CONFIG_SCHED_HRTICK
 /*
@@ -314,16 +321,14 @@ void hrtick_start(struct rq *rq, u64 del
 	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),
 		      HRTIMER_MODE_REL_PINNED_HARD);
 }
+
 #endif /* CONFIG_SMP */
 
 static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
-	rq->hrtick_csd.flags = 0;
-	rq->hrtick_csd.func = __hrtick_start;
-	rq->hrtick_csd.info = rq;
+	rq_csd_init(rq, &rq->hrtick_csd, __hrtick_start);
 #endif
-
 	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
 	rq->hrtick_timer.function = hrtick;
 }
@@ -650,6 +655,16 @@ static inline bool got_nohz_idle_kick(vo
 	return false;
 }
 
+static void nohz_csd_func(void *info)
+{
+	struct rq *rq = info;
+
+	if (got_nohz_idle_kick()) {
+		rq->idle_balance = 1;
+		raise_softirq_irqoff(SCHED_SOFTIRQ);
+	}
+}
+
 #else /* CONFIG_NO_HZ_COMMON */
 
 static inline bool got_nohz_idle_kick(void)
@@ -2292,6 +2307,11 @@ void sched_ttwu_pending(void)
 	rq_unlock_irqrestore(rq, &rf);
 }
 
+static void wake_csd_func(void *info)
+{
+	sched_ttwu_pending();
+}
+
 void scheduler_ipi(void)
 {
 	/*
@@ -2300,34 +2320,6 @@ void scheduler_ipi(void)
 	 * this IPI.
 	 */
 	preempt_fold_need_resched();
-
-	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
-		return;
-
-	/*
-	 * Not all reschedule IPI handlers call irq_enter/irq_exit, since
-	 * traditionally all their work was done from the interrupt return
-	 * path. Now that we actually do some work, we need to make sure
-	 * we do call them.
-	 *
-	 * Some archs already do call them, luckily irq_enter/exit nest
-	 * properly.
-	 *
-	 * Arguably we should visit all archs and update all handlers,
-	 * however a fair share of IPIs are still resched only so this would
-	 * somewhat pessimize the simple resched case.
-	 */
-	irq_enter();
-	sched_ttwu_pending();
-
-	/*
-	 * Check if someone kicked us for doing the nohz idle load balance.
-	 */
-	if (unlikely(got_nohz_idle_kick())) {
-		this_rq()->idle_balance = 1;
-		raise_softirq_irqoff(SCHED_SOFTIRQ);
-	}
-	irq_exit();
 }
 
 static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
@@ -2336,9 +2328,9 @@ static void ttwu_queue_remote(struct tas
 
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
-	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
+	if (llist_add(&p->wake_entry, &rq->wake_list)) {
 		if (!set_nr_if_polling(rq->idle))
-			smp_send_reschedule(cpu);
+			smp_call_function_single_async(cpu, &rq->wake_csd);
 		else
 			trace_sched_wake_idle_without_ipi(cpu);
 	}
@@ -6685,12 +6677,16 @@ void __init sched_init(void)
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
+		rq_csd_init(rq, &rq->wake_csd, wake_csd_func);
+
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ_COMMON
 		rq->last_blocked_load_update_tick = jiffies;
 		atomic_set(&rq->nohz_flags, 0);
+
+		rq_csd_init(rq, &rq->nohz_csd, nohz_csd_func);
 #endif
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10009,12 +10009,11 @@ static void kick_ilb(unsigned int flags)
 		return;
 
 	/*
-	 * Use smp_send_reschedule() instead of resched_cpu().
-	 * This way we generate a sched IPI on the target CPU which
+	 * This way we generate an IPI on the target CPU which
 	 * is idle. And the softirq performing nohz idle load balance
 	 * will be run before returning from the IPI.
 	 */
-	smp_send_reschedule(ilb_cpu);
+	smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->wake_csd);
 }
 
 /*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -890,9 +890,10 @@ struct rq {
 #ifdef CONFIG_SMP
 	unsigned long		last_blocked_load_update_tick;
 	unsigned int		has_blocked_load;
+	call_single_data_t	nohz_csd;
 #endif /* CONFIG_SMP */
 	unsigned int		nohz_tick_stopped;
-	atomic_t nohz_flags;
+	atomic_t		nohz_flags;
 #endif /* CONFIG_NO_HZ_COMMON */
 
 	unsigned long		nr_load_updates;
@@ -979,7 +980,7 @@ struct rq {
 
 	/* This is used to determine avg_idle's max value */
 	u64			max_idle_balance_cost;
-#endif
+#endif /* CONFIG_SMP */
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	u64			prev_irq_time;
@@ -1021,6 +1022,7 @@ struct rq {
 #endif
 
 #ifdef CONFIG_SMP
+	call_single_data_t	wake_csd;
 	struct llist_head	wake_list;
 #endif
 


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 04/36] sched: Make scheduler_ipi inline
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (2 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 03/36] sched: Clean up scheduler_ipi() Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06 12:42   ` Alexandre Chartre
  2020-05-12 15:13   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:16 ` [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling Thomas Gleixner
                   ` (32 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Now that the scheduler IPI is trivial and simple again there is no point to
have the little function out of line. This simplifies the effort of
constraining the instrumentation nicely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/sched.h |   10 +++++++++-
 kernel/sched/core.c   |   10 ----------
 2 files changed, 9 insertions(+), 11 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1719,7 +1719,15 @@ extern char *__get_task_comm(char *to, s
 })
 
 #ifdef CONFIG_SMP
-void scheduler_ipi(void);
+static __always_inline void scheduler_ipi(void)
+{
+	/*
+	 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
+	 * TIF_NEED_RESCHED remotely (for the first time) will also send
+	 * this IPI.
+	 */
+	preempt_fold_need_resched();
+}
 extern unsigned long wait_task_inactive(struct task_struct *, long match_state);
 #else
 static inline void scheduler_ipi(void) { }
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2312,16 +2312,6 @@ static void wake_csd_func(void *info)
 	sched_ttwu_pending();
 }
 
-void scheduler_ipi(void)
-{
-	/*
-	 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
-	 * TIF_NEED_RESCHED remotely (for the first time) will also send
-	 * this IPI.
-	 */
-	preempt_fold_need_resched();
-}
-
 static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (3 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 04/36] sched: Make scheduler_ipi inline Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06 11:53   ` Miroslav Benes
                     ` (4 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 06/36] compiler: Simple READ/WRITE_ONCE() implementations Thomas Gleixner
                   ` (31 subsequent siblings)
  36 siblings, 5 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

Make sure task_work runs before any kind of userspace -- very much
including signals -- is invoked.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/common.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -156,16 +156,16 @@ static void exit_to_usermode_loop(struct
 		if (cached_flags & _TIF_PATCH_PENDING)
 			klp_update_patch_state(current);
 
-		/* deal with pending signal delivery */
-		if (cached_flags & _TIF_SIGPENDING)
-			do_signal(regs);
-
 		if (cached_flags & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
 			rseq_handle_notify_resume(NULL, regs);
 		}
 
+		/* deal with pending signal delivery */
+		if (cached_flags & _TIF_SIGPENDING)
+			do_signal(regs);
+
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 06/36] compiler: Simple READ/WRITE_ONCE() implementations
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (4 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06 13:11   ` Alexandre Chartre
                     ` (2 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 07/36] locking/atomics: Flip fallbacks and instrumentation Thomas Gleixner
                   ` (30 subsequent siblings)
  36 siblings, 3 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

READ/WRITE_ONCE_NOCHECK() is required for atomics in code which cannot be
instrumented like the x86 int3 text poke code. As READ/WRITE_ONCE() is
undergoing a rewrite, provide __{READ,WRITE}_ONCE_SCALAR().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/compiler.h |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -313,6 +313,14 @@ unsigned long read_word_at_a_time(const
 	__u.__val;					\
 })
 
+#define __READ_ONCE_SCALAR(x)				\
+	(*(const volatile typeof(x) *)&(x))
+
+#define __WRITE_ONCE_SCALAR(x, val)			\
+do {							\
+	*(volatile typeof(x) *)&(x) = val;		\
+} while (0)
+
 /**
  * data_race - mark an expression as containing intentional data races
  *


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 07/36] locking/atomics: Flip fallbacks and instrumentation
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (5 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 06/36] compiler: Simple READ/WRITE_ONCE() implementations Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-05 16:04   ` Mark Rutland
                     ` (2 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 08/36] x86/doublefault: Remove memmove() call Thomas Gleixner
                   ` (29 subsequent siblings)
  36 siblings, 3 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel),
	Mark Rutland

Currently instrumentation of atomic primitives is done at the
architecture level, while composites or fallbacks are provided at the
generic level.

The result is that there are no uninstrumented variants of the
fallbacks. Since there is now need of such (see the next patch),
invert this ordering.

Doing this means moving the instrumentation into the generic code as
well as having (for now) two variants of the fallbacks.

Notes:

 - the various *cond_read* primitives are not proper fallbacks
   and got moved into linux/atomic.c. No arch_ variants are
   generated because the base primitives smp_cond_load*()
   are instrumented.

 - once all architectures are moved over to arch_atomic_ we can remove
   one of the fallback variants and reclaim some 2300 lines.

 - atomic_{read,set}*() are no longer double-instrumented

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
---
 arch/arm64/include/asm/atomic.h              |    6 
 arch/x86/include/asm/atomic.h                |   17 
 arch/x86/include/asm/atomic64_32.h           |    9 
 arch/x86/include/asm/atomic64_64.h           |   15 
 include/linux/atomic-arch-fallback.h         | 2291 +++++++++++++++++++++++++++
 include/linux/atomic-fallback.h              |    8 
 include/linux/atomic.h                       |   11 
 scripts/atomic/fallbacks/acquire             |    4 
 scripts/atomic/fallbacks/add_negative        |    6 
 scripts/atomic/fallbacks/add_unless          |    6 
 scripts/atomic/fallbacks/andnot              |    4 
 scripts/atomic/fallbacks/dec                 |    4 
 scripts/atomic/fallbacks/dec_and_test        |    6 
 scripts/atomic/fallbacks/dec_if_positive     |    6 
 scripts/atomic/fallbacks/dec_unless_positive |    6 
 scripts/atomic/fallbacks/fence               |    4 
 scripts/atomic/fallbacks/fetch_add_unless    |    8 
 scripts/atomic/fallbacks/inc                 |    4 
 scripts/atomic/fallbacks/inc_and_test        |    6 
 scripts/atomic/fallbacks/inc_not_zero        |    6 
 scripts/atomic/fallbacks/inc_unless_negative |    6 
 scripts/atomic/fallbacks/read_acquire        |    2 
 scripts/atomic/fallbacks/release             |    4 
 scripts/atomic/fallbacks/set_release         |    2 
 scripts/atomic/fallbacks/sub_and_test        |    6 
 scripts/atomic/fallbacks/try_cmpxchg         |    4 
 scripts/atomic/gen-atomic-fallback.sh        |   29 
 scripts/atomic/gen-atomics.sh                |    5 
 28 files changed, 2403 insertions(+), 82 deletions(-)

--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@ -101,8 +101,8 @@ static inline long arch_atomic64_dec_if_
 
 #define ATOMIC_INIT(i)	{ (i) }
 
-#define arch_atomic_read(v)			READ_ONCE((v)->counter)
-#define arch_atomic_set(v, i)			WRITE_ONCE(((v)->counter), (i))
+#define arch_atomic_read(v)			__READ_ONCE_SCALAR((v)->counter)
+#define arch_atomic_set(v, i)			__WRITE_ONCE_SCALAR(((v)->counter), (i))
 
 #define arch_atomic_add_return_relaxed		arch_atomic_add_return_relaxed
 #define arch_atomic_add_return_acquire		arch_atomic_add_return_acquire
@@ -225,6 +225,6 @@ static inline long arch_atomic64_dec_if_
 
 #define arch_atomic64_dec_if_positive		arch_atomic64_dec_if_positive
 
-#include <asm-generic/atomic-instrumented.h>
+#define ARCH_ATOMIC
 
 #endif /* __ASM_ATOMIC_H */
--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -28,7 +28,7 @@ static __always_inline int arch_atomic_r
 	 * Note for KASAN: we deliberately don't use READ_ONCE_NOCHECK() here,
 	 * it's non-inlined function that increases binary size and stack usage.
 	 */
-	return READ_ONCE((v)->counter);
+	return __READ_ONCE_SCALAR((v)->counter);
 }
 
 /**
@@ -40,7 +40,7 @@ static __always_inline int arch_atomic_r
  */
 static __always_inline void arch_atomic_set(atomic_t *v, int i)
 {
-	WRITE_ONCE(v->counter, i);
+	__WRITE_ONCE_SCALAR(v->counter, i);
 }
 
 /**
@@ -166,6 +166,7 @@ static __always_inline int arch_atomic_a
 {
 	return i + xadd(&v->counter, i);
 }
+#define arch_atomic_add_return arch_atomic_add_return
 
 /**
  * arch_atomic_sub_return - subtract integer and return
@@ -178,32 +179,37 @@ static __always_inline int arch_atomic_s
 {
 	return arch_atomic_add_return(-i, v);
 }
+#define arch_atomic_sub_return arch_atomic_sub_return
 
 static __always_inline int arch_atomic_fetch_add(int i, atomic_t *v)
 {
 	return xadd(&v->counter, i);
 }
+#define arch_atomic_fetch_add arch_atomic_fetch_add
 
 static __always_inline int arch_atomic_fetch_sub(int i, atomic_t *v)
 {
 	return xadd(&v->counter, -i);
 }
+#define arch_atomic_fetch_sub arch_atomic_fetch_sub
 
 static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
 {
 	return arch_cmpxchg(&v->counter, old, new);
 }
+#define arch_atomic_cmpxchg arch_atomic_cmpxchg
 
-#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
 static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
 {
 	return try_cmpxchg(&v->counter, old, new);
 }
+#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
 
 static inline int arch_atomic_xchg(atomic_t *v, int new)
 {
 	return arch_xchg(&v->counter, new);
 }
+#define arch_atomic_xchg arch_atomic_xchg
 
 static inline void arch_atomic_and(int i, atomic_t *v)
 {
@@ -221,6 +227,7 @@ static inline int arch_atomic_fetch_and(
 
 	return val;
 }
+#define arch_atomic_fetch_and arch_atomic_fetch_and
 
 static inline void arch_atomic_or(int i, atomic_t *v)
 {
@@ -238,6 +245,7 @@ static inline int arch_atomic_fetch_or(i
 
 	return val;
 }
+#define arch_atomic_fetch_or arch_atomic_fetch_or
 
 static inline void arch_atomic_xor(int i, atomic_t *v)
 {
@@ -255,6 +263,7 @@ static inline int arch_atomic_fetch_xor(
 
 	return val;
 }
+#define arch_atomic_fetch_xor arch_atomic_fetch_xor
 
 #ifdef CONFIG_X86_32
 # include <asm/atomic64_32.h>
@@ -262,6 +271,6 @@ static inline int arch_atomic_fetch_xor(
 # include <asm/atomic64_64.h>
 #endif
 
-#include <asm-generic/atomic-instrumented.h>
+#define ARCH_ATOMIC
 
 #endif /* _ASM_X86_ATOMIC_H */
--- a/arch/x86/include/asm/atomic64_32.h
+++ b/arch/x86/include/asm/atomic64_32.h
@@ -75,6 +75,7 @@ static inline s64 arch_atomic64_cmpxchg(
 {
 	return arch_cmpxchg64(&v->counter, o, n);
 }
+#define arch_atomic64_cmpxchg arch_atomic64_cmpxchg
 
 /**
  * arch_atomic64_xchg - xchg atomic64 variable
@@ -94,6 +95,7 @@ static inline s64 arch_atomic64_xchg(ato
 			     : "memory");
 	return o;
 }
+#define arch_atomic64_xchg arch_atomic64_xchg
 
 /**
  * arch_atomic64_set - set atomic64 variable
@@ -138,6 +140,7 @@ static inline s64 arch_atomic64_add_retu
 			     ASM_NO_INPUT_CLOBBER("memory"));
 	return i;
 }
+#define arch_atomic64_add_return arch_atomic64_add_return
 
 /*
  * Other variants with different arithmetic operators:
@@ -149,6 +152,7 @@ static inline s64 arch_atomic64_sub_retu
 			     ASM_NO_INPUT_CLOBBER("memory"));
 	return i;
 }
+#define arch_atomic64_sub_return arch_atomic64_sub_return
 
 static inline s64 arch_atomic64_inc_return(atomic64_t *v)
 {
@@ -242,6 +246,7 @@ static inline int arch_atomic64_add_unle
 			     "S" (v) : "memory");
 	return (int)a;
 }
+#define arch_atomic64_add_unless arch_atomic64_add_unless
 
 static inline int arch_atomic64_inc_not_zero(atomic64_t *v)
 {
@@ -281,6 +286,7 @@ static inline s64 arch_atomic64_fetch_an
 
 	return old;
 }
+#define arch_atomic64_fetch_and arch_atomic64_fetch_and
 
 static inline void arch_atomic64_or(s64 i, atomic64_t *v)
 {
@@ -299,6 +305,7 @@ static inline s64 arch_atomic64_fetch_or
 
 	return old;
 }
+#define arch_atomic64_fetch_or arch_atomic64_fetch_or
 
 static inline void arch_atomic64_xor(s64 i, atomic64_t *v)
 {
@@ -317,6 +324,7 @@ static inline s64 arch_atomic64_fetch_xo
 
 	return old;
 }
+#define arch_atomic64_fetch_xor arch_atomic64_fetch_xor
 
 static inline s64 arch_atomic64_fetch_add(s64 i, atomic64_t *v)
 {
@@ -327,6 +335,7 @@ static inline s64 arch_atomic64_fetch_ad
 
 	return old;
 }
+#define arch_atomic64_fetch_add arch_atomic64_fetch_add
 
 #define arch_atomic64_fetch_sub(i, v)	arch_atomic64_fetch_add(-(i), (v))
 
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -19,7 +19,7 @@
  */
 static inline s64 arch_atomic64_read(const atomic64_t *v)
 {
-	return READ_ONCE((v)->counter);
+	return __READ_ONCE_SCALAR((v)->counter);
 }
 
 /**
@@ -31,7 +31,7 @@ static inline s64 arch_atomic64_read(con
  */
 static inline void arch_atomic64_set(atomic64_t *v, s64 i)
 {
-	WRITE_ONCE(v->counter, i);
+	__WRITE_ONCE_SCALAR(v->counter, i);
 }
 
 /**
@@ -159,37 +159,43 @@ static __always_inline s64 arch_atomic64
 {
 	return i + xadd(&v->counter, i);
 }
+#define arch_atomic64_add_return arch_atomic64_add_return
 
 static inline s64 arch_atomic64_sub_return(s64 i, atomic64_t *v)
 {
 	return arch_atomic64_add_return(-i, v);
 }
+#define arch_atomic64_sub_return arch_atomic64_sub_return
 
 static inline s64 arch_atomic64_fetch_add(s64 i, atomic64_t *v)
 {
 	return xadd(&v->counter, i);
 }
+#define arch_atomic64_fetch_add arch_atomic64_fetch_add
 
 static inline s64 arch_atomic64_fetch_sub(s64 i, atomic64_t *v)
 {
 	return xadd(&v->counter, -i);
 }
+#define arch_atomic64_fetch_sub arch_atomic64_fetch_sub
 
 static inline s64 arch_atomic64_cmpxchg(atomic64_t *v, s64 old, s64 new)
 {
 	return arch_cmpxchg(&v->counter, old, new);
 }
+#define arch_atomic64_cmpxchg arch_atomic64_cmpxchg
 
-#define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg
 static __always_inline bool arch_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
 {
 	return try_cmpxchg(&v->counter, old, new);
 }
+#define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg
 
 static inline s64 arch_atomic64_xchg(atomic64_t *v, s64 new)
 {
 	return arch_xchg(&v->counter, new);
 }
+#define arch_atomic64_xchg arch_atomic64_xchg
 
 static inline void arch_atomic64_and(s64 i, atomic64_t *v)
 {
@@ -207,6 +213,7 @@ static inline s64 arch_atomic64_fetch_an
 	} while (!arch_atomic64_try_cmpxchg(v, &val, val & i));
 	return val;
 }
+#define arch_atomic64_fetch_and arch_atomic64_fetch_and
 
 static inline void arch_atomic64_or(s64 i, atomic64_t *v)
 {
@@ -224,6 +231,7 @@ static inline s64 arch_atomic64_fetch_or
 	} while (!arch_atomic64_try_cmpxchg(v, &val, val | i));
 	return val;
 }
+#define arch_atomic64_fetch_or arch_atomic64_fetch_or
 
 static inline void arch_atomic64_xor(s64 i, atomic64_t *v)
 {
@@ -241,5 +249,6 @@ static inline s64 arch_atomic64_fetch_xo
 	} while (!arch_atomic64_try_cmpxchg(v, &val, val ^ i));
 	return val;
 }
+#define arch_atomic64_fetch_xor arch_atomic64_fetch_xor
 
 #endif /* _ASM_X86_ATOMIC64_64_H */
--- /dev/null
+++ b/include/linux/atomic-arch-fallback.h
@@ -0,0 +1,2291 @@
+// SPDX-License-Identifier: GPL-2.0
+
+// Generated by scripts/atomic/gen-atomic-fallback.sh
+// DO NOT MODIFY THIS FILE DIRECTLY
+
+#ifndef _LINUX_ATOMIC_FALLBACK_H
+#define _LINUX_ATOMIC_FALLBACK_H
+
+#include <linux/compiler.h>
+
+#ifndef arch_xchg_relaxed
+#define arch_xchg_relaxed		arch_xchg
+#define arch_xchg_acquire		arch_xchg
+#define arch_xchg_release		arch_xchg
+#else /* arch_xchg_relaxed */
+
+#ifndef arch_xchg_acquire
+#define arch_xchg_acquire(...) \
+	__atomic_op_acquire(arch_xchg, __VA_ARGS__)
+#endif
+
+#ifndef arch_xchg_release
+#define arch_xchg_release(...) \
+	__atomic_op_release(arch_xchg, __VA_ARGS__)
+#endif
+
+#ifndef arch_xchg
+#define arch_xchg(...) \
+	__atomic_op_fence(arch_xchg, __VA_ARGS__)
+#endif
+
+#endif /* arch_xchg_relaxed */
+
+#ifndef arch_cmpxchg_relaxed
+#define arch_cmpxchg_relaxed		arch_cmpxchg
+#define arch_cmpxchg_acquire		arch_cmpxchg
+#define arch_cmpxchg_release		arch_cmpxchg
+#else /* arch_cmpxchg_relaxed */
+
+#ifndef arch_cmpxchg_acquire
+#define arch_cmpxchg_acquire(...) \
+	__atomic_op_acquire(arch_cmpxchg, __VA_ARGS__)
+#endif
+
+#ifndef arch_cmpxchg_release
+#define arch_cmpxchg_release(...) \
+	__atomic_op_release(arch_cmpxchg, __VA_ARGS__)
+#endif
+
+#ifndef arch_cmpxchg
+#define arch_cmpxchg(...) \
+	__atomic_op_fence(arch_cmpxchg, __VA_ARGS__)
+#endif
+
+#endif /* arch_cmpxchg_relaxed */
+
+#ifndef arch_cmpxchg64_relaxed
+#define arch_cmpxchg64_relaxed		arch_cmpxchg64
+#define arch_cmpxchg64_acquire		arch_cmpxchg64
+#define arch_cmpxchg64_release		arch_cmpxchg64
+#else /* arch_cmpxchg64_relaxed */
+
+#ifndef arch_cmpxchg64_acquire
+#define arch_cmpxchg64_acquire(...) \
+	__atomic_op_acquire(arch_cmpxchg64, __VA_ARGS__)
+#endif
+
+#ifndef arch_cmpxchg64_release
+#define arch_cmpxchg64_release(...) \
+	__atomic_op_release(arch_cmpxchg64, __VA_ARGS__)
+#endif
+
+#ifndef arch_cmpxchg64
+#define arch_cmpxchg64(...) \
+	__atomic_op_fence(arch_cmpxchg64, __VA_ARGS__)
+#endif
+
+#endif /* arch_cmpxchg64_relaxed */
+
+#ifndef arch_atomic_read_acquire
+static __always_inline int
+arch_atomic_read_acquire(const atomic_t *v)
+{
+	return smp_load_acquire(&(v)->counter);
+}
+#define arch_atomic_read_acquire arch_atomic_read_acquire
+#endif
+
+#ifndef arch_atomic_set_release
+static __always_inline void
+arch_atomic_set_release(atomic_t *v, int i)
+{
+	smp_store_release(&(v)->counter, i);
+}
+#define arch_atomic_set_release arch_atomic_set_release
+#endif
+
+#ifndef arch_atomic_add_return_relaxed
+#define arch_atomic_add_return_acquire arch_atomic_add_return
+#define arch_atomic_add_return_release arch_atomic_add_return
+#define arch_atomic_add_return_relaxed arch_atomic_add_return
+#else /* arch_atomic_add_return_relaxed */
+
+#ifndef arch_atomic_add_return_acquire
+static __always_inline int
+arch_atomic_add_return_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_add_return_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_add_return_acquire arch_atomic_add_return_acquire
+#endif
+
+#ifndef arch_atomic_add_return_release
+static __always_inline int
+arch_atomic_add_return_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_add_return_relaxed(i, v);
+}
+#define arch_atomic_add_return_release arch_atomic_add_return_release
+#endif
+
+#ifndef arch_atomic_add_return
+static __always_inline int
+arch_atomic_add_return(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_add_return_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_add_return arch_atomic_add_return
+#endif
+
+#endif /* arch_atomic_add_return_relaxed */
+
+#ifndef arch_atomic_fetch_add_relaxed
+#define arch_atomic_fetch_add_acquire arch_atomic_fetch_add
+#define arch_atomic_fetch_add_release arch_atomic_fetch_add
+#define arch_atomic_fetch_add_relaxed arch_atomic_fetch_add
+#else /* arch_atomic_fetch_add_relaxed */
+
+#ifndef arch_atomic_fetch_add_acquire
+static __always_inline int
+arch_atomic_fetch_add_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_fetch_add_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_add_acquire arch_atomic_fetch_add_acquire
+#endif
+
+#ifndef arch_atomic_fetch_add_release
+static __always_inline int
+arch_atomic_fetch_add_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_add_relaxed(i, v);
+}
+#define arch_atomic_fetch_add_release arch_atomic_fetch_add_release
+#endif
+
+#ifndef arch_atomic_fetch_add
+static __always_inline int
+arch_atomic_fetch_add(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_add_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_add arch_atomic_fetch_add
+#endif
+
+#endif /* arch_atomic_fetch_add_relaxed */
+
+#ifndef arch_atomic_sub_return_relaxed
+#define arch_atomic_sub_return_acquire arch_atomic_sub_return
+#define arch_atomic_sub_return_release arch_atomic_sub_return
+#define arch_atomic_sub_return_relaxed arch_atomic_sub_return
+#else /* arch_atomic_sub_return_relaxed */
+
+#ifndef arch_atomic_sub_return_acquire
+static __always_inline int
+arch_atomic_sub_return_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_sub_return_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_sub_return_acquire arch_atomic_sub_return_acquire
+#endif
+
+#ifndef arch_atomic_sub_return_release
+static __always_inline int
+arch_atomic_sub_return_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_sub_return_relaxed(i, v);
+}
+#define arch_atomic_sub_return_release arch_atomic_sub_return_release
+#endif
+
+#ifndef arch_atomic_sub_return
+static __always_inline int
+arch_atomic_sub_return(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_sub_return_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_sub_return arch_atomic_sub_return
+#endif
+
+#endif /* arch_atomic_sub_return_relaxed */
+
+#ifndef arch_atomic_fetch_sub_relaxed
+#define arch_atomic_fetch_sub_acquire arch_atomic_fetch_sub
+#define arch_atomic_fetch_sub_release arch_atomic_fetch_sub
+#define arch_atomic_fetch_sub_relaxed arch_atomic_fetch_sub
+#else /* arch_atomic_fetch_sub_relaxed */
+
+#ifndef arch_atomic_fetch_sub_acquire
+static __always_inline int
+arch_atomic_fetch_sub_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_fetch_sub_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_sub_acquire arch_atomic_fetch_sub_acquire
+#endif
+
+#ifndef arch_atomic_fetch_sub_release
+static __always_inline int
+arch_atomic_fetch_sub_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_sub_relaxed(i, v);
+}
+#define arch_atomic_fetch_sub_release arch_atomic_fetch_sub_release
+#endif
+
+#ifndef arch_atomic_fetch_sub
+static __always_inline int
+arch_atomic_fetch_sub(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_sub_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_sub arch_atomic_fetch_sub
+#endif
+
+#endif /* arch_atomic_fetch_sub_relaxed */
+
+#ifndef arch_atomic_inc
+static __always_inline void
+arch_atomic_inc(atomic_t *v)
+{
+	arch_atomic_add(1, v);
+}
+#define arch_atomic_inc arch_atomic_inc
+#endif
+
+#ifndef arch_atomic_inc_return_relaxed
+#ifdef arch_atomic_inc_return
+#define arch_atomic_inc_return_acquire arch_atomic_inc_return
+#define arch_atomic_inc_return_release arch_atomic_inc_return
+#define arch_atomic_inc_return_relaxed arch_atomic_inc_return
+#endif /* arch_atomic_inc_return */
+
+#ifndef arch_atomic_inc_return
+static __always_inline int
+arch_atomic_inc_return(atomic_t *v)
+{
+	return arch_atomic_add_return(1, v);
+}
+#define arch_atomic_inc_return arch_atomic_inc_return
+#endif
+
+#ifndef arch_atomic_inc_return_acquire
+static __always_inline int
+arch_atomic_inc_return_acquire(atomic_t *v)
+{
+	return arch_atomic_add_return_acquire(1, v);
+}
+#define arch_atomic_inc_return_acquire arch_atomic_inc_return_acquire
+#endif
+
+#ifndef arch_atomic_inc_return_release
+static __always_inline int
+arch_atomic_inc_return_release(atomic_t *v)
+{
+	return arch_atomic_add_return_release(1, v);
+}
+#define arch_atomic_inc_return_release arch_atomic_inc_return_release
+#endif
+
+#ifndef arch_atomic_inc_return_relaxed
+static __always_inline int
+arch_atomic_inc_return_relaxed(atomic_t *v)
+{
+	return arch_atomic_add_return_relaxed(1, v);
+}
+#define arch_atomic_inc_return_relaxed arch_atomic_inc_return_relaxed
+#endif
+
+#else /* arch_atomic_inc_return_relaxed */
+
+#ifndef arch_atomic_inc_return_acquire
+static __always_inline int
+arch_atomic_inc_return_acquire(atomic_t *v)
+{
+	int ret = arch_atomic_inc_return_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_inc_return_acquire arch_atomic_inc_return_acquire
+#endif
+
+#ifndef arch_atomic_inc_return_release
+static __always_inline int
+arch_atomic_inc_return_release(atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_inc_return_relaxed(v);
+}
+#define arch_atomic_inc_return_release arch_atomic_inc_return_release
+#endif
+
+#ifndef arch_atomic_inc_return
+static __always_inline int
+arch_atomic_inc_return(atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_inc_return_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_inc_return arch_atomic_inc_return
+#endif
+
+#endif /* arch_atomic_inc_return_relaxed */
+
+#ifndef arch_atomic_fetch_inc_relaxed
+#ifdef arch_atomic_fetch_inc
+#define arch_atomic_fetch_inc_acquire arch_atomic_fetch_inc
+#define arch_atomic_fetch_inc_release arch_atomic_fetch_inc
+#define arch_atomic_fetch_inc_relaxed arch_atomic_fetch_inc
+#endif /* arch_atomic_fetch_inc */
+
+#ifndef arch_atomic_fetch_inc
+static __always_inline int
+arch_atomic_fetch_inc(atomic_t *v)
+{
+	return arch_atomic_fetch_add(1, v);
+}
+#define arch_atomic_fetch_inc arch_atomic_fetch_inc
+#endif
+
+#ifndef arch_atomic_fetch_inc_acquire
+static __always_inline int
+arch_atomic_fetch_inc_acquire(atomic_t *v)
+{
+	return arch_atomic_fetch_add_acquire(1, v);
+}
+#define arch_atomic_fetch_inc_acquire arch_atomic_fetch_inc_acquire
+#endif
+
+#ifndef arch_atomic_fetch_inc_release
+static __always_inline int
+arch_atomic_fetch_inc_release(atomic_t *v)
+{
+	return arch_atomic_fetch_add_release(1, v);
+}
+#define arch_atomic_fetch_inc_release arch_atomic_fetch_inc_release
+#endif
+
+#ifndef arch_atomic_fetch_inc_relaxed
+static __always_inline int
+arch_atomic_fetch_inc_relaxed(atomic_t *v)
+{
+	return arch_atomic_fetch_add_relaxed(1, v);
+}
+#define arch_atomic_fetch_inc_relaxed arch_atomic_fetch_inc_relaxed
+#endif
+
+#else /* arch_atomic_fetch_inc_relaxed */
+
+#ifndef arch_atomic_fetch_inc_acquire
+static __always_inline int
+arch_atomic_fetch_inc_acquire(atomic_t *v)
+{
+	int ret = arch_atomic_fetch_inc_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_inc_acquire arch_atomic_fetch_inc_acquire
+#endif
+
+#ifndef arch_atomic_fetch_inc_release
+static __always_inline int
+arch_atomic_fetch_inc_release(atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_inc_relaxed(v);
+}
+#define arch_atomic_fetch_inc_release arch_atomic_fetch_inc_release
+#endif
+
+#ifndef arch_atomic_fetch_inc
+static __always_inline int
+arch_atomic_fetch_inc(atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_inc_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_inc arch_atomic_fetch_inc
+#endif
+
+#endif /* arch_atomic_fetch_inc_relaxed */
+
+#ifndef arch_atomic_dec
+static __always_inline void
+arch_atomic_dec(atomic_t *v)
+{
+	arch_atomic_sub(1, v);
+}
+#define arch_atomic_dec arch_atomic_dec
+#endif
+
+#ifndef arch_atomic_dec_return_relaxed
+#ifdef arch_atomic_dec_return
+#define arch_atomic_dec_return_acquire arch_atomic_dec_return
+#define arch_atomic_dec_return_release arch_atomic_dec_return
+#define arch_atomic_dec_return_relaxed arch_atomic_dec_return
+#endif /* arch_atomic_dec_return */
+
+#ifndef arch_atomic_dec_return
+static __always_inline int
+arch_atomic_dec_return(atomic_t *v)
+{
+	return arch_atomic_sub_return(1, v);
+}
+#define arch_atomic_dec_return arch_atomic_dec_return
+#endif
+
+#ifndef arch_atomic_dec_return_acquire
+static __always_inline int
+arch_atomic_dec_return_acquire(atomic_t *v)
+{
+	return arch_atomic_sub_return_acquire(1, v);
+}
+#define arch_atomic_dec_return_acquire arch_atomic_dec_return_acquire
+#endif
+
+#ifndef arch_atomic_dec_return_release
+static __always_inline int
+arch_atomic_dec_return_release(atomic_t *v)
+{
+	return arch_atomic_sub_return_release(1, v);
+}
+#define arch_atomic_dec_return_release arch_atomic_dec_return_release
+#endif
+
+#ifndef arch_atomic_dec_return_relaxed
+static __always_inline int
+arch_atomic_dec_return_relaxed(atomic_t *v)
+{
+	return arch_atomic_sub_return_relaxed(1, v);
+}
+#define arch_atomic_dec_return_relaxed arch_atomic_dec_return_relaxed
+#endif
+
+#else /* arch_atomic_dec_return_relaxed */
+
+#ifndef arch_atomic_dec_return_acquire
+static __always_inline int
+arch_atomic_dec_return_acquire(atomic_t *v)
+{
+	int ret = arch_atomic_dec_return_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_dec_return_acquire arch_atomic_dec_return_acquire
+#endif
+
+#ifndef arch_atomic_dec_return_release
+static __always_inline int
+arch_atomic_dec_return_release(atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_dec_return_relaxed(v);
+}
+#define arch_atomic_dec_return_release arch_atomic_dec_return_release
+#endif
+
+#ifndef arch_atomic_dec_return
+static __always_inline int
+arch_atomic_dec_return(atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_dec_return_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_dec_return arch_atomic_dec_return
+#endif
+
+#endif /* arch_atomic_dec_return_relaxed */
+
+#ifndef arch_atomic_fetch_dec_relaxed
+#ifdef arch_atomic_fetch_dec
+#define arch_atomic_fetch_dec_acquire arch_atomic_fetch_dec
+#define arch_atomic_fetch_dec_release arch_atomic_fetch_dec
+#define arch_atomic_fetch_dec_relaxed arch_atomic_fetch_dec
+#endif /* arch_atomic_fetch_dec */
+
+#ifndef arch_atomic_fetch_dec
+static __always_inline int
+arch_atomic_fetch_dec(atomic_t *v)
+{
+	return arch_atomic_fetch_sub(1, v);
+}
+#define arch_atomic_fetch_dec arch_atomic_fetch_dec
+#endif
+
+#ifndef arch_atomic_fetch_dec_acquire
+static __always_inline int
+arch_atomic_fetch_dec_acquire(atomic_t *v)
+{
+	return arch_atomic_fetch_sub_acquire(1, v);
+}
+#define arch_atomic_fetch_dec_acquire arch_atomic_fetch_dec_acquire
+#endif
+
+#ifndef arch_atomic_fetch_dec_release
+static __always_inline int
+arch_atomic_fetch_dec_release(atomic_t *v)
+{
+	return arch_atomic_fetch_sub_release(1, v);
+}
+#define arch_atomic_fetch_dec_release arch_atomic_fetch_dec_release
+#endif
+
+#ifndef arch_atomic_fetch_dec_relaxed
+static __always_inline int
+arch_atomic_fetch_dec_relaxed(atomic_t *v)
+{
+	return arch_atomic_fetch_sub_relaxed(1, v);
+}
+#define arch_atomic_fetch_dec_relaxed arch_atomic_fetch_dec_relaxed
+#endif
+
+#else /* arch_atomic_fetch_dec_relaxed */
+
+#ifndef arch_atomic_fetch_dec_acquire
+static __always_inline int
+arch_atomic_fetch_dec_acquire(atomic_t *v)
+{
+	int ret = arch_atomic_fetch_dec_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_dec_acquire arch_atomic_fetch_dec_acquire
+#endif
+
+#ifndef arch_atomic_fetch_dec_release
+static __always_inline int
+arch_atomic_fetch_dec_release(atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_dec_relaxed(v);
+}
+#define arch_atomic_fetch_dec_release arch_atomic_fetch_dec_release
+#endif
+
+#ifndef arch_atomic_fetch_dec
+static __always_inline int
+arch_atomic_fetch_dec(atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_dec_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_dec arch_atomic_fetch_dec
+#endif
+
+#endif /* arch_atomic_fetch_dec_relaxed */
+
+#ifndef arch_atomic_fetch_and_relaxed
+#define arch_atomic_fetch_and_acquire arch_atomic_fetch_and
+#define arch_atomic_fetch_and_release arch_atomic_fetch_and
+#define arch_atomic_fetch_and_relaxed arch_atomic_fetch_and
+#else /* arch_atomic_fetch_and_relaxed */
+
+#ifndef arch_atomic_fetch_and_acquire
+static __always_inline int
+arch_atomic_fetch_and_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_fetch_and_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_and_acquire arch_atomic_fetch_and_acquire
+#endif
+
+#ifndef arch_atomic_fetch_and_release
+static __always_inline int
+arch_atomic_fetch_and_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_and_relaxed(i, v);
+}
+#define arch_atomic_fetch_and_release arch_atomic_fetch_and_release
+#endif
+
+#ifndef arch_atomic_fetch_and
+static __always_inline int
+arch_atomic_fetch_and(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_and_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_and arch_atomic_fetch_and
+#endif
+
+#endif /* arch_atomic_fetch_and_relaxed */
+
+#ifndef arch_atomic_andnot
+static __always_inline void
+arch_atomic_andnot(int i, atomic_t *v)
+{
+	arch_atomic_and(~i, v);
+}
+#define arch_atomic_andnot arch_atomic_andnot
+#endif
+
+#ifndef arch_atomic_fetch_andnot_relaxed
+#ifdef arch_atomic_fetch_andnot
+#define arch_atomic_fetch_andnot_acquire arch_atomic_fetch_andnot
+#define arch_atomic_fetch_andnot_release arch_atomic_fetch_andnot
+#define arch_atomic_fetch_andnot_relaxed arch_atomic_fetch_andnot
+#endif /* arch_atomic_fetch_andnot */
+
+#ifndef arch_atomic_fetch_andnot
+static __always_inline int
+arch_atomic_fetch_andnot(int i, atomic_t *v)
+{
+	return arch_atomic_fetch_and(~i, v);
+}
+#define arch_atomic_fetch_andnot arch_atomic_fetch_andnot
+#endif
+
+#ifndef arch_atomic_fetch_andnot_acquire
+static __always_inline int
+arch_atomic_fetch_andnot_acquire(int i, atomic_t *v)
+{
+	return arch_atomic_fetch_and_acquire(~i, v);
+}
+#define arch_atomic_fetch_andnot_acquire arch_atomic_fetch_andnot_acquire
+#endif
+
+#ifndef arch_atomic_fetch_andnot_release
+static __always_inline int
+arch_atomic_fetch_andnot_release(int i, atomic_t *v)
+{
+	return arch_atomic_fetch_and_release(~i, v);
+}
+#define arch_atomic_fetch_andnot_release arch_atomic_fetch_andnot_release
+#endif
+
+#ifndef arch_atomic_fetch_andnot_relaxed
+static __always_inline int
+arch_atomic_fetch_andnot_relaxed(int i, atomic_t *v)
+{
+	return arch_atomic_fetch_and_relaxed(~i, v);
+}
+#define arch_atomic_fetch_andnot_relaxed arch_atomic_fetch_andnot_relaxed
+#endif
+
+#else /* arch_atomic_fetch_andnot_relaxed */
+
+#ifndef arch_atomic_fetch_andnot_acquire
+static __always_inline int
+arch_atomic_fetch_andnot_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_fetch_andnot_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_andnot_acquire arch_atomic_fetch_andnot_acquire
+#endif
+
+#ifndef arch_atomic_fetch_andnot_release
+static __always_inline int
+arch_atomic_fetch_andnot_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_andnot_relaxed(i, v);
+}
+#define arch_atomic_fetch_andnot_release arch_atomic_fetch_andnot_release
+#endif
+
+#ifndef arch_atomic_fetch_andnot
+static __always_inline int
+arch_atomic_fetch_andnot(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_andnot_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_andnot arch_atomic_fetch_andnot
+#endif
+
+#endif /* arch_atomic_fetch_andnot_relaxed */
+
+#ifndef arch_atomic_fetch_or_relaxed
+#define arch_atomic_fetch_or_acquire arch_atomic_fetch_or
+#define arch_atomic_fetch_or_release arch_atomic_fetch_or
+#define arch_atomic_fetch_or_relaxed arch_atomic_fetch_or
+#else /* arch_atomic_fetch_or_relaxed */
+
+#ifndef arch_atomic_fetch_or_acquire
+static __always_inline int
+arch_atomic_fetch_or_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_fetch_or_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_or_acquire arch_atomic_fetch_or_acquire
+#endif
+
+#ifndef arch_atomic_fetch_or_release
+static __always_inline int
+arch_atomic_fetch_or_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_or_relaxed(i, v);
+}
+#define arch_atomic_fetch_or_release arch_atomic_fetch_or_release
+#endif
+
+#ifndef arch_atomic_fetch_or
+static __always_inline int
+arch_atomic_fetch_or(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_or_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_or arch_atomic_fetch_or
+#endif
+
+#endif /* arch_atomic_fetch_or_relaxed */
+
+#ifndef arch_atomic_fetch_xor_relaxed
+#define arch_atomic_fetch_xor_acquire arch_atomic_fetch_xor
+#define arch_atomic_fetch_xor_release arch_atomic_fetch_xor
+#define arch_atomic_fetch_xor_relaxed arch_atomic_fetch_xor
+#else /* arch_atomic_fetch_xor_relaxed */
+
+#ifndef arch_atomic_fetch_xor_acquire
+static __always_inline int
+arch_atomic_fetch_xor_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_fetch_xor_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_xor_acquire arch_atomic_fetch_xor_acquire
+#endif
+
+#ifndef arch_atomic_fetch_xor_release
+static __always_inline int
+arch_atomic_fetch_xor_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_xor_relaxed(i, v);
+}
+#define arch_atomic_fetch_xor_release arch_atomic_fetch_xor_release
+#endif
+
+#ifndef arch_atomic_fetch_xor
+static __always_inline int
+arch_atomic_fetch_xor(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_xor_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_xor arch_atomic_fetch_xor
+#endif
+
+#endif /* arch_atomic_fetch_xor_relaxed */
+
+#ifndef arch_atomic_xchg_relaxed
+#define arch_atomic_xchg_acquire arch_atomic_xchg
+#define arch_atomic_xchg_release arch_atomic_xchg
+#define arch_atomic_xchg_relaxed arch_atomic_xchg
+#else /* arch_atomic_xchg_relaxed */
+
+#ifndef arch_atomic_xchg_acquire
+static __always_inline int
+arch_atomic_xchg_acquire(atomic_t *v, int i)
+{
+	int ret = arch_atomic_xchg_relaxed(v, i);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_xchg_acquire arch_atomic_xchg_acquire
+#endif
+
+#ifndef arch_atomic_xchg_release
+static __always_inline int
+arch_atomic_xchg_release(atomic_t *v, int i)
+{
+	__atomic_release_fence();
+	return arch_atomic_xchg_relaxed(v, i);
+}
+#define arch_atomic_xchg_release arch_atomic_xchg_release
+#endif
+
+#ifndef arch_atomic_xchg
+static __always_inline int
+arch_atomic_xchg(atomic_t *v, int i)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_xchg_relaxed(v, i);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_xchg arch_atomic_xchg
+#endif
+
+#endif /* arch_atomic_xchg_relaxed */
+
+#ifndef arch_atomic_cmpxchg_relaxed
+#define arch_atomic_cmpxchg_acquire arch_atomic_cmpxchg
+#define arch_atomic_cmpxchg_release arch_atomic_cmpxchg
+#define arch_atomic_cmpxchg_relaxed arch_atomic_cmpxchg
+#else /* arch_atomic_cmpxchg_relaxed */
+
+#ifndef arch_atomic_cmpxchg_acquire
+static __always_inline int
+arch_atomic_cmpxchg_acquire(atomic_t *v, int old, int new)
+{
+	int ret = arch_atomic_cmpxchg_relaxed(v, old, new);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_cmpxchg_acquire arch_atomic_cmpxchg_acquire
+#endif
+
+#ifndef arch_atomic_cmpxchg_release
+static __always_inline int
+arch_atomic_cmpxchg_release(atomic_t *v, int old, int new)
+{
+	__atomic_release_fence();
+	return arch_atomic_cmpxchg_relaxed(v, old, new);
+}
+#define arch_atomic_cmpxchg_release arch_atomic_cmpxchg_release
+#endif
+
+#ifndef arch_atomic_cmpxchg
+static __always_inline int
+arch_atomic_cmpxchg(atomic_t *v, int old, int new)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_cmpxchg_relaxed(v, old, new);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_cmpxchg arch_atomic_cmpxchg
+#endif
+
+#endif /* arch_atomic_cmpxchg_relaxed */
+
+#ifndef arch_atomic_try_cmpxchg_relaxed
+#ifdef arch_atomic_try_cmpxchg
+#define arch_atomic_try_cmpxchg_acquire arch_atomic_try_cmpxchg
+#define arch_atomic_try_cmpxchg_release arch_atomic_try_cmpxchg
+#define arch_atomic_try_cmpxchg_relaxed arch_atomic_try_cmpxchg
+#endif /* arch_atomic_try_cmpxchg */
+
+#ifndef arch_atomic_try_cmpxchg
+static __always_inline bool
+arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
+{
+	int r, o = *old;
+	r = arch_atomic_cmpxchg(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
+#endif
+
+#ifndef arch_atomic_try_cmpxchg_acquire
+static __always_inline bool
+arch_atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
+{
+	int r, o = *old;
+	r = arch_atomic_cmpxchg_acquire(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic_try_cmpxchg_acquire arch_atomic_try_cmpxchg_acquire
+#endif
+
+#ifndef arch_atomic_try_cmpxchg_release
+static __always_inline bool
+arch_atomic_try_cmpxchg_release(atomic_t *v, int *old, int new)
+{
+	int r, o = *old;
+	r = arch_atomic_cmpxchg_release(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic_try_cmpxchg_release arch_atomic_try_cmpxchg_release
+#endif
+
+#ifndef arch_atomic_try_cmpxchg_relaxed
+static __always_inline bool
+arch_atomic_try_cmpxchg_relaxed(atomic_t *v, int *old, int new)
+{
+	int r, o = *old;
+	r = arch_atomic_cmpxchg_relaxed(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic_try_cmpxchg_relaxed arch_atomic_try_cmpxchg_relaxed
+#endif
+
+#else /* arch_atomic_try_cmpxchg_relaxed */
+
+#ifndef arch_atomic_try_cmpxchg_acquire
+static __always_inline bool
+arch_atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
+{
+	bool ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_try_cmpxchg_acquire arch_atomic_try_cmpxchg_acquire
+#endif
+
+#ifndef arch_atomic_try_cmpxchg_release
+static __always_inline bool
+arch_atomic_try_cmpxchg_release(atomic_t *v, int *old, int new)
+{
+	__atomic_release_fence();
+	return arch_atomic_try_cmpxchg_relaxed(v, old, new);
+}
+#define arch_atomic_try_cmpxchg_release arch_atomic_try_cmpxchg_release
+#endif
+
+#ifndef arch_atomic_try_cmpxchg
+static __always_inline bool
+arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
+{
+	bool ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
+#endif
+
+#endif /* arch_atomic_try_cmpxchg_relaxed */
+
+#ifndef arch_atomic_sub_and_test
+/**
+ * arch_atomic_sub_and_test - subtract value from variable and test result
+ * @i: integer value to subtract
+ * @v: pointer of type atomic_t
+ *
+ * Atomically subtracts @i from @v and returns
+ * true if the result is zero, or false for all
+ * other cases.
+ */
+static __always_inline bool
+arch_atomic_sub_and_test(int i, atomic_t *v)
+{
+	return arch_atomic_sub_return(i, v) == 0;
+}
+#define arch_atomic_sub_and_test arch_atomic_sub_and_test
+#endif
+
+#ifndef arch_atomic_dec_and_test
+/**
+ * arch_atomic_dec_and_test - decrement and test
+ * @v: pointer of type atomic_t
+ *
+ * Atomically decrements @v by 1 and
+ * returns true if the result is 0, or false for all other
+ * cases.
+ */
+static __always_inline bool
+arch_atomic_dec_and_test(atomic_t *v)
+{
+	return arch_atomic_dec_return(v) == 0;
+}
+#define arch_atomic_dec_and_test arch_atomic_dec_and_test
+#endif
+
+#ifndef arch_atomic_inc_and_test
+/**
+ * arch_atomic_inc_and_test - increment and test
+ * @v: pointer of type atomic_t
+ *
+ * Atomically increments @v by 1
+ * and returns true if the result is zero, or false for all
+ * other cases.
+ */
+static __always_inline bool
+arch_atomic_inc_and_test(atomic_t *v)
+{
+	return arch_atomic_inc_return(v) == 0;
+}
+#define arch_atomic_inc_and_test arch_atomic_inc_and_test
+#endif
+
+#ifndef arch_atomic_add_negative
+/**
+ * arch_atomic_add_negative - add and test if negative
+ * @i: integer value to add
+ * @v: pointer of type atomic_t
+ *
+ * Atomically adds @i to @v and returns true
+ * if the result is negative, or false when
+ * result is greater than or equal to zero.
+ */
+static __always_inline bool
+arch_atomic_add_negative(int i, atomic_t *v)
+{
+	return arch_atomic_add_return(i, v) < 0;
+}
+#define arch_atomic_add_negative arch_atomic_add_negative
+#endif
+
+#ifndef arch_atomic_fetch_add_unless
+/**
+ * arch_atomic_fetch_add_unless - add unless the number is already a given value
+ * @v: pointer of type atomic_t
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ *
+ * Atomically adds @a to @v, so long as @v was not already @u.
+ * Returns original value of @v
+ */
+static __always_inline int
+arch_atomic_fetch_add_unless(atomic_t *v, int a, int u)
+{
+	int c = arch_atomic_read(v);
+
+	do {
+		if (unlikely(c == u))
+			break;
+	} while (!arch_atomic_try_cmpxchg(v, &c, c + a));
+
+	return c;
+}
+#define arch_atomic_fetch_add_unless arch_atomic_fetch_add_unless
+#endif
+
+#ifndef arch_atomic_add_unless
+/**
+ * arch_atomic_add_unless - add unless the number is already a given value
+ * @v: pointer of type atomic_t
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ *
+ * Atomically adds @a to @v, if @v was not already @u.
+ * Returns true if the addition was done.
+ */
+static __always_inline bool
+arch_atomic_add_unless(atomic_t *v, int a, int u)
+{
+	return arch_atomic_fetch_add_unless(v, a, u) != u;
+}
+#define arch_atomic_add_unless arch_atomic_add_unless
+#endif
+
+#ifndef arch_atomic_inc_not_zero
+/**
+ * arch_atomic_inc_not_zero - increment unless the number is zero
+ * @v: pointer of type atomic_t
+ *
+ * Atomically increments @v by 1, if @v is non-zero.
+ * Returns true if the increment was done.
+ */
+static __always_inline bool
+arch_atomic_inc_not_zero(atomic_t *v)
+{
+	return arch_atomic_add_unless(v, 1, 0);
+}
+#define arch_atomic_inc_not_zero arch_atomic_inc_not_zero
+#endif
+
+#ifndef arch_atomic_inc_unless_negative
+static __always_inline bool
+arch_atomic_inc_unless_negative(atomic_t *v)
+{
+	int c = arch_atomic_read(v);
+
+	do {
+		if (unlikely(c < 0))
+			return false;
+	} while (!arch_atomic_try_cmpxchg(v, &c, c + 1));
+
+	return true;
+}
+#define arch_atomic_inc_unless_negative arch_atomic_inc_unless_negative
+#endif
+
+#ifndef arch_atomic_dec_unless_positive
+static __always_inline bool
+arch_atomic_dec_unless_positive(atomic_t *v)
+{
+	int c = arch_atomic_read(v);
+
+	do {
+		if (unlikely(c > 0))
+			return false;
+	} while (!arch_atomic_try_cmpxchg(v, &c, c - 1));
+
+	return true;
+}
+#define arch_atomic_dec_unless_positive arch_atomic_dec_unless_positive
+#endif
+
+#ifndef arch_atomic_dec_if_positive
+static __always_inline int
+arch_atomic_dec_if_positive(atomic_t *v)
+{
+	int dec, c = arch_atomic_read(v);
+
+	do {
+		dec = c - 1;
+		if (unlikely(dec < 0))
+			break;
+	} while (!arch_atomic_try_cmpxchg(v, &c, dec));
+
+	return dec;
+}
+#define arch_atomic_dec_if_positive arch_atomic_dec_if_positive
+#endif
+
+#ifdef CONFIG_GENERIC_ATOMIC64
+#include <asm-generic/atomic64.h>
+#endif
+
+#ifndef arch_atomic64_read_acquire
+static __always_inline s64
+arch_atomic64_read_acquire(const atomic64_t *v)
+{
+	return smp_load_acquire(&(v)->counter);
+}
+#define arch_atomic64_read_acquire arch_atomic64_read_acquire
+#endif
+
+#ifndef arch_atomic64_set_release
+static __always_inline void
+arch_atomic64_set_release(atomic64_t *v, s64 i)
+{
+	smp_store_release(&(v)->counter, i);
+}
+#define arch_atomic64_set_release arch_atomic64_set_release
+#endif
+
+#ifndef arch_atomic64_add_return_relaxed
+#define arch_atomic64_add_return_acquire arch_atomic64_add_return
+#define arch_atomic64_add_return_release arch_atomic64_add_return
+#define arch_atomic64_add_return_relaxed arch_atomic64_add_return
+#else /* arch_atomic64_add_return_relaxed */
+
+#ifndef arch_atomic64_add_return_acquire
+static __always_inline s64
+arch_atomic64_add_return_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_add_return_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_add_return_acquire arch_atomic64_add_return_acquire
+#endif
+
+#ifndef arch_atomic64_add_return_release
+static __always_inline s64
+arch_atomic64_add_return_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_add_return_relaxed(i, v);
+}
+#define arch_atomic64_add_return_release arch_atomic64_add_return_release
+#endif
+
+#ifndef arch_atomic64_add_return
+static __always_inline s64
+arch_atomic64_add_return(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_add_return_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_add_return arch_atomic64_add_return
+#endif
+
+#endif /* arch_atomic64_add_return_relaxed */
+
+#ifndef arch_atomic64_fetch_add_relaxed
+#define arch_atomic64_fetch_add_acquire arch_atomic64_fetch_add
+#define arch_atomic64_fetch_add_release arch_atomic64_fetch_add
+#define arch_atomic64_fetch_add_relaxed arch_atomic64_fetch_add
+#else /* arch_atomic64_fetch_add_relaxed */
+
+#ifndef arch_atomic64_fetch_add_acquire
+static __always_inline s64
+arch_atomic64_fetch_add_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_add_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_add_acquire arch_atomic64_fetch_add_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_add_release
+static __always_inline s64
+arch_atomic64_fetch_add_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_add_relaxed(i, v);
+}
+#define arch_atomic64_fetch_add_release arch_atomic64_fetch_add_release
+#endif
+
+#ifndef arch_atomic64_fetch_add
+static __always_inline s64
+arch_atomic64_fetch_add(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_add_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_add arch_atomic64_fetch_add
+#endif
+
+#endif /* arch_atomic64_fetch_add_relaxed */
+
+#ifndef arch_atomic64_sub_return_relaxed
+#define arch_atomic64_sub_return_acquire arch_atomic64_sub_return
+#define arch_atomic64_sub_return_release arch_atomic64_sub_return
+#define arch_atomic64_sub_return_relaxed arch_atomic64_sub_return
+#else /* arch_atomic64_sub_return_relaxed */
+
+#ifndef arch_atomic64_sub_return_acquire
+static __always_inline s64
+arch_atomic64_sub_return_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_sub_return_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_sub_return_acquire arch_atomic64_sub_return_acquire
+#endif
+
+#ifndef arch_atomic64_sub_return_release
+static __always_inline s64
+arch_atomic64_sub_return_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_sub_return_relaxed(i, v);
+}
+#define arch_atomic64_sub_return_release arch_atomic64_sub_return_release
+#endif
+
+#ifndef arch_atomic64_sub_return
+static __always_inline s64
+arch_atomic64_sub_return(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_sub_return_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_sub_return arch_atomic64_sub_return
+#endif
+
+#endif /* arch_atomic64_sub_return_relaxed */
+
+#ifndef arch_atomic64_fetch_sub_relaxed
+#define arch_atomic64_fetch_sub_acquire arch_atomic64_fetch_sub
+#define arch_atomic64_fetch_sub_release arch_atomic64_fetch_sub
+#define arch_atomic64_fetch_sub_relaxed arch_atomic64_fetch_sub
+#else /* arch_atomic64_fetch_sub_relaxed */
+
+#ifndef arch_atomic64_fetch_sub_acquire
+static __always_inline s64
+arch_atomic64_fetch_sub_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_sub_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_sub_acquire arch_atomic64_fetch_sub_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_sub_release
+static __always_inline s64
+arch_atomic64_fetch_sub_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_sub_relaxed(i, v);
+}
+#define arch_atomic64_fetch_sub_release arch_atomic64_fetch_sub_release
+#endif
+
+#ifndef arch_atomic64_fetch_sub
+static __always_inline s64
+arch_atomic64_fetch_sub(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_sub_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_sub arch_atomic64_fetch_sub
+#endif
+
+#endif /* arch_atomic64_fetch_sub_relaxed */
+
+#ifndef arch_atomic64_inc
+static __always_inline void
+arch_atomic64_inc(atomic64_t *v)
+{
+	arch_atomic64_add(1, v);
+}
+#define arch_atomic64_inc arch_atomic64_inc
+#endif
+
+#ifndef arch_atomic64_inc_return_relaxed
+#ifdef arch_atomic64_inc_return
+#define arch_atomic64_inc_return_acquire arch_atomic64_inc_return
+#define arch_atomic64_inc_return_release arch_atomic64_inc_return
+#define arch_atomic64_inc_return_relaxed arch_atomic64_inc_return
+#endif /* arch_atomic64_inc_return */
+
+#ifndef arch_atomic64_inc_return
+static __always_inline s64
+arch_atomic64_inc_return(atomic64_t *v)
+{
+	return arch_atomic64_add_return(1, v);
+}
+#define arch_atomic64_inc_return arch_atomic64_inc_return
+#endif
+
+#ifndef arch_atomic64_inc_return_acquire
+static __always_inline s64
+arch_atomic64_inc_return_acquire(atomic64_t *v)
+{
+	return arch_atomic64_add_return_acquire(1, v);
+}
+#define arch_atomic64_inc_return_acquire arch_atomic64_inc_return_acquire
+#endif
+
+#ifndef arch_atomic64_inc_return_release
+static __always_inline s64
+arch_atomic64_inc_return_release(atomic64_t *v)
+{
+	return arch_atomic64_add_return_release(1, v);
+}
+#define arch_atomic64_inc_return_release arch_atomic64_inc_return_release
+#endif
+
+#ifndef arch_atomic64_inc_return_relaxed
+static __always_inline s64
+arch_atomic64_inc_return_relaxed(atomic64_t *v)
+{
+	return arch_atomic64_add_return_relaxed(1, v);
+}
+#define arch_atomic64_inc_return_relaxed arch_atomic64_inc_return_relaxed
+#endif
+
+#else /* arch_atomic64_inc_return_relaxed */
+
+#ifndef arch_atomic64_inc_return_acquire
+static __always_inline s64
+arch_atomic64_inc_return_acquire(atomic64_t *v)
+{
+	s64 ret = arch_atomic64_inc_return_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_inc_return_acquire arch_atomic64_inc_return_acquire
+#endif
+
+#ifndef arch_atomic64_inc_return_release
+static __always_inline s64
+arch_atomic64_inc_return_release(atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_inc_return_relaxed(v);
+}
+#define arch_atomic64_inc_return_release arch_atomic64_inc_return_release
+#endif
+
+#ifndef arch_atomic64_inc_return
+static __always_inline s64
+arch_atomic64_inc_return(atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_inc_return_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_inc_return arch_atomic64_inc_return
+#endif
+
+#endif /* arch_atomic64_inc_return_relaxed */
+
+#ifndef arch_atomic64_fetch_inc_relaxed
+#ifdef arch_atomic64_fetch_inc
+#define arch_atomic64_fetch_inc_acquire arch_atomic64_fetch_inc
+#define arch_atomic64_fetch_inc_release arch_atomic64_fetch_inc
+#define arch_atomic64_fetch_inc_relaxed arch_atomic64_fetch_inc
+#endif /* arch_atomic64_fetch_inc */
+
+#ifndef arch_atomic64_fetch_inc
+static __always_inline s64
+arch_atomic64_fetch_inc(atomic64_t *v)
+{
+	return arch_atomic64_fetch_add(1, v);
+}
+#define arch_atomic64_fetch_inc arch_atomic64_fetch_inc
+#endif
+
+#ifndef arch_atomic64_fetch_inc_acquire
+static __always_inline s64
+arch_atomic64_fetch_inc_acquire(atomic64_t *v)
+{
+	return arch_atomic64_fetch_add_acquire(1, v);
+}
+#define arch_atomic64_fetch_inc_acquire arch_atomic64_fetch_inc_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_inc_release
+static __always_inline s64
+arch_atomic64_fetch_inc_release(atomic64_t *v)
+{
+	return arch_atomic64_fetch_add_release(1, v);
+}
+#define arch_atomic64_fetch_inc_release arch_atomic64_fetch_inc_release
+#endif
+
+#ifndef arch_atomic64_fetch_inc_relaxed
+static __always_inline s64
+arch_atomic64_fetch_inc_relaxed(atomic64_t *v)
+{
+	return arch_atomic64_fetch_add_relaxed(1, v);
+}
+#define arch_atomic64_fetch_inc_relaxed arch_atomic64_fetch_inc_relaxed
+#endif
+
+#else /* arch_atomic64_fetch_inc_relaxed */
+
+#ifndef arch_atomic64_fetch_inc_acquire
+static __always_inline s64
+arch_atomic64_fetch_inc_acquire(atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_inc_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_inc_acquire arch_atomic64_fetch_inc_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_inc_release
+static __always_inline s64
+arch_atomic64_fetch_inc_release(atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_inc_relaxed(v);
+}
+#define arch_atomic64_fetch_inc_release arch_atomic64_fetch_inc_release
+#endif
+
+#ifndef arch_atomic64_fetch_inc
+static __always_inline s64
+arch_atomic64_fetch_inc(atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_inc_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_inc arch_atomic64_fetch_inc
+#endif
+
+#endif /* arch_atomic64_fetch_inc_relaxed */
+
+#ifndef arch_atomic64_dec
+static __always_inline void
+arch_atomic64_dec(atomic64_t *v)
+{
+	arch_atomic64_sub(1, v);
+}
+#define arch_atomic64_dec arch_atomic64_dec
+#endif
+
+#ifndef arch_atomic64_dec_return_relaxed
+#ifdef arch_atomic64_dec_return
+#define arch_atomic64_dec_return_acquire arch_atomic64_dec_return
+#define arch_atomic64_dec_return_release arch_atomic64_dec_return
+#define arch_atomic64_dec_return_relaxed arch_atomic64_dec_return
+#endif /* arch_atomic64_dec_return */
+
+#ifndef arch_atomic64_dec_return
+static __always_inline s64
+arch_atomic64_dec_return(atomic64_t *v)
+{
+	return arch_atomic64_sub_return(1, v);
+}
+#define arch_atomic64_dec_return arch_atomic64_dec_return
+#endif
+
+#ifndef arch_atomic64_dec_return_acquire
+static __always_inline s64
+arch_atomic64_dec_return_acquire(atomic64_t *v)
+{
+	return arch_atomic64_sub_return_acquire(1, v);
+}
+#define arch_atomic64_dec_return_acquire arch_atomic64_dec_return_acquire
+#endif
+
+#ifndef arch_atomic64_dec_return_release
+static __always_inline s64
+arch_atomic64_dec_return_release(atomic64_t *v)
+{
+	return arch_atomic64_sub_return_release(1, v);
+}
+#define arch_atomic64_dec_return_release arch_atomic64_dec_return_release
+#endif
+
+#ifndef arch_atomic64_dec_return_relaxed
+static __always_inline s64
+arch_atomic64_dec_return_relaxed(atomic64_t *v)
+{
+	return arch_atomic64_sub_return_relaxed(1, v);
+}
+#define arch_atomic64_dec_return_relaxed arch_atomic64_dec_return_relaxed
+#endif
+
+#else /* arch_atomic64_dec_return_relaxed */
+
+#ifndef arch_atomic64_dec_return_acquire
+static __always_inline s64
+arch_atomic64_dec_return_acquire(atomic64_t *v)
+{
+	s64 ret = arch_atomic64_dec_return_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_dec_return_acquire arch_atomic64_dec_return_acquire
+#endif
+
+#ifndef arch_atomic64_dec_return_release
+static __always_inline s64
+arch_atomic64_dec_return_release(atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_dec_return_relaxed(v);
+}
+#define arch_atomic64_dec_return_release arch_atomic64_dec_return_release
+#endif
+
+#ifndef arch_atomic64_dec_return
+static __always_inline s64
+arch_atomic64_dec_return(atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_dec_return_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_dec_return arch_atomic64_dec_return
+#endif
+
+#endif /* arch_atomic64_dec_return_relaxed */
+
+#ifndef arch_atomic64_fetch_dec_relaxed
+#ifdef arch_atomic64_fetch_dec
+#define arch_atomic64_fetch_dec_acquire arch_atomic64_fetch_dec
+#define arch_atomic64_fetch_dec_release arch_atomic64_fetch_dec
+#define arch_atomic64_fetch_dec_relaxed arch_atomic64_fetch_dec
+#endif /* arch_atomic64_fetch_dec */
+
+#ifndef arch_atomic64_fetch_dec
+static __always_inline s64
+arch_atomic64_fetch_dec(atomic64_t *v)
+{
+	return arch_atomic64_fetch_sub(1, v);
+}
+#define arch_atomic64_fetch_dec arch_atomic64_fetch_dec
+#endif
+
+#ifndef arch_atomic64_fetch_dec_acquire
+static __always_inline s64
+arch_atomic64_fetch_dec_acquire(atomic64_t *v)
+{
+	return arch_atomic64_fetch_sub_acquire(1, v);
+}
+#define arch_atomic64_fetch_dec_acquire arch_atomic64_fetch_dec_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_dec_release
+static __always_inline s64
+arch_atomic64_fetch_dec_release(atomic64_t *v)
+{
+	return arch_atomic64_fetch_sub_release(1, v);
+}
+#define arch_atomic64_fetch_dec_release arch_atomic64_fetch_dec_release
+#endif
+
+#ifndef arch_atomic64_fetch_dec_relaxed
+static __always_inline s64
+arch_atomic64_fetch_dec_relaxed(atomic64_t *v)
+{
+	return arch_atomic64_fetch_sub_relaxed(1, v);
+}
+#define arch_atomic64_fetch_dec_relaxed arch_atomic64_fetch_dec_relaxed
+#endif
+
+#else /* arch_atomic64_fetch_dec_relaxed */
+
+#ifndef arch_atomic64_fetch_dec_acquire
+static __always_inline s64
+arch_atomic64_fetch_dec_acquire(atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_dec_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_dec_acquire arch_atomic64_fetch_dec_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_dec_release
+static __always_inline s64
+arch_atomic64_fetch_dec_release(atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_dec_relaxed(v);
+}
+#define arch_atomic64_fetch_dec_release arch_atomic64_fetch_dec_release
+#endif
+
+#ifndef arch_atomic64_fetch_dec
+static __always_inline s64
+arch_atomic64_fetch_dec(atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_dec_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_dec arch_atomic64_fetch_dec
+#endif
+
+#endif /* arch_atomic64_fetch_dec_relaxed */
+
+#ifndef arch_atomic64_fetch_and_relaxed
+#define arch_atomic64_fetch_and_acquire arch_atomic64_fetch_and
+#define arch_atomic64_fetch_and_release arch_atomic64_fetch_and
+#define arch_atomic64_fetch_and_relaxed arch_atomic64_fetch_and
+#else /* arch_atomic64_fetch_and_relaxed */
+
+#ifndef arch_atomic64_fetch_and_acquire
+static __always_inline s64
+arch_atomic64_fetch_and_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_and_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_and_acquire arch_atomic64_fetch_and_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_and_release
+static __always_inline s64
+arch_atomic64_fetch_and_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_and_relaxed(i, v);
+}
+#define arch_atomic64_fetch_and_release arch_atomic64_fetch_and_release
+#endif
+
+#ifndef arch_atomic64_fetch_and
+static __always_inline s64
+arch_atomic64_fetch_and(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_and_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_and arch_atomic64_fetch_and
+#endif
+
+#endif /* arch_atomic64_fetch_and_relaxed */
+
+#ifndef arch_atomic64_andnot
+static __always_inline void
+arch_atomic64_andnot(s64 i, atomic64_t *v)
+{
+	arch_atomic64_and(~i, v);
+}
+#define arch_atomic64_andnot arch_atomic64_andnot
+#endif
+
+#ifndef arch_atomic64_fetch_andnot_relaxed
+#ifdef arch_atomic64_fetch_andnot
+#define arch_atomic64_fetch_andnot_acquire arch_atomic64_fetch_andnot
+#define arch_atomic64_fetch_andnot_release arch_atomic64_fetch_andnot
+#define arch_atomic64_fetch_andnot_relaxed arch_atomic64_fetch_andnot
+#endif /* arch_atomic64_fetch_andnot */
+
+#ifndef arch_atomic64_fetch_andnot
+static __always_inline s64
+arch_atomic64_fetch_andnot(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_fetch_and(~i, v);
+}
+#define arch_atomic64_fetch_andnot arch_atomic64_fetch_andnot
+#endif
+
+#ifndef arch_atomic64_fetch_andnot_acquire
+static __always_inline s64
+arch_atomic64_fetch_andnot_acquire(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_fetch_and_acquire(~i, v);
+}
+#define arch_atomic64_fetch_andnot_acquire arch_atomic64_fetch_andnot_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_andnot_release
+static __always_inline s64
+arch_atomic64_fetch_andnot_release(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_fetch_and_release(~i, v);
+}
+#define arch_atomic64_fetch_andnot_release arch_atomic64_fetch_andnot_release
+#endif
+
+#ifndef arch_atomic64_fetch_andnot_relaxed
+static __always_inline s64
+arch_atomic64_fetch_andnot_relaxed(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_fetch_and_relaxed(~i, v);
+}
+#define arch_atomic64_fetch_andnot_relaxed arch_atomic64_fetch_andnot_relaxed
+#endif
+
+#else /* arch_atomic64_fetch_andnot_relaxed */
+
+#ifndef arch_atomic64_fetch_andnot_acquire
+static __always_inline s64
+arch_atomic64_fetch_andnot_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_andnot_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_andnot_acquire arch_atomic64_fetch_andnot_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_andnot_release
+static __always_inline s64
+arch_atomic64_fetch_andnot_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_andnot_relaxed(i, v);
+}
+#define arch_atomic64_fetch_andnot_release arch_atomic64_fetch_andnot_release
+#endif
+
+#ifndef arch_atomic64_fetch_andnot
+static __always_inline s64
+arch_atomic64_fetch_andnot(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_andnot_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_andnot arch_atomic64_fetch_andnot
+#endif
+
+#endif /* arch_atomic64_fetch_andnot_relaxed */
+
+#ifndef arch_atomic64_fetch_or_relaxed
+#define arch_atomic64_fetch_or_acquire arch_atomic64_fetch_or
+#define arch_atomic64_fetch_or_release arch_atomic64_fetch_or
+#define arch_atomic64_fetch_or_relaxed arch_atomic64_fetch_or
+#else /* arch_atomic64_fetch_or_relaxed */
+
+#ifndef arch_atomic64_fetch_or_acquire
+static __always_inline s64
+arch_atomic64_fetch_or_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_or_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_or_acquire arch_atomic64_fetch_or_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_or_release
+static __always_inline s64
+arch_atomic64_fetch_or_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_or_relaxed(i, v);
+}
+#define arch_atomic64_fetch_or_release arch_atomic64_fetch_or_release
+#endif
+
+#ifndef arch_atomic64_fetch_or
+static __always_inline s64
+arch_atomic64_fetch_or(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_or_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_or arch_atomic64_fetch_or
+#endif
+
+#endif /* arch_atomic64_fetch_or_relaxed */
+
+#ifndef arch_atomic64_fetch_xor_relaxed
+#define arch_atomic64_fetch_xor_acquire arch_atomic64_fetch_xor
+#define arch_atomic64_fetch_xor_release arch_atomic64_fetch_xor
+#define arch_atomic64_fetch_xor_relaxed arch_atomic64_fetch_xor
+#else /* arch_atomic64_fetch_xor_relaxed */
+
+#ifndef arch_atomic64_fetch_xor_acquire
+static __always_inline s64
+arch_atomic64_fetch_xor_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_xor_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_xor_acquire arch_atomic64_fetch_xor_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_xor_release
+static __always_inline s64
+arch_atomic64_fetch_xor_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_xor_relaxed(i, v);
+}
+#define arch_atomic64_fetch_xor_release arch_atomic64_fetch_xor_release
+#endif
+
+#ifndef arch_atomic64_fetch_xor
+static __always_inline s64
+arch_atomic64_fetch_xor(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_xor_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_xor arch_atomic64_fetch_xor
+#endif
+
+#endif /* arch_atomic64_fetch_xor_relaxed */
+
+#ifndef arch_atomic64_xchg_relaxed
+#define arch_atomic64_xchg_acquire arch_atomic64_xchg
+#define arch_atomic64_xchg_release arch_atomic64_xchg
+#define arch_atomic64_xchg_relaxed arch_atomic64_xchg
+#else /* arch_atomic64_xchg_relaxed */
+
+#ifndef arch_atomic64_xchg_acquire
+static __always_inline s64
+arch_atomic64_xchg_acquire(atomic64_t *v, s64 i)
+{
+	s64 ret = arch_atomic64_xchg_relaxed(v, i);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_xchg_acquire arch_atomic64_xchg_acquire
+#endif
+
+#ifndef arch_atomic64_xchg_release
+static __always_inline s64
+arch_atomic64_xchg_release(atomic64_t *v, s64 i)
+{
+	__atomic_release_fence();
+	return arch_atomic64_xchg_relaxed(v, i);
+}
+#define arch_atomic64_xchg_release arch_atomic64_xchg_release
+#endif
+
+#ifndef arch_atomic64_xchg
+static __always_inline s64
+arch_atomic64_xchg(atomic64_t *v, s64 i)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_xchg_relaxed(v, i);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_xchg arch_atomic64_xchg
+#endif
+
+#endif /* arch_atomic64_xchg_relaxed */
+
+#ifndef arch_atomic64_cmpxchg_relaxed
+#define arch_atomic64_cmpxchg_acquire arch_atomic64_cmpxchg
+#define arch_atomic64_cmpxchg_release arch_atomic64_cmpxchg
+#define arch_atomic64_cmpxchg_relaxed arch_atomic64_cmpxchg
+#else /* arch_atomic64_cmpxchg_relaxed */
+
+#ifndef arch_atomic64_cmpxchg_acquire
+static __always_inline s64
+arch_atomic64_cmpxchg_acquire(atomic64_t *v, s64 old, s64 new)
+{
+	s64 ret = arch_atomic64_cmpxchg_relaxed(v, old, new);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_cmpxchg_acquire arch_atomic64_cmpxchg_acquire
+#endif
+
+#ifndef arch_atomic64_cmpxchg_release
+static __always_inline s64
+arch_atomic64_cmpxchg_release(atomic64_t *v, s64 old, s64 new)
+{
+	__atomic_release_fence();
+	return arch_atomic64_cmpxchg_relaxed(v, old, new);
+}
+#define arch_atomic64_cmpxchg_release arch_atomic64_cmpxchg_release
+#endif
+
+#ifndef arch_atomic64_cmpxchg
+static __always_inline s64
+arch_atomic64_cmpxchg(atomic64_t *v, s64 old, s64 new)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_cmpxchg_relaxed(v, old, new);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_cmpxchg arch_atomic64_cmpxchg
+#endif
+
+#endif /* arch_atomic64_cmpxchg_relaxed */
+
+#ifndef arch_atomic64_try_cmpxchg_relaxed
+#ifdef arch_atomic64_try_cmpxchg
+#define arch_atomic64_try_cmpxchg_acquire arch_atomic64_try_cmpxchg
+#define arch_atomic64_try_cmpxchg_release arch_atomic64_try_cmpxchg
+#define arch_atomic64_try_cmpxchg_relaxed arch_atomic64_try_cmpxchg
+#endif /* arch_atomic64_try_cmpxchg */
+
+#ifndef arch_atomic64_try_cmpxchg
+static __always_inline bool
+arch_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
+{
+	s64 r, o = *old;
+	r = arch_atomic64_cmpxchg(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg
+#endif
+
+#ifndef arch_atomic64_try_cmpxchg_acquire
+static __always_inline bool
+arch_atomic64_try_cmpxchg_acquire(atomic64_t *v, s64 *old, s64 new)
+{
+	s64 r, o = *old;
+	r = arch_atomic64_cmpxchg_acquire(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic64_try_cmpxchg_acquire arch_atomic64_try_cmpxchg_acquire
+#endif
+
+#ifndef arch_atomic64_try_cmpxchg_release
+static __always_inline bool
+arch_atomic64_try_cmpxchg_release(atomic64_t *v, s64 *old, s64 new)
+{
+	s64 r, o = *old;
+	r = arch_atomic64_cmpxchg_release(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic64_try_cmpxchg_release arch_atomic64_try_cmpxchg_release
+#endif
+
+#ifndef arch_atomic64_try_cmpxchg_relaxed
+static __always_inline bool
+arch_atomic64_try_cmpxchg_relaxed(atomic64_t *v, s64 *old, s64 new)
+{
+	s64 r, o = *old;
+	r = arch_atomic64_cmpxchg_relaxed(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic64_try_cmpxchg_relaxed arch_atomic64_try_cmpxchg_relaxed
+#endif
+
+#else /* arch_atomic64_try_cmpxchg_relaxed */
+
+#ifndef arch_atomic64_try_cmpxchg_acquire
+static __always_inline bool
+arch_atomic64_try_cmpxchg_acquire(atomic64_t *v, s64 *old, s64 new)
+{
+	bool ret = arch_atomic64_try_cmpxchg_relaxed(v, old, new);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_try_cmpxchg_acquire arch_atomic64_try_cmpxchg_acquire
+#endif
+
+#ifndef arch_atomic64_try_cmpxchg_release
+static __always_inline bool
+arch_atomic64_try_cmpxchg_release(atomic64_t *v, s64 *old, s64 new)
+{
+	__atomic_release_fence();
+	return arch_atomic64_try_cmpxchg_relaxed(v, old, new);
+}
+#define arch_atomic64_try_cmpxchg_release arch_atomic64_try_cmpxchg_release
+#endif
+
+#ifndef arch_atomic64_try_cmpxchg
+static __always_inline bool
+arch_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
+{
+	bool ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_try_cmpxchg_relaxed(v, old, new);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg
+#endif
+
+#endif /* arch_atomic64_try_cmpxchg_relaxed */
+
+#ifndef arch_atomic64_sub_and_test
+/**
+ * arch_atomic64_sub_and_test - subtract value from variable and test result
+ * @i: integer value to subtract
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically subtracts @i from @v and returns
+ * true if the result is zero, or false for all
+ * other cases.
+ */
+static __always_inline bool
+arch_atomic64_sub_and_test(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_sub_return(i, v) == 0;
+}
+#define arch_atomic64_sub_and_test arch_atomic64_sub_and_test
+#endif
+
+#ifndef arch_atomic64_dec_and_test
+/**
+ * arch_atomic64_dec_and_test - decrement and test
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically decrements @v by 1 and
+ * returns true if the result is 0, or false for all other
+ * cases.
+ */
+static __always_inline bool
+arch_atomic64_dec_and_test(atomic64_t *v)
+{
+	return arch_atomic64_dec_return(v) == 0;
+}
+#define arch_atomic64_dec_and_test arch_atomic64_dec_and_test
+#endif
+
+#ifndef arch_atomic64_inc_and_test
+/**
+ * arch_atomic64_inc_and_test - increment and test
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically increments @v by 1
+ * and returns true if the result is zero, or false for all
+ * other cases.
+ */
+static __always_inline bool
+arch_atomic64_inc_and_test(atomic64_t *v)
+{
+	return arch_atomic64_inc_return(v) == 0;
+}
+#define arch_atomic64_inc_and_test arch_atomic64_inc_and_test
+#endif
+
+#ifndef arch_atomic64_add_negative
+/**
+ * arch_atomic64_add_negative - add and test if negative
+ * @i: integer value to add
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically adds @i to @v and returns true
+ * if the result is negative, or false when
+ * result is greater than or equal to zero.
+ */
+static __always_inline bool
+arch_atomic64_add_negative(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_add_return(i, v) < 0;
+}
+#define arch_atomic64_add_negative arch_atomic64_add_negative
+#endif
+
+#ifndef arch_atomic64_fetch_add_unless
+/**
+ * arch_atomic64_fetch_add_unless - add unless the number is already a given value
+ * @v: pointer of type atomic64_t
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ *
+ * Atomically adds @a to @v, so long as @v was not already @u.
+ * Returns original value of @v
+ */
+static __always_inline s64
+arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, s64 u)
+{
+	s64 c = arch_atomic64_read(v);
+
+	do {
+		if (unlikely(c == u))
+			break;
+	} while (!arch_atomic64_try_cmpxchg(v, &c, c + a));
+
+	return c;
+}
+#define arch_atomic64_fetch_add_unless arch_atomic64_fetch_add_unless
+#endif
+
+#ifndef arch_atomic64_add_unless
+/**
+ * arch_atomic64_add_unless - add unless the number is already a given value
+ * @v: pointer of type atomic64_t
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ *
+ * Atomically adds @a to @v, if @v was not already @u.
+ * Returns true if the addition was done.
+ */
+static __always_inline bool
+arch_atomic64_add_unless(atomic64_t *v, s64 a, s64 u)
+{
+	return arch_atomic64_fetch_add_unless(v, a, u) != u;
+}
+#define arch_atomic64_add_unless arch_atomic64_add_unless
+#endif
+
+#ifndef arch_atomic64_inc_not_zero
+/**
+ * arch_atomic64_inc_not_zero - increment unless the number is zero
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically increments @v by 1, if @v is non-zero.
+ * Returns true if the increment was done.
+ */
+static __always_inline bool
+arch_atomic64_inc_not_zero(atomic64_t *v)
+{
+	return arch_atomic64_add_unless(v, 1, 0);
+}
+#define arch_atomic64_inc_not_zero arch_atomic64_inc_not_zero
+#endif
+
+#ifndef arch_atomic64_inc_unless_negative
+static __always_inline bool
+arch_atomic64_inc_unless_negative(atomic64_t *v)
+{
+	s64 c = arch_atomic64_read(v);
+
+	do {
+		if (unlikely(c < 0))
+			return false;
+	} while (!arch_atomic64_try_cmpxchg(v, &c, c + 1));
+
+	return true;
+}
+#define arch_atomic64_inc_unless_negative arch_atomic64_inc_unless_negative
+#endif
+
+#ifndef arch_atomic64_dec_unless_positive
+static __always_inline bool
+arch_atomic64_dec_unless_positive(atomic64_t *v)
+{
+	s64 c = arch_atomic64_read(v);
+
+	do {
+		if (unlikely(c > 0))
+			return false;
+	} while (!arch_atomic64_try_cmpxchg(v, &c, c - 1));
+
+	return true;
+}
+#define arch_atomic64_dec_unless_positive arch_atomic64_dec_unless_positive
+#endif
+
+#ifndef arch_atomic64_dec_if_positive
+static __always_inline s64
+arch_atomic64_dec_if_positive(atomic64_t *v)
+{
+	s64 dec, c = arch_atomic64_read(v);
+
+	do {
+		dec = c - 1;
+		if (unlikely(dec < 0))
+			break;
+	} while (!arch_atomic64_try_cmpxchg(v, &c, dec));
+
+	return dec;
+}
+#define arch_atomic64_dec_if_positive arch_atomic64_dec_if_positive
+#endif
+
+#endif /* _LINUX_ATOMIC_FALLBACK_H */
+// 90cd26cfd69d2250303d654955a0cc12620fb91b
--- a/include/linux/atomic-fallback.h
+++ b/include/linux/atomic-fallback.h
@@ -1180,9 +1180,6 @@ atomic_dec_if_positive(atomic_t *v)
 #define atomic_dec_if_positive atomic_dec_if_positive
 #endif
 
-#define atomic_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
-#define atomic_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
-
 #ifdef CONFIG_GENERIC_ATOMIC64
 #include <asm-generic/atomic64.h>
 #endif
@@ -2290,8 +2287,5 @@ atomic64_dec_if_positive(atomic64_t *v)
 #define atomic64_dec_if_positive atomic64_dec_if_positive
 #endif
 
-#define atomic64_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
-#define atomic64_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
-
 #endif /* _LINUX_ATOMIC_FALLBACK_H */
-// baaf45f4c24ed88ceae58baca39d7fd80bb8101b
+// 1fac0941c79bf0ae100723cc2ac9b94061f0b67a
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -25,6 +25,12 @@
  * See Documentation/memory-barriers.txt for ACQUIRE/RELEASE definitions.
  */
 
+#define atomic_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
+#define atomic_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
+
+#define atomic64_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
+#define atomic64_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
+
 /*
  * The idea here is to build acquire/release variants by adding explicit
  * barriers on top of the relaxed variant. In the case where the relaxed
@@ -71,7 +77,12 @@
 	__ret;								\
 })
 
+#ifdef ARCH_ATOMIC
+#include <linux/atomic-arch-fallback.h>
+#include <asm-generic/atomic-instrumented.h>
+#else
 #include <linux/atomic-fallback.h>
+#endif
 
 #include <asm-generic/atomic-long.h>
 
--- a/scripts/atomic/fallbacks/acquire
+++ b/scripts/atomic/fallbacks/acquire
@@ -1,8 +1,8 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_${pfx}${name}${sfx}_acquire(${params})
+${arch}${atomic}_${pfx}${name}${sfx}_acquire(${params})
 {
-	${ret} ret = ${atomic}_${pfx}${name}${sfx}_relaxed(${args});
+	${ret} ret = ${arch}${atomic}_${pfx}${name}${sfx}_relaxed(${args});
 	__atomic_acquire_fence();
 	return ret;
 }
--- a/scripts/atomic/fallbacks/add_negative
+++ b/scripts/atomic/fallbacks/add_negative
@@ -1,6 +1,6 @@
 cat <<EOF
 /**
- * ${atomic}_add_negative - add and test if negative
+ * ${arch}${atomic}_add_negative - add and test if negative
  * @i: integer value to add
  * @v: pointer of type ${atomic}_t
  *
@@ -9,8 +9,8 @@ cat <<EOF
  * result is greater than or equal to zero.
  */
 static __always_inline bool
-${atomic}_add_negative(${int} i, ${atomic}_t *v)
+${arch}${atomic}_add_negative(${int} i, ${atomic}_t *v)
 {
-	return ${atomic}_add_return(i, v) < 0;
+	return ${arch}${atomic}_add_return(i, v) < 0;
 }
 EOF
--- a/scripts/atomic/fallbacks/add_unless
+++ b/scripts/atomic/fallbacks/add_unless
@@ -1,6 +1,6 @@
 cat << EOF
 /**
- * ${atomic}_add_unless - add unless the number is already a given value
+ * ${arch}${atomic}_add_unless - add unless the number is already a given value
  * @v: pointer of type ${atomic}_t
  * @a: the amount to add to v...
  * @u: ...unless v is equal to u.
@@ -9,8 +9,8 @@ cat << EOF
  * Returns true if the addition was done.
  */
 static __always_inline bool
-${atomic}_add_unless(${atomic}_t *v, ${int} a, ${int} u)
+${arch}${atomic}_add_unless(${atomic}_t *v, ${int} a, ${int} u)
 {
-	return ${atomic}_fetch_add_unless(v, a, u) != u;
+	return ${arch}${atomic}_fetch_add_unless(v, a, u) != u;
 }
 EOF
--- a/scripts/atomic/fallbacks/andnot
+++ b/scripts/atomic/fallbacks/andnot
@@ -1,7 +1,7 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_${pfx}andnot${sfx}${order}(${int} i, ${atomic}_t *v)
+${arch}${atomic}_${pfx}andnot${sfx}${order}(${int} i, ${atomic}_t *v)
 {
-	${retstmt}${atomic}_${pfx}and${sfx}${order}(~i, v);
+	${retstmt}${arch}${atomic}_${pfx}and${sfx}${order}(~i, v);
 }
 EOF
--- a/scripts/atomic/fallbacks/dec
+++ b/scripts/atomic/fallbacks/dec
@@ -1,7 +1,7 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_${pfx}dec${sfx}${order}(${atomic}_t *v)
+${arch}${atomic}_${pfx}dec${sfx}${order}(${atomic}_t *v)
 {
-	${retstmt}${atomic}_${pfx}sub${sfx}${order}(1, v);
+	${retstmt}${arch}${atomic}_${pfx}sub${sfx}${order}(1, v);
 }
 EOF
--- a/scripts/atomic/fallbacks/dec_and_test
+++ b/scripts/atomic/fallbacks/dec_and_test
@@ -1,6 +1,6 @@
 cat <<EOF
 /**
- * ${atomic}_dec_and_test - decrement and test
+ * ${arch}${atomic}_dec_and_test - decrement and test
  * @v: pointer of type ${atomic}_t
  *
  * Atomically decrements @v by 1 and
@@ -8,8 +8,8 @@ cat <<EOF
  * cases.
  */
 static __always_inline bool
-${atomic}_dec_and_test(${atomic}_t *v)
+${arch}${atomic}_dec_and_test(${atomic}_t *v)
 {
-	return ${atomic}_dec_return(v) == 0;
+	return ${arch}${atomic}_dec_return(v) == 0;
 }
 EOF
--- a/scripts/atomic/fallbacks/dec_if_positive
+++ b/scripts/atomic/fallbacks/dec_if_positive
@@ -1,14 +1,14 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_dec_if_positive(${atomic}_t *v)
+${arch}${atomic}_dec_if_positive(${atomic}_t *v)
 {
-	${int} dec, c = ${atomic}_read(v);
+	${int} dec, c = ${arch}${atomic}_read(v);
 
 	do {
 		dec = c - 1;
 		if (unlikely(dec < 0))
 			break;
-	} while (!${atomic}_try_cmpxchg(v, &c, dec));
+	} while (!${arch}${atomic}_try_cmpxchg(v, &c, dec));
 
 	return dec;
 }
--- a/scripts/atomic/fallbacks/dec_unless_positive
+++ b/scripts/atomic/fallbacks/dec_unless_positive
@@ -1,13 +1,13 @@
 cat <<EOF
 static __always_inline bool
-${atomic}_dec_unless_positive(${atomic}_t *v)
+${arch}${atomic}_dec_unless_positive(${atomic}_t *v)
 {
-	${int} c = ${atomic}_read(v);
+	${int} c = ${arch}${atomic}_read(v);
 
 	do {
 		if (unlikely(c > 0))
 			return false;
-	} while (!${atomic}_try_cmpxchg(v, &c, c - 1));
+	} while (!${arch}${atomic}_try_cmpxchg(v, &c, c - 1));
 
 	return true;
 }
--- a/scripts/atomic/fallbacks/fence
+++ b/scripts/atomic/fallbacks/fence
@@ -1,10 +1,10 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_${pfx}${name}${sfx}(${params})
+${arch}${atomic}_${pfx}${name}${sfx}(${params})
 {
 	${ret} ret;
 	__atomic_pre_full_fence();
-	ret = ${atomic}_${pfx}${name}${sfx}_relaxed(${args});
+	ret = ${arch}${atomic}_${pfx}${name}${sfx}_relaxed(${args});
 	__atomic_post_full_fence();
 	return ret;
 }
--- a/scripts/atomic/fallbacks/fetch_add_unless
+++ b/scripts/atomic/fallbacks/fetch_add_unless
@@ -1,6 +1,6 @@
 cat << EOF
 /**
- * ${atomic}_fetch_add_unless - add unless the number is already a given value
+ * ${arch}${atomic}_fetch_add_unless - add unless the number is already a given value
  * @v: pointer of type ${atomic}_t
  * @a: the amount to add to v...
  * @u: ...unless v is equal to u.
@@ -9,14 +9,14 @@ cat << EOF
  * Returns original value of @v
  */
 static __always_inline ${int}
-${atomic}_fetch_add_unless(${atomic}_t *v, ${int} a, ${int} u)
+${arch}${atomic}_fetch_add_unless(${atomic}_t *v, ${int} a, ${int} u)
 {
-	${int} c = ${atomic}_read(v);
+	${int} c = ${arch}${atomic}_read(v);
 
 	do {
 		if (unlikely(c == u))
 			break;
-	} while (!${atomic}_try_cmpxchg(v, &c, c + a));
+	} while (!${arch}${atomic}_try_cmpxchg(v, &c, c + a));
 
 	return c;
 }
--- a/scripts/atomic/fallbacks/inc
+++ b/scripts/atomic/fallbacks/inc
@@ -1,7 +1,7 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_${pfx}inc${sfx}${order}(${atomic}_t *v)
+${arch}${atomic}_${pfx}inc${sfx}${order}(${atomic}_t *v)
 {
-	${retstmt}${atomic}_${pfx}add${sfx}${order}(1, v);
+	${retstmt}${arch}${atomic}_${pfx}add${sfx}${order}(1, v);
 }
 EOF
--- a/scripts/atomic/fallbacks/inc_and_test
+++ b/scripts/atomic/fallbacks/inc_and_test
@@ -1,6 +1,6 @@
 cat <<EOF
 /**
- * ${atomic}_inc_and_test - increment and test
+ * ${arch}${atomic}_inc_and_test - increment and test
  * @v: pointer of type ${atomic}_t
  *
  * Atomically increments @v by 1
@@ -8,8 +8,8 @@ cat <<EOF
  * other cases.
  */
 static __always_inline bool
-${atomic}_inc_and_test(${atomic}_t *v)
+${arch}${atomic}_inc_and_test(${atomic}_t *v)
 {
-	return ${atomic}_inc_return(v) == 0;
+	return ${arch}${atomic}_inc_return(v) == 0;
 }
 EOF
--- a/scripts/atomic/fallbacks/inc_not_zero
+++ b/scripts/atomic/fallbacks/inc_not_zero
@@ -1,14 +1,14 @@
 cat <<EOF
 /**
- * ${atomic}_inc_not_zero - increment unless the number is zero
+ * ${arch}${atomic}_inc_not_zero - increment unless the number is zero
  * @v: pointer of type ${atomic}_t
  *
  * Atomically increments @v by 1, if @v is non-zero.
  * Returns true if the increment was done.
  */
 static __always_inline bool
-${atomic}_inc_not_zero(${atomic}_t *v)
+${arch}${atomic}_inc_not_zero(${atomic}_t *v)
 {
-	return ${atomic}_add_unless(v, 1, 0);
+	return ${arch}${atomic}_add_unless(v, 1, 0);
 }
 EOF
--- a/scripts/atomic/fallbacks/inc_unless_negative
+++ b/scripts/atomic/fallbacks/inc_unless_negative
@@ -1,13 +1,13 @@
 cat <<EOF
 static __always_inline bool
-${atomic}_inc_unless_negative(${atomic}_t *v)
+${arch}${atomic}_inc_unless_negative(${atomic}_t *v)
 {
-	${int} c = ${atomic}_read(v);
+	${int} c = ${arch}${atomic}_read(v);
 
 	do {
 		if (unlikely(c < 0))
 			return false;
-	} while (!${atomic}_try_cmpxchg(v, &c, c + 1));
+	} while (!${arch}${atomic}_try_cmpxchg(v, &c, c + 1));
 
 	return true;
 }
--- a/scripts/atomic/fallbacks/read_acquire
+++ b/scripts/atomic/fallbacks/read_acquire
@@ -1,6 +1,6 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_read_acquire(const ${atomic}_t *v)
+${arch}${atomic}_read_acquire(const ${atomic}_t *v)
 {
 	return smp_load_acquire(&(v)->counter);
 }
--- a/scripts/atomic/fallbacks/release
+++ b/scripts/atomic/fallbacks/release
@@ -1,8 +1,8 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_${pfx}${name}${sfx}_release(${params})
+${arch}${atomic}_${pfx}${name}${sfx}_release(${params})
 {
 	__atomic_release_fence();
-	${retstmt}${atomic}_${pfx}${name}${sfx}_relaxed(${args});
+	${retstmt}${arch}${atomic}_${pfx}${name}${sfx}_relaxed(${args});
 }
 EOF
--- a/scripts/atomic/fallbacks/set_release
+++ b/scripts/atomic/fallbacks/set_release
@@ -1,6 +1,6 @@
 cat <<EOF
 static __always_inline void
-${atomic}_set_release(${atomic}_t *v, ${int} i)
+${arch}${atomic}_set_release(${atomic}_t *v, ${int} i)
 {
 	smp_store_release(&(v)->counter, i);
 }
--- a/scripts/atomic/fallbacks/sub_and_test
+++ b/scripts/atomic/fallbacks/sub_and_test
@@ -1,6 +1,6 @@
 cat <<EOF
 /**
- * ${atomic}_sub_and_test - subtract value from variable and test result
+ * ${arch}${atomic}_sub_and_test - subtract value from variable and test result
  * @i: integer value to subtract
  * @v: pointer of type ${atomic}_t
  *
@@ -9,8 +9,8 @@ cat <<EOF
  * other cases.
  */
 static __always_inline bool
-${atomic}_sub_and_test(${int} i, ${atomic}_t *v)
+${arch}${atomic}_sub_and_test(${int} i, ${atomic}_t *v)
 {
-	return ${atomic}_sub_return(i, v) == 0;
+	return ${arch}${atomic}_sub_return(i, v) == 0;
 }
 EOF
--- a/scripts/atomic/fallbacks/try_cmpxchg
+++ b/scripts/atomic/fallbacks/try_cmpxchg
@@ -1,9 +1,9 @@
 cat <<EOF
 static __always_inline bool
-${atomic}_try_cmpxchg${order}(${atomic}_t *v, ${int} *old, ${int} new)
+${arch}${atomic}_try_cmpxchg${order}(${atomic}_t *v, ${int} *old, ${int} new)
 {
 	${int} r, o = *old;
-	r = ${atomic}_cmpxchg${order}(v, o, new);
+	r = ${arch}${atomic}_cmpxchg${order}(v, o, new);
 	if (unlikely(r != o))
 		*old = r;
 	return likely(r == o);
--- a/scripts/atomic/gen-atomic-fallback.sh
+++ b/scripts/atomic/gen-atomic-fallback.sh
@@ -2,10 +2,11 @@
 # SPDX-License-Identifier: GPL-2.0
 
 ATOMICDIR=$(dirname $0)
+ARCH=$2
 
 . ${ATOMICDIR}/atomic-tbl.sh
 
-#gen_template_fallback(template, meta, pfx, name, sfx, order, atomic, int, args...)
+#gen_template_fallback(template, meta, pfx, name, sfx, order, arch, atomic, int, args...)
 gen_template_fallback()
 {
 	local template="$1"; shift
@@ -14,10 +15,11 @@ gen_template_fallback()
 	local name="$1"; shift
 	local sfx="$1"; shift
 	local order="$1"; shift
+	local arch="$1"; shift
 	local atomic="$1"; shift
 	local int="$1"; shift
 
-	local atomicname="${atomic}_${pfx}${name}${sfx}${order}"
+	local atomicname="${arch}${atomic}_${pfx}${name}${sfx}${order}"
 
 	local ret="$(gen_ret_type "${meta}" "${int}")"
 	local retstmt="$(gen_ret_stmt "${meta}")"
@@ -32,7 +34,7 @@ gen_template_fallback()
 	fi
 }
 
-#gen_proto_fallback(meta, pfx, name, sfx, order, atomic, int, args...)
+#gen_proto_fallback(meta, pfx, name, sfx, order, arch, atomic, int, args...)
 gen_proto_fallback()
 {
 	local meta="$1"; shift
@@ -56,16 +58,17 @@ cat << EOF
 EOF
 }
 
-#gen_proto_order_variants(meta, pfx, name, sfx, atomic, int, args...)
+#gen_proto_order_variants(meta, pfx, name, sfx, arch, atomic, int, args...)
 gen_proto_order_variants()
 {
 	local meta="$1"; shift
 	local pfx="$1"; shift
 	local name="$1"; shift
 	local sfx="$1"; shift
-	local atomic="$1"
+	local arch="$1"
+	local atomic="$2"
 
-	local basename="${atomic}_${pfx}${name}${sfx}"
+	local basename="${arch}${atomic}_${pfx}${name}${sfx}"
 
 	local template="$(find_fallback_template "${pfx}" "${name}" "${sfx}" "${order}")"
 
@@ -94,7 +97,7 @@ gen_proto_order_variants()
 	gen_basic_fallbacks "${basename}"
 
 	if [ ! -z "${template}" ]; then
-		printf "#endif /* ${atomic}_${pfx}${name}${sfx} */\n\n"
+		printf "#endif /* ${arch}${atomic}_${pfx}${name}${sfx} */\n\n"
 		gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "" "$@"
 		gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "_acquire" "$@"
 		gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "_release" "$@"
@@ -153,18 +156,15 @@ cat << EOF
 
 EOF
 
-for xchg in "xchg" "cmpxchg" "cmpxchg64"; do
+for xchg in "${ARCH}xchg" "${ARCH}cmpxchg" "${ARCH}cmpxchg64"; do
 	gen_xchg_fallbacks "${xchg}"
 done
 
 grep '^[a-z]' "$1" | while read name meta args; do
-	gen_proto "${meta}" "${name}" "atomic" "int" ${args}
+	gen_proto "${meta}" "${name}" "${ARCH}" "atomic" "int" ${args}
 done
 
 cat <<EOF
-#define atomic_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
-#define atomic_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
-
 #ifdef CONFIG_GENERIC_ATOMIC64
 #include <asm-generic/atomic64.h>
 #endif
@@ -172,12 +172,9 @@ cat <<EOF
 EOF
 
 grep '^[a-z]' "$1" | while read name meta args; do
-	gen_proto "${meta}" "${name}" "atomic64" "s64" ${args}
+	gen_proto "${meta}" "${name}" "${ARCH}" "atomic64" "s64" ${args}
 done
 
 cat <<EOF
-#define atomic64_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
-#define atomic64_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
-
 #endif /* _LINUX_ATOMIC_FALLBACK_H */
 EOF
--- a/scripts/atomic/gen-atomics.sh
+++ b/scripts/atomic/gen-atomics.sh
@@ -10,10 +10,11 @@ LINUXDIR=${ATOMICDIR}/../..
 cat <<EOF |
 gen-atomic-instrumented.sh      asm-generic/atomic-instrumented.h
 gen-atomic-long.sh              asm-generic/atomic-long.h
+gen-atomic-fallback.sh          linux/atomic-arch-fallback.h		arch_
 gen-atomic-fallback.sh          linux/atomic-fallback.h
 EOF
-while read script header; do
-	/bin/sh ${ATOMICDIR}/${script} ${ATOMICTBL} > ${LINUXDIR}/include/${header}
+while read script header args; do
+	/bin/sh ${ATOMICDIR}/${script} ${ATOMICTBL} ${args} > ${LINUXDIR}/include/${header}
 	HASH="$(sha1sum ${LINUXDIR}/include/${header})"
 	HASH="${HASH%% *}"
 	printf "// %s\n" "${HASH}" >> ${LINUXDIR}/include/${header}


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 08/36] x86/doublefault: Remove memmove() call
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (6 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 07/36] locking/atomics: Flip fallbacks and instrumentation Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06 13:47   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  2020-05-05 13:16 ` [patch V4 part 1 09/36] x86/entry/64: Avoid pointless code when CONTEXT_TRACKING=n Thomas Gleixner
                   ` (28 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Borislav Petkov,
	Peter Zijlstra (Intel)

Use of memmove() in #DF is problematic considered tracing and other
instrumentation.

Remove the memmove() call and simply write out what needs doing; this
even clarifies the code, win-win! The code copies from the espfix64
stack to the normal task stack, there is no possible way for that to
overlap.

Survives selftests/x86, specifically sigreturn_64.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200220121727.GB507@zn.tnic
---
 arch/x86/kernel/traps.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -278,6 +278,7 @@ dotraplinkage void do_double_fault(struc
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
 		struct pt_regs *gpregs = (struct pt_regs *)this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
+		unsigned long *p = (unsigned long *)regs->sp;
 
 		/*
 		 * regs->sp points to the failing IRET frame on the
@@ -285,7 +286,11 @@ dotraplinkage void do_double_fault(struc
 		 * in gpregs->ss through gpregs->ip.
 		 *
 		 */
-		memmove(&gpregs->ip, (void *)regs->sp, 5*8);
+		gpregs->ip	= p[0];
+		gpregs->cs	= p[1];
+		gpregs->flags	= p[2];
+		gpregs->sp	= p[3];
+		gpregs->ss	= p[4];
 		gpregs->orig_ax = 0;  /* Missing (lost) #GP error code */
 
 		/*


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 09/36] x86/entry/64: Avoid pointless code when CONTEXT_TRACKING=n
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (7 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 08/36] x86/doublefault: Remove memmove() call Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:16 ` [patch V4 part 1 10/36] x86/entry: Remove the unused LOCKDEP_SYSEXIT cruft Thomas Gleixner
                   ` (27 subsequent siblings)
  36 siblings, 1 reply; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

From: Thomas Gleixner <tglx@linutronix.de>

GAS cannot optimize out the test and conditional jump when context tracking
is disabled and CALL_enter_from_user_mode is an empty macro.

Wrap it in #ifdeffery. Will go away once all this is moved to C.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>

---
 arch/x86/entry/entry_64.S |    2 ++
 1 file changed, 2 insertions(+)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -888,12 +888,14 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work
 	TRACE_IRQS_OFF
 	.endif
 
+#ifdef CONFIG_CONTEXT_TRACKING
 	.if \paranoid == 0
 	testb	$3, CS(%rsp)
 	jz	.Lfrom_kernel_no_context_tracking_\@
 	CALL_enter_from_user_mode
 .Lfrom_kernel_no_context_tracking_\@:
 	.endif
+#endif
 
 	movq	%rsp, %rdi			/* pt_regs pointer */
 


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 10/36] x86/entry: Remove the unused LOCKDEP_SYSEXIT cruft
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (8 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 09/36] x86/entry/64: Avoid pointless code when CONTEXT_TRACKING=n Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06 13:52   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:16 ` [patch V4 part 1 11/36] x86/kvm: Handle async page faults directly through do_page_fault() Thomas Gleixner
                   ` (26 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

No users left since two years due to commit 21d375b6b34f ("x86/entry/64:
Remove the SYSCALL64 fast path")

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/thunk_64.S       |    5 -----
 arch/x86/include/asm/irqflags.h |   24 ------------------------
 2 files changed, 29 deletions(-)

--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -42,10 +42,6 @@ SYM_FUNC_END(\name)
 	THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller,1
 #endif
 
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-	THUNK lockdep_sys_exit_thunk,lockdep_sys_exit
-#endif
-
 #ifdef CONFIG_PREEMPTION
 	THUNK preempt_schedule_thunk, preempt_schedule
 	THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
@@ -54,7 +50,6 @@ SYM_FUNC_END(\name)
 #endif
 
 #if defined(CONFIG_TRACE_IRQFLAGS) \
- || defined(CONFIG_DEBUG_LOCK_ALLOC) \
  || defined(CONFIG_PREEMPTION)
 SYM_CODE_START_LOCAL_NOALIGN(.L_restore)
 	popq %r11
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -180,30 +180,6 @@ static inline int arch_irqs_disabled(voi
 #  define TRACE_IRQS_ON
 #  define TRACE_IRQS_OFF
 #endif
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-#  ifdef CONFIG_X86_64
-#    define LOCKDEP_SYS_EXIT		call lockdep_sys_exit_thunk
-#    define LOCKDEP_SYS_EXIT_IRQ \
-	TRACE_IRQS_ON; \
-	sti; \
-	call lockdep_sys_exit_thunk; \
-	cli; \
-	TRACE_IRQS_OFF;
-#  else
-#    define LOCKDEP_SYS_EXIT \
-	pushl %eax;				\
-	pushl %ecx;				\
-	pushl %edx;				\
-	call lockdep_sys_exit;			\
-	popl %edx;				\
-	popl %ecx;				\
-	popl %eax;
-#    define LOCKDEP_SYS_EXIT_IRQ
-#  endif
-#else
-#  define LOCKDEP_SYS_EXIT
-#  define LOCKDEP_SYS_EXIT_IRQ
-#endif
 #endif /* __ASSEMBLY__ */
 
 #endif


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 11/36] x86/kvm: Handle async page faults directly through do_page_fault()
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (9 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 10/36] x86/entry: Remove the unused LOCKDEP_SYSEXIT cruft Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06  7:00   ` Paolo Bonzini
                     ` (2 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 12/36] x86/kvm: Sanitize kvm_async_pf_task_wait() Thomas Gleixner
                   ` (25 subsequent siblings)
  36 siblings, 3 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

From: Andy Lutomirski <luto@kernel.org>

KVM overloads #PF to indicate two types of not-actually-page-fault
events.  Right now, the KVM guest code intercepts them by modifying
the IDT and hooking the #PF vector.  This makes the already fragile
fault code even harder to understand, and it also pollutes call
traces with async_page_fault and do_async_page_fault for normal page
faults.

Clean it up by moving the logic into do_page_fault() using a static
branch.  This gets rid of the platform trap_init override mechanism
completely.

[ tglx: Fixed up 32bit, removed error code from the async functions and
  	massaged coding style ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/entry/entry_32.S       |    8 --------
 arch/x86/entry/entry_64.S       |    4 ----
 arch/x86/include/asm/kvm_para.h |   19 +++++++++++++++++--
 arch/x86/include/asm/x86_init.h |    2 --
 arch/x86/kernel/kvm.c           |   39 +++++++++++++++++++++------------------
 arch/x86/kernel/traps.c         |    2 --
 arch/x86/kernel/x86_init.c      |    1 -
 arch/x86/mm/fault.c             |   19 +++++++++++++++++++
 8 files changed, 57 insertions(+), 37 deletions(-)

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1693,14 +1693,6 @@ SYM_CODE_START(general_protection)
 	jmp	common_exception
 SYM_CODE_END(general_protection)
 
-#ifdef CONFIG_KVM_GUEST
-SYM_CODE_START(async_page_fault)
-	ASM_CLAC
-	pushl	$do_async_page_fault
-	jmp	common_exception_read_cr2
-SYM_CODE_END(async_page_fault)
-#endif
-
 SYM_CODE_START(rewind_stack_do_exit)
 	/* Prevent any naive code from trying to unwind to our caller. */
 	xorl	%ebp, %ebp
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1204,10 +1204,6 @@ idtentry xendebug		do_debug		has_error_c
 idtentry general_protection	do_general_protection	has_error_code=1
 idtentry page_fault		do_page_fault		has_error_code=1	read_cr2=1
 
-#ifdef CONFIG_KVM_GUEST
-idtentry async_page_fault	do_async_page_fault	has_error_code=1	read_cr2=1
-#endif
-
 #ifdef CONFIG_X86_MCE
 idtentry machine_check		do_mce			has_error_code=0	paranoid=1
 #endif
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -91,8 +91,18 @@ unsigned int kvm_arch_para_hints(void);
 void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
-extern void kvm_disable_steal_time(void);
-void do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
+void kvm_disable_steal_time(void);
+bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token);
+
+DECLARE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
+
+static __always_inline bool kvm_handle_async_pf(struct pt_regs *regs, u32 token)
+{
+	if (static_branch_unlikely(&kvm_async_pf_enabled))
+		return __kvm_handle_async_pf(regs, token);
+	else
+		return false;
+}
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 void __init kvm_spinlock_init(void);
@@ -130,6 +140,11 @@ static inline void kvm_disable_steal_tim
 {
 	return;
 }
+
+static inline bool kvm_handle_async_pf(struct pt_regs *regs, u32 token)
+{
+	return false;
+}
 #endif
 
 #endif /* _ASM_X86_KVM_PARA_H */
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -50,14 +50,12 @@ struct x86_init_resources {
  * @pre_vector_init:		init code to run before interrupt vectors
  *				are set up.
  * @intr_init:			interrupt init code
- * @trap_init:			platform specific trap setup
  * @intr_mode_select:		interrupt delivery mode selection
  * @intr_mode_init:		interrupt delivery mode setup
  */
 struct x86_init_irqs {
 	void (*pre_vector_init)(void);
 	void (*intr_init)(void);
-	void (*trap_init)(void);
 	void (*intr_mode_select)(void);
 	void (*intr_mode_init)(void);
 };
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -35,6 +35,8 @@
 #include <asm/tlb.h>
 #include <asm/cpuidle_haltpoll.h>
 
+DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
+
 static int kvmapf = 1;
 
 static int __init parse_no_kvmapf(char *arg)
@@ -242,25 +244,27 @@ u32 kvm_read_and_reset_pf_reason(void)
 EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
 NOKPROBE_SYMBOL(kvm_read_and_reset_pf_reason);
 
-dotraplinkage void
-do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 {
+	/*
+	 * If we get a page fault right here, the pf_reason seems likely
+	 * to be clobbered.  Bummer.
+	 */
 	switch (kvm_read_and_reset_pf_reason()) {
 	default:
-		do_page_fault(regs, error_code, address);
-		break;
+		return false;
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		/* page is swapped out by the host. */
-		kvm_async_pf_task_wait((u32)address, !user_mode(regs));
-		break;
+		kvm_async_pf_task_wait(token, !user_mode(regs));
+		return true;
 	case KVM_PV_REASON_PAGE_READY:
 		rcu_irq_enter();
-		kvm_async_pf_task_wake((u32)address);
+		kvm_async_pf_task_wake(token);
 		rcu_irq_exit();
-		break;
+		return true;
 	}
 }
-NOKPROBE_SYMBOL(do_async_page_fault);
+NOKPROBE_SYMBOL(__kvm_handle_async_pf);
 
 static void __init paravirt_ops_setup(void)
 {
@@ -306,7 +310,11 @@ static notrace void kvm_guest_apic_eoi_w
 static void kvm_guest_cpu_init(void)
 {
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
-		u64 pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
+		u64 pa;
+
+		WARN_ON_ONCE(!static_branch_likely(&kvm_async_pf_enabled));
+
+		pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
 
 #ifdef CONFIG_PREEMPTION
 		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
@@ -592,12 +600,6 @@ static int kvm_cpu_down_prepare(unsigned
 }
 #endif
 
-static void __init kvm_apf_trap_init(void)
-{
-	update_intr_gate(X86_TRAP_PF, async_page_fault);
-}
-
-
 static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 			const struct flush_tlb_info *info)
 {
@@ -632,8 +634,6 @@ static void __init kvm_guest_init(void)
 	register_reboot_notifier(&kvm_pv_reboot_nb);
 	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
 		raw_spin_lock_init(&async_pf_sleepers[i].lock);
-	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
-		x86_init.irqs.trap_init = kvm_apf_trap_init;
 
 	if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
 		has_steal_clock = 1;
@@ -649,6 +649,9 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf)
+		static_branch_enable(&kvm_async_pf_enabled);
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -988,7 +988,5 @@ void __init trap_init(void)
 
 	idt_setup_ist_traps();
 
-	x86_init.irqs.trap_init();
-
 	idt_setup_debugidt_traps();
 }
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -79,7 +79,6 @@ struct x86_init_ops x86_init __initdata
 	.irqs = {
 		.pre_vector_init	= init_ISA_irqs,
 		.intr_init		= native_init_IRQ,
-		.trap_init		= x86_init_noop,
 		.intr_mode_select	= apic_intr_mode_select,
 		.intr_mode_init		= apic_intr_mode_init
 	},
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -30,6 +30,7 @@
 #include <asm/desc.h>			/* store_idt(), ...		*/
 #include <asm/cpu_entry_area.h>		/* exception stack		*/
 #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/
+#include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -1523,6 +1524,24 @@ do_page_fault(struct pt_regs *regs, unsi
 		unsigned long address)
 {
 	prefetchw(&current->mm->mmap_sem);
+	/*
+	 * KVM has two types of events that are, logically, interrupts, but
+	 * are unfortunately delivered using the #PF vector.  These events are
+	 * "you just accessed valid memory, but the host doesn't have it right
+	 * now, so I'll put you to sleep if you continue" and "that memory
+	 * you tried to access earlier is available now."
+	 *
+	 * We are relying on the interrupted context being sane (valid RSP,
+	 * relevant locks not held, etc.), which is fine as long as the
+	 * interrupted context had IF=1.  We are also relying on the KVM
+	 * async pf type field and CR2 being read consistently instead of
+	 * getting values from real and async page faults mixed up.
+	 *
+	 * Fingers crossed.
+	 */
+	if (kvm_handle_async_pf(regs, (u32)address))
+		return;
+
 	trace_page_fault_entries(regs, hw_error_code, address);
 
 	if (unlikely(kmmio_fault(regs, address)))


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 12/36] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (10 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 11/36] x86/kvm: Handle async page faults directly through do_page_fault() Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-05 17:54   ` Paul E. McKenney
                     ` (3 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 13/36] x86/kvm: Restrict ASYNC_PF to user space Thomas Gleixner
                   ` (24 subsequent siblings)
  36 siblings, 4 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

From: Thomas Gleixner <tglx@linutronix.de>

While working on the entry consolidation I stumbled over the KVM async page
fault handler and kvm_async_pf_task_wait() in particular. It took me a
while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
invocations are just cargo cult programming. Several patches "fixed" RCU
splats by curing the symptoms without noticing that the code is flawed 
from a design perspective.

The main problem is that this async injection is not based on a proper
handshake mechanism and only respects the minimal requirement, i.e. the
guest is not in a state where it has interrupts disabled.

Aside of that the actual code is a convoluted one fits it all swiss army
knife. It is invoked from different places with different RCU constraints:

  1) Host side:

     vcpu_enter_guest()
       kvm_x86_ops->handle_exit()
         kvm_handle_page_fault()
           kvm_async_pf_task_wait()

     The invocation happens from fully preemptible context.

  2) Guest side:

     The async page fault interrupted:

         a) user space

	 b) preemptible kernel code which is not in a RCU read side
	    critical section

     	 c) non-preemtible kernel code or a RCU read side critical section
	    or kernel code with CONFIG_PREEMPTION=n which allows not to
	    differentiate between #2b and #2c.

RCU is watching for:

  #1  The vCPU exited and current is definitely not the idle task

  #2a The #PF entry code on the guest went through enter_from_user_mode()
      which reactivates RCU

  #2b There is no preemptible, interrupts enabled code in the kernel
      which can run with RCU looking away. (The idle task is always
      non preemptible).

I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU
voodoo at all.

In #2c RCU is eventually not watching, but as that state cannot schedule
anyway there is no point to worry about it so it has to invoke
rcu_irq_enter() before running that code. This can be optimized, but this
will be done as an extra step in course of the entry code consolidation
work.

So the proper solution for this is to:

  - Split kvm_async_pf_task_wait() into schedule and halt based waiting
    interfaces which share the enqueueing code.

  - Add comments (condensed form of this changelog) to spare others the
    time waste and pain of reverse engineering all of this with the help of
    uncomprehensible changelogs and code history.

  - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(),
    user mode and schedulable kernel side async page faults (#1, #2a, #2b)

  - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel
    case (#2c).

    For this case also remove the rcu_irq_exit()/enter() pair around the
    halt as it is just a pointless exercise:

       - vCPUs can VMEXIT at any random point and can be scheduled out for
         an arbitrary amount of time by the host and this is not any
         different except that it voluntary triggers the exit via halt.

       - The interrupted context could have RCU watching already. So the
	 rcu_irq_exit() before the halt is not gaining anything aside of
	 confusing the reader. Claiming that this might prevent RCU stalls
	 is just an illusion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: Panic if async #PF is injected into an interrupt disabled region.
---
 arch/x86/include/asm/kvm_para.h |    4 
 arch/x86/kernel/kvm.c           |  201 ++++++++++++++++++++++++++++------------
 arch/x86/kvm/mmu/mmu.c          |    2 
 3 files changed, 144 insertions(+), 63 deletions(-)

--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsign
 bool kvm_para_available(void);
 unsigned int kvm_arch_para_features(void);
 unsigned int kvm_arch_para_hints(void);
-void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
+void kvm_async_pf_task_wait_schedule(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 void kvm_disable_steal_time(void);
@@ -113,7 +113,7 @@ static inline void kvm_spinlock_init(voi
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
 
 #else /* CONFIG_KVM_GUEST */
-#define kvm_async_pf_task_wait(T, I) do {} while(0)
+#define kvm_async_pf_task_wait_schedule(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
 
 static inline bool kvm_para_available(void)
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -75,7 +75,7 @@ struct kvm_task_sleep_node {
 	struct swait_queue_head wq;
 	u32 token;
 	int cpu;
-	bool halted;
+	bool use_halt;
 };
 
 static struct kvm_task_sleep_head {
@@ -98,75 +98,145 @@ static struct kvm_task_sleep_node *_find
 	return NULL;
 }
 
-/*
- * @interrupt_kernel: Is this called from a routine which interrupts the kernel
- * 		      (other than user space)?
- */
-void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
+static bool kvm_async_pf_queue_task(u32 token, bool use_halt,
+				    struct kvm_task_sleep_node *n)
 {
 	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
-	struct kvm_task_sleep_node n, *e;
-	DECLARE_SWAITQUEUE(wait);
-
-	rcu_irq_enter();
+	struct kvm_task_sleep_node *e;
 
 	raw_spin_lock(&b->lock);
 	e = _find_apf_task(b, token);
 	if (e) {
 		/* dummy entry exist -> wake up was delivered ahead of PF */
 		hlist_del(&e->link);
-		kfree(e);
 		raw_spin_unlock(&b->lock);
+		kfree(e);
+		return false;
+	}
 
-		rcu_irq_exit();
+	n->token = token;
+	n->cpu = smp_processor_id();
+	n->use_halt = use_halt;
+	init_swait_queue_head(&n->wq);
+	hlist_add_head(&n->link, &b->list);
+	raw_spin_unlock(&b->lock);
+	return true;
+}
+
+/*
+ * kvm_async_pf_task_wait_schedule - Wait for pagefault to be handled
+ * @token:	Token to identify the sleep node entry
+ *
+ * Invoked from the async pagefault handling code or from the VM exit page
+ * fault handler. In both cases RCU is watching.
+ */
+void kvm_async_pf_task_wait_schedule(u32 token)
+{
+	struct kvm_task_sleep_node n;
+	DECLARE_SWAITQUEUE(wait);
+
+	lockdep_assert_irqs_disabled();
+
+	if (!kvm_async_pf_queue_task(token, false, &n))
 		return;
+
+	for (;;) {
+		prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (hlist_unhashed(&n.link))
+			break;
+
+		local_irq_enable();
+		schedule();
+		local_irq_disable();
 	}
+	finish_swait(&n.wq, &wait);
+}
+EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
 
-	n.token = token;
-	n.cpu = smp_processor_id();
-	n.halted = is_idle_task(current) ||
-		   (IS_ENABLED(CONFIG_PREEMPT_COUNT)
-		    ? preempt_count() > 1 || rcu_preempt_depth()
-		    : interrupt_kernel);
-	init_swait_queue_head(&n.wq);
-	hlist_add_head(&n.link, &b->list);
-	raw_spin_unlock(&b->lock);
+/*
+ * Invoked from the async page fault handler.
+ */
+static void kvm_async_pf_task_wait_halt(u32 token)
+{
+	struct kvm_task_sleep_node n;
+
+	if (!kvm_async_pf_queue_task(token, true, &n))
+		return;
 
 	for (;;) {
-		if (!n.halted)
-			prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
 		if (hlist_unhashed(&n.link))
 			break;
+		/*
+		 * No point in doing anything about RCU here. Any RCU read
+		 * side critical section or RCU watching section can be
+		 * interrupted by VMEXITs and the host is free to keep the
+		 * vCPU scheduled out as long as it sees fit. This is not
+		 * any different just because of the halt induced voluntary
+		 * VMEXIT.
+		 *
+		 * Also the async page fault could have interrupted any RCU
+		 * watching context, so invoking rcu_irq_exit()/enter()
+		 * around this is not gaining anything.
+		 */
+		native_safe_halt();
+		local_irq_disable();
+	}
+}
 
-		rcu_irq_exit();
+/* Invoked from the async page fault handler */
+static void kvm_async_pf_task_wait(u32 token, bool usermode)
+{
+	bool can_schedule;
 
-		if (!n.halted) {
-			local_irq_enable();
-			schedule();
-			local_irq_disable();
-		} else {
-			/*
-			 * We cannot reschedule. So halt.
-			 */
-			native_safe_halt();
-			local_irq_disable();
-		}
+	/*
+	 * No need to check whether interrupts were disabled because the
+	 * host will (hopefully) only inject an async page fault into
+	 * interrupt enabled regions.
+	 *
+	 * If CONFIG_PREEMPTION is enabled then check whether the code
+	 * which triggered the page fault is preemptible. This covers user
+	 * mode as well because preempt_count() is obviously 0 there.
+	 *
+	 * The check for rcu_preempt_depth() is also required because
+	 * voluntary scheduling inside a rcu read locked section is not
+	 * allowed.
+	 *
+	 * The idle task is already covered by this because idle always
+	 * has a preempt count > 0.
+	 *
+	 * If CONFIG_PREEMPTION is disabled only allow scheduling when
+	 * coming from user mode as there is no indication whether the
+	 * context which triggered the page fault could schedule or not.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPTION))
+		can_schedule = preempt_count() + rcu_preempt_depth() == 0;
+	else
+		can_schedule = usermode;
 
+	/*
+	 * If the kernel context is allowed to schedule then RCU is
+	 * watching because no preemptible code in the kernel is inside RCU
+	 * idle state. So it can be treated like user mode. User mode is
+	 * safe because the #PF entry invoked enter_from_user_mode().
+	 *
+	 * For the non schedulable case invoke rcu_irq_enter() for
+	 * now. This will be moved out to the pagefault entry code later
+	 * and only invoked when really needed.
+	 */
+	if (can_schedule) {
+		kvm_async_pf_task_wait_schedule(token);
+	} else {
 		rcu_irq_enter();
+		kvm_async_pf_task_wait_halt(token);
+		rcu_irq_exit();
 	}
-	if (!n.halted)
-		finish_swait(&n.wq, &wait);
-
-	rcu_irq_exit();
-	return;
 }
-EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
 
 static void apf_task_wake_one(struct kvm_task_sleep_node *n)
 {
 	hlist_del_init(&n->link);
-	if (n->halted)
+	if (n->use_halt)
 		smp_send_reschedule(n->cpu);
 	else if (swq_has_sleeper(&n->wq))
 		swake_up_one(&n->wq);
@@ -177,12 +247,13 @@ static void apf_task_wake_all(void)
 	int i;
 
 	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
-		struct hlist_node *p, *next;
 		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
+		struct kvm_task_sleep_node *n;
+		struct hlist_node *p, *next;
+
 		raw_spin_lock(&b->lock);
 		hlist_for_each_safe(p, next, &b->list) {
-			struct kvm_task_sleep_node *n =
-				hlist_entry(p, typeof(*n), link);
+			n = hlist_entry(p, typeof(*n), link);
 			if (n->cpu == smp_processor_id())
 				apf_task_wake_one(n);
 		}
@@ -223,8 +294,9 @@ void kvm_async_pf_task_wake(u32 token)
 		n->cpu = smp_processor_id();
 		init_swait_queue_head(&n->wq);
 		hlist_add_head(&n->link, &b->list);
-	} else
+	} else {
 		apf_task_wake_one(n);
+	}
 	raw_spin_unlock(&b->lock);
 	return;
 }
@@ -246,23 +318,33 @@ NOKPROBE_SYMBOL(kvm_read_and_reset_pf_re
 
 bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 {
-	/*
-	 * If we get a page fault right here, the pf_reason seems likely
-	 * to be clobbered.  Bummer.
-	 */
-	switch (kvm_read_and_reset_pf_reason()) {
+	u32 reason = kvm_read_and_reset_pf_reason();
+
+	switch (reason) {
+	case KVM_PV_REASON_PAGE_NOT_PRESENT:
+	case KVM_PV_REASON_PAGE_READY:
+		break;
 	default:
 		return false;
-	case KVM_PV_REASON_PAGE_NOT_PRESENT:
+	}
+
+	/*
+	 * If the host managed to inject an async #PF into an interrupt
+	 * disabled region, then die hard as this is not going to end well
+	 * and the host side is seriously broken.
+	 */
+	if (unlikely(!(regs->flags & X86_EFLAGS_IF)))
+		panic("Host injected async #PF in interrupt disabled region\n");
+
+	if (reason == KVM_PV_REASON_PAGE_NOT_PRESENT) {
 		/* page is swapped out by the host. */
-		kvm_async_pf_task_wait(token, !user_mode(regs));
-		return true;
-	case KVM_PV_REASON_PAGE_READY:
+		kvm_async_pf_task_wait(token, user_mode(regs));
+	} else {
 		rcu_irq_enter();
 		kvm_async_pf_task_wake(token);
 		rcu_irq_exit();
-		return true;
 	}
+	return true;
 }
 NOKPROBE_SYMBOL(__kvm_handle_async_pf);
 
@@ -326,12 +408,12 @@ static void kvm_guest_cpu_init(void)
 
 		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
 		__this_cpu_write(apf_reason.enabled, 1);
-		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
-		       smp_processor_id());
+		pr_info("KVM setup async PF for cpu %d\n", smp_processor_id());
 	}
 
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
 		unsigned long pa;
+
 		/* Size alignment is implied but just to make it explicit. */
 		BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
 		__this_cpu_write(kvm_apic_eoi, 0);
@@ -352,8 +434,7 @@ static void kvm_pv_disable_apf(void)
 	wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
 	__this_cpu_write(apf_reason.enabled, 0);
 
-	printk(KERN_INFO"Unregister pv shared memory for cpu %d\n",
-	       smp_processor_id());
+	pr_info("Unregister pv shared memory for cpu %d\n", smp_processor_id());
 }
 
 static void kvm_pv_guest_cpu_reboot(void *unused)
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4198,7 +4198,7 @@ int kvm_handle_page_fault(struct kvm_vcp
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		vcpu->arch.apf.host_apf_reason = 0;
 		local_irq_disable();
-		kvm_async_pf_task_wait(fault_address, 0);
+		kvm_async_pf_task_wait_schedule(fault_address);
 		local_irq_enable();
 		break;
 	case KVM_PV_REASON_PAGE_READY:


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 13/36] x86/kvm: Restrict ASYNC_PF to user space
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (11 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 12/36] x86/kvm: Sanitize kvm_async_pf_task_wait() Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06  7:00   ` Paolo Bonzini
                     ` (2 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic() Thomas Gleixner
                   ` (23 subsequent siblings)
  36 siblings, 3 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

The async page fault injection into kernel space creates more problems than
it solves. The host has absolutely no knowledge about the state of the
guest if the fault happens in CPL0. The only restriction for the host is
interrupt disabled state. If interrupts are enabled in the guest then the
exception can hit arbitrary code. The HALT based wait in non-preemotible
code is a hacky replacement for a proper hypercall.

For the ongoing work to restrict instrumentation and make the RCU idle
interaction well defined the required extra work for supporting async
pagefault in CPL0 is just not justified and creates complexity for a
dubious benefit.

The CPL3 injection is well defined and does not cause any issues as it is
more or less the same as a regular page fault from CPL3.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/kvm.c |  100 +++-----------------------------------------------
 1 file changed, 7 insertions(+), 93 deletions(-)

--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -75,7 +75,6 @@ struct kvm_task_sleep_node {
 	struct swait_queue_head wq;
 	u32 token;
 	int cpu;
-	bool use_halt;
 };
 
 static struct kvm_task_sleep_head {
@@ -98,8 +97,7 @@ static struct kvm_task_sleep_node *_find
 	return NULL;
 }
 
-static bool kvm_async_pf_queue_task(u32 token, bool use_halt,
-				    struct kvm_task_sleep_node *n)
+static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
 {
 	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
@@ -117,7 +115,6 @@ static bool kvm_async_pf_queue_task(u32
 
 	n->token = token;
 	n->cpu = smp_processor_id();
-	n->use_halt = use_halt;
 	init_swait_queue_head(&n->wq);
 	hlist_add_head(&n->link, &b->list);
 	raw_spin_unlock(&b->lock);
@@ -138,7 +135,7 @@ void kvm_async_pf_task_wait_schedule(u32
 
 	lockdep_assert_irqs_disabled();
 
-	if (!kvm_async_pf_queue_task(token, false, &n))
+	if (!kvm_async_pf_queue_task(token, &n))
 		return;
 
 	for (;;) {
@@ -154,91 +151,10 @@ void kvm_async_pf_task_wait_schedule(u32
 }
 EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
 
-/*
- * Invoked from the async page fault handler.
- */
-static void kvm_async_pf_task_wait_halt(u32 token)
-{
-	struct kvm_task_sleep_node n;
-
-	if (!kvm_async_pf_queue_task(token, true, &n))
-		return;
-
-	for (;;) {
-		if (hlist_unhashed(&n.link))
-			break;
-		/*
-		 * No point in doing anything about RCU here. Any RCU read
-		 * side critical section or RCU watching section can be
-		 * interrupted by VMEXITs and the host is free to keep the
-		 * vCPU scheduled out as long as it sees fit. This is not
-		 * any different just because of the halt induced voluntary
-		 * VMEXIT.
-		 *
-		 * Also the async page fault could have interrupted any RCU
-		 * watching context, so invoking rcu_irq_exit()/enter()
-		 * around this is not gaining anything.
-		 */
-		native_safe_halt();
-		local_irq_disable();
-	}
-}
-
-/* Invoked from the async page fault handler */
-static void kvm_async_pf_task_wait(u32 token, bool usermode)
-{
-	bool can_schedule;
-
-	/*
-	 * No need to check whether interrupts were disabled because the
-	 * host will (hopefully) only inject an async page fault into
-	 * interrupt enabled regions.
-	 *
-	 * If CONFIG_PREEMPTION is enabled then check whether the code
-	 * which triggered the page fault is preemptible. This covers user
-	 * mode as well because preempt_count() is obviously 0 there.
-	 *
-	 * The check for rcu_preempt_depth() is also required because
-	 * voluntary scheduling inside a rcu read locked section is not
-	 * allowed.
-	 *
-	 * The idle task is already covered by this because idle always
-	 * has a preempt count > 0.
-	 *
-	 * If CONFIG_PREEMPTION is disabled only allow scheduling when
-	 * coming from user mode as there is no indication whether the
-	 * context which triggered the page fault could schedule or not.
-	 */
-	if (IS_ENABLED(CONFIG_PREEMPTION))
-		can_schedule = preempt_count() + rcu_preempt_depth() == 0;
-	else
-		can_schedule = usermode;
-
-	/*
-	 * If the kernel context is allowed to schedule then RCU is
-	 * watching because no preemptible code in the kernel is inside RCU
-	 * idle state. So it can be treated like user mode. User mode is
-	 * safe because the #PF entry invoked enter_from_user_mode().
-	 *
-	 * For the non schedulable case invoke rcu_irq_enter() for
-	 * now. This will be moved out to the pagefault entry code later
-	 * and only invoked when really needed.
-	 */
-	if (can_schedule) {
-		kvm_async_pf_task_wait_schedule(token);
-	} else {
-		rcu_irq_enter();
-		kvm_async_pf_task_wait_halt(token);
-		rcu_irq_exit();
-	}
-}
-
 static void apf_task_wake_one(struct kvm_task_sleep_node *n)
 {
 	hlist_del_init(&n->link);
-	if (n->use_halt)
-		smp_send_reschedule(n->cpu);
-	else if (swq_has_sleeper(&n->wq))
+	if (swq_has_sleeper(&n->wq))
 		swake_up_one(&n->wq);
 }
 
@@ -337,8 +253,10 @@ bool __kvm_handle_async_pf(struct pt_reg
 		panic("Host injected async #PF in interrupt disabled region\n");
 
 	if (reason == KVM_PV_REASON_PAGE_NOT_PRESENT) {
-		/* page is swapped out by the host. */
-		kvm_async_pf_task_wait(token, user_mode(regs));
+		if (unlikely(!(user_mode(regs))))
+			panic("Host injected async #PF in kernel mode\n");
+		/* Page is swapped out by the host. */
+		kvm_async_pf_task_wait_schedule(token);
 	} else {
 		rcu_irq_enter();
 		kvm_async_pf_task_wake(token);
@@ -397,10 +315,6 @@ static void kvm_guest_cpu_init(void)
 		WARN_ON_ONCE(!static_branch_likely(&kvm_async_pf_enabled));
 
 		pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
-
-#ifdef CONFIG_PREEMPTION
-		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
-#endif
 		pa |= KVM_ASYNC_PF_ENABLED;
 
 		if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_VMEXIT))


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic()
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (12 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 13/36] x86/kvm: Restrict ASYNC_PF to user space Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06 15:34   ` Alexandre Chartre
                     ` (3 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 15/36] kprobes: Lock kprobe_mutex while showing kprobe_blacklist Thomas Gleixner
                   ` (22 subsequent siblings)
  36 siblings, 4 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

This is completely overengineered and definitely not an interface which
should be made available to anything else than this particular MCE case.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/traps.h   |    2 --
 arch/x86/kernel/cpu/mce/core.c |    6 ++++--
 arch/x86/kernel/traps.c        |   37 -------------------------------------
 3 files changed, 4 insertions(+), 41 deletions(-)

--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -120,8 +120,6 @@ asmlinkage void smp_irq_move_cleanup_int
 
 extern void ist_enter(struct pt_regs *regs);
 extern void ist_exit(struct pt_regs *regs);
-extern void ist_begin_non_atomic(struct pt_regs *regs);
-extern void ist_end_non_atomic(void);
 
 #ifdef CONFIG_VMAP_STACK
 void __noreturn handle_stack_overflow(const char *message,
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1352,13 +1352,15 @@ void notrace do_machine_check(struct pt_
 
 	/* Fault was in user mode and we need to take some action */
 	if ((m.cs & 3) == 3) {
-		ist_begin_non_atomic(regs);
+		/* If this triggers there is no way to recover. Die hard. */
+		BUG_ON(!on_thread_stack() || !user_mode(regs));
 		local_irq_enable();
+		preempt_enable();
 
 		if (kill_it || do_memory_failure(&m))
 			force_sig(SIGBUS);
+		preempt_disable();
 		local_irq_disable();
-		ist_end_non_atomic();
 	} else {
 		if (!fixup_exception(regs, X86_TRAP_MC, error_code, 0))
 			mce_panic("Failed kernel mode recovery", &m, msg);
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -117,43 +117,6 @@ void ist_exit(struct pt_regs *regs)
 		rcu_nmi_exit();
 }
 
-/**
- * ist_begin_non_atomic() - begin a non-atomic section in an IST exception
- * @regs:	regs passed to the IST exception handler
- *
- * IST exception handlers normally cannot schedule.  As a special
- * exception, if the exception interrupted userspace code (i.e.
- * user_mode(regs) would return true) and the exception was not
- * a double fault, it can be safe to schedule.  ist_begin_non_atomic()
- * begins a non-atomic section within an ist_enter()/ist_exit() region.
- * Callers are responsible for enabling interrupts themselves inside
- * the non-atomic section, and callers must call ist_end_non_atomic()
- * before ist_exit().
- */
-void ist_begin_non_atomic(struct pt_regs *regs)
-{
-	BUG_ON(!user_mode(regs));
-
-	/*
-	 * Sanity check: we need to be on the normal thread stack.  This
-	 * will catch asm bugs and any attempt to use ist_preempt_enable
-	 * from double_fault.
-	 */
-	BUG_ON(!on_thread_stack());
-
-	preempt_enable_no_resched();
-}
-
-/**
- * ist_end_non_atomic() - begin a non-atomic section in an IST exception
- *
- * Ends a non-atomic section started with ist_begin_non_atomic().
- */
-void ist_end_non_atomic(void)
-{
-	preempt_disable();
-}
-
 int is_valid_bugaddr(unsigned long addr)
 {
 	unsigned short ud;


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 15/36] kprobes: Lock kprobe_mutex while showing kprobe_blacklist
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (13 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic() Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06 15:38   ` Alexandre Chartre
  2020-05-12 15:18   ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu
  2020-05-05 13:16 ` [patch V4 part 1 16/36] kprobes: Support __kprobes blacklist in modules Thomas Gleixner
                   ` (21 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

From: Masami Hiramatsu <mhiramat@kernel.org>

Lock kprobe_mutex while showing kprobe_blacklist to prevent updating the
kprobe_blacklist.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/158523417665.24735.10253198878535635600.stgit@devnote2

---
 kernel/kprobes.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2426,6 +2426,7 @@ static const struct file_operations debu
 /* kprobes/blacklist -- shows which functions can not be probed */
 static void *kprobe_blacklist_seq_start(struct seq_file *m, loff_t *pos)
 {
+	mutex_lock(&kprobe_mutex);
 	return seq_list_start(&kprobe_blacklist, *pos);
 }
 
@@ -2452,10 +2453,15 @@ static int kprobe_blacklist_seq_show(str
 	return 0;
 }
 
+static void kprobe_blacklist_seq_stop(struct seq_file *f, void *v)
+{
+	mutex_unlock(&kprobe_mutex);
+}
+
 static const struct seq_operations kprobe_blacklist_seq_ops = {
 	.start = kprobe_blacklist_seq_start,
 	.next  = kprobe_blacklist_seq_next,
-	.stop  = kprobe_seq_stop,	/* Reuse void function */
+	.stop  = kprobe_blacklist_seq_stop,
 	.show  = kprobe_blacklist_seq_show,
 };
 


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 16/36] kprobes: Support __kprobes blacklist in modules
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (14 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 15/36] kprobes: Lock kprobe_mutex while showing kprobe_blacklist Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06 15:47   ` Alexandre Chartre
  2020-05-12 15:18   ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu
  2020-05-05 13:16 ` [patch V4 part 1 17/36] kprobes: Support NOKPROBE_SYMBOL() " Thomas Gleixner
                   ` (20 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

From: Masami Hiramatsu <mhiramat@kernel.org>

Support __kprobes attribute for blacklist functions in modules.  The
__kprobes attribute functions are stored in .kprobes.text section.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/158523418821.24735.15379873028352411526.stgit@devnote2

---
 include/linux/module.h |    4 ++++
 kernel/kprobes.c       |   42 ++++++++++++++++++++++++++++++++++++++++++
 kernel/module.c        |    4 ++++
 3 files changed, 50 insertions(+)

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -489,6 +489,10 @@ struct module {
 	unsigned int num_ftrace_callsites;
 	unsigned long *ftrace_callsites;
 #endif
+#ifdef CONFIG_KPROBES
+	void *kprobes_text_start;
+	unsigned int kprobes_text_size;
+#endif
 
 #ifdef CONFIG_LIVEPATCH
 	bool klp; /* Is this a livepatch module? */
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2179,6 +2179,19 @@ int kprobe_add_area_blacklist(unsigned l
 	return 0;
 }
 
+/* Remove all symbols in given area from kprobe blacklist */
+static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
+{
+	struct kprobe_blacklist_entry *ent, *n;
+
+	list_for_each_entry_safe(ent, n, &kprobe_blacklist, list) {
+		if (ent->start_addr < start || ent->start_addr >= end)
+			continue;
+		list_del(&ent->list);
+		kfree(ent);
+	}
+}
+
 int __init __weak arch_populate_kprobe_blacklist(void)
 {
 	return 0;
@@ -2215,6 +2228,28 @@ static int __init populate_kprobe_blackl
 	return ret ? : arch_populate_kprobe_blacklist();
 }
 
+static void add_module_kprobe_blacklist(struct module *mod)
+{
+	unsigned long start, end;
+
+	start = (unsigned long)mod->kprobes_text_start;
+	if (start) {
+		end = start + mod->kprobes_text_size;
+		kprobe_add_area_blacklist(start, end);
+	}
+}
+
+static void remove_module_kprobe_blacklist(struct module *mod)
+{
+	unsigned long start, end;
+
+	start = (unsigned long)mod->kprobes_text_start;
+	if (start) {
+		end = start + mod->kprobes_text_size;
+		kprobe_remove_area_blacklist(start, end);
+	}
+}
+
 /* Module notifier call back, checking kprobes on the module */
 static int kprobes_module_callback(struct notifier_block *nb,
 				   unsigned long val, void *data)
@@ -2225,6 +2260,11 @@ static int kprobes_module_callback(struc
 	unsigned int i;
 	int checkcore = (val == MODULE_STATE_GOING);
 
+	if (val == MODULE_STATE_COMING) {
+		mutex_lock(&kprobe_mutex);
+		add_module_kprobe_blacklist(mod);
+		mutex_unlock(&kprobe_mutex);
+	}
 	if (val != MODULE_STATE_GOING && val != MODULE_STATE_LIVE)
 		return NOTIFY_DONE;
 
@@ -2255,6 +2295,8 @@ static int kprobes_module_callback(struc
 				kill_kprobe(p);
 			}
 	}
+	if (val == MODULE_STATE_GOING)
+		remove_module_kprobe_blacklist(mod);
 	mutex_unlock(&kprobe_mutex);
 	return NOTIFY_DONE;
 }
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3194,6 +3194,10 @@ static int find_module_sections(struct m
 					    sizeof(*mod->ei_funcs),
 					    &mod->num_ei_funcs);
 #endif
+#ifdef CONFIG_KPROBES
+	mod->kprobes_text_start = section_objs(info, ".kprobes.text", 1,
+						&mod->kprobes_text_size);
+#endif
 	mod->extable = section_objs(info, "__ex_table",
 				    sizeof(*mod->extable), &mod->num_exentries);
 


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 17/36] kprobes: Support NOKPROBE_SYMBOL() in modules
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (15 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 16/36] kprobes: Support __kprobes blacklist in modules Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06 15:54   ` Alexandre Chartre
  2020-05-12 15:18   ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu
  2020-05-05 13:16 ` [patch V4 part 1 18/36] samples/kprobes: Add __kprobes and NOKPROBE_SYMBOL() for handlers Thomas Gleixner
                   ` (19 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

From: Masami Hiramatsu <mhiramat@kernel.org>

Support NOKPROBE_SYMBOL() in modules. NOKPROBE_SYMBOL() records only symbol
address in "_kprobe_blacklist" section in the module.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/158523419989.24735.6665260504057165207.stgit@devnote2

---
 include/linux/module.h |    2 ++
 kernel/kprobes.c       |   17 +++++++++++++++++
 kernel/module.c        |    3 +++
 3 files changed, 22 insertions(+)

diff --git a/include/linux/module.h b/include/linux/module.h
index 369c354f9207..1192097c9a69 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -492,6 +492,8 @@ struct module {
 #ifdef CONFIG_KPROBES
 	void *kprobes_text_start;
 	unsigned int kprobes_text_size;
+	unsigned long *kprobe_blacklist;
+	unsigned int num_kprobe_blacklist;
 #endif
 
 #ifdef CONFIG_LIVEPATCH
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index b7549992b9bd..9eb5acf0a9f3 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2192,6 +2192,11 @@ static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
 	}
 }
 
+static void kprobe_remove_ksym_blacklist(unsigned long entry)
+{
+	kprobe_remove_area_blacklist(entry, entry + 1);
+}
+
 int __init __weak arch_populate_kprobe_blacklist(void)
 {
 	return 0;
@@ -2231,6 +2236,12 @@ static int __init populate_kprobe_blacklist(unsigned long *start,
 static void add_module_kprobe_blacklist(struct module *mod)
 {
 	unsigned long start, end;
+	int i;
+
+	if (mod->kprobe_blacklist) {
+		for (i = 0; i < mod->num_kprobe_blacklist; i++)
+			kprobe_add_ksym_blacklist(mod->kprobe_blacklist[i]);
+	}
 
 	start = (unsigned long)mod->kprobes_text_start;
 	if (start) {
@@ -2242,6 +2253,12 @@ static void add_module_kprobe_blacklist(struct module *mod)
 static void remove_module_kprobe_blacklist(struct module *mod)
 {
 	unsigned long start, end;
+	int i;
+
+	if (mod->kprobe_blacklist) {
+		for (i = 0; i < mod->num_kprobe_blacklist; i++)
+			kprobe_remove_ksym_blacklist(mod->kprobe_blacklist[i]);
+	}
 
 	start = (unsigned long)mod->kprobes_text_start;
 	if (start) {
diff --git a/kernel/module.c b/kernel/module.c
index f7712a923d63..7be011dacd8a 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3197,6 +3197,9 @@ static int find_module_sections(struct module *mod, struct load_info *info)
 #ifdef CONFIG_KPROBES
 	mod->kprobes_text_start = section_objs(info, ".kprobes.text", 1,
 						&mod->kprobes_text_size);
+	mod->kprobe_blacklist = section_objs(info, "_kprobe_blacklist",
+						sizeof(unsigned long),
+						&mod->num_kprobe_blacklist);
 #endif
 	mod->extable = section_objs(info, "__ex_table",
 				    sizeof(*mod->extable), &mod->num_exentries);



^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 18/36] samples/kprobes: Add __kprobes and NOKPROBE_SYMBOL() for handlers.
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (16 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 17/36] kprobes: Support NOKPROBE_SYMBOL() " Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06 15:57   ` Alexandre Chartre
  2020-05-12 15:18   ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu
  2020-05-05 13:16 ` [patch V4 part 1 19/36] x86/entry: Exclude low level entry code from sanitizing Thomas Gleixner
                   ` (18 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

From: Masami Hiramatsu <mhiramat@kernel.org>

Add __kprobes and NOKPROBE_SYMBOL() for sample kprobe handlers.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/158523421177.24735.16273975317343670204.stgit@devnote2

---
 samples/kprobes/kprobe_example.c    |    6 ++++--
 samples/kprobes/kretprobe_example.c |    2 ++
 2 files changed, 6 insertions(+), 2 deletions(-)

--- a/samples/kprobes/kprobe_example.c
+++ b/samples/kprobes/kprobe_example.c
@@ -25,7 +25,7 @@ static struct kprobe kp = {
 };
 
 /* kprobe pre_handler: called just before the probed instruction is executed */
-static int handler_pre(struct kprobe *p, struct pt_regs *regs)
+static int __kprobes handler_pre(struct kprobe *p, struct pt_regs *regs)
 {
 #ifdef CONFIG_X86
 	pr_info("<%s> pre_handler: p->addr = 0x%p, ip = %lx, flags = 0x%lx\n",
@@ -54,7 +54,7 @@ static int handler_pre(struct kprobe *p,
 }
 
 /* kprobe post_handler: called after the probed instruction is executed */
-static void handler_post(struct kprobe *p, struct pt_regs *regs,
+static void __kprobes handler_post(struct kprobe *p, struct pt_regs *regs,
 				unsigned long flags)
 {
 #ifdef CONFIG_X86
@@ -90,6 +90,8 @@ static int handler_fault(struct kprobe *
 	/* Return 0 because we don't handle the fault. */
 	return 0;
 }
+/* NOKPROBE_SYMBOL() is also available */
+NOKPROBE_SYMBOL(handler_fault);
 
 static int __init kprobe_init(void)
 {
--- a/samples/kprobes/kretprobe_example.c
+++ b/samples/kprobes/kretprobe_example.c
@@ -48,6 +48,7 @@ static int entry_handler(struct kretprob
 	data->entry_stamp = ktime_get();
 	return 0;
 }
+NOKPROBE_SYMBOL(entry_handler);
 
 /*
  * Return-probe handler: Log the return value and duration. Duration may turn
@@ -67,6 +68,7 @@ static int ret_handler(struct kretprobe_
 			func_name, retval, (long long)delta);
 	return 0;
 }
+NOKPROBE_SYMBOL(ret_handler);
 
 static struct kretprobe my_kretprobe = {
 	.handler		= ret_handler,


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 19/36] x86/entry: Exclude low level entry code from sanitizing
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (17 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 18/36] samples/kprobes: Add __kprobes and NOKPROBE_SYMBOL() for handlers Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-05 20:39   ` Brian Gerst
                     ` (2 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 20/36] vmlinux.lds.h: Create section for protection against instrumentation Thomas Gleixner
                   ` (17 subsequent siblings)
  36 siblings, 3 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

The sanitizers are not really applicable to the fragile low level entry
code. code. Entry code needs to carefully setup a normal 'runtime'
environment.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/Makefile |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -3,6 +3,14 @@
 # Makefile for the x86 low level entry code
 #
 
+KASAN_SANITIZE := n
+UBSAN_SANITIZE := n
+KCOV_INSTRUMENT := n
+
+CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
+CFLAGS_REMOVE_syscall_32.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
+CFLAGS_REMOVE_syscall_64.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
+
 OBJECT_FILES_NON_STANDARD_entry_64_compat.o := y
 
 CFLAGS_syscall_64.o		+= $(call cc-option,-Wno-override-init,)


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 20/36] vmlinux.lds.h: Create section for protection against instrumentation
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (18 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 19/36] x86/entry: Exclude low level entry code from sanitizing Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-06 16:08   ` Sean Christopherson
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:16 ` [patch V4 part 1 21/36] kprobes: Prevent probes in .noinstr.text section Thomas Gleixner
                   ` (16 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Some code pathes, especially the low level entry code, must be protected
against instrumentation for various reasons:

 - Low level entry code can be a fragile beast, especially on x86.

 - With NO_HZ_FULL RCU state needs to be established before using it.

Having a dedicated section for such code allows to validate with tooling
that no unsafe functions are invoked.

Add the .noinstr.text section and the noinstr attribute to mark
functions. noinstr implies notrace. Kprobes will gain a section check
later.

Provide also a set of markers: instr_begin()/end()

These are used to mark code inside a noinstr function which calls
into regular instrumentable text section as safe.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: Drop noinstr_call_begin()/end()
---
 include/asm-generic/sections.h    |    3 +++
 include/asm-generic/vmlinux.lds.h |    4 ++++
 include/linux/compiler.h          |   17 +++++++++++++++++
 include/linux/compiler_types.h    |    4 ++++
 scripts/mod/modpost.c             |    2 +-
 5 files changed, 29 insertions(+), 1 deletion(-)

--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -53,6 +53,9 @@ extern char __ctors_start[], __ctors_end
 /* Start and end of .opd section - used for function descriptors. */
 extern char __start_opd[], __end_opd[];
 
+/* Start and end of instrumentation protected text section */
+extern char __noinstr_text_start[], __noinstr_text_end[];
+
 extern __visible const void __nosave_begin, __nosave_end;
 
 /* Function descriptor handling (if any).  Override in asm/sections.h */
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -551,6 +551,10 @@
 #define TEXT_TEXT							\
 		ALIGN_FUNCTION();					\
 		*(.text.hot TEXT_MAIN .text.fixup .text.unlikely)	\
+		ALIGN_FUNCTION();					\
+		__noinstr_text_start = .;				\
+		*(.noinstr.text)					\
+		__noinstr_text_end = .;					\
 		*(.text..refcount)					\
 		*(.ref.text)						\
 	MEM_KEEP(init.text*)						\
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -120,10 +120,27 @@ void ftrace_likely_update(struct ftrace_
 /* Annotate a C jump table to allow objtool to follow the code flow */
 #define __annotate_jump_table __section(.rodata..c_jump_table)
 
+/* Begin/end of an instrumentation safe region */
+#define instr_begin() ({						\
+	asm volatile("%c0:\n\t"						\
+		     ".pushsection .discard.instr_begin\n\t"		\
+		     ".long %c0b - .\n\t"				\
+		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+})
+
+#define instr_end() ({							\
+	asm volatile("%c0:\n\t"						\
+		     ".pushsection .discard.instr_end\n\t"		\
+		     ".long %c0b - .\n\t"				\
+		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+})
+
 #else
 #define annotate_reachable()
 #define annotate_unreachable()
 #define __annotate_jump_table
+#define instr_begin()		do { } while(0)
+#define instr_end()		do { } while(0)
 #endif
 
 #ifndef ASM_UNREACHABLE
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -118,6 +118,10 @@ struct ftrace_likely_data {
 #define notrace			__attribute__((__no_instrument_function__))
 #endif
 
+/* Section for code which can't be instrumented at all */
+#define noinstr								\
+	noinline notrace __attribute((__section__(".noinstr.text")))
+
 /*
  * it doesn't make sense on ARM (currently the only user of __naked)
  * to trace naked functions because then mcount is called without
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -948,7 +948,7 @@ static void check_section(const char *mo
 
 #define DATA_SECTIONS ".data", ".data.rel"
 #define TEXT_SECTIONS ".text", ".text.unlikely", ".sched.text", \
-		".kprobes.text", ".cpuidle.text"
+		".kprobes.text", ".cpuidle.text", ".noinstr.text"
 #define OTHER_TEXT_SECTIONS ".ref.text", ".head.text", ".spinlock.text", \
 		".fixup", ".entry.text", ".exception.text", ".text.*", \
 		".coldtext"


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 21/36] kprobes: Prevent probes in .noinstr.text section
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (19 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 20/36] vmlinux.lds.h: Create section for protection against instrumentation Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-08  6:30   ` Masami Hiramatsu
  2020-05-19 19:52   ` [tip: core/kprobes] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:16 ` [patch V4 part 1 22/36] tracing: Provide lockdep less trace_hardirqs_on/off() variants Thomas Gleixner
                   ` (15 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Instrumentation is forbidden in the .noinstr.text section. Make kprobes
respect this.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/module.h |    2 ++
 kernel/kprobes.c       |   18 ++++++++++++++++++
 kernel/module.c        |    3 +++
 3 files changed, 23 insertions(+)

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -458,6 +458,8 @@ struct module {
 	void __percpu *percpu;
 	unsigned int percpu_size;
 #endif
+	void *noinstr_text_start;
+	unsigned int noinstr_text_size;
 
 #ifdef CONFIG_TRACEPOINTS
 	unsigned int num_tracepoints;
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2229,6 +2229,12 @@ static int __init populate_kprobe_blackl
 	/* Symbols in __kprobes_text are blacklisted */
 	ret = kprobe_add_area_blacklist((unsigned long)__kprobes_text_start,
 					(unsigned long)__kprobes_text_end);
+	if (ret)
+		return ret;
+
+	/* Symbols in noinstr section are blacklisted */
+	ret = kprobe_add_area_blacklist((unsigned long)__noinstr_text_start,
+					(unsigned long)__noinstr_text_end);
 
 	return ret ? : arch_populate_kprobe_blacklist();
 }
@@ -2248,6 +2254,12 @@ static void add_module_kprobe_blacklist(
 		end = start + mod->kprobes_text_size;
 		kprobe_add_area_blacklist(start, end);
 	}
+
+	start = (unsigned long)mod->noinstr_text_start;
+	if (start) {
+		end = start + mod->noinstr_text_size;
+		kprobe_add_area_blacklist(start, end);
+	}
 }
 
 static void remove_module_kprobe_blacklist(struct module *mod)
@@ -2265,6 +2277,12 @@ static void remove_module_kprobe_blackli
 		end = start + mod->kprobes_text_size;
 		kprobe_remove_area_blacklist(start, end);
 	}
+
+	start = (unsigned long)mod->noinstr_text_start;
+	if (start) {
+		end = start + mod->noinstr_text_size;
+		kprobe_remove_area_blacklist(start, end);
+	}
 }
 
 /* Module notifier call back, checking kprobes on the module */
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3150,6 +3150,9 @@ static int find_module_sections(struct m
 	}
 #endif
 
+	mod->noinstr_text_start = section_objs(info, ".noinstr.text", 1,
+						&mod->noinstr_text_size);
+
 #ifdef CONFIG_TRACEPOINTS
 	mod->tracepoints_ptrs = section_objs(info, "__tracepoints_ptrs",
 					     sizeof(*mod->tracepoints_ptrs),


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 22/36] tracing: Provide lockdep less trace_hardirqs_on/off() variants
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (20 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 21/36] kprobes: Prevent probes in .noinstr.text section Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-07 17:55   ` Andy Lutomirski
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:16 ` [patch V4 part 1 23/36] bug: Annotate WARN/BUG/stackfail as noinstr safe Thomas Gleixner
                   ` (14 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

trace_hardirqs_on/off() is only partially safe vs. RCU idle. The tracer
core itself is safe, but the resulting tracepoints can be utilized by
e.g. BPF which is unsafe.

Provide variants which do not contain the lockdep invocation so the lockdep
and tracer invocations can be split at the call site and placed properly.

The new variants also do not use rcuidle as they are going to be called
from entry code after/before context tracking.

Name them so they match the lockdep counterparts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V3: Renamed to trace_hardirqs_on/off_prepare().
V2: New patch
---
 include/linux/irqflags.h        |    4 ++++
 kernel/trace/trace_preemptirq.c |   37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -29,6 +29,8 @@
 #endif
 
 #ifdef CONFIG_TRACE_IRQFLAGS
+  extern void trace_hardirqs_on_prepare(void);
+  extern void trace_hardirqs_off_prepare(void);
   extern void trace_hardirqs_on(void);
   extern void trace_hardirqs_off(void);
 # define lockdep_hardirq_context(p)	((p)->hardirq_context)
@@ -96,6 +98,8 @@ do {						\
 	  } while (0)
 
 #else
+# define trace_hardirqs_on_prepare()		do { } while (0)
+# define trace_hardirqs_off_prepare()		do { } while (0)
 # define trace_hardirqs_on()		do { } while (0)
 # define trace_hardirqs_off()		do { } while (0)
 # define lockdep_hardirq_context(p)	0
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -19,6 +19,24 @@
 /* Per-cpu variable to prevent redundant calls when IRQs already off */
 static DEFINE_PER_CPU(int, tracing_irq_cpu);
 
+/*
+ * Like trace_hardirqs_on() but without the lockdep invocation. This is
+ * used in the low level entry code where the ordering vs. RCU is important
+ * and lockdep uses a staged approach which splits the lockdep hardirq
+ * tracking into a RCU on and a RCU off section.
+ */
+void trace_hardirqs_on_prepare(void)
+{
+	if (this_cpu_read(tracing_irq_cpu)) {
+		if (!in_nmi())
+			trace_irq_enable(CALLER_ADDR0, CALLER_ADDR1);
+		tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
+		this_cpu_write(tracing_irq_cpu, 0);
+	}
+}
+EXPORT_SYMBOL(trace_hardirqs_on_prepare);
+NOKPROBE_SYMBOL(trace_hardirqs_on_prepare);
+
 void trace_hardirqs_on(void)
 {
 	if (this_cpu_read(tracing_irq_cpu)) {
@@ -33,6 +51,25 @@ void trace_hardirqs_on(void)
 EXPORT_SYMBOL(trace_hardirqs_on);
 NOKPROBE_SYMBOL(trace_hardirqs_on);
 
+/*
+ * Like trace_hardirqs_off() but without the lockdep invocation. This is
+ * used in the low level entry code where the ordering vs. RCU is important
+ * and lockdep uses a staged approach which splits the lockdep hardirq
+ * tracking into a RCU on and a RCU off section.
+ */
+void trace_hardirqs_off_prepare(void)
+{
+	if (!this_cpu_read(tracing_irq_cpu)) {
+		this_cpu_write(tracing_irq_cpu, 1);
+		tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
+		if (!in_nmi())
+			trace_irq_disable(CALLER_ADDR0, CALLER_ADDR1);
+	}
+
+}
+EXPORT_SYMBOL(trace_hardirqs_off_prepare);
+NOKPROBE_SYMBOL(trace_hardirqs_off_prepare);
+
 void trace_hardirqs_off(void)
 {
 	if (!this_cpu_read(tracing_irq_cpu)) {


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 23/36] bug: Annotate WARN/BUG/stackfail as noinstr safe
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (21 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 22/36] tracing: Provide lockdep less trace_hardirqs_on/off() variants Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-13 23:12   ` Mathieu Desnoyers
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-05-05 13:16 ` [patch V4 part 1 24/36] lockdep: Prepare for noinstr sections Thomas Gleixner
                   ` (13 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

Warnings, bugs and stack protection fails from noinstr sections, e.g. low
level and early entry code, are likely to be fatal.

Mark them as "safe" to be invoked from noinstr protected code to avoid
annotating all usage sites. Getting the information out is important.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/bug.h |    3 +++
 include/asm-generic/bug.h  |    9 +++++++--
 kernel/panic.c             |    4 +++-
 3 files changed, 13 insertions(+), 3 deletions(-)

--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -70,13 +70,16 @@ do {									\
 #define HAVE_ARCH_BUG
 #define BUG()							\
 do {								\
+	instr_begin();						\
 	_BUG_FLAGS(ASM_UD2, 0);					\
 	unreachable();						\
 } while (0)
 
 #define __WARN_FLAGS(flags)					\
 do {								\
+	instr_begin();						\
 	_BUG_FLAGS(ASM_UD2, BUGFLAG_WARNING|(flags));		\
+	instr_end();						\
 	annotate_reachable();					\
 } while (0)
 
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -83,14 +83,19 @@ extern __printf(4, 5)
 void warn_slowpath_fmt(const char *file, const int line, unsigned taint,
 		       const char *fmt, ...);
 #define __WARN()		__WARN_printf(TAINT_WARN, NULL)
-#define __WARN_printf(taint, arg...)					\
-	warn_slowpath_fmt(__FILE__, __LINE__, taint, arg)
+#define __WARN_printf(taint, arg...) do {				\
+		instr_begin();						\
+		warn_slowpath_fmt(__FILE__, __LINE__, taint, arg);	\
+		instr_end();						\
+	} while (0)
 #else
 extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 #define __WARN()		__WARN_FLAGS(BUGFLAG_TAINT(TAINT_WARN))
 #define __WARN_printf(taint, arg...) do {				\
+		instr_begin();						\
 		__warn_printk(arg);					\
 		__WARN_FLAGS(BUGFLAG_NO_CUT_HERE | BUGFLAG_TAINT(taint));\
+		instr_end();						\
 	} while (0)
 #define WARN_ON_ONCE(condition) ({				\
 	int __ret_warn_on = !!(condition);			\
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -662,10 +662,12 @@ device_initcall(register_warn_debugfs);
  * Called when gcc's -fstack-protector feature is used, and
  * gcc detects corruption of the on-stack canary value
  */
-__visible void __stack_chk_fail(void)
+__visible noinstr void __stack_chk_fail(void)
 {
+	instr_begin();
 	panic("stack-protector: Kernel stack is corrupted in: %pB",
 		__builtin_return_address(0));
+	instr_end();
 }
 EXPORT_SYMBOL(__stack_chk_fail);
 


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 24/36] lockdep: Prepare for noinstr sections
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (22 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 23/36] bug: Annotate WARN/BUG/stackfail as noinstr safe Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-08  0:23   ` Steven Rostedt
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  2020-05-05 13:16 ` [patch V4 part 1 25/36] rcu/tree: Mark the idle relevant functions noinstr Thomas Gleixner
                   ` (12 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra

From: Peter Zijlstra <peterz@infradead.org>

Force inlining and prevent instrumentation of all sorts by marking the
functions which are invoked from low level entry code with 'noinstr'.

Split the irqflags tracking into two parts. One which does the heavy
lifting while RCU is watching and the final one which can be invoked after
RCU is turned off.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irqflags.h        |    2 +
 include/linux/sched.h           |    1 
 kernel/locking/lockdep.c        |   70 ++++++++++++++++++++++++++++++----------
 kernel/trace/trace_preemptirq.c |    2 +
 lib/debug_locks.c               |    2 -
 5 files changed, 59 insertions(+), 18 deletions(-)

--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -19,11 +19,13 @@
 #ifdef CONFIG_PROVE_LOCKING
   extern void lockdep_softirqs_on(unsigned long ip);
   extern void lockdep_softirqs_off(unsigned long ip);
+  extern void lockdep_hardirqs_on_prepare(unsigned long ip);
   extern void lockdep_hardirqs_on(unsigned long ip);
   extern void lockdep_hardirqs_off(unsigned long ip);
 #else
   static inline void lockdep_softirqs_on(unsigned long ip) { }
   static inline void lockdep_softirqs_off(unsigned long ip) { }
+  static inline void lockdep_hardirqs_on_prepare(unsigned long ip) { }
   static inline void lockdep_hardirqs_on(unsigned long ip) { }
   static inline void lockdep_hardirqs_off(unsigned long ip) { }
 #endif
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -983,6 +983,7 @@ struct task_struct {
 	unsigned int			hardirq_disable_event;
 	int				hardirqs_enabled;
 	int				hardirq_context;
+	u64				hardirq_chain_key;
 	unsigned long			softirq_disable_ip;
 	unsigned long			softirq_enable_ip;
 	unsigned int			softirq_disable_event;
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3639,9 +3639,6 @@ static void __trace_hardirqs_on_caller(u
 {
 	struct task_struct *curr = current;
 
-	/* we'll do an OFF -> ON transition: */
-	curr->hardirqs_enabled = 1;
-
 	/*
 	 * We are going to turn hardirqs on, so set the
 	 * usage bit for all held locks:
@@ -3653,16 +3650,13 @@ static void __trace_hardirqs_on_caller(u
 	 * bit for all held locks. (disabled hardirqs prevented
 	 * this bit from being set before)
 	 */
-	if (curr->softirqs_enabled)
+	if (curr->softirqs_enabled) {
 		if (!mark_held_locks(curr, LOCK_ENABLED_SOFTIRQ))
 			return;
-
-	curr->hardirq_enable_ip = ip;
-	curr->hardirq_enable_event = ++curr->irq_events;
-	debug_atomic_inc(hardirqs_on_events);
+	}
 }
 
-void lockdep_hardirqs_on(unsigned long ip)
+void lockdep_hardirqs_on_prepare(unsigned long ip)
 {
 	if (unlikely(!debug_locks || current->lockdep_recursion))
 		return;
@@ -3698,20 +3692,62 @@ void lockdep_hardirqs_on(unsigned long i
 	if (DEBUG_LOCKS_WARN_ON(current->hardirq_context))
 		return;
 
+	current->hardirq_chain_key = current->curr_chain_key;
+
 	current->lockdep_recursion++;
 	__trace_hardirqs_on_caller(ip);
 	lockdep_recursion_finish();
 }
-NOKPROBE_SYMBOL(lockdep_hardirqs_on);
+EXPORT_SYMBOL_GPL(lockdep_hardirqs_on_prepare);
+
+void noinstr lockdep_hardirqs_on(unsigned long ip)
+{
+	struct task_struct *curr = current;
+
+	if (unlikely(!debug_locks || curr->lockdep_recursion))
+		return;
+
+	if (curr->hardirqs_enabled) {
+		/*
+		 * Neither irq nor preemption are disabled here
+		 * so this is racy by nature but losing one hit
+		 * in a stat is not a big deal.
+		 */
+		__debug_atomic_inc(redundant_hardirqs_on);
+		return;
+	}
+
+	/*
+	 * We're enabling irqs and according to our state above irqs weren't
+	 * already enabled, yet we find the hardware thinks they are in fact
+	 * enabled.. someone messed up their IRQ state tracing.
+	 */
+	if (DEBUG_LOCKS_WARN_ON(!irqs_disabled()))
+		return;
+
+	/*
+	 * Ensure the lock stack remained unchanged between
+	 * lockdep_hardirqs_on_prepare() and lockdep_hardirqs_on().
+	 */
+	DEBUG_LOCKS_WARN_ON(current->hardirq_chain_key !=
+			    current->curr_chain_key);
+
+	/* we'll do an OFF -> ON transition: */
+	curr->hardirqs_enabled = 1;
+	curr->hardirq_enable_ip = ip;
+	curr->hardirq_enable_event = ++curr->irq_events;
+	debug_atomic_inc(hardirqs_on_events);
+}
+EXPORT_SYMBOL_GPL(lockdep_hardirqs_on);
 
 /*
  * Hardirqs were disabled:
  */
-void lockdep_hardirqs_off(unsigned long ip)
+void noinstr lockdep_hardirqs_off(unsigned long ip)
 {
 	struct task_struct *curr = current;
 
-	if (unlikely(!debug_locks || current->lockdep_recursion))
+	if (unlikely(!debug_locks || curr->lockdep_recursion))
 		return;
 
 	/*
@@ -3732,7 +3768,7 @@ void lockdep_hardirqs_off(unsigned long
 	} else
 		debug_atomic_inc(redundant_hardirqs_off);
 }
-NOKPROBE_SYMBOL(lockdep_hardirqs_off);
+EXPORT_SYMBOL_GPL(lockdep_hardirqs_off);
 
 /*
  * Softirqs will be enabled:
@@ -4408,8 +4444,8 @@ static void print_unlock_imbalance_bug(s
 	dump_stack();
 }
 
-static int match_held_lock(const struct held_lock *hlock,
-					const struct lockdep_map *lock)
+static noinstr int match_held_lock(const struct held_lock *hlock,
+				   const struct lockdep_map *lock)
 {
 	if (hlock->instance == lock)
 		return 1;
@@ -4696,7 +4732,7 @@ static int
 	return 0;
 }
 
-static nokprobe_inline
+static __always_inline
 int __lock_is_held(const struct lockdep_map *lock, int read)
 {
 	struct task_struct *curr = current;
@@ -4956,7 +4992,7 @@ void lock_release(struct lockdep_map *lo
 }
 EXPORT_SYMBOL_GPL(lock_release);
 
-int lock_is_held_type(const struct lockdep_map *lock, int read)
+noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
 {
 	unsigned long flags;
 	int ret = 0;
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -46,6 +46,7 @@ void trace_hardirqs_on(void)
 		this_cpu_write(tracing_irq_cpu, 0);
 	}
 
+	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 	lockdep_hardirqs_on(CALLER_ADDR0);
 }
 EXPORT_SYMBOL(trace_hardirqs_on);
@@ -93,6 +94,7 @@ NOKPROBE_SYMBOL(trace_hardirqs_off);
 		this_cpu_write(tracing_irq_cpu, 0);
 	}
 
+	lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 	lockdep_hardirqs_on(CALLER_ADDR0);
 }
 EXPORT_SYMBOL(trace_hardirqs_on_caller);
--- a/lib/debug_locks.c
+++ b/lib/debug_locks.c
@@ -36,7 +36,7 @@ EXPORT_SYMBOL_GPL(debug_locks_silent);
 /*
  * Generic 'turn off all lock debugging' function:
  */
-int debug_locks_off(void)
+noinstr int debug_locks_off(void)
 {
 	if (debug_locks && __debug_locks_off()) {
 		if (!debug_locks_silent) {


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 25/36] rcu/tree: Mark the idle relevant functions noinstr
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (23 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 24/36] lockdep: Prepare for noinstr sections Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-05 18:07   ` Paul E. McKenney
                     ` (2 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 26/36] printk: Prepare for nested printk_nmi_enter() Thomas Gleixner
                   ` (11 subsequent siblings)
  36 siblings, 3 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

These functions are invoked from context tracking and other places in the
low level entry code. Move them into the .noinstr.text section to exclude
them from instrumentation.

Mark the places which are safe to invoke traceable functions with
instr_begin/end() so objtool won't complain.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/rcu/tree.c        |   85 +++++++++++++++++++++++++----------------------
 kernel/rcu/tree_plugin.h |    4 +-
 kernel/rcu/update.c      |    7 +--
 3 files changed, 52 insertions(+), 44 deletions(-)

--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -75,9 +75,6 @@
  */
 #define RCU_DYNTICK_CTRL_MASK 0x1
 #define RCU_DYNTICK_CTRL_CTR  (RCU_DYNTICK_CTRL_MASK + 1)
-#ifndef rcu_eqs_special_exit
-#define rcu_eqs_special_exit() do { } while (0)
-#endif
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
 	.dynticks_nesting = 1,
@@ -229,7 +226,7 @@ void rcu_softirq_qs(void)
  * RCU is watching prior to the call to this function and is no longer
  * watching upon return.
  */
-static void rcu_dynticks_eqs_enter(void)
+static noinstr void rcu_dynticks_eqs_enter(void)
 {
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 	int seq;
@@ -253,7 +250,7 @@ static void rcu_dynticks_eqs_enter(void)
  * called from an extended quiescent state, that is, RCU is not watching
  * prior to the call to this function and is watching upon return.
  */
-static void rcu_dynticks_eqs_exit(void)
+static noinstr void rcu_dynticks_eqs_exit(void)
 {
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 	int seq;
@@ -270,8 +267,6 @@ static void rcu_dynticks_eqs_exit(void)
 	if (seq & RCU_DYNTICK_CTRL_MASK) {
 		atomic_andnot(RCU_DYNTICK_CTRL_MASK, &rdp->dynticks);
 		smp_mb__after_atomic(); /* _exit after clearing mask. */
-		/* Prefer duplicate flushes to losing a flush. */
-		rcu_eqs_special_exit();
 	}
 }
 
@@ -299,7 +294,7 @@ static void rcu_dynticks_eqs_online(void
  *
  * No ordering, as we are sampling CPU-local information.
  */
-static bool rcu_dynticks_curr_cpu_in_eqs(void)
+static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
 {
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 
@@ -566,7 +561,7 @@ EXPORT_SYMBOL_GPL(rcutorture_get_gp_data
  * the possibility of usermode upcalls having messed up our count
  * of interrupt nesting level during the prior busy period.
  */
-static void rcu_eqs_enter(bool user)
+static noinstr void rcu_eqs_enter(bool user)
 {
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 
@@ -581,12 +576,14 @@ static void rcu_eqs_enter(bool user)
 	}
 
 	lockdep_assert_irqs_disabled();
+	instr_begin();
 	trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks));
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
 	rdp = this_cpu_ptr(&rcu_data);
 	do_nocb_deferred_wakeup(rdp);
 	rcu_prepare_for_idle();
 	rcu_preempt_deferred_qs(current);
+	instr_end();
 	WRITE_ONCE(rdp->dynticks_nesting, 0); /* Avoid irq-access tearing. */
 	// RCU is watching here ...
 	rcu_dynticks_eqs_enter();
@@ -623,7 +620,7 @@ void rcu_idle_enter(void)
  * If you add or remove a call to rcu_user_enter(), be sure to test with
  * CONFIG_RCU_EQS_DEBUG=y.
  */
-void rcu_user_enter(void)
+noinstr void rcu_user_enter(void)
 {
 	lockdep_assert_irqs_disabled();
 	rcu_eqs_enter(true);
@@ -656,19 +653,23 @@ static __always_inline void rcu_nmi_exit
 	 * leave it in non-RCU-idle state.
 	 */
 	if (rdp->dynticks_nmi_nesting != 1) {
+		instr_begin();
 		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2,
 				  atomic_read(&rdp->dynticks));
 		WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
 			   rdp->dynticks_nmi_nesting - 2);
+		instr_end();
 		return;
 	}
 
+	instr_begin();
 	/* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
 	trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks));
 	WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
 
 	if (irq)
 		rcu_prepare_for_idle();
+	instr_end();
 
 	// RCU is watching here ...
 	rcu_dynticks_eqs_enter();
@@ -684,7 +685,7 @@ static __always_inline void rcu_nmi_exit
  * If you add or remove a call to rcu_nmi_exit(), be sure to test
  * with CONFIG_RCU_EQS_DEBUG=y.
  */
-void rcu_nmi_exit(void)
+void noinstr rcu_nmi_exit(void)
 {
 	rcu_nmi_exit_common(false);
 }
@@ -708,7 +709,7 @@ void rcu_nmi_exit(void)
  * If you add or remove a call to rcu_irq_exit(), be sure to test with
  * CONFIG_RCU_EQS_DEBUG=y.
  */
-void rcu_irq_exit(void)
+void noinstr rcu_irq_exit(void)
 {
 	lockdep_assert_irqs_disabled();
 	rcu_nmi_exit_common(true);
@@ -737,7 +738,7 @@ void rcu_irq_exit_irqson(void)
  * allow for the possibility of usermode upcalls messing up our count of
  * interrupt nesting level during the busy period that is just now starting.
  */
-static void rcu_eqs_exit(bool user)
+static void noinstr rcu_eqs_exit(bool user)
 {
 	struct rcu_data *rdp;
 	long oldval;
@@ -755,12 +756,14 @@ static void rcu_eqs_exit(bool user)
 	// RCU is not watching here ...
 	rcu_dynticks_eqs_exit();
 	// ... but is watching here.
+	instr_begin();
 	rcu_cleanup_after_idle();
 	trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, atomic_read(&rdp->dynticks));
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
 	WRITE_ONCE(rdp->dynticks_nesting, 1);
 	WARN_ON_ONCE(rdp->dynticks_nmi_nesting);
 	WRITE_ONCE(rdp->dynticks_nmi_nesting, DYNTICK_IRQ_NONIDLE);
+	instr_end();
 }
 
 /**
@@ -791,7 +794,7 @@ void rcu_idle_exit(void)
  * If you add or remove a call to rcu_user_exit(), be sure to test with
  * CONFIG_RCU_EQS_DEBUG=y.
  */
-void rcu_user_exit(void)
+void noinstr rcu_user_exit(void)
 {
 	rcu_eqs_exit(1);
 }
@@ -839,28 +842,35 @@ static __always_inline void rcu_nmi_ente
 			rcu_cleanup_after_idle();
 
 		incby = 1;
-	} else if (irq && tick_nohz_full_cpu(rdp->cpu) &&
-		   rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
-		   READ_ONCE(rdp->rcu_urgent_qs) &&
-		   !READ_ONCE(rdp->rcu_forced_tick)) {
-		// We get here only if we had already exited the extended
-		// quiescent state and this was an interrupt (not an NMI).
-		// Therefore, (1) RCU is already watching and (2) The fact
-		// that we are in an interrupt handler and that the rcu_node
-		// lock is an irq-disabled lock prevents self-deadlock.
-		// So we can safely recheck under the lock.
-		raw_spin_lock_rcu_node(rdp->mynode);
-		if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
-			// A nohz_full CPU is in the kernel and RCU
-			// needs a quiescent state.  Turn on the tick!
-			WRITE_ONCE(rdp->rcu_forced_tick, true);
-			tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
+	} else if (irq) {
+		instr_begin();
+		if (tick_nohz_full_cpu(rdp->cpu) &&
+		    rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
+		    READ_ONCE(rdp->rcu_urgent_qs) &&
+		    !READ_ONCE(rdp->rcu_forced_tick)) {
+			// We get here only if we had already exited the
+			// extended quiescent state and this was an
+			// interrupt (not an NMI).  Therefore, (1) RCU is
+			// already watching and (2) The fact that we are in
+			// an interrupt handler and that the rcu_node lock
+			// is an irq-disabled lock prevents self-deadlock.
+			// So we can safely recheck under the lock.
+			raw_spin_lock_rcu_node(rdp->mynode);
+			if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
+				// A nohz_full CPU is in the kernel and RCU
+				// needs a quiescent state.  Turn on the tick!
+				WRITE_ONCE(rdp->rcu_forced_tick, true);
+				tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
+			}
+			raw_spin_unlock_rcu_node(rdp->mynode);
 		}
-		raw_spin_unlock_rcu_node(rdp->mynode);
+		instr_end();
 	}
+	instr_begin();
 	trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
 			  rdp->dynticks_nmi_nesting,
 			  rdp->dynticks_nmi_nesting + incby, atomic_read(&rdp->dynticks));
+	instr_end();
 	WRITE_ONCE(rdp->dynticks_nmi_nesting, /* Prevent store tearing. */
 		   rdp->dynticks_nmi_nesting + incby);
 	barrier();
@@ -869,11 +879,10 @@ static __always_inline void rcu_nmi_ente
 /**
  * rcu_nmi_enter - inform RCU of entry to NMI context
  */
-void rcu_nmi_enter(void)
+noinstr void rcu_nmi_enter(void)
 {
 	rcu_nmi_enter_common(false);
 }
-NOKPROBE_SYMBOL(rcu_nmi_enter);
 
 /**
  * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
@@ -897,7 +906,7 @@ NOKPROBE_SYMBOL(rcu_nmi_enter);
  * If you add or remove a call to rcu_irq_enter(), be sure to test with
  * CONFIG_RCU_EQS_DEBUG=y.
  */
-void rcu_irq_enter(void)
+noinstr void rcu_irq_enter(void)
 {
 	lockdep_assert_irqs_disabled();
 	rcu_nmi_enter_common(true);
@@ -942,7 +951,7 @@ static void rcu_disable_urgency_upon_qs(
  * if the current CPU is not in its idle loop or is in an interrupt or
  * NMI handler, return true.
  */
-bool notrace rcu_is_watching(void)
+noinstr bool rcu_is_watching(void)
 {
 	bool ret;
 
@@ -986,7 +995,7 @@ void rcu_request_urgent_qs_task(struct t
  * RCU on an offline processor during initial boot, hence the check for
  * rcu_scheduler_fully_active.
  */
-bool rcu_lockdep_current_cpu_online(void)
+noinstr bool rcu_lockdep_current_cpu_online(void)
 {
 	struct rcu_data *rdp;
 	struct rcu_node *rnp;
@@ -994,12 +1003,12 @@ bool rcu_lockdep_current_cpu_online(void
 
 	if (in_nmi() || !rcu_scheduler_fully_active)
 		return true;
-	preempt_disable();
+	preempt_disable_notrace();
 	rdp = this_cpu_ptr(&rcu_data);
 	rnp = rdp->mynode;
 	if (rdp->grpmask & rcu_rnp_online_cpus(rnp))
 		ret = true;
-	preempt_enable();
+	preempt_enable_notrace();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(rcu_lockdep_current_cpu_online);
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2553,7 +2553,7 @@ static void rcu_bind_gp_kthread(void)
 }
 
 /* Record the current task on dyntick-idle entry. */
-static void rcu_dynticks_task_enter(void)
+static void noinstr rcu_dynticks_task_enter(void)
 {
 #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
 	WRITE_ONCE(current->rcu_tasks_idle_cpu, smp_processor_id());
@@ -2561,7 +2561,7 @@ static void rcu_dynticks_task_enter(void
 }
 
 /* Record no current task on dyntick-idle exit. */
-static void rcu_dynticks_task_exit(void)
+static void noinstr rcu_dynticks_task_exit(void)
 {
 #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
 	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -95,7 +95,7 @@ module_param(rcu_normal_after_boot, int,
  * Similarly, we avoid claiming an RCU read lock held if the current
  * CPU is offline.
  */
-static bool rcu_read_lock_held_common(bool *ret)
+static noinstr bool rcu_read_lock_held_common(bool *ret)
 {
 	if (!debug_lockdep_rcu_enabled()) {
 		*ret = 1;
@@ -112,7 +112,7 @@ static bool rcu_read_lock_held_common(bo
 	return false;
 }
 
-int rcu_read_lock_sched_held(void)
+noinstr int rcu_read_lock_sched_held(void)
 {
 	bool ret;
 
@@ -270,13 +270,12 @@ struct lockdep_map rcu_callback_map =
 	STATIC_LOCKDEP_MAP_INIT("rcu_callback", &rcu_callback_key);
 EXPORT_SYMBOL_GPL(rcu_callback_map);
 
-int notrace debug_lockdep_rcu_enabled(void)
+noinstr int notrace debug_lockdep_rcu_enabled(void)
 {
 	return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE && debug_locks &&
 	       current->lockdep_recursion == 0;
 }
 EXPORT_SYMBOL_GPL(debug_lockdep_rcu_enabled);
-NOKPROBE_SYMBOL(debug_lockdep_rcu_enabled);
 
 /**
  * rcu_read_lock_held() - might we be in RCU read-side critical section?


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 26/36] printk: Prepare for nested printk_nmi_enter()
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (24 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 25/36] rcu/tree: Mark the idle relevant functions noinstr Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Petr Mladek
  2020-05-05 13:16 ` [patch V4 part 1 27/36] arm64: Prepare arch_nmi_enter() for recursion Thomas Gleixner
                   ` (10 subsequent siblings)
  36 siblings, 1 reply; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

From: Petr Mladek <pmladek@suse.com>

There is plenty of space in the printk_context variable. Reserve one byte
there for the NMI context to be on the safe side.

It should never overflow. The BUG_ON(in_nmi() == NMI_MASK) in nmi_enter()
will trigger much earlier.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200224121331.tnkvp6mmjchm4s2i@pathway.suse.cz
---
 kernel/printk/internal.h    |    8 +++++---
 kernel/printk/printk_safe.c |    4 ++--
 2 files changed, 7 insertions(+), 5 deletions(-)

--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -6,9 +6,11 @@
 
 #ifdef CONFIG_PRINTK
 
-#define PRINTK_SAFE_CONTEXT_MASK	 0x3fffffff
-#define PRINTK_NMI_DIRECT_CONTEXT_MASK	 0x40000000
-#define PRINTK_NMI_CONTEXT_MASK		 0x80000000
+#define PRINTK_SAFE_CONTEXT_MASK	0x007ffffff
+#define PRINTK_NMI_DIRECT_CONTEXT_MASK	0x008000000
+#define PRINTK_NMI_CONTEXT_MASK		0xff0000000
+
+#define PRINTK_NMI_CONTEXT_OFFSET	0x010000000
 
 extern raw_spinlock_t logbuf_lock;
 
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -295,12 +295,12 @@ static __printf(1, 0) int vprintk_nmi(co
 
 void notrace printk_nmi_enter(void)
 {
-	this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
+	this_cpu_add(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
 }
 
 void notrace printk_nmi_exit(void)
 {
-	this_cpu_and(printk_context, ~PRINTK_NMI_CONTEXT_MASK);
+	this_cpu_sub(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
 }
 
 /*


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 27/36] arm64: Prepare arch_nmi_enter() for recursion
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (25 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 26/36] printk: Prepare for nested printk_nmi_enter() Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-13 23:28   ` Mathieu Desnoyers
                     ` (2 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 28/36] hardirq/nmi: Allow nested nmi_enter() Thomas Gleixner
                   ` (9 subsequent siblings)
  36 siblings, 3 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel),
	Catalin Marinas

From: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/include/asm/hardirq.h |   78 +++++++++++++++++++++++++++++----------
 1 file changed, 59 insertions(+), 19 deletions(-)

--- a/arch/arm64/include/asm/hardirq.h
+++ b/arch/arm64/include/asm/hardirq.h
@@ -32,30 +32,70 @@ u64 smp_irq_stat_cpu(unsigned int cpu);
 
 struct nmi_ctx {
 	u64 hcr;
+	unsigned int cnt;
 };
 
 DECLARE_PER_CPU(struct nmi_ctx, nmi_contexts);
 
-#define arch_nmi_enter()							\
-	do {									\
-		if (is_kernel_in_hyp_mode()) {					\
-			struct nmi_ctx *nmi_ctx = this_cpu_ptr(&nmi_contexts);	\
-			nmi_ctx->hcr = read_sysreg(hcr_el2);			\
-			if (!(nmi_ctx->hcr & HCR_TGE)) {			\
-				write_sysreg(nmi_ctx->hcr | HCR_TGE, hcr_el2);	\
-				isb();						\
-			}							\
-		}								\
-	} while (0)
+#define arch_nmi_enter()						\
+do {									\
+	struct nmi_ctx *___ctx;						\
+	u64 ___hcr;							\
+									\
+	if (!is_kernel_in_hyp_mode())					\
+		break;							\
+									\
+	___ctx = this_cpu_ptr(&nmi_contexts);				\
+	if (___ctx->cnt) {						\
+		___ctx->cnt++;						\
+		break;							\
+	}								\
+									\
+	___hcr = read_sysreg(hcr_el2);					\
+	if (!(___hcr & HCR_TGE)) {					\
+		write_sysreg(___hcr | HCR_TGE, hcr_el2);		\
+		isb();							\
+	}								\
+	/*								\
+	 * Make sure the sysreg write is performed before ___ctx->cnt	\
+	 * is set to 1. NMIs that see cnt == 1 will rely on us.		\
+	 */								\
+	barrier();							\
+	___ctx->cnt = 1;                                                \
+	/*								\
+	 * Make sure ___ctx->cnt is set before we save ___hcr. We	\
+	 * don't want ___ctx->hcr to be overwritten.			\
+	 */								\
+	barrier();							\
+	___ctx->hcr = ___hcr;						\
+} while (0)
 
-#define arch_nmi_exit()								\
-	do {									\
-		if (is_kernel_in_hyp_mode()) {					\
-			struct nmi_ctx *nmi_ctx = this_cpu_ptr(&nmi_contexts);	\
-			if (!(nmi_ctx->hcr & HCR_TGE))				\
-				write_sysreg(nmi_ctx->hcr, hcr_el2);		\
-		}								\
-	} while (0)
+#define arch_nmi_exit()							\
+do {									\
+	struct nmi_ctx *___ctx;						\
+	u64 ___hcr;							\
+									\
+	if (!is_kernel_in_hyp_mode())					\
+		break;							\
+									\
+	___ctx = this_cpu_ptr(&nmi_contexts);				\
+	___hcr = ___ctx->hcr;						\
+	/*								\
+	 * Make sure we read ___ctx->hcr before we release		\
+	 * ___ctx->cnt as it makes ___ctx->hcr updatable again.		\
+	 */								\
+	barrier();							\
+	___ctx->cnt--;							\
+	/*								\
+	 * Make sure ___ctx->cnt release is visible before we		\
+	 * restore the sysreg. Otherwise a new NMI occurring		\
+	 * right after write_sysreg() can be fooled and think		\
+	 * we secured things for it.					\
+	 */								\
+	barrier();							\
+	if (!___ctx->cnt && !(___hcr & HCR_TGE))			\
+		write_sysreg(___hcr, hcr_el2);				\
+} while (0)
 
 static inline void ack_bad_irq(unsigned int irq)
 {


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 28/36] hardirq/nmi: Allow nested nmi_enter()
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (26 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 27/36] arm64: Prepare arch_nmi_enter() for recursion Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Peter Zijlstra
  2020-05-05 13:16 ` [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work Thomas Gleixner
                   ` (8 subsequent siblings)
  36 siblings, 1 reply; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel),
	Marc Zyngier, Michael Ellerman

From: Peter Zijlstra <peterz@infradead.org>

Since there are already a number of sites (ARM64, PowerPC) that effectively
nest nmi_enter(), make the primitive support this before adding even more.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/arm64/kernel/sdei.c    |   14 ++------------
 arch/arm64/kernel/traps.c   |    8 ++------
 arch/powerpc/kernel/traps.c |   22 ++++++----------------
 include/linux/hardirq.h     |    5 ++++-
 include/linux/preempt.h     |    4 ++--
 5 files changed, 16 insertions(+), 37 deletions(-)

--- a/arch/arm64/kernel/sdei.c
+++ b/arch/arm64/kernel/sdei.c
@@ -251,22 +251,12 @@ asmlinkage __kprobes notrace unsigned lo
 __sdei_handler(struct pt_regs *regs, struct sdei_registered_event *arg)
 {
 	unsigned long ret;
-	bool do_nmi_exit = false;
 
-	/*
-	 * nmi_enter() deals with printk() re-entrance and use of RCU when
-	 * RCU believed this CPU was idle. Because critical events can
-	 * interrupt normal events, we may already be in_nmi().
-	 */
-	if (!in_nmi()) {
-		nmi_enter();
-		do_nmi_exit = true;
-	}
+	nmi_enter();
 
 	ret = _sdei_handler(regs, arg);
 
-	if (do_nmi_exit)
-		nmi_exit();
+	nmi_exit();
 
 	return ret;
 }
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -906,17 +906,13 @@ bool arm64_is_fatal_ras_serror(struct pt
 
 asmlinkage void do_serror(struct pt_regs *regs, unsigned int esr)
 {
-	const bool was_in_nmi = in_nmi();
-
-	if (!was_in_nmi)
-		nmi_enter();
+	nmi_enter();
 
 	/* non-RAS errors are not containable */
 	if (!arm64_is_ras_serror(esr) || arm64_is_fatal_ras_serror(regs, esr))
 		arm64_serror_panic(regs, esr);
 
-	if (!was_in_nmi)
-		nmi_exit();
+	nmi_exit();
 }
 
 asmlinkage void enter_from_user_mode(void)
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -441,15 +441,9 @@ void hv_nmi_check_nonrecoverable(struct
 void system_reset_exception(struct pt_regs *regs)
 {
 	unsigned long hsrr0, hsrr1;
-	bool nested = in_nmi();
 	bool saved_hsrrs = false;
 
-	/*
-	 * Avoid crashes in case of nested NMI exceptions. Recoverability
-	 * is determined by RI and in_nmi
-	 */
-	if (!nested)
-		nmi_enter();
+	nmi_enter();
 
 	/*
 	 * System reset can interrupt code where HSRRs are live and MSR[RI]=1.
@@ -521,8 +515,7 @@ void system_reset_exception(struct pt_re
 		mtspr(SPRN_HSRR1, hsrr1);
 	}
 
-	if (!nested)
-		nmi_exit();
+	nmi_exit();
 
 	/* What should we do here? We could issue a shutdown or hard reset. */
 }
@@ -823,9 +816,8 @@ int machine_check_generic(struct pt_regs
 void machine_check_exception(struct pt_regs *regs)
 {
 	int recover = 0;
-	bool nested = in_nmi();
-	if (!nested)
-		nmi_enter();
+
+	nmi_enter();
 
 	__this_cpu_inc(irq_stat.mce_exceptions);
 
@@ -851,8 +843,7 @@ void machine_check_exception(struct pt_r
 	if (check_io_access(regs))
 		goto bail;
 
-	if (!nested)
-		nmi_exit();
+	nmi_exit();
 
 	die("Machine check", regs, SIGBUS);
 
@@ -863,8 +854,7 @@ void machine_check_exception(struct pt_r
 	return;
 
 bail:
-	if (!nested)
-		nmi_exit();
+	nmi_exit();
 }
 
 void SMIException(struct pt_regs *regs)
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -65,13 +65,16 @@ extern void irq_exit(void);
 #define arch_nmi_exit()		do { } while (0)
 #endif
 
+/*
+ * nmi_enter() can nest up to 15 times; see NMI_BITS.
+ */
 #define nmi_enter()						\
 	do {							\
 		arch_nmi_enter();				\
 		printk_nmi_enter();				\
 		lockdep_off();					\
 		ftrace_nmi_enter();				\
-		BUG_ON(in_nmi());				\
+		BUG_ON(in_nmi() == NMI_MASK);			\
 		preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
 		rcu_nmi_enter();				\
 		lockdep_hardirq_enter();			\
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -26,13 +26,13 @@
  *         PREEMPT_MASK:	0x000000ff
  *         SOFTIRQ_MASK:	0x0000ff00
  *         HARDIRQ_MASK:	0x000f0000
- *             NMI_MASK:	0x00100000
+ *             NMI_MASK:	0x00f00000
  * PREEMPT_NEED_RESCHED:	0x80000000
  */
 #define PREEMPT_BITS	8
 #define SOFTIRQ_BITS	8
 #define HARDIRQ_BITS	4
-#define NMI_BITS	1
+#define NMI_BITS	4
 
 #define PREEMPT_SHIFT	0
 #define SOFTIRQ_SHIFT	(PREEMPT_SHIFT + PREEMPT_BITS)


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (27 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 28/36] hardirq/nmi: Allow nested nmi_enter() Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-07 18:02   ` Andy Lutomirski
                     ` (3 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 30/36] lockdep: Always inline lockdep_{off,on}() Thomas Gleixner
                   ` (7 subsequent siblings)
  36 siblings, 4 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

From: Peter Zijlstra <peterz@infradead.org>

Convert #MC over to using task_work_add(); it will run the same code
slightly later, on the return to user path of the same exception.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
 arch/x86/kernel/cpu/mce/core.c |   56 ++++++++++++++++++++++-------------------
 include/linux/sched.h          |    6 ++++
 2 files changed, 37 insertions(+), 25 deletions(-)

--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -42,6 +42,7 @@
 #include <linux/export.h>
 #include <linux/jump_label.h>
 #include <linux/set_memory.h>
+#include <linux/task_work.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
@@ -1086,23 +1087,6 @@ static void mce_clear_state(unsigned lon
 	}
 }
 
-static int do_memory_failure(struct mce *m)
-{
-	int flags = MF_ACTION_REQUIRED;
-	int ret;
-
-	pr_err("Uncorrected hardware memory error in user-access at %llx", m->addr);
-	if (!(m->mcgstatus & MCG_STATUS_RIPV))
-		flags |= MF_MUST_KILL;
-	ret = memory_failure(m->addr >> PAGE_SHIFT, flags);
-	if (ret)
-		pr_err("Memory error not recovered");
-	else
-		set_mce_nospec(m->addr >> PAGE_SHIFT);
-	return ret;
-}
-
-
 /*
  * Cases where we avoid rendezvous handler timeout:
  * 1) If this CPU is offline.
@@ -1204,6 +1188,29 @@ static void __mc_scan_banks(struct mce *
 	*m = *final;
 }
 
+static void kill_me_now(struct callback_head *ch)
+{
+	force_sig(SIGBUS);
+}
+
+static void kill_me_maybe(struct callback_head *cb)
+{
+	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
+	int flags = MF_ACTION_REQUIRED;
+
+	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
+	if (!(p->mce_status & MCG_STATUS_RIPV))
+		flags |= MF_MUST_KILL;
+
+	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags)) {
+		set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
+		return;
+	}
+
+	pr_err("Memory error not recovered");
+	kill_me_now(cb);
+}
+
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
@@ -1222,7 +1229,7 @@ static void __mc_scan_banks(struct mce *
  * backing the user stack, tracing that reads the user stack will cause
  * potentially infinite recursion.
  */
-void notrace do_machine_check(struct pt_regs *regs, long error_code)
+void noinstr do_machine_check(struct pt_regs *regs, long error_code)
 {
 	DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
 	DECLARE_BITMAP(toclear, MAX_NR_BANKS);
@@ -1354,13 +1361,13 @@ void notrace do_machine_check(struct pt_
 	if ((m.cs & 3) == 3) {
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
-		local_irq_enable();
-		preempt_enable();
 
-		if (kill_it || do_memory_failure(&m))
-			force_sig(SIGBUS);
-		preempt_disable();
-		local_irq_disable();
+		current->mce_addr = m.addr;
+		current->mce_status = m.mcgstatus;
+		current->mce_kill_me.func = kill_me_maybe;
+		if (kill_it)
+			current->mce_kill_me.func = kill_me_now;
+		task_work_add(current, &current->mce_kill_me, true);
 	} else {
 		if (!fixup_exception(regs, X86_TRAP_MC, error_code, 0))
 			mce_panic("Failed kernel mode recovery", &m, msg);
@@ -1370,7 +1377,6 @@ void notrace do_machine_check(struct pt_
 	ist_exit(regs);
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
-NOKPROBE_SYMBOL(do_machine_check);
 
 #ifndef CONFIG_MEMORY_FAILURE
 int memory_failure(unsigned long pfn, int flags)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1290,6 +1290,12 @@ struct task_struct {
 	unsigned long			prev_lowest_stack;
 #endif
 
+#ifdef CONFIG_X86_MCE
+	u64				mce_addr;
+	u64				mce_status;
+	struct callback_head		mce_kill_me;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 30/36] lockdep: Always inline lockdep_{off,on}()
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (28 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-13 23:46   ` Mathieu Desnoyers
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Peter Zijlstra
  2020-05-05 13:16 ` [patch V4 part 1 31/36] printk: Disallow instrumenting print_nmi_enter() Thomas Gleixner
                   ` (6 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

From: Peter Zijlstra <peterz@infradead.org>

These functions are called {early,late} in nmi_{enter,exit} and should
not be traced or probed. They are also puny, so 'inline' them.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/lockdep.h  |   23 +++++++++++++++++++++--
 kernel/locking/lockdep.c |   19 -------------------
 2 files changed, 21 insertions(+), 21 deletions(-)

--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -308,8 +308,27 @@ extern void lockdep_set_selftest_task(st
 
 extern void lockdep_init_task(struct task_struct *task);
 
-extern void lockdep_off(void);
-extern void lockdep_on(void);
+/*
+ * Split the recrursion counter in two to readily detect 'off' vs recursion.
+ */
+#define LOCKDEP_RECURSION_BITS	16
+#define LOCKDEP_OFF		(1U << LOCKDEP_RECURSION_BITS)
+#define LOCKDEP_RECURSION_MASK	(LOCKDEP_OFF - 1)
+
+/*
+ * lockdep_{off,on}() are macros to avoid tracing and kprobes; not inlines due
+ * to header dependencies.
+ */
+
+#define lockdep_off()					\
+do {							\
+	current->lockdep_recursion += LOCKDEP_OFF;	\
+} while (0)
+
+#define lockdep_on()					\
+do {							\
+	current->lockdep_recursion -= LOCKDEP_OFF;	\
+} while (0)
 
 extern void lockdep_register_key(struct lock_class_key *key);
 extern void lockdep_unregister_key(struct lock_class_key *key);
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -393,25 +393,6 @@ void lockdep_init_task(struct task_struc
 	task->lockdep_recursion = 0;
 }
 
-/*
- * Split the recrursion counter in two to readily detect 'off' vs recursion.
- */
-#define LOCKDEP_RECURSION_BITS	16
-#define LOCKDEP_OFF		(1U << LOCKDEP_RECURSION_BITS)
-#define LOCKDEP_RECURSION_MASK	(LOCKDEP_OFF - 1)
-
-void lockdep_off(void)
-{
-	current->lockdep_recursion += LOCKDEP_OFF;
-}
-EXPORT_SYMBOL(lockdep_off);
-
-void lockdep_on(void)
-{
-	current->lockdep_recursion -= LOCKDEP_OFF;
-}
-EXPORT_SYMBOL(lockdep_on);
-
 static inline void lockdep_recursion_finish(void)
 {
 	if (WARN_ON_ONCE(--current->lockdep_recursion))


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 31/36] printk: Disallow instrumenting print_nmi_enter()
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (29 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 30/36] lockdep: Always inline lockdep_{off,on}() Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Peter Zijlstra
  2020-05-05 13:16 ` [patch V4 part 1 32/36] sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception Thomas Gleixner
                   ` (5 subsequent siblings)
  36 siblings, 1 reply; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

From: Peter Zijlstra <peterz@infradead.org>

It happens early in nmi_enter(), no tracing, probing or other funnies
allowed. Specifically as nmi_enter() will be used in do_debug(), which
would cause recursive exceptions when kprobed.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/printk/printk_safe.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -10,6 +10,7 @@
 #include <linux/cpumask.h>
 #include <linux/irq_work.h>
 #include <linux/printk.h>
+#include <linux/kprobes.h>
 
 #include "internal.h"
 
@@ -293,12 +294,12 @@ static __printf(1, 0) int vprintk_nmi(co
 	return printk_safe_log_store(s, fmt, args);
 }
 
-void notrace printk_nmi_enter(void)
+void noinstr printk_nmi_enter(void)
 {
 	this_cpu_add(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
 }
 
-void notrace printk_nmi_exit(void)
+void noinstr printk_nmi_exit(void)
 {
 	this_cpu_sub(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
 }


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 32/36] sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (30 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 31/36] printk: Disallow instrumenting print_nmi_enter() Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-08  0:34   ` Steven Rostedt
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Peter Zijlstra
  2020-05-05 13:16 ` [patch V4 part 1 33/36] x86,tracing: Robustify ftrace_nmi_enter() Thomas Gleixner
                   ` (4 subsequent siblings)
  36 siblings, 2 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel),
	Rich Felker, Yoshinori Sato

From: Peter Zijlstra <peterz@infradead.org>

SuperH is the last remaining user of arch_ftrace_nmi_{enter,exit}(),
remove it from the generic code and into the SuperH code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
---
 Documentation/trace/ftrace-design.rst |    8 --------
 arch/sh/Kconfig                       |    1 -
 arch/sh/kernel/traps.c                |   12 ++++++++++++
 include/linux/ftrace_irq.h            |   11 -----------
 kernel/trace/Kconfig                  |   10 ----------
 5 files changed, 12 insertions(+), 30 deletions(-)

--- a/Documentation/trace/ftrace-design.rst
+++ b/Documentation/trace/ftrace-design.rst
@@ -229,14 +229,6 @@ Adding support for it is easy: just defi
 pass the return address pointer as the 'retp' argument to
 ftrace_push_return_trace().
 
-HAVE_FTRACE_NMI_ENTER
----------------------
-
-If you can't trace NMI functions, then skip this option.
-
-<details to be filled>
-
-
 HAVE_SYSCALL_TRACEPOINTS
 ------------------------
 
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -71,7 +71,6 @@ config SUPERH32
 	select HAVE_FUNCTION_TRACER
 	select HAVE_FTRACE_MCOUNT_RECORD
 	select HAVE_DYNAMIC_FTRACE
-	select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_ARCH_KGDB
--- a/arch/sh/kernel/traps.c
+++ b/arch/sh/kernel/traps.c
@@ -170,11 +170,21 @@ BUILD_TRAP_HANDLER(bug)
 	force_sig(SIGTRAP);
 }
 
+#ifdef CONFIG_DYNAMIC_FTRACE
+extern void arch_ftrace_nmi_enter(void);
+extern void arch_ftrace_nmi_exit(void);
+#else
+static inline void arch_ftrace_nmi_enter(void) { }
+static inline void arch_ftrace_nmi_exit(void) { }
+#endif
+
 BUILD_TRAP_HANDLER(nmi)
 {
 	unsigned int cpu = smp_processor_id();
 	TRAP_HANDLER_DECL;
 
+	arch_ftrace_nmi_enter();
+
 	nmi_enter();
 	nmi_count(cpu)++;
 
@@ -190,4 +200,6 @@ BUILD_TRAP_HANDLER(nmi)
 	}
 
 	nmi_exit();
+
+	arch_ftrace_nmi_exit();
 }
--- a/include/linux/ftrace_irq.h
+++ b/include/linux/ftrace_irq.h
@@ -2,15 +2,6 @@
 #ifndef _LINUX_FTRACE_IRQ_H
 #define _LINUX_FTRACE_IRQ_H
 
-
-#ifdef CONFIG_FTRACE_NMI_ENTER
-extern void arch_ftrace_nmi_enter(void);
-extern void arch_ftrace_nmi_exit(void);
-#else
-static inline void arch_ftrace_nmi_enter(void) { }
-static inline void arch_ftrace_nmi_exit(void) { }
-#endif
-
 #ifdef CONFIG_HWLAT_TRACER
 extern bool trace_hwlat_callback_enabled;
 extern void trace_hwlat_callback(bool enter);
@@ -22,12 +13,10 @@ static inline void ftrace_nmi_enter(void
 	if (trace_hwlat_callback_enabled)
 		trace_hwlat_callback(true);
 #endif
-	arch_ftrace_nmi_enter();
 }
 
 static inline void ftrace_nmi_exit(void)
 {
-	arch_ftrace_nmi_exit();
 #ifdef CONFIG_HWLAT_TRACER
 	if (trace_hwlat_callback_enabled)
 		trace_hwlat_callback(false);
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -10,11 +10,6 @@ config USER_STACKTRACE_SUPPORT
 config NOP_TRACER
 	bool
 
-config HAVE_FTRACE_NMI_ENTER
-	bool
-	help
-	  See Documentation/trace/ftrace-design.rst
-
 config HAVE_FUNCTION_TRACER
 	bool
 	help
@@ -72,11 +67,6 @@ config RING_BUFFER
 	select TRACE_CLOCK
 	select IRQ_WORK
 
-config FTRACE_NMI_ENTER
-       bool
-       depends on HAVE_FTRACE_NMI_ENTER
-       default y
-
 config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	select GLOB


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 33/36] x86,tracing: Robustify ftrace_nmi_enter()
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (31 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 32/36] sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-08  6:19   ` Masami Hiramatsu
  2020-05-05 13:16 ` [patch V4 part 1 34/36] sched,rcu,tracing: Avoid tracing before in_nmi() is correct Thomas Gleixner
                   ` (3 subsequent siblings)
  36 siblings, 1 reply; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

From: Peter Zijlstra <peterz@infradead.org>

  ftrace_nmi_enter()
     trace_hwlat_callback()
       trace_clock_local()
         sched_clock()
           paravirt_sched_clock()
           native_sched_clock()

All must not be traced or kprobed, it will be called from do_debug()
before the kprobe handler.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 arch/x86/include/asm/paravirt.h |    2 +-
 arch/x86/kernel/tsc.c           |    4 ++--
 include/linux/ftrace_irq.h      |    4 ++--
 kernel/trace/trace_clock.c      |    3 ++-
 kernel/trace/trace_hwlat.c      |    2 +-
 5 files changed, 8 insertions(+), 7 deletions(-)

--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -17,7 +17,7 @@
 #include <linux/cpumask.h>
 #include <asm/frame.h>
 
-static inline unsigned long long paravirt_sched_clock(void)
+static __always_inline unsigned long long paravirt_sched_clock(void)
 {
 	return PVOP_CALL0(unsigned long long, time.sched_clock);
 }
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -207,7 +207,7 @@ static void __init cyc2ns_init_secondary
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
-u64 native_sched_clock(void)
+noinstr u64 native_sched_clock(void)
 {
 	if (static_branch_likely(&__use_tsc)) {
 		u64 tsc_now = rdtsc();
@@ -240,7 +240,7 @@ u64 native_sched_clock_from_tsc(u64 tsc)
 /* We need to define a real function for sched_clock, to override the
    weak default version */
 #ifdef CONFIG_PARAVIRT
-unsigned long long sched_clock(void)
+noinstr unsigned long long sched_clock(void)
 {
 	return paravirt_sched_clock();
 }
--- a/include/linux/ftrace_irq.h
+++ b/include/linux/ftrace_irq.h
@@ -7,7 +7,7 @@ extern bool trace_hwlat_callback_enabled
 extern void trace_hwlat_callback(bool enter);
 #endif
 
-static inline void ftrace_nmi_enter(void)
+static __always_inline void ftrace_nmi_enter(void)
 {
 #ifdef CONFIG_HWLAT_TRACER
 	if (trace_hwlat_callback_enabled)
@@ -15,7 +15,7 @@ static inline void ftrace_nmi_enter(void
 #endif
 }
 
-static inline void ftrace_nmi_exit(void)
+static __always_inline void ftrace_nmi_exit(void)
 {
 #ifdef CONFIG_HWLAT_TRACER
 	if (trace_hwlat_callback_enabled)
--- a/kernel/trace/trace_clock.c
+++ b/kernel/trace/trace_clock.c
@@ -22,6 +22,7 @@
 #include <linux/sched/clock.h>
 #include <linux/ktime.h>
 #include <linux/trace_clock.h>
+#include <linux/kprobes.h>
 
 /*
  * trace_clock_local(): the simplest and least coherent tracing clock.
@@ -29,7 +30,7 @@
  * Useful for tracing that does not cross to other CPUs nor
  * does it go through idle events.
  */
-u64 notrace trace_clock_local(void)
+u64 noinstr trace_clock_local(void)
 {
 	u64 clock;
 
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -139,7 +139,7 @@ static void trace_hwlat_sample(struct hw
 #define init_time(a, b)	(a = b)
 #define time_u64(a)	a
 
-void trace_hwlat_callback(bool enter)
+noinstr void trace_hwlat_callback(bool enter)
 {
 	if (smp_processor_id() != nmi_cpu)
 		return;


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 34/36] sched,rcu,tracing: Avoid tracing before in_nmi() is correct
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (32 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 33/36] x86,tracing: Robustify ftrace_nmi_enter() Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Peter Zijlstra
  2020-05-05 13:16 ` [patch V4 part 1 35/36] x86: Replace ist_enter() with nmi_enter() Thomas Gleixner
                   ` (2 subsequent siblings)
  36 siblings, 1 reply; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Steven Rostedt (VMware),
	Peter Zijlstra (Intel)

From: Peter Zijlstra <peterz@infradead.org>

If a tracer is invoked before in_nmi() becomes true, the tracer can no
longer detect it is called from NMI context and behave correctly.

Therefore change nmi_{enter,exit}() to use __preempt_count_{add,sub}()
as the normal preempt_count_{add,sub}() have a (desired) function
trace entry.

This fixes a potential issue with the current code; when the function-tracer
has stack-tracing enabled __trace_stack() will malfunction when it hits the
preempt_count_add() function entry from NMI context.

Suggested-by: Steven Rostedt (VMware) <rosted@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 include/linux/hardirq.h |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -66,6 +66,15 @@ extern void irq_exit(void);
 #endif
 
 /*
+ * NMI vs Tracing
+ * --------------
+ *
+ * We must not land in a tracer until (or after) we've changed preempt_count
+ * such that in_nmi() becomes true. To that effect all NMI C entry points must
+ * be marked 'notrace' and call nmi_enter() as soon as possible.
+ */
+
+/*
  * nmi_enter() can nest up to 15 times; see NMI_BITS.
  */
 #define nmi_enter()						\
@@ -75,7 +84,7 @@ extern void irq_exit(void);
 		lockdep_off();					\
 		ftrace_nmi_enter();				\
 		BUG_ON(in_nmi() == NMI_MASK);			\
-		preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
+		__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
 		rcu_nmi_enter();				\
 		lockdep_hardirq_enter();			\
 	} while (0)
@@ -85,7 +94,7 @@ extern void irq_exit(void);
 		lockdep_hardirq_exit();				\
 		rcu_nmi_exit();					\
 		BUG_ON(!in_nmi());				\
-		preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);	\
+		__preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);	\
 		ftrace_nmi_exit();				\
 		lockdep_on();					\
 		printk_nmi_exit();				\


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 35/36] x86: Replace ist_enter() with nmi_enter()
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (33 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 34/36] sched,rcu,tracing: Avoid tracing before in_nmi() is correct Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-07 18:04   ` Andy Lutomirski
                     ` (2 more replies)
  2020-05-05 13:16 ` [patch V4 part 1 36/36] rcu: Make RCU IRQ enter/exit functions rely on in_nmi() Thomas Gleixner
  2020-05-07 18:05 ` [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Andy Lutomirski
  36 siblings, 3 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

From: Peter Zijlstra <peterz@infradead.org>

A few exceptions (like #DB and #BP) can happen at any location in the code,
this then means that tracers should treat events from these exceptions as
NMI-like. The interrupted context could be holding locks with interrupts
disabled for instance.

Similarly, #MC is an actual NMI-like exception.

All of them use ist_enter() which only concerns itself with RCU, but does
not do any of the other setup that NMIs need. This means things like:

	printk()
	  raw_spin_lock_irq(&logbuf_lock);
	  <#DB/#BP/#MC>
	     printk()
	       raw_spin_lock_irq(&logbuf_lock);

are entirely possible (well, not really since printk tries hard to
play nice, but the concept stands).

So replace ist_enter() with nmi_enter(). Also observe that any nmi_enter()
caller must be both notrace and NOKPROBE, or in the noinstr text section.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/traps.h      |    3 -
 arch/x86/kernel/cpu/mce/core.c    |    5 +-
 arch/x86/kernel/cpu/mce/p5.c      |    5 +-
 arch/x86/kernel/cpu/mce/winchip.c |    5 +-
 arch/x86/kernel/traps.c           |   71 ++++++++------------------------------
 5 files changed, 24 insertions(+), 65 deletions(-)

--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -79,9 +79,6 @@ void smp_spurious_interrupt(struct pt_re
 void smp_error_interrupt(struct pt_regs *regs);
 asmlinkage void smp_irq_move_cleanup_interrupt(void);
 
-extern void ist_enter(struct pt_regs *regs);
-extern void ist_exit(struct pt_regs *regs);
-
 #ifdef CONFIG_VMAP_STACK
 void __noreturn handle_stack_overflow(const char *message,
 				      struct pt_regs *regs,
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -43,6 +43,7 @@
 #include <linux/jump_label.h>
 #include <linux/set_memory.h>
 #include <linux/task_work.h>
+#include <linux/hardirq.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
@@ -1266,7 +1267,7 @@ void noinstr do_machine_check(struct pt_
 	if (__mc_check_crashing_cpu(cpu))
 		return;
 
-	ist_enter(regs);
+	nmi_enter();
 
 	this_cpu_inc(mce_exception_count);
 
@@ -1374,7 +1375,7 @@ void noinstr do_machine_check(struct pt_
 	}
 
 out_ist:
-	ist_exit(regs);
+	nmi_exit();
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
--- a/arch/x86/kernel/cpu/mce/p5.c
+++ b/arch/x86/kernel/cpu/mce/p5.c
@@ -7,6 +7,7 @@
 #include <linux/kernel.h>
 #include <linux/types.h>
 #include <linux/smp.h>
+#include <linux/hardirq.h>
 
 #include <asm/processor.h>
 #include <asm/traps.h>
@@ -24,7 +25,7 @@ static void pentium_machine_check(struct
 {
 	u32 loaddr, hi, lotype;
 
-	ist_enter(regs);
+	nmi_enter();
 
 	rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
 	rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);
@@ -39,7 +40,7 @@ static void pentium_machine_check(struct
 
 	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
 
-	ist_exit(regs);
+	nmi_exit();
 }
 
 /* Set up machine check reporting for processors with Intel style MCE: */
--- a/arch/x86/kernel/cpu/mce/winchip.c
+++ b/arch/x86/kernel/cpu/mce/winchip.c
@@ -6,6 +6,7 @@
 #include <linux/interrupt.h>
 #include <linux/kernel.h>
 #include <linux/types.h>
+#include <linux/hardirq.h>
 
 #include <asm/processor.h>
 #include <asm/traps.h>
@@ -18,12 +19,12 @@
 /* Machine check handler for WinChip C6: */
 static void winchip_machine_check(struct pt_regs *regs, long error_code)
 {
-	ist_enter(regs);
+	nmi_enter();
 
 	pr_emerg("CPU0: Machine Check Exception.\n");
 	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
 
-	ist_exit(regs);
+	nmi_exit();
 }
 
 /* Set up machine check reporting on the Winchip C6 series */
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -37,10 +37,12 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/hardirq.h>
+#include <linux/atomic.h>
+
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
 #include <asm/debugreg.h>
-#include <linux/atomic.h>
 #include <asm/text-patching.h>
 #include <asm/ftrace.h>
 #include <asm/traps.h>
@@ -82,41 +84,6 @@ static inline void cond_local_irq_disabl
 		local_irq_disable();
 }
 
-/*
- * In IST context, we explicitly disable preemption.  This serves two
- * purposes: it makes it much less likely that we would accidentally
- * schedule in IST context and it will force a warning if we somehow
- * manage to schedule by accident.
- */
-void ist_enter(struct pt_regs *regs)
-{
-	if (user_mode(regs)) {
-		RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
-	} else {
-		/*
-		 * We might have interrupted pretty much anything.  In
-		 * fact, if we're a machine check, we can even interrupt
-		 * NMI processing.  We don't want in_nmi() to return true,
-		 * but we need to notify RCU.
-		 */
-		rcu_nmi_enter();
-	}
-
-	preempt_disable();
-
-	/* This code is a bit fragile.  Test it. */
-	RCU_LOCKDEP_WARN(!rcu_is_watching(), "ist_enter didn't work");
-}
-NOKPROBE_SYMBOL(ist_enter);
-
-void ist_exit(struct pt_regs *regs)
-{
-	preempt_enable_no_resched();
-
-	if (!user_mode(regs))
-		rcu_nmi_exit();
-}
-
 int is_valid_bugaddr(unsigned long addr)
 {
 	unsigned short ud;
@@ -366,7 +333,7 @@ dotraplinkage void do_double_fault(struc
 	 * The net result is that our #GP handler will think that we
 	 * entered from usermode with the bad user context.
 	 *
-	 * No need for ist_enter here because we don't use RCU.
+	 * No need for nmi_enter() here because we don't use RCU.
 	 */
 	if (((long)regs->sp >> P4D_SHIFT) == ESPFIX_PGD_ENTRY &&
 		regs->cs == __KERNEL_CS &&
@@ -406,7 +373,7 @@ dotraplinkage void do_double_fault(struc
 	}
 #endif
 
-	ist_enter(regs);
+	nmi_enter();
 	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
 
 	tsk->thread.error_code = error_code;
@@ -603,19 +570,13 @@ dotraplinkage void notrace do_int3(struc
 		return;
 
 	/*
-	 * Unlike any other non-IST entry, we can be called from a kprobe in
-	 * non-CONTEXT_KERNEL kernel mode or even during context tracking
-	 * state changes.  Make sure that we wake up RCU even if we're coming
-	 * from kernel code.
-	 *
-	 * This means that we can't schedule even if we came from a
-	 * preemptible kernel context.  That's okay.
+	 * Unlike any other non-IST entry, we can be called from pretty much
+	 * any location in the kernel through kprobes -- text_poke() will most
+	 * likely be handled by poke_int3_handler() above. This means this
+	 * handler is effectively NMI-like.
 	 */
-	if (!user_mode(regs)) {
-		rcu_nmi_enter();
-		preempt_disable();
-	}
-	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+	if (!user_mode(regs))
+		nmi_enter();
 
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
 	if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
@@ -637,10 +598,8 @@ dotraplinkage void notrace do_int3(struc
 	cond_local_irq_disable(regs);
 
 exit:
-	if (!user_mode(regs)) {
-		preempt_enable_no_resched();
-		rcu_nmi_exit();
-	}
+	if (!user_mode(regs))
+		nmi_exit();
 }
 NOKPROBE_SYMBOL(do_int3);
 
@@ -745,7 +704,7 @@ dotraplinkage void do_debug(struct pt_re
 	unsigned long dr6;
 	int si_code;
 
-	ist_enter(regs);
+	nmi_enter();
 
 	get_debugreg(dr6, 6);
 	/*
@@ -838,7 +797,7 @@ dotraplinkage void do_debug(struct pt_re
 	debug_stack_usage_dec();
 
 exit:
-	ist_exit(regs);
+	nmi_exit();
 }
 NOKPROBE_SYMBOL(do_debug);
 


^ permalink raw reply	[flat|nested] 178+ messages in thread

* [patch V4 part 1 36/36] rcu: Make RCU IRQ enter/exit functions rely on in_nmi()
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (34 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 35/36] x86: Replace ist_enter() with nmi_enter() Thomas Gleixner
@ 2020-05-05 13:16 ` Thomas Gleixner
  2020-05-05 18:13   ` Paul E. McKenney
                     ` (2 more replies)
  2020-05-07 18:05 ` [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Andy Lutomirski
  36 siblings, 3 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 13:16 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

From: Paul E. McKenney <paulmck@kernel.org>

The rcu_nmi_enter_common() and rcu_nmi_exit_common() functions take an
"irq" parameter that indicates whether these functions are invoked from
an irq handler (irq==true) or an NMI handler (irq==false).  However,
recent changes have applied notrace to a few critical functions such
that rcu_nmi_enter_common() and rcu_nmi_exit_common() many now rely
on in_nmi().  Note that in_nmi() works no differently than before,
but rather that tracing is now prohibited in code regions where in_nmi()
would incorrectly report NMI state.

Therefore remove the "irq" parameter and inlines rcu_nmi_enter_common() and
rcu_nmi_exit_common() into rcu_nmi_enter() and rcu_nmi_exit(),
respectively.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/rcu/tree.c |   47 +++++++++++++++--------------------------------
 1 file changed, 15 insertions(+), 32 deletions(-)

--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -627,16 +627,18 @@ noinstr void rcu_user_enter(void)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
-/*
+/**
+ * rcu_nmi_exit - inform RCU of exit from NMI context
+ *
  * If we are returning from the outermost NMI handler that interrupted an
  * RCU-idle period, update rdp->dynticks and rdp->dynticks_nmi_nesting
  * to let the RCU grace-period handling know that the CPU is back to
  * being RCU-idle.
  *
- * If you add or remove a call to rcu_nmi_exit_common(), be sure to test
+ * If you add or remove a call to rcu_nmi_exit(), be sure to test
  * with CONFIG_RCU_EQS_DEBUG=y.
  */
-static __always_inline void rcu_nmi_exit_common(bool irq)
+noinstr void rcu_nmi_exit(void)
 {
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 
@@ -667,7 +669,7 @@ static __always_inline void rcu_nmi_exit
 	trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks));
 	WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
 
-	if (irq)
+	if (!in_nmi())
 		rcu_prepare_for_idle();
 	instr_end();
 
@@ -675,22 +677,11 @@ static __always_inline void rcu_nmi_exit
 	rcu_dynticks_eqs_enter();
 	// ... but is no longer watching here.
 
-	if (irq)
+	if (!in_nmi())
 		rcu_dynticks_task_enter();
 }
 
 /**
- * rcu_nmi_exit - inform RCU of exit from NMI context
- *
- * If you add or remove a call to rcu_nmi_exit(), be sure to test
- * with CONFIG_RCU_EQS_DEBUG=y.
- */
-void noinstr rcu_nmi_exit(void)
-{
-	rcu_nmi_exit_common(false);
-}
-
-/**
  * rcu_irq_exit - inform RCU that current CPU is exiting irq towards idle
  *
  * Exit from an interrupt handler, which might possibly result in entering
@@ -712,7 +703,7 @@ void noinstr rcu_nmi_exit(void)
 void noinstr rcu_irq_exit(void)
 {
 	lockdep_assert_irqs_disabled();
-	rcu_nmi_exit_common(true);
+	rcu_nmi_exit();
 }
 
 /*
@@ -801,7 +792,7 @@ void noinstr rcu_user_exit(void)
 #endif /* CONFIG_NO_HZ_FULL */
 
 /**
- * rcu_nmi_enter_common - inform RCU of entry to NMI context
+ * rcu_nmi_enter - inform RCU of entry to NMI context
  * @irq: Is this call from rcu_irq_enter?
  *
  * If the CPU was idle from RCU's viewpoint, update rdp->dynticks and
@@ -810,10 +801,10 @@ void noinstr rcu_user_exit(void)
  * long as the nesting level does not overflow an int.  (You will probably
  * run out of stack space first.)
  *
- * If you add or remove a call to rcu_nmi_enter_common(), be sure to test
+ * If you add or remove a call to rcu_nmi_enter(), be sure to test
  * with CONFIG_RCU_EQS_DEBUG=y.
  */
-static __always_inline void rcu_nmi_enter_common(bool irq)
+noinstr void rcu_nmi_enter(void)
 {
 	long incby = 2;
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
@@ -831,18 +822,18 @@ static __always_inline void rcu_nmi_ente
 	 */
 	if (rcu_dynticks_curr_cpu_in_eqs()) {
 
-		if (irq)
+		if (!in_nmi())
 			rcu_dynticks_task_exit();
 
 		// RCU is not watching here ...
 		rcu_dynticks_eqs_exit();
 		// ... but is watching here.
 
-		if (irq)
+		if (!in_nmi())
 			rcu_cleanup_after_idle();
 
 		incby = 1;
-	} else if (irq) {
+	} else if (!in_nmi()) {
 		instr_begin();
 		if (tick_nohz_full_cpu(rdp->cpu) &&
 		    rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
@@ -877,14 +868,6 @@ static __always_inline void rcu_nmi_ente
 }
 
 /**
- * rcu_nmi_enter - inform RCU of entry to NMI context
- */
-noinstr void rcu_nmi_enter(void)
-{
-	rcu_nmi_enter_common(false);
-}
-
-/**
  * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
  *
  * Enter an interrupt handler, which might possibly result in exiting
@@ -909,7 +892,7 @@ noinstr void rcu_nmi_enter(void)
 noinstr void rcu_irq_enter(void)
 {
 	lockdep_assert_irqs_disabled();
-	rcu_nmi_enter_common(true);
+	rcu_nmi_enter();
 }
 
 /*


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 07/36] locking/atomics: Flip fallbacks and instrumentation
  2020-05-05 13:16 ` [patch V4 part 1 07/36] locking/atomics: Flip fallbacks and instrumentation Thomas Gleixner
@ 2020-05-05 16:04   ` Mark Rutland
  2020-05-07 23:41   ` Steven Rostedt
  2020-05-12 14:36   ` [tip: locking/kcsan] " tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 178+ messages in thread
From: Mark Rutland @ 2020-05-05 16:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

On Tue, May 05, 2020 at 03:16:09PM +0200, Thomas Gleixner wrote:
> Currently instrumentation of atomic primitives is done at the
> architecture level, while composites or fallbacks are provided at the
> generic level.
> 
> The result is that there are no uninstrumented variants of the
> fallbacks. Since there is now need of such (see the next patch),
> invert this ordering.
> 
> Doing this means moving the instrumentation into the generic code as
> well as having (for now) two variants of the fallbacks.
> 
> Notes:
> 
>  - the various *cond_read* primitives are not proper fallbacks
>    and got moved into linux/atomic.c. No arch_ variants are
>    generated because the base primitives smp_cond_load*()
>    are instrumented.
> 
>  - once all architectures are moved over to arch_atomic_ we can remove
>    one of the fallback variants and reclaim some 2300 lines.
> 
>  - atomic_{read,set}*() are no longer double-instrumented
> 
> Reported-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mark Rutland <mark.rutland@arm.com>

FWIW, the scripting changes all look fine to me, so:

Acked-by: Mark Rutland <mark.rutland@arm.com>

I'm hoping that I can convert the remaining arches over to arch_atomic
atop of this, at which point we can remove the duplication, and have the
usual arch_<foo>/raw_<foo>/<foo> split.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 12/36] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-05-05 13:16 ` [patch V4 part 1 12/36] x86/kvm: Sanitize kvm_async_pf_task_wait() Thomas Gleixner
@ 2020-05-05 17:54   ` Paul E. McKenney
  2020-05-05 21:50     ` Thomas Gleixner
  2020-05-06  7:00   ` Paolo Bonzini
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 178+ messages in thread
From: Paul E. McKenney @ 2020-05-05 17:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

On Tue, May 05, 2020 at 03:16:14PM +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> While working on the entry consolidation I stumbled over the KVM async page
> fault handler and kvm_async_pf_task_wait() in particular. It took me a
> while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
> invocations are just cargo cult programming. Several patches "fixed" RCU
> splats by curing the symptoms without noticing that the code is flawed 
> from a design perspective.
> 
> The main problem is that this async injection is not based on a proper
> handshake mechanism and only respects the minimal requirement, i.e. the
> guest is not in a state where it has interrupts disabled.
> 
> Aside of that the actual code is a convoluted one fits it all swiss army
> knife. It is invoked from different places with different RCU constraints:
> 
>   1) Host side:
> 
>      vcpu_enter_guest()
>        kvm_x86_ops->handle_exit()
>          kvm_handle_page_fault()
>            kvm_async_pf_task_wait()
> 
>      The invocation happens from fully preemptible context.
> 
>   2) Guest side:
> 
>      The async page fault interrupted:
> 
>          a) user space
> 
> 	 b) preemptible kernel code which is not in a RCU read side
> 	    critical section
> 
>      	 c) non-preemtible kernel code or a RCU read side critical section
> 	    or kernel code with CONFIG_PREEMPTION=n which allows not to
> 	    differentiate between #2b and #2c.
> 
> RCU is watching for:
> 
>   #1  The vCPU exited and current is definitely not the idle task
> 
>   #2a The #PF entry code on the guest went through enter_from_user_mode()
>       which reactivates RCU

I have to double-check...  The NO_HZ_FULL case transitioning to/from
userspace is entirely non-preemptible, correct?  (After rcu_user_enter()
and before rcu_user_exit(), respectively.)

							Thanx, Paul

>   #2b There is no preemptible, interrupts enabled code in the kernel
>       which can run with RCU looking away. (The idle task is always
>       non preemptible).
> 
> I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU
> voodoo at all.
> 
> In #2c RCU is eventually not watching, but as that state cannot schedule
> anyway there is no point to worry about it so it has to invoke
> rcu_irq_enter() before running that code. This can be optimized, but this
> will be done as an extra step in course of the entry code consolidation
> work.
> 
> So the proper solution for this is to:
> 
>   - Split kvm_async_pf_task_wait() into schedule and halt based waiting
>     interfaces which share the enqueueing code.
> 
>   - Add comments (condensed form of this changelog) to spare others the
>     time waste and pain of reverse engineering all of this with the help of
>     uncomprehensible changelogs and code history.
> 
>   - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(),
>     user mode and schedulable kernel side async page faults (#1, #2a, #2b)
> 
>   - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel
>     case (#2c).
> 
>     For this case also remove the rcu_irq_exit()/enter() pair around the
>     halt as it is just a pointless exercise:
> 
>        - vCPUs can VMEXIT at any random point and can be scheduled out for
>          an arbitrary amount of time by the host and this is not any
>          different except that it voluntary triggers the exit via halt.
> 
>        - The interrupted context could have RCU watching already. So the
> 	 rcu_irq_exit() before the halt is not gaining anything aside of
> 	 confusing the reader. Claiming that this might prevent RCU stalls
> 	 is just an illusion.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V2: Panic if async #PF is injected into an interrupt disabled region.
> ---
>  arch/x86/include/asm/kvm_para.h |    4 
>  arch/x86/kernel/kvm.c           |  201 ++++++++++++++++++++++++++++------------
>  arch/x86/kvm/mmu/mmu.c          |    2 
>  3 files changed, 144 insertions(+), 63 deletions(-)
> 
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsign
>  bool kvm_para_available(void);
>  unsigned int kvm_arch_para_features(void);
>  unsigned int kvm_arch_para_hints(void);
> -void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
> +void kvm_async_pf_task_wait_schedule(u32 token);
>  void kvm_async_pf_task_wake(u32 token);
>  u32 kvm_read_and_reset_pf_reason(void);
>  void kvm_disable_steal_time(void);
> @@ -113,7 +113,7 @@ static inline void kvm_spinlock_init(voi
>  #endif /* CONFIG_PARAVIRT_SPINLOCKS */
>  
>  #else /* CONFIG_KVM_GUEST */
> -#define kvm_async_pf_task_wait(T, I) do {} while(0)
> +#define kvm_async_pf_task_wait_schedule(T) do {} while(0)
>  #define kvm_async_pf_task_wake(T) do {} while(0)
>  
>  static inline bool kvm_para_available(void)
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -75,7 +75,7 @@ struct kvm_task_sleep_node {
>  	struct swait_queue_head wq;
>  	u32 token;
>  	int cpu;
> -	bool halted;
> +	bool use_halt;
>  };
>  
>  static struct kvm_task_sleep_head {
> @@ -98,75 +98,145 @@ static struct kvm_task_sleep_node *_find
>  	return NULL;
>  }
>  
> -/*
> - * @interrupt_kernel: Is this called from a routine which interrupts the kernel
> - * 		      (other than user space)?
> - */
> -void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
> +static bool kvm_async_pf_queue_task(u32 token, bool use_halt,
> +				    struct kvm_task_sleep_node *n)
>  {
>  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> -	struct kvm_task_sleep_node n, *e;
> -	DECLARE_SWAITQUEUE(wait);
> -
> -	rcu_irq_enter();
> +	struct kvm_task_sleep_node *e;
>  
>  	raw_spin_lock(&b->lock);
>  	e = _find_apf_task(b, token);
>  	if (e) {
>  		/* dummy entry exist -> wake up was delivered ahead of PF */
>  		hlist_del(&e->link);
> -		kfree(e);
>  		raw_spin_unlock(&b->lock);
> +		kfree(e);
> +		return false;
> +	}
>  
> -		rcu_irq_exit();
> +	n->token = token;
> +	n->cpu = smp_processor_id();
> +	n->use_halt = use_halt;
> +	init_swait_queue_head(&n->wq);
> +	hlist_add_head(&n->link, &b->list);
> +	raw_spin_unlock(&b->lock);
> +	return true;
> +}
> +
> +/*
> + * kvm_async_pf_task_wait_schedule - Wait for pagefault to be handled
> + * @token:	Token to identify the sleep node entry
> + *
> + * Invoked from the async pagefault handling code or from the VM exit page
> + * fault handler. In both cases RCU is watching.
> + */
> +void kvm_async_pf_task_wait_schedule(u32 token)
> +{
> +	struct kvm_task_sleep_node n;
> +	DECLARE_SWAITQUEUE(wait);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	if (!kvm_async_pf_queue_task(token, false, &n))
>  		return;
> +
> +	for (;;) {
> +		prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
> +		if (hlist_unhashed(&n.link))
> +			break;
> +
> +		local_irq_enable();
> +		schedule();
> +		local_irq_disable();
>  	}
> +	finish_swait(&n.wq, &wait);
> +}
> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
>  
> -	n.token = token;
> -	n.cpu = smp_processor_id();
> -	n.halted = is_idle_task(current) ||
> -		   (IS_ENABLED(CONFIG_PREEMPT_COUNT)
> -		    ? preempt_count() > 1 || rcu_preempt_depth()
> -		    : interrupt_kernel);
> -	init_swait_queue_head(&n.wq);
> -	hlist_add_head(&n.link, &b->list);
> -	raw_spin_unlock(&b->lock);
> +/*
> + * Invoked from the async page fault handler.
> + */
> +static void kvm_async_pf_task_wait_halt(u32 token)
> +{
> +	struct kvm_task_sleep_node n;
> +
> +	if (!kvm_async_pf_queue_task(token, true, &n))
> +		return;
>  
>  	for (;;) {
> -		if (!n.halted)
> -			prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
>  		if (hlist_unhashed(&n.link))
>  			break;
> +		/*
> +		 * No point in doing anything about RCU here. Any RCU read
> +		 * side critical section or RCU watching section can be
> +		 * interrupted by VMEXITs and the host is free to keep the
> +		 * vCPU scheduled out as long as it sees fit. This is not
> +		 * any different just because of the halt induced voluntary
> +		 * VMEXIT.
> +		 *
> +		 * Also the async page fault could have interrupted any RCU
> +		 * watching context, so invoking rcu_irq_exit()/enter()
> +		 * around this is not gaining anything.
> +		 */
> +		native_safe_halt();
> +		local_irq_disable();
> +	}
> +}
>  
> -		rcu_irq_exit();
> +/* Invoked from the async page fault handler */
> +static void kvm_async_pf_task_wait(u32 token, bool usermode)
> +{
> +	bool can_schedule;
>  
> -		if (!n.halted) {
> -			local_irq_enable();
> -			schedule();
> -			local_irq_disable();
> -		} else {
> -			/*
> -			 * We cannot reschedule. So halt.
> -			 */
> -			native_safe_halt();
> -			local_irq_disable();
> -		}
> +	/*
> +	 * No need to check whether interrupts were disabled because the
> +	 * host will (hopefully) only inject an async page fault into
> +	 * interrupt enabled regions.
> +	 *
> +	 * If CONFIG_PREEMPTION is enabled then check whether the code
> +	 * which triggered the page fault is preemptible. This covers user
> +	 * mode as well because preempt_count() is obviously 0 there.
> +	 *
> +	 * The check for rcu_preempt_depth() is also required because
> +	 * voluntary scheduling inside a rcu read locked section is not
> +	 * allowed.
> +	 *
> +	 * The idle task is already covered by this because idle always
> +	 * has a preempt count > 0.
> +	 *
> +	 * If CONFIG_PREEMPTION is disabled only allow scheduling when
> +	 * coming from user mode as there is no indication whether the
> +	 * context which triggered the page fault could schedule or not.
> +	 */
> +	if (IS_ENABLED(CONFIG_PREEMPTION))
> +		can_schedule = preempt_count() + rcu_preempt_depth() == 0;
> +	else
> +		can_schedule = usermode;
>  
> +	/*
> +	 * If the kernel context is allowed to schedule then RCU is
> +	 * watching because no preemptible code in the kernel is inside RCU
> +	 * idle state. So it can be treated like user mode. User mode is
> +	 * safe because the #PF entry invoked enter_from_user_mode().
> +	 *
> +	 * For the non schedulable case invoke rcu_irq_enter() for
> +	 * now. This will be moved out to the pagefault entry code later
> +	 * and only invoked when really needed.
> +	 */
> +	if (can_schedule) {
> +		kvm_async_pf_task_wait_schedule(token);
> +	} else {
>  		rcu_irq_enter();
> +		kvm_async_pf_task_wait_halt(token);
> +		rcu_irq_exit();
>  	}
> -	if (!n.halted)
> -		finish_swait(&n.wq, &wait);
> -
> -	rcu_irq_exit();
> -	return;
>  }
> -EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
>  
>  static void apf_task_wake_one(struct kvm_task_sleep_node *n)
>  {
>  	hlist_del_init(&n->link);
> -	if (n->halted)
> +	if (n->use_halt)
>  		smp_send_reschedule(n->cpu);
>  	else if (swq_has_sleeper(&n->wq))
>  		swake_up_one(&n->wq);
> @@ -177,12 +247,13 @@ static void apf_task_wake_all(void)
>  	int i;
>  
>  	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
> -		struct hlist_node *p, *next;
>  		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
> +		struct kvm_task_sleep_node *n;
> +		struct hlist_node *p, *next;
> +
>  		raw_spin_lock(&b->lock);
>  		hlist_for_each_safe(p, next, &b->list) {
> -			struct kvm_task_sleep_node *n =
> -				hlist_entry(p, typeof(*n), link);
> +			n = hlist_entry(p, typeof(*n), link);
>  			if (n->cpu == smp_processor_id())
>  				apf_task_wake_one(n);
>  		}
> @@ -223,8 +294,9 @@ void kvm_async_pf_task_wake(u32 token)
>  		n->cpu = smp_processor_id();
>  		init_swait_queue_head(&n->wq);
>  		hlist_add_head(&n->link, &b->list);
> -	} else
> +	} else {
>  		apf_task_wake_one(n);
> +	}
>  	raw_spin_unlock(&b->lock);
>  	return;
>  }
> @@ -246,23 +318,33 @@ NOKPROBE_SYMBOL(kvm_read_and_reset_pf_re
>  
>  bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
>  {
> -	/*
> -	 * If we get a page fault right here, the pf_reason seems likely
> -	 * to be clobbered.  Bummer.
> -	 */
> -	switch (kvm_read_and_reset_pf_reason()) {
> +	u32 reason = kvm_read_and_reset_pf_reason();
> +
> +	switch (reason) {
> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +	case KVM_PV_REASON_PAGE_READY:
> +		break;
>  	default:
>  		return false;
> -	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +	}
> +
> +	/*
> +	 * If the host managed to inject an async #PF into an interrupt
> +	 * disabled region, then die hard as this is not going to end well
> +	 * and the host side is seriously broken.
> +	 */
> +	if (unlikely(!(regs->flags & X86_EFLAGS_IF)))
> +		panic("Host injected async #PF in interrupt disabled region\n");
> +
> +	if (reason == KVM_PV_REASON_PAGE_NOT_PRESENT) {
>  		/* page is swapped out by the host. */
> -		kvm_async_pf_task_wait(token, !user_mode(regs));
> -		return true;
> -	case KVM_PV_REASON_PAGE_READY:
> +		kvm_async_pf_task_wait(token, user_mode(regs));
> +	} else {
>  		rcu_irq_enter();
>  		kvm_async_pf_task_wake(token);
>  		rcu_irq_exit();
> -		return true;
>  	}
> +	return true;
>  }
>  NOKPROBE_SYMBOL(__kvm_handle_async_pf);
>  
> @@ -326,12 +408,12 @@ static void kvm_guest_cpu_init(void)
>  
>  		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
>  		__this_cpu_write(apf_reason.enabled, 1);
> -		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
> -		       smp_processor_id());
> +		pr_info("KVM setup async PF for cpu %d\n", smp_processor_id());
>  	}
>  
>  	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
>  		unsigned long pa;
> +
>  		/* Size alignment is implied but just to make it explicit. */
>  		BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
>  		__this_cpu_write(kvm_apic_eoi, 0);
> @@ -352,8 +434,7 @@ static void kvm_pv_disable_apf(void)
>  	wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
>  	__this_cpu_write(apf_reason.enabled, 0);
>  
> -	printk(KERN_INFO"Unregister pv shared memory for cpu %d\n",
> -	       smp_processor_id());
> +	pr_info("Unregister pv shared memory for cpu %d\n", smp_processor_id());
>  }
>  
>  static void kvm_pv_guest_cpu_reboot(void *unused)
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4198,7 +4198,7 @@ int kvm_handle_page_fault(struct kvm_vcp
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		vcpu->arch.apf.host_apf_reason = 0;
>  		local_irq_disable();
> -		kvm_async_pf_task_wait(fault_address, 0);
> +		kvm_async_pf_task_wait_schedule(fault_address);
>  		local_irq_enable();
>  		break;
>  	case KVM_PV_REASON_PAGE_READY:
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 25/36] rcu/tree: Mark the idle relevant functions noinstr
  2020-05-05 13:16 ` [patch V4 part 1 25/36] rcu/tree: Mark the idle relevant functions noinstr Thomas Gleixner
@ 2020-05-05 18:07   ` Paul E. McKenney
  2020-05-19 19:48   ` Joel Fernandes
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 178+ messages in thread
From: Paul E. McKenney @ 2020-05-05 18:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

On Tue, May 05, 2020 at 03:16:27PM +0200, Thomas Gleixner wrote:
> These functions are invoked from context tracking and other places in the
> low level entry code. Move them into the .noinstr.text section to exclude
> them from instrumentation.
> 
> Mark the places which are safe to invoke traceable functions with
> instr_begin/end() so objtool won't complain.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  kernel/rcu/tree.c        |   85 +++++++++++++++++++++++++----------------------
>  kernel/rcu/tree_plugin.h |    4 +-
>  kernel/rcu/update.c      |    7 +--
>  3 files changed, 52 insertions(+), 44 deletions(-)
> 
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -75,9 +75,6 @@
>   */
>  #define RCU_DYNTICK_CTRL_MASK 0x1
>  #define RCU_DYNTICK_CTRL_CTR  (RCU_DYNTICK_CTRL_MASK + 1)
> -#ifndef rcu_eqs_special_exit
> -#define rcu_eqs_special_exit() do { } while (0)
> -#endif

Joel has a patch series doing a more thorough removal of this function,
but sometimes it is necessary to remove the function twice, just to
be sure.  ;-)

>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
>  	.dynticks_nesting = 1,
> @@ -229,7 +226,7 @@ void rcu_softirq_qs(void)
>   * RCU is watching prior to the call to this function and is no longer
>   * watching upon return.
>   */
> -static void rcu_dynticks_eqs_enter(void)
> +static noinstr void rcu_dynticks_eqs_enter(void)
>  {
>  	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>  	int seq;
> @@ -253,7 +250,7 @@ static void rcu_dynticks_eqs_enter(void)
>   * called from an extended quiescent state, that is, RCU is not watching
>   * prior to the call to this function and is watching upon return.
>   */
> -static void rcu_dynticks_eqs_exit(void)
> +static noinstr void rcu_dynticks_eqs_exit(void)
>  {
>  	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>  	int seq;
> @@ -270,8 +267,6 @@ static void rcu_dynticks_eqs_exit(void)
>  	if (seq & RCU_DYNTICK_CTRL_MASK) {
>  		atomic_andnot(RCU_DYNTICK_CTRL_MASK, &rdp->dynticks);
>  		smp_mb__after_atomic(); /* _exit after clearing mask. */
> -		/* Prefer duplicate flushes to losing a flush. */
> -		rcu_eqs_special_exit();
>  	}
>  }
>  
> @@ -299,7 +294,7 @@ static void rcu_dynticks_eqs_online(void
>   *
>   * No ordering, as we are sampling CPU-local information.
>   */
> -static bool rcu_dynticks_curr_cpu_in_eqs(void)
> +static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
>  {
>  	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>  
> @@ -566,7 +561,7 @@ EXPORT_SYMBOL_GPL(rcutorture_get_gp_data
>   * the possibility of usermode upcalls having messed up our count
>   * of interrupt nesting level during the prior busy period.
>   */
> -static void rcu_eqs_enter(bool user)
> +static noinstr void rcu_eqs_enter(bool user)
>  {
>  	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>  
> @@ -581,12 +576,14 @@ static void rcu_eqs_enter(bool user)
>  	}
>  
>  	lockdep_assert_irqs_disabled();
> +	instr_begin();
>  	trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks));
>  	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
>  	rdp = this_cpu_ptr(&rcu_data);
>  	do_nocb_deferred_wakeup(rdp);
>  	rcu_prepare_for_idle();
>  	rcu_preempt_deferred_qs(current);
> +	instr_end();
>  	WRITE_ONCE(rdp->dynticks_nesting, 0); /* Avoid irq-access tearing. */
>  	// RCU is watching here ...
>  	rcu_dynticks_eqs_enter();
> @@ -623,7 +620,7 @@ void rcu_idle_enter(void)
>   * If you add or remove a call to rcu_user_enter(), be sure to test with
>   * CONFIG_RCU_EQS_DEBUG=y.
>   */
> -void rcu_user_enter(void)
> +noinstr void rcu_user_enter(void)
>  {
>  	lockdep_assert_irqs_disabled();
>  	rcu_eqs_enter(true);
> @@ -656,19 +653,23 @@ static __always_inline void rcu_nmi_exit
>  	 * leave it in non-RCU-idle state.
>  	 */
>  	if (rdp->dynticks_nmi_nesting != 1) {
> +		instr_begin();
>  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2,
>  				  atomic_read(&rdp->dynticks));
>  		WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
>  			   rdp->dynticks_nmi_nesting - 2);
> +		instr_end();
>  		return;
>  	}
>  
> +	instr_begin();
>  	/* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
>  	trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks));
>  	WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
>  
>  	if (irq)
>  		rcu_prepare_for_idle();
> +	instr_end();
>  
>  	// RCU is watching here ...
>  	rcu_dynticks_eqs_enter();
> @@ -684,7 +685,7 @@ static __always_inline void rcu_nmi_exit
>   * If you add or remove a call to rcu_nmi_exit(), be sure to test
>   * with CONFIG_RCU_EQS_DEBUG=y.
>   */
> -void rcu_nmi_exit(void)
> +void noinstr rcu_nmi_exit(void)
>  {
>  	rcu_nmi_exit_common(false);
>  }
> @@ -708,7 +709,7 @@ void rcu_nmi_exit(void)
>   * If you add or remove a call to rcu_irq_exit(), be sure to test with
>   * CONFIG_RCU_EQS_DEBUG=y.
>   */
> -void rcu_irq_exit(void)
> +void noinstr rcu_irq_exit(void)
>  {
>  	lockdep_assert_irqs_disabled();
>  	rcu_nmi_exit_common(true);
> @@ -737,7 +738,7 @@ void rcu_irq_exit_irqson(void)
>   * allow for the possibility of usermode upcalls messing up our count of
>   * interrupt nesting level during the busy period that is just now starting.
>   */
> -static void rcu_eqs_exit(bool user)
> +static void noinstr rcu_eqs_exit(bool user)
>  {
>  	struct rcu_data *rdp;
>  	long oldval;
> @@ -755,12 +756,14 @@ static void rcu_eqs_exit(bool user)
>  	// RCU is not watching here ...
>  	rcu_dynticks_eqs_exit();
>  	// ... but is watching here.
> +	instr_begin();
>  	rcu_cleanup_after_idle();
>  	trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, atomic_read(&rdp->dynticks));
>  	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
>  	WRITE_ONCE(rdp->dynticks_nesting, 1);
>  	WARN_ON_ONCE(rdp->dynticks_nmi_nesting);
>  	WRITE_ONCE(rdp->dynticks_nmi_nesting, DYNTICK_IRQ_NONIDLE);
> +	instr_end();
>  }
>  
>  /**
> @@ -791,7 +794,7 @@ void rcu_idle_exit(void)
>   * If you add or remove a call to rcu_user_exit(), be sure to test with
>   * CONFIG_RCU_EQS_DEBUG=y.
>   */
> -void rcu_user_exit(void)
> +void noinstr rcu_user_exit(void)
>  {
>  	rcu_eqs_exit(1);
>  }
> @@ -839,28 +842,35 @@ static __always_inline void rcu_nmi_ente
>  			rcu_cleanup_after_idle();
>  
>  		incby = 1;
> -	} else if (irq && tick_nohz_full_cpu(rdp->cpu) &&
> -		   rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
> -		   READ_ONCE(rdp->rcu_urgent_qs) &&
> -		   !READ_ONCE(rdp->rcu_forced_tick)) {
> -		// We get here only if we had already exited the extended
> -		// quiescent state and this was an interrupt (not an NMI).
> -		// Therefore, (1) RCU is already watching and (2) The fact
> -		// that we are in an interrupt handler and that the rcu_node
> -		// lock is an irq-disabled lock prevents self-deadlock.
> -		// So we can safely recheck under the lock.
> -		raw_spin_lock_rcu_node(rdp->mynode);
> -		if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> -			// A nohz_full CPU is in the kernel and RCU
> -			// needs a quiescent state.  Turn on the tick!
> -			WRITE_ONCE(rdp->rcu_forced_tick, true);
> -			tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
> +	} else if (irq) {
> +		instr_begin();
> +		if (tick_nohz_full_cpu(rdp->cpu) &&
> +		    rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
> +		    READ_ONCE(rdp->rcu_urgent_qs) &&
> +		    !READ_ONCE(rdp->rcu_forced_tick)) {
> +			// We get here only if we had already exited the
> +			// extended quiescent state and this was an
> +			// interrupt (not an NMI).  Therefore, (1) RCU is
> +			// already watching and (2) The fact that we are in
> +			// an interrupt handler and that the rcu_node lock
> +			// is an irq-disabled lock prevents self-deadlock.
> +			// So we can safely recheck under the lock.
> +			raw_spin_lock_rcu_node(rdp->mynode);
> +			if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> +				// A nohz_full CPU is in the kernel and RCU
> +				// needs a quiescent state.  Turn on the tick!
> +				WRITE_ONCE(rdp->rcu_forced_tick, true);
> +				tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
> +			}
> +			raw_spin_unlock_rcu_node(rdp->mynode);
>  		}
> -		raw_spin_unlock_rcu_node(rdp->mynode);
> +		instr_end();
>  	}
> +	instr_begin();
>  	trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
>  			  rdp->dynticks_nmi_nesting,
>  			  rdp->dynticks_nmi_nesting + incby, atomic_read(&rdp->dynticks));
> +	instr_end();
>  	WRITE_ONCE(rdp->dynticks_nmi_nesting, /* Prevent store tearing. */
>  		   rdp->dynticks_nmi_nesting + incby);
>  	barrier();
> @@ -869,11 +879,10 @@ static __always_inline void rcu_nmi_ente
>  /**
>   * rcu_nmi_enter - inform RCU of entry to NMI context
>   */
> -void rcu_nmi_enter(void)
> +noinstr void rcu_nmi_enter(void)
>  {
>  	rcu_nmi_enter_common(false);
>  }
> -NOKPROBE_SYMBOL(rcu_nmi_enter);
>  
>  /**
>   * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
> @@ -897,7 +906,7 @@ NOKPROBE_SYMBOL(rcu_nmi_enter);
>   * If you add or remove a call to rcu_irq_enter(), be sure to test with
>   * CONFIG_RCU_EQS_DEBUG=y.
>   */
> -void rcu_irq_enter(void)
> +noinstr void rcu_irq_enter(void)
>  {
>  	lockdep_assert_irqs_disabled();
>  	rcu_nmi_enter_common(true);
> @@ -942,7 +951,7 @@ static void rcu_disable_urgency_upon_qs(
>   * if the current CPU is not in its idle loop or is in an interrupt or
>   * NMI handler, return true.
>   */
> -bool notrace rcu_is_watching(void)
> +noinstr bool rcu_is_watching(void)
>  {
>  	bool ret;
>  
> @@ -986,7 +995,7 @@ void rcu_request_urgent_qs_task(struct t
>   * RCU on an offline processor during initial boot, hence the check for
>   * rcu_scheduler_fully_active.
>   */
> -bool rcu_lockdep_current_cpu_online(void)
> +noinstr bool rcu_lockdep_current_cpu_online(void)
>  {
>  	struct rcu_data *rdp;
>  	struct rcu_node *rnp;
> @@ -994,12 +1003,12 @@ bool rcu_lockdep_current_cpu_online(void
>  
>  	if (in_nmi() || !rcu_scheduler_fully_active)
>  		return true;
> -	preempt_disable();
> +	preempt_disable_notrace();
>  	rdp = this_cpu_ptr(&rcu_data);
>  	rnp = rdp->mynode;
>  	if (rdp->grpmask & rcu_rnp_online_cpus(rnp))
>  		ret = true;
> -	preempt_enable();
> +	preempt_enable_notrace();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(rcu_lockdep_current_cpu_online);
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2553,7 +2553,7 @@ static void rcu_bind_gp_kthread(void)
>  }
>  
>  /* Record the current task on dyntick-idle entry. */
> -static void rcu_dynticks_task_enter(void)
> +static void noinstr rcu_dynticks_task_enter(void)
>  {
>  #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
>  	WRITE_ONCE(current->rcu_tasks_idle_cpu, smp_processor_id());
> @@ -2561,7 +2561,7 @@ static void rcu_dynticks_task_enter(void
>  }
>  
>  /* Record no current task on dyntick-idle exit. */
> -static void rcu_dynticks_task_exit(void)
> +static void noinstr rcu_dynticks_task_exit(void)
>  {
>  #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
>  	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -95,7 +95,7 @@ module_param(rcu_normal_after_boot, int,
>   * Similarly, we avoid claiming an RCU read lock held if the current
>   * CPU is offline.
>   */
> -static bool rcu_read_lock_held_common(bool *ret)
> +static noinstr bool rcu_read_lock_held_common(bool *ret)
>  {
>  	if (!debug_lockdep_rcu_enabled()) {
>  		*ret = 1;
> @@ -112,7 +112,7 @@ static bool rcu_read_lock_held_common(bo
>  	return false;
>  }
>  
> -int rcu_read_lock_sched_held(void)
> +noinstr int rcu_read_lock_sched_held(void)
>  {
>  	bool ret;
>  
> @@ -270,13 +270,12 @@ struct lockdep_map rcu_callback_map =
>  	STATIC_LOCKDEP_MAP_INIT("rcu_callback", &rcu_callback_key);
>  EXPORT_SYMBOL_GPL(rcu_callback_map);
>  
> -int notrace debug_lockdep_rcu_enabled(void)
> +noinstr int notrace debug_lockdep_rcu_enabled(void)
>  {
>  	return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE && debug_locks &&
>  	       current->lockdep_recursion == 0;
>  }
>  EXPORT_SYMBOL_GPL(debug_lockdep_rcu_enabled);
> -NOKPROBE_SYMBOL(debug_lockdep_rcu_enabled);
>  
>  /**
>   * rcu_read_lock_held() - might we be in RCU read-side critical section?
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 36/36] rcu: Make RCU IRQ enter/exit functions rely on in_nmi()
  2020-05-05 13:16 ` [patch V4 part 1 36/36] rcu: Make RCU IRQ enter/exit functions rely on in_nmi() Thomas Gleixner
@ 2020-05-05 18:13   ` Paul E. McKenney
  2020-05-06 17:09   ` Alexandre Chartre
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Paul E. McKenney
  2 siblings, 0 replies; 178+ messages in thread
From: Paul E. McKenney @ 2020-05-05 18:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

On Tue, May 05, 2020 at 03:16:38PM +0200, Thomas Gleixner wrote:
> From: Paul E. McKenney <paulmck@kernel.org>
> 
> The rcu_nmi_enter_common() and rcu_nmi_exit_common() functions take an
> "irq" parameter that indicates whether these functions are invoked from
> an irq handler (irq==true) or an NMI handler (irq==false).  However,
> recent changes have applied notrace to a few critical functions such
> that rcu_nmi_enter_common() and rcu_nmi_exit_common() many now rely
> on in_nmi().  Note that in_nmi() works no differently than before,
> but rather that tracing is now prohibited in code regions where in_nmi()
> would incorrectly report NMI state.
> 
> Therefore remove the "irq" parameter and inlines rcu_nmi_enter_common() and
> rcu_nmi_exit_common() into rcu_nmi_enter() and rcu_nmi_exit(),
> respectively.

Not a bad job of ghostwriting, actually.  I had forgotten about this
entirely, but did find my February 13th email containing this patch.  ;-)

But why not make the commit log official?

------------------------------------------------------------------------

The rcu_nmi_enter_common() and rcu_nmi_exit_common() functions take an
"irq" parameter that indicates whether these functions have been invoked
from an irq handler (irq==true) or an NMI handler (irq==false).  However,
recent changes have applied notrace to a number of critical functions,
thus allowing rcu_nmi_enter_common() and rcu_nmi_exit_common() to rely
on in_nmi().  Note that in_nmi() works no differently than before.
Instead, tracing is now prohibited in code regions where in_nmi() would
previously have incorrectly reported NMI state.

This commit therefore removes the "irq" parameter and inlines
rcu_nmi_enter_common() and rcu_nmi_exit_common() into rcu_nmi_enter()
and rcu_nmi_exit(), respectively.

------------------------------------------------------------------------

With that commit log, looks good to me!

							Thanx, Paul

> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/rcu/tree.c |   47 +++++++++++++++--------------------------------
>  1 file changed, 15 insertions(+), 32 deletions(-)
> 
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -627,16 +627,18 @@ noinstr void rcu_user_enter(void)
>  }
>  #endif /* CONFIG_NO_HZ_FULL */
>  
> -/*
> +/**
> + * rcu_nmi_exit - inform RCU of exit from NMI context
> + *
>   * If we are returning from the outermost NMI handler that interrupted an
>   * RCU-idle period, update rdp->dynticks and rdp->dynticks_nmi_nesting
>   * to let the RCU grace-period handling know that the CPU is back to
>   * being RCU-idle.
>   *
> - * If you add or remove a call to rcu_nmi_exit_common(), be sure to test
> + * If you add or remove a call to rcu_nmi_exit(), be sure to test
>   * with CONFIG_RCU_EQS_DEBUG=y.
>   */
> -static __always_inline void rcu_nmi_exit_common(bool irq)
> +noinstr void rcu_nmi_exit(void)
>  {
>  	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>  
> @@ -667,7 +669,7 @@ static __always_inline void rcu_nmi_exit
>  	trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks));
>  	WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
>  
> -	if (irq)
> +	if (!in_nmi())
>  		rcu_prepare_for_idle();
>  	instr_end();
>  
> @@ -675,22 +677,11 @@ static __always_inline void rcu_nmi_exit
>  	rcu_dynticks_eqs_enter();
>  	// ... but is no longer watching here.
>  
> -	if (irq)
> +	if (!in_nmi())
>  		rcu_dynticks_task_enter();
>  }
>  
>  /**
> - * rcu_nmi_exit - inform RCU of exit from NMI context
> - *
> - * If you add or remove a call to rcu_nmi_exit(), be sure to test
> - * with CONFIG_RCU_EQS_DEBUG=y.
> - */
> -void noinstr rcu_nmi_exit(void)
> -{
> -	rcu_nmi_exit_common(false);
> -}
> -
> -/**
>   * rcu_irq_exit - inform RCU that current CPU is exiting irq towards idle
>   *
>   * Exit from an interrupt handler, which might possibly result in entering
> @@ -712,7 +703,7 @@ void noinstr rcu_nmi_exit(void)
>  void noinstr rcu_irq_exit(void)
>  {
>  	lockdep_assert_irqs_disabled();
> -	rcu_nmi_exit_common(true);
> +	rcu_nmi_exit();
>  }
>  
>  /*
> @@ -801,7 +792,7 @@ void noinstr rcu_user_exit(void)
>  #endif /* CONFIG_NO_HZ_FULL */
>  
>  /**
> - * rcu_nmi_enter_common - inform RCU of entry to NMI context
> + * rcu_nmi_enter - inform RCU of entry to NMI context
>   * @irq: Is this call from rcu_irq_enter?
>   *
>   * If the CPU was idle from RCU's viewpoint, update rdp->dynticks and
> @@ -810,10 +801,10 @@ void noinstr rcu_user_exit(void)
>   * long as the nesting level does not overflow an int.  (You will probably
>   * run out of stack space first.)
>   *
> - * If you add or remove a call to rcu_nmi_enter_common(), be sure to test
> + * If you add or remove a call to rcu_nmi_enter(), be sure to test
>   * with CONFIG_RCU_EQS_DEBUG=y.
>   */
> -static __always_inline void rcu_nmi_enter_common(bool irq)
> +noinstr void rcu_nmi_enter(void)
>  {
>  	long incby = 2;
>  	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> @@ -831,18 +822,18 @@ static __always_inline void rcu_nmi_ente
>  	 */
>  	if (rcu_dynticks_curr_cpu_in_eqs()) {
>  
> -		if (irq)
> +		if (!in_nmi())
>  			rcu_dynticks_task_exit();
>  
>  		// RCU is not watching here ...
>  		rcu_dynticks_eqs_exit();
>  		// ... but is watching here.
>  
> -		if (irq)
> +		if (!in_nmi())
>  			rcu_cleanup_after_idle();
>  
>  		incby = 1;
> -	} else if (irq) {
> +	} else if (!in_nmi()) {
>  		instr_begin();
>  		if (tick_nohz_full_cpu(rdp->cpu) &&
>  		    rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
> @@ -877,14 +868,6 @@ static __always_inline void rcu_nmi_ente
>  }
>  
>  /**
> - * rcu_nmi_enter - inform RCU of entry to NMI context
> - */
> -noinstr void rcu_nmi_enter(void)
> -{
> -	rcu_nmi_enter_common(false);
> -}
> -
> -/**
>   * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
>   *
>   * Enter an interrupt handler, which might possibly result in exiting
> @@ -909,7 +892,7 @@ noinstr void rcu_nmi_enter(void)
>  noinstr void rcu_irq_enter(void)
>  {
>  	lockdep_assert_irqs_disabled();
> -	rcu_nmi_enter_common(true);
> +	rcu_nmi_enter();
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 19/36] x86/entry: Exclude low level entry code from sanitizing
  2020-05-05 13:16 ` [patch V4 part 1 19/36] x86/entry: Exclude low level entry code from sanitizing Thomas Gleixner
@ 2020-05-05 20:39   ` Brian Gerst
  2020-05-06 15:42     ` Peter Zijlstra
  2020-05-06 16:03   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  2 siblings, 1 reply; 178+ messages in thread
From: Brian Gerst @ 2020-05-05 20:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Paul E. McKenney,
	Andy Lutomirski, Alexandre Chartre, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Peter Zijlstra (Intel)

On Tue, May 5, 2020 at 10:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> The sanitizers are not really applicable to the fragile low level entry
> code. code. Entry code needs to carefully setup a normal 'runtime'
> environment.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/entry/Makefile |    8 ++++++++
>  1 file changed, 8 insertions(+)
>
> --- a/arch/x86/entry/Makefile
> +++ b/arch/x86/entry/Makefile
> @@ -3,6 +3,14 @@
>  # Makefile for the x86 low level entry code
>  #
>
> +KASAN_SANITIZE := n
> +UBSAN_SANITIZE := n
> +KCOV_INSTRUMENT := n
> +
> +CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
> +CFLAGS_REMOVE_syscall_32.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
> +CFLAGS_REMOVE_syscall_64.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong

Is this necessary for syscall_*.o?  They just contain the syscall
tables (ie. data).

--
Brian Gerst

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 12/36] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-05-05 17:54   ` Paul E. McKenney
@ 2020-05-05 21:50     ` Thomas Gleixner
  0 siblings, 0 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-05 21:50 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, x86, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Tue, May 05, 2020 at 03:16:14PM +0200, Thomas Gleixner wrote:
>> RCU is watching for:
>> 
>>   #1  The vCPU exited and current is definitely not the idle task
>> 
>>   #2a The #PF entry code on the guest went through enter_from_user_mode()
>>       which reactivates RCU
>
> I have to double-check...  The NO_HZ_FULL case transitioning to/from
> userspace is entirely non-preemptible, correct?  (After rcu_user_enter()
> and before rcu_user_exit(), respectively.)

Yes. It runs with interrupts disabled down to the actual return (sysret,
iret).

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 12/36] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-05-05 13:16 ` [patch V4 part 1 12/36] x86/kvm: Sanitize kvm_async_pf_task_wait() Thomas Gleixner
  2020-05-05 17:54   ` Paul E. McKenney
@ 2020-05-06  7:00   ` Paolo Bonzini
  2020-05-06 12:53     ` Steven Rostedt
  2020-05-06 15:13   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  3 siblings, 1 reply; 178+ messages in thread
From: Paolo Bonzini @ 2020-05-06  7:00 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On 05/05/20 15:16, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> While working on the entry consolidation I stumbled over the KVM async page
> fault handler and kvm_async_pf_task_wait() in particular. It took me a
> while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
> invocations are just cargo cult programming. Several patches "fixed" RCU
> splats by curing the symptoms without noticing that the code is flawed 
> from a design perspective.
> 
> The main problem is that this async injection is not based on a proper
> handshake mechanism and only respects the minimal requirement, i.e. the
> guest is not in a state where it has interrupts disabled.
> 
> Aside of that the actual code is a convoluted one fits it all swiss army
> knife. It is invoked from different places with different RCU constraints:
> 
>   1) Host side:
> 
>      vcpu_enter_guest()
>        kvm_x86_ops->handle_exit()
>          kvm_handle_page_fault()
>            kvm_async_pf_task_wait()
> 
>      The invocation happens from fully preemptible context.
> 
>   2) Guest side:
> 
>      The async page fault interrupted:
> 
>          a) user space
> 
> 	 b) preemptible kernel code which is not in a RCU read side
> 	    critical section
> 
>      	 c) non-preemtible kernel code or a RCU read side critical section
> 	    or kernel code with CONFIG_PREEMPTION=n which allows not to
> 	    differentiate between #2b and #2c.
> 
> RCU is watching for:
> 
>   #1  The vCPU exited and current is definitely not the idle task
> 
>   #2a The #PF entry code on the guest went through enter_from_user_mode()
>       which reactivates RCU
> 
>   #2b There is no preemptible, interrupts enabled code in the kernel
>       which can run with RCU looking away. (The idle task is always
>       non preemptible).
> 
> I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU
> voodoo at all.
> 
> In #2c RCU is eventually not watching, but as that state cannot schedule
> anyway there is no point to worry about it so it has to invoke
> rcu_irq_enter() before running that code. This can be optimized, but this
> will be done as an extra step in course of the entry code consolidation
> work.
> 
> So the proper solution for this is to:
> 
>   - Split kvm_async_pf_task_wait() into schedule and halt based waiting
>     interfaces which share the enqueueing code.
> 
>   - Add comments (condensed form of this changelog) to spare others the
>     time waste and pain of reverse engineering all of this with the help of
>     uncomprehensible changelogs and code history.
> 
>   - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(),
>     user mode and schedulable kernel side async page faults (#1, #2a, #2b)
> 
>   - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel
>     case (#2c).
> 
>     For this case also remove the rcu_irq_exit()/enter() pair around the
>     halt as it is just a pointless exercise:
> 
>        - vCPUs can VMEXIT at any random point and can be scheduled out for
>          an arbitrary amount of time by the host and this is not any
>          different except that it voluntary triggers the exit via halt.
> 
>        - The interrupted context could have RCU watching already. So the
> 	 rcu_irq_exit() before the halt is not gaining anything aside of
> 	 confusing the reader. Claiming that this might prevent RCU stalls
> 	 is just an illusion.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V2: Panic if async #PF is injected into an interrupt disabled region.
> ---
>  arch/x86/include/asm/kvm_para.h |    4 
>  arch/x86/kernel/kvm.c           |  201 ++++++++++++++++++++++++++++------------
>  arch/x86/kvm/mmu/mmu.c          |    2 
>  3 files changed, 144 insertions(+), 63 deletions(-)
> 
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsign
>  bool kvm_para_available(void);
>  unsigned int kvm_arch_para_features(void);
>  unsigned int kvm_arch_para_hints(void);
> -void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
> +void kvm_async_pf_task_wait_schedule(u32 token);
>  void kvm_async_pf_task_wake(u32 token);
>  u32 kvm_read_and_reset_pf_reason(void);
>  void kvm_disable_steal_time(void);
> @@ -113,7 +113,7 @@ static inline void kvm_spinlock_init(voi
>  #endif /* CONFIG_PARAVIRT_SPINLOCKS */
>  
>  #else /* CONFIG_KVM_GUEST */
> -#define kvm_async_pf_task_wait(T, I) do {} while(0)
> +#define kvm_async_pf_task_wait_schedule(T) do {} while(0)
>  #define kvm_async_pf_task_wake(T) do {} while(0)
>  
>  static inline bool kvm_para_available(void)
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -75,7 +75,7 @@ struct kvm_task_sleep_node {
>  	struct swait_queue_head wq;
>  	u32 token;
>  	int cpu;
> -	bool halted;
> +	bool use_halt;
>  };
>  
>  static struct kvm_task_sleep_head {
> @@ -98,75 +98,145 @@ static struct kvm_task_sleep_node *_find
>  	return NULL;
>  }
>  
> -/*
> - * @interrupt_kernel: Is this called from a routine which interrupts the kernel
> - * 		      (other than user space)?
> - */
> -void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
> +static bool kvm_async_pf_queue_task(u32 token, bool use_halt,
> +				    struct kvm_task_sleep_node *n)
>  {
>  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> -	struct kvm_task_sleep_node n, *e;
> -	DECLARE_SWAITQUEUE(wait);
> -
> -	rcu_irq_enter();
> +	struct kvm_task_sleep_node *e;
>  
>  	raw_spin_lock(&b->lock);
>  	e = _find_apf_task(b, token);
>  	if (e) {
>  		/* dummy entry exist -> wake up was delivered ahead of PF */
>  		hlist_del(&e->link);
> -		kfree(e);
>  		raw_spin_unlock(&b->lock);
> +		kfree(e);
> +		return false;
> +	}
>  
> -		rcu_irq_exit();
> +	n->token = token;
> +	n->cpu = smp_processor_id();
> +	n->use_halt = use_halt;
> +	init_swait_queue_head(&n->wq);
> +	hlist_add_head(&n->link, &b->list);
> +	raw_spin_unlock(&b->lock);
> +	return true;
> +}
> +
> +/*
> + * kvm_async_pf_task_wait_schedule - Wait for pagefault to be handled
> + * @token:	Token to identify the sleep node entry
> + *
> + * Invoked from the async pagefault handling code or from the VM exit page
> + * fault handler. In both cases RCU is watching.
> + */
> +void kvm_async_pf_task_wait_schedule(u32 token)
> +{
> +	struct kvm_task_sleep_node n;
> +	DECLARE_SWAITQUEUE(wait);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	if (!kvm_async_pf_queue_task(token, false, &n))
>  		return;
> +
> +	for (;;) {
> +		prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
> +		if (hlist_unhashed(&n.link))
> +			break;
> +
> +		local_irq_enable();
> +		schedule();
> +		local_irq_disable();
>  	}
> +	finish_swait(&n.wq, &wait);
> +}
> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
>  
> -	n.token = token;
> -	n.cpu = smp_processor_id();
> -	n.halted = is_idle_task(current) ||
> -		   (IS_ENABLED(CONFIG_PREEMPT_COUNT)
> -		    ? preempt_count() > 1 || rcu_preempt_depth()
> -		    : interrupt_kernel);
> -	init_swait_queue_head(&n.wq);
> -	hlist_add_head(&n.link, &b->list);
> -	raw_spin_unlock(&b->lock);
> +/*
> + * Invoked from the async page fault handler.
> + */
> +static void kvm_async_pf_task_wait_halt(u32 token)
> +{
> +	struct kvm_task_sleep_node n;
> +
> +	if (!kvm_async_pf_queue_task(token, true, &n))
> +		return;
>  
>  	for (;;) {
> -		if (!n.halted)
> -			prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
>  		if (hlist_unhashed(&n.link))
>  			break;
> +		/*
> +		 * No point in doing anything about RCU here. Any RCU read
> +		 * side critical section or RCU watching section can be
> +		 * interrupted by VMEXITs and the host is free to keep the
> +		 * vCPU scheduled out as long as it sees fit. This is not
> +		 * any different just because of the halt induced voluntary
> +		 * VMEXIT.
> +		 *
> +		 * Also the async page fault could have interrupted any RCU
> +		 * watching context, so invoking rcu_irq_exit()/enter()
> +		 * around this is not gaining anything.
> +		 */
> +		native_safe_halt();
> +		local_irq_disable();
> +	}
> +}
>  
> -		rcu_irq_exit();
> +/* Invoked from the async page fault handler */
> +static void kvm_async_pf_task_wait(u32 token, bool usermode)
> +{
> +	bool can_schedule;
>  
> -		if (!n.halted) {
> -			local_irq_enable();
> -			schedule();
> -			local_irq_disable();
> -		} else {
> -			/*
> -			 * We cannot reschedule. So halt.
> -			 */
> -			native_safe_halt();
> -			local_irq_disable();
> -		}
> +	/*
> +	 * No need to check whether interrupts were disabled because the
> +	 * host will (hopefully) only inject an async page fault into
> +	 * interrupt enabled regions.
> +	 *
> +	 * If CONFIG_PREEMPTION is enabled then check whether the code
> +	 * which triggered the page fault is preemptible. This covers user
> +	 * mode as well because preempt_count() is obviously 0 there.
> +	 *
> +	 * The check for rcu_preempt_depth() is also required because
> +	 * voluntary scheduling inside a rcu read locked section is not
> +	 * allowed.
> +	 *
> +	 * The idle task is already covered by this because idle always
> +	 * has a preempt count > 0.
> +	 *
> +	 * If CONFIG_PREEMPTION is disabled only allow scheduling when
> +	 * coming from user mode as there is no indication whether the
> +	 * context which triggered the page fault could schedule or not.
> +	 */
> +	if (IS_ENABLED(CONFIG_PREEMPTION))
> +		can_schedule = preempt_count() + rcu_preempt_depth() == 0;
> +	else
> +		can_schedule = usermode;
>  
> +	/*
> +	 * If the kernel context is allowed to schedule then RCU is
> +	 * watching because no preemptible code in the kernel is inside RCU
> +	 * idle state. So it can be treated like user mode. User mode is
> +	 * safe because the #PF entry invoked enter_from_user_mode().
> +	 *
> +	 * For the non schedulable case invoke rcu_irq_enter() for
> +	 * now. This will be moved out to the pagefault entry code later
> +	 * and only invoked when really needed.
> +	 */
> +	if (can_schedule) {
> +		kvm_async_pf_task_wait_schedule(token);
> +	} else {
>  		rcu_irq_enter();
> +		kvm_async_pf_task_wait_halt(token);
> +		rcu_irq_exit();
>  	}
> -	if (!n.halted)
> -		finish_swait(&n.wq, &wait);
> -
> -	rcu_irq_exit();
> -	return;
>  }
> -EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
>  
>  static void apf_task_wake_one(struct kvm_task_sleep_node *n)
>  {
>  	hlist_del_init(&n->link);
> -	if (n->halted)
> +	if (n->use_halt)
>  		smp_send_reschedule(n->cpu);
>  	else if (swq_has_sleeper(&n->wq))
>  		swake_up_one(&n->wq);
> @@ -177,12 +247,13 @@ static void apf_task_wake_all(void)
>  	int i;
>  
>  	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
> -		struct hlist_node *p, *next;
>  		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
> +		struct kvm_task_sleep_node *n;
> +		struct hlist_node *p, *next;
> +
>  		raw_spin_lock(&b->lock);
>  		hlist_for_each_safe(p, next, &b->list) {
> -			struct kvm_task_sleep_node *n =
> -				hlist_entry(p, typeof(*n), link);
> +			n = hlist_entry(p, typeof(*n), link);
>  			if (n->cpu == smp_processor_id())
>  				apf_task_wake_one(n);
>  		}
> @@ -223,8 +294,9 @@ void kvm_async_pf_task_wake(u32 token)
>  		n->cpu = smp_processor_id();
>  		init_swait_queue_head(&n->wq);
>  		hlist_add_head(&n->link, &b->list);
> -	} else
> +	} else {
>  		apf_task_wake_one(n);
> +	}
>  	raw_spin_unlock(&b->lock);
>  	return;
>  }
> @@ -246,23 +318,33 @@ NOKPROBE_SYMBOL(kvm_read_and_reset_pf_re
>  
>  bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
>  {
> -	/*
> -	 * If we get a page fault right here, the pf_reason seems likely
> -	 * to be clobbered.  Bummer.
> -	 */
> -	switch (kvm_read_and_reset_pf_reason()) {
> +	u32 reason = kvm_read_and_reset_pf_reason();
> +
> +	switch (reason) {
> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +	case KVM_PV_REASON_PAGE_READY:
> +		break;
>  	default:
>  		return false;
> -	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +	}
> +
> +	/*
> +	 * If the host managed to inject an async #PF into an interrupt
> +	 * disabled region, then die hard as this is not going to end well
> +	 * and the host side is seriously broken.
> +	 */
> +	if (unlikely(!(regs->flags & X86_EFLAGS_IF)))
> +		panic("Host injected async #PF in interrupt disabled region\n");
> +
> +	if (reason == KVM_PV_REASON_PAGE_NOT_PRESENT) {
>  		/* page is swapped out by the host. */
> -		kvm_async_pf_task_wait(token, !user_mode(regs));
> -		return true;
> -	case KVM_PV_REASON_PAGE_READY:
> +		kvm_async_pf_task_wait(token, user_mode(regs));
> +	} else {
>  		rcu_irq_enter();
>  		kvm_async_pf_task_wake(token);
>  		rcu_irq_exit();
> -		return true;
>  	}
> +	return true;
>  }
>  NOKPROBE_SYMBOL(__kvm_handle_async_pf);
>  
> @@ -326,12 +408,12 @@ static void kvm_guest_cpu_init(void)
>  
>  		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
>  		__this_cpu_write(apf_reason.enabled, 1);
> -		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
> -		       smp_processor_id());
> +		pr_info("KVM setup async PF for cpu %d\n", smp_processor_id());
>  	}
>  
>  	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
>  		unsigned long pa;
> +
>  		/* Size alignment is implied but just to make it explicit. */
>  		BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
>  		__this_cpu_write(kvm_apic_eoi, 0);
> @@ -352,8 +434,7 @@ static void kvm_pv_disable_apf(void)
>  	wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
>  	__this_cpu_write(apf_reason.enabled, 0);
>  
> -	printk(KERN_INFO"Unregister pv shared memory for cpu %d\n",
> -	       smp_processor_id());
> +	pr_info("Unregister pv shared memory for cpu %d\n", smp_processor_id());
>  }
>  
>  static void kvm_pv_guest_cpu_reboot(void *unused)
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4198,7 +4198,7 @@ int kvm_handle_page_fault(struct kvm_vcp
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		vcpu->arch.apf.host_apf_reason = 0;
>  		local_irq_disable();
> -		kvm_async_pf_task_wait(fault_address, 0);
> +		kvm_async_pf_task_wait_schedule(fault_address);
>  		local_irq_enable();
>  		break;
>  	case KVM_PV_REASON_PAGE_READY:
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 13/36] x86/kvm: Restrict ASYNC_PF to user space
  2020-05-05 13:16 ` [patch V4 part 1 13/36] x86/kvm: Restrict ASYNC_PF to user space Thomas Gleixner
@ 2020-05-06  7:00   ` Paolo Bonzini
  2020-05-06 15:29   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 178+ messages in thread
From: Paolo Bonzini @ 2020-05-06  7:00 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On 05/05/20 15:16, Thomas Gleixner wrote:
> The async page fault injection into kernel space creates more problems than
> it solves. The host has absolutely no knowledge about the state of the
> guest if the fault happens in CPL0. The only restriction for the host is
> interrupt disabled state. If interrupts are enabled in the guest then the
> exception can hit arbitrary code. The HALT based wait in non-preemotible
> code is a hacky replacement for a proper hypercall.
> 
> For the ongoing work to restrict instrumentation and make the RCU idle
> interaction well defined the required extra work for supporting async
> pagefault in CPL0 is just not justified and creates complexity for a
> dubious benefit.
> 
> The CPL3 injection is well defined and does not cause any issues as it is
> more or less the same as a regular page fault from CPL3.
> 
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/kernel/kvm.c |  100 +++-----------------------------------------------
>  1 file changed, 7 insertions(+), 93 deletions(-)
> 
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -75,7 +75,6 @@ struct kvm_task_sleep_node {
>  	struct swait_queue_head wq;
>  	u32 token;
>  	int cpu;
> -	bool use_halt;
>  };
>  
>  static struct kvm_task_sleep_head {
> @@ -98,8 +97,7 @@ static struct kvm_task_sleep_node *_find
>  	return NULL;
>  }
>  
> -static bool kvm_async_pf_queue_task(u32 token, bool use_halt,
> -				    struct kvm_task_sleep_node *n)
> +static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
>  {
>  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> @@ -117,7 +115,6 @@ static bool kvm_async_pf_queue_task(u32
>  
>  	n->token = token;
>  	n->cpu = smp_processor_id();
> -	n->use_halt = use_halt;
>  	init_swait_queue_head(&n->wq);
>  	hlist_add_head(&n->link, &b->list);
>  	raw_spin_unlock(&b->lock);
> @@ -138,7 +135,7 @@ void kvm_async_pf_task_wait_schedule(u32
>  
>  	lockdep_assert_irqs_disabled();
>  
> -	if (!kvm_async_pf_queue_task(token, false, &n))
> +	if (!kvm_async_pf_queue_task(token, &n))
>  		return;
>  
>  	for (;;) {
> @@ -154,91 +151,10 @@ void kvm_async_pf_task_wait_schedule(u32
>  }
>  EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
>  
> -/*
> - * Invoked from the async page fault handler.
> - */
> -static void kvm_async_pf_task_wait_halt(u32 token)
> -{
> -	struct kvm_task_sleep_node n;
> -
> -	if (!kvm_async_pf_queue_task(token, true, &n))
> -		return;
> -
> -	for (;;) {
> -		if (hlist_unhashed(&n.link))
> -			break;
> -		/*
> -		 * No point in doing anything about RCU here. Any RCU read
> -		 * side critical section or RCU watching section can be
> -		 * interrupted by VMEXITs and the host is free to keep the
> -		 * vCPU scheduled out as long as it sees fit. This is not
> -		 * any different just because of the halt induced voluntary
> -		 * VMEXIT.
> -		 *
> -		 * Also the async page fault could have interrupted any RCU
> -		 * watching context, so invoking rcu_irq_exit()/enter()
> -		 * around this is not gaining anything.
> -		 */
> -		native_safe_halt();
> -		local_irq_disable();
> -	}
> -}
> -
> -/* Invoked from the async page fault handler */
> -static void kvm_async_pf_task_wait(u32 token, bool usermode)
> -{
> -	bool can_schedule;
> -
> -	/*
> -	 * No need to check whether interrupts were disabled because the
> -	 * host will (hopefully) only inject an async page fault into
> -	 * interrupt enabled regions.
> -	 *
> -	 * If CONFIG_PREEMPTION is enabled then check whether the code
> -	 * which triggered the page fault is preemptible. This covers user
> -	 * mode as well because preempt_count() is obviously 0 there.
> -	 *
> -	 * The check for rcu_preempt_depth() is also required because
> -	 * voluntary scheduling inside a rcu read locked section is not
> -	 * allowed.
> -	 *
> -	 * The idle task is already covered by this because idle always
> -	 * has a preempt count > 0.
> -	 *
> -	 * If CONFIG_PREEMPTION is disabled only allow scheduling when
> -	 * coming from user mode as there is no indication whether the
> -	 * context which triggered the page fault could schedule or not.
> -	 */
> -	if (IS_ENABLED(CONFIG_PREEMPTION))
> -		can_schedule = preempt_count() + rcu_preempt_depth() == 0;
> -	else
> -		can_schedule = usermode;
> -
> -	/*
> -	 * If the kernel context is allowed to schedule then RCU is
> -	 * watching because no preemptible code in the kernel is inside RCU
> -	 * idle state. So it can be treated like user mode. User mode is
> -	 * safe because the #PF entry invoked enter_from_user_mode().
> -	 *
> -	 * For the non schedulable case invoke rcu_irq_enter() for
> -	 * now. This will be moved out to the pagefault entry code later
> -	 * and only invoked when really needed.
> -	 */
> -	if (can_schedule) {
> -		kvm_async_pf_task_wait_schedule(token);
> -	} else {
> -		rcu_irq_enter();
> -		kvm_async_pf_task_wait_halt(token);
> -		rcu_irq_exit();
> -	}
> -}
> -
>  static void apf_task_wake_one(struct kvm_task_sleep_node *n)
>  {
>  	hlist_del_init(&n->link);
> -	if (n->use_halt)
> -		smp_send_reschedule(n->cpu);
> -	else if (swq_has_sleeper(&n->wq))
> +	if (swq_has_sleeper(&n->wq))
>  		swake_up_one(&n->wq);
>  }
>  
> @@ -337,8 +253,10 @@ bool __kvm_handle_async_pf(struct pt_reg
>  		panic("Host injected async #PF in interrupt disabled region\n");
>  
>  	if (reason == KVM_PV_REASON_PAGE_NOT_PRESENT) {
> -		/* page is swapped out by the host. */
> -		kvm_async_pf_task_wait(token, user_mode(regs));
> +		if (unlikely(!(user_mode(regs))))
> +			panic("Host injected async #PF in kernel mode\n");
> +		/* Page is swapped out by the host. */
> +		kvm_async_pf_task_wait_schedule(token);
>  	} else {
>  		rcu_irq_enter();
>  		kvm_async_pf_task_wake(token);
> @@ -397,10 +315,6 @@ static void kvm_guest_cpu_init(void)
>  		WARN_ON_ONCE(!static_branch_likely(&kvm_async_pf_enabled));
>  
>  		pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
> -
> -#ifdef CONFIG_PREEMPTION
> -		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
> -#endif
>  		pa |= KVM_ASYNC_PF_ENABLED;
>  
>  		if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_VMEXIT))
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 11/36] x86/kvm: Handle async page faults directly through do_page_fault()
  2020-05-05 13:16 ` [patch V4 part 1 11/36] x86/kvm: Handle async page faults directly through do_page_fault() Thomas Gleixner
@ 2020-05-06  7:00   ` Paolo Bonzini
  2020-05-06 14:05   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Andy Lutomirski
  2 siblings, 0 replies; 178+ messages in thread
From: Paolo Bonzini @ 2020-05-06  7:00 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On 05/05/20 15:16, Thomas Gleixner wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> KVM overloads #PF to indicate two types of not-actually-page-fault
> events.  Right now, the KVM guest code intercepts them by modifying
> the IDT and hooking the #PF vector.  This makes the already fragile
> fault code even harder to understand, and it also pollutes call
> traces with async_page_fault and do_async_page_fault for normal page
> faults.
> 
> Clean it up by moving the logic into do_page_fault() using a static
> branch.  This gets rid of the platform trap_init override mechanism
> completely.
> 
> [ tglx: Fixed up 32bit, removed error code from the async functions and
>   	massaged coding style ]
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/entry/entry_32.S       |    8 --------
>  arch/x86/entry/entry_64.S       |    4 ----
>  arch/x86/include/asm/kvm_para.h |   19 +++++++++++++++++--
>  arch/x86/include/asm/x86_init.h |    2 --
>  arch/x86/kernel/kvm.c           |   39 +++++++++++++++++++++------------------
>  arch/x86/kernel/traps.c         |    2 --
>  arch/x86/kernel/x86_init.c      |    1 -
>  arch/x86/mm/fault.c             |   19 +++++++++++++++++++
>  8 files changed, 57 insertions(+), 37 deletions(-)
> 
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -1693,14 +1693,6 @@ SYM_CODE_START(general_protection)
>  	jmp	common_exception
>  SYM_CODE_END(general_protection)
>  
> -#ifdef CONFIG_KVM_GUEST
> -SYM_CODE_START(async_page_fault)
> -	ASM_CLAC
> -	pushl	$do_async_page_fault
> -	jmp	common_exception_read_cr2
> -SYM_CODE_END(async_page_fault)
> -#endif
> -
>  SYM_CODE_START(rewind_stack_do_exit)
>  	/* Prevent any naive code from trying to unwind to our caller. */
>  	xorl	%ebp, %ebp
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1204,10 +1204,6 @@ idtentry xendebug		do_debug		has_error_c
>  idtentry general_protection	do_general_protection	has_error_code=1
>  idtentry page_fault		do_page_fault		has_error_code=1	read_cr2=1
>  
> -#ifdef CONFIG_KVM_GUEST
> -idtentry async_page_fault	do_async_page_fault	has_error_code=1	read_cr2=1
> -#endif
> -
>  #ifdef CONFIG_X86_MCE
>  idtentry machine_check		do_mce			has_error_code=0	paranoid=1
>  #endif
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -91,8 +91,18 @@ unsigned int kvm_arch_para_hints(void);
>  void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
>  void kvm_async_pf_task_wake(u32 token);
>  u32 kvm_read_and_reset_pf_reason(void);
> -extern void kvm_disable_steal_time(void);
> -void do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
> +void kvm_disable_steal_time(void);
> +bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token);
> +
> +DECLARE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
> +
> +static __always_inline bool kvm_handle_async_pf(struct pt_regs *regs, u32 token)
> +{
> +	if (static_branch_unlikely(&kvm_async_pf_enabled))
> +		return __kvm_handle_async_pf(regs, token);
> +	else
> +		return false;
> +}
>  
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  void __init kvm_spinlock_init(void);
> @@ -130,6 +140,11 @@ static inline void kvm_disable_steal_tim
>  {
>  	return;
>  }
> +
> +static inline bool kvm_handle_async_pf(struct pt_regs *regs, u32 token)
> +{
> +	return false;
> +}
>  #endif
>  
>  #endif /* _ASM_X86_KVM_PARA_H */
> --- a/arch/x86/include/asm/x86_init.h
> +++ b/arch/x86/include/asm/x86_init.h
> @@ -50,14 +50,12 @@ struct x86_init_resources {
>   * @pre_vector_init:		init code to run before interrupt vectors
>   *				are set up.
>   * @intr_init:			interrupt init code
> - * @trap_init:			platform specific trap setup
>   * @intr_mode_select:		interrupt delivery mode selection
>   * @intr_mode_init:		interrupt delivery mode setup
>   */
>  struct x86_init_irqs {
>  	void (*pre_vector_init)(void);
>  	void (*intr_init)(void);
> -	void (*trap_init)(void);
>  	void (*intr_mode_select)(void);
>  	void (*intr_mode_init)(void);
>  };
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -35,6 +35,8 @@
>  #include <asm/tlb.h>
>  #include <asm/cpuidle_haltpoll.h>
>  
> +DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
> +
>  static int kvmapf = 1;
>  
>  static int __init parse_no_kvmapf(char *arg)
> @@ -242,25 +244,27 @@ u32 kvm_read_and_reset_pf_reason(void)
>  EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
>  NOKPROBE_SYMBOL(kvm_read_and_reset_pf_reason);
>  
> -dotraplinkage void
> -do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address)
> +bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
>  {
> +	/*
> +	 * If we get a page fault right here, the pf_reason seems likely
> +	 * to be clobbered.  Bummer.
> +	 */
>  	switch (kvm_read_and_reset_pf_reason()) {
>  	default:
> -		do_page_fault(regs, error_code, address);
> -		break;
> +		return false;
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		/* page is swapped out by the host. */
> -		kvm_async_pf_task_wait((u32)address, !user_mode(regs));
> -		break;
> +		kvm_async_pf_task_wait(token, !user_mode(regs));
> +		return true;
>  	case KVM_PV_REASON_PAGE_READY:
>  		rcu_irq_enter();
> -		kvm_async_pf_task_wake((u32)address);
> +		kvm_async_pf_task_wake(token);
>  		rcu_irq_exit();
> -		break;
> +		return true;
>  	}
>  }
> -NOKPROBE_SYMBOL(do_async_page_fault);
> +NOKPROBE_SYMBOL(__kvm_handle_async_pf);
>  
>  static void __init paravirt_ops_setup(void)
>  {
> @@ -306,7 +310,11 @@ static notrace void kvm_guest_apic_eoi_w
>  static void kvm_guest_cpu_init(void)
>  {
>  	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
> -		u64 pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
> +		u64 pa;
> +
> +		WARN_ON_ONCE(!static_branch_likely(&kvm_async_pf_enabled));
> +
> +		pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
>  
>  #ifdef CONFIG_PREEMPTION
>  		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
> @@ -592,12 +600,6 @@ static int kvm_cpu_down_prepare(unsigned
>  }
>  #endif
>  
> -static void __init kvm_apf_trap_init(void)
> -{
> -	update_intr_gate(X86_TRAP_PF, async_page_fault);
> -}
> -
> -
>  static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>  			const struct flush_tlb_info *info)
>  {
> @@ -632,8 +634,6 @@ static void __init kvm_guest_init(void)
>  	register_reboot_notifier(&kvm_pv_reboot_nb);
>  	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
>  		raw_spin_lock_init(&async_pf_sleepers[i].lock);
> -	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
> -		x86_init.irqs.trap_init = kvm_apf_trap_init;
>  
>  	if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
>  		has_steal_clock = 1;
> @@ -649,6 +649,9 @@ static void __init kvm_guest_init(void)
>  	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
>  		apic_set_eoi_write(kvm_guest_apic_eoi_write);
>  
> +	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf)
> +		static_branch_enable(&kvm_async_pf_enabled);
> +
>  #ifdef CONFIG_SMP
>  	smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
>  	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -988,7 +988,5 @@ void __init trap_init(void)
>  
>  	idt_setup_ist_traps();
>  
> -	x86_init.irqs.trap_init();
> -
>  	idt_setup_debugidt_traps();
>  }
> --- a/arch/x86/kernel/x86_init.c
> +++ b/arch/x86/kernel/x86_init.c
> @@ -79,7 +79,6 @@ struct x86_init_ops x86_init __initdata
>  	.irqs = {
>  		.pre_vector_init	= init_ISA_irqs,
>  		.intr_init		= native_init_IRQ,
> -		.trap_init		= x86_init_noop,
>  		.intr_mode_select	= apic_intr_mode_select,
>  		.intr_mode_init		= apic_intr_mode_init
>  	},
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -30,6 +30,7 @@
>  #include <asm/desc.h>			/* store_idt(), ...		*/
>  #include <asm/cpu_entry_area.h>		/* exception stack		*/
>  #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/
> +#include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
>  
>  #define CREATE_TRACE_POINTS
>  #include <asm/trace/exceptions.h>
> @@ -1523,6 +1524,24 @@ do_page_fault(struct pt_regs *regs, unsi
>  		unsigned long address)
>  {
>  	prefetchw(&current->mm->mmap_sem);
> +	/*
> +	 * KVM has two types of events that are, logically, interrupts, but
> +	 * are unfortunately delivered using the #PF vector.  These events are
> +	 * "you just accessed valid memory, but the host doesn't have it right
> +	 * now, so I'll put you to sleep if you continue" and "that memory
> +	 * you tried to access earlier is available now."
> +	 *
> +	 * We are relying on the interrupted context being sane (valid RSP,
> +	 * relevant locks not held, etc.), which is fine as long as the
> +	 * interrupted context had IF=1.  We are also relying on the KVM
> +	 * async pf type field and CR2 being read consistently instead of
> +	 * getting values from real and async page faults mixed up.
> +	 *
> +	 * Fingers crossed.
> +	 */
> +	if (kvm_handle_async_pf(regs, (u32)address))
> +		return;
> +
>  	trace_page_fault_entries(regs, hw_error_code, address);
>  
>  	if (unlikely(kmmio_fault(regs, address)))
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 02/36] x86/hw_breakpoint: Prevent data breakpoints on cpu_entry_area
  2020-05-05 13:16 ` [patch V4 part 1 02/36] x86/hw_breakpoint: Prevent data breakpoints on cpu_entry_area Thomas Gleixner
@ 2020-05-06  8:14   ` Borislav Petkov
  2020-05-06 12:11   ` Alexandre Chartre
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 178+ messages in thread
From: Borislav Petkov @ 2020-05-06  8:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

On Tue, May 05, 2020 at 03:16:04PM +0200, Thomas Gleixner wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> A data breakpoint near the top of an IST stack will cause unresoverable

unrecoverable

> recursion.  A data breakpoint on the GDT, IDT, or TSS is terrifying.

"terrifying" huh? Colorful. :)

> Prevent either of these from happening.
> 
> Co-developed-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/kernel/hw_breakpoint.c |   25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 03/36] sched: Clean up scheduler_ipi()
  2020-05-05 13:16 ` [patch V4 part 1 03/36] sched: Clean up scheduler_ipi() Thomas Gleixner
@ 2020-05-06  8:32   ` Thomas Gleixner
  2020-05-06  8:40   ` Borislav Petkov
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-06  8:32 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

Thomas Gleixner <tglx@linutronix.de> writes:

Bah. I managed to lose the

  From: Peterz

line somehow.


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 03/36] sched: Clean up scheduler_ipi()
  2020-05-05 13:16 ` [patch V4 part 1 03/36] sched: Clean up scheduler_ipi() Thomas Gleixner
  2020-05-06  8:32   ` Thomas Gleixner
@ 2020-05-06  8:40   ` Borislav Petkov
  2020-05-06  9:12     ` Thomas Gleixner
  2020-05-06 12:37   ` Alexandre Chartre
  2020-05-12 15:13   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra (Intel)
  3 siblings, 1 reply; 178+ messages in thread
From: Borislav Petkov @ 2020-05-06  8:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

On Tue, May 05, 2020 at 03:16:05PM +0200, Thomas Gleixner wrote:
> The scheduler IPI has grown weird and wonderful over the years, time
> for spring cleaning.
> 
> Move all the non-trivial stuff out of it and into a regular smp function
> call IPI. This then reduces the schedule_ipi() to most of it's former MOP

s/MOP/NOP/ after clarifying it on IRC. :)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 03/36] sched: Clean up scheduler_ipi()
  2020-05-06  8:40   ` Borislav Petkov
@ 2020-05-06  9:12     ` Thomas Gleixner
  2020-05-06 10:02       ` Borislav Petkov
  0 siblings, 1 reply; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-06  9:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

Borislav Petkov <bp@alien8.de> writes:

> On Tue, May 05, 2020 at 03:16:05PM +0200, Thomas Gleixner wrote:
>> The scheduler IPI has grown weird and wonderful over the years, time
>> for spring cleaning.
>> 
>> Move all the non-trivial stuff out of it and into a regular smp function
>> call IPI. This then reduces the schedule_ipi() to most of it's former MOP
>
> s/MOP/NOP/ after clarifying it on IRC. :)

It was NOP long ago. Now it's Minimal OPeration :)

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 03/36] sched: Clean up scheduler_ipi()
  2020-05-06  9:12     ` Thomas Gleixner
@ 2020-05-06 10:02       ` Borislav Petkov
  0 siblings, 0 replies; 178+ messages in thread
From: Borislav Petkov @ 2020-05-06 10:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

On Wed, May 06, 2020 at 11:12:35AM +0200, Thomas Gleixner wrote:
> It was NOP long ago. Now it's Minimal OPeration :)

LOL, a new instruction: MOP. I can think of a couple of x86 instructions
which can be MOPs.

:-P

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-05 13:16 ` [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling Thomas Gleixner
@ 2020-05-06 11:53   ` Miroslav Benes
  2020-05-06 12:06     ` Thomas Gleixner
  2020-05-06 15:35     ` Peter Zijlstra
  2020-05-06 13:06   ` Alexandre Chartre
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 178+ messages in thread
From: Miroslav Benes @ 2020-05-06 11:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

On Tue, 5 May 2020, Thomas Gleixner wrote:

> Make sure task_work runs before any kind of userspace -- very much
> including signals -- is invoked.

I might be missing something, but isn't this guaranteed by 
do_signal()->get_signal()->task_work_run() path?

Miroslav

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-06 11:53   ` Miroslav Benes
@ 2020-05-06 12:06     ` Thomas Gleixner
  2020-05-06 15:35     ` Peter Zijlstra
  1 sibling, 0 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-06 12:06 UTC (permalink / raw)
  To: Miroslav Benes
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

Miroslav Benes <mbenes@suse.cz> writes:

> On Tue, 5 May 2020, Thomas Gleixner wrote:
>
>> Make sure task_work runs before any kind of userspace -- very much
>> including signals -- is invoked.
>
> I might be missing something, but isn't this guaranteed by 
> do_signal()->get_signal()->task_work_run() path?

The changelog is misleading. This is not about task_work it's about
TIF_NOTIFY_RESUME. Will fix.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 02/36] x86/hw_breakpoint: Prevent data breakpoints on cpu_entry_area
  2020-05-05 13:16 ` [patch V4 part 1 02/36] x86/hw_breakpoint: Prevent data breakpoints on cpu_entry_area Thomas Gleixner
  2020-05-06  8:14   ` Borislav Petkov
@ 2020-05-06 12:11   ` Alexandre Chartre
  2020-05-09  9:00   ` Lai Jiangshan
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 12:11 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra (Intel)


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> A data breakpoint near the top of an IST stack will cause unresoverable

typo: unresoverable -> unrecoverable

> recursion.  A data breakpoint on the GDT, IDT, or TSS is terrifying.
> Prevent either of these from happening.
> 
> Co-developed-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/kernel/hw_breakpoint.c |   25 +++++++++++++++++++++++++
>   1 file changed, 25 insertions(+)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> --- a/arch/x86/kernel/hw_breakpoint.c
> +++ b/arch/x86/kernel/hw_breakpoint.c
> @@ -227,10 +227,35 @@ int arch_check_bp_in_kernelspace(struct
>   	return (va >= TASK_SIZE_MAX) || ((va + len - 1) >= TASK_SIZE_MAX);
>   }
>   
> +/*
> + * Checks whether the range from addr to end, inclusive, overlaps the CPU
> + * entry area range.
> + */
> +static inline bool within_cpu_entry_area(unsigned long addr, unsigned long end)
> +{
> +	return end >= CPU_ENTRY_AREA_PER_CPU &&
> +	       addr < (CPU_ENTRY_AREA_PER_CPU + CPU_ENTRY_AREA_TOTAL_SIZE);
> +}
> +
>   static int arch_build_bp_info(struct perf_event *bp,
>   			      const struct perf_event_attr *attr,
>   			      struct arch_hw_breakpoint *hw)
>   {
> +	unsigned long bp_end;
> +
> +	bp_end = attr->bp_addr + attr->bp_len - 1;
> +	if (bp_end < attr->bp_addr)
> +		return -EINVAL;
> +
> +	/*
> +	 * Prevent any breakpoint of any type that overlaps the
> +	 * cpu_entry_area.  This protects the IST stacks and also
> +	 * reduces the chance that we ever find out what happens if
> +	 * there's a data breakpoint on the GDT, IDT, or TSS.
> +	 */
> +	if (within_cpu_entry_area(attr->bp_addr, bp_end))
> +		return -EINVAL;
> +
>   	hw->address = attr->bp_addr;
>   	hw->mask = 0;
>   
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 03/36] sched: Clean up scheduler_ipi()
  2020-05-05 13:16 ` [patch V4 part 1 03/36] sched: Clean up scheduler_ipi() Thomas Gleixner
  2020-05-06  8:32   ` Thomas Gleixner
  2020-05-06  8:40   ` Borislav Petkov
@ 2020-05-06 12:37   ` Alexandre Chartre
  2020-05-06 15:03     ` Thomas Gleixner
  2020-05-06 15:33     ` Peter Zijlstra
  2020-05-12 15:13   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra (Intel)
  3 siblings, 2 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra (Intel)


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> The scheduler IPI has grown weird and wonderful over the years, time
> for spring cleaning.
> 
> Move all the non-trivial stuff out of it and into a regular smp function
> call IPI. This then reduces the schedule_ipi() to most of it's former MOP
> glory and ensures to keep the interrupt vector lean and mean.
> 
> Aside of that avoiding the full irq_enter() in the x86 IPI implementation
> is incorrect as scheduler_ipi() can be instrumented. To work around that
> scheduler_ipi() had an irq_enter/exit() hack when heavy work was
> pending. This is gone now.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   kernel/sched/core.c  |   64 +++++++++++++++++++++++----------------------------
>   kernel/sched/fair.c  |    5 +--
>   kernel/sched/sched.h |    6 +++-
>   3 files changed, 36 insertions(+), 39 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -219,6 +219,13 @@ void update_rq_clock(struct rq *rq)
>   	update_rq_clock_task(rq, delta);
>   }
>   
> +static inline void
> +rq_csd_init(struct rq *rq, call_single_data_t *csd, smp_call_func_t func)
> +{
> +	csd->flags = 0;
> +	csd->func = func;
> +	csd->info = rq;
> +}
>   
>   #ifdef CONFIG_SCHED_HRTICK
>   /*
> @@ -314,16 +321,14 @@ void hrtick_start(struct rq *rq, u64 del
>   	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),
>   		      HRTIMER_MODE_REL_PINNED_HARD);
>   }
> +
>   #endif /* CONFIG_SMP */
>   
>   static void hrtick_rq_init(struct rq *rq)
>   {
>   #ifdef CONFIG_SMP
> -	rq->hrtick_csd.flags = 0;
> -	rq->hrtick_csd.func = __hrtick_start;
> -	rq->hrtick_csd.info = rq;
> +	rq_csd_init(rq, &rq->hrtick_csd, __hrtick_start);
>   #endif
> -
>   	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
>   	rq->hrtick_timer.function = hrtick;
>   }
> @@ -650,6 +655,16 @@ static inline bool got_nohz_idle_kick(vo
>   	return false;
>   }
>   
> +static void nohz_csd_func(void *info)
> +{
> +	struct rq *rq = info;
> +
> +	if (got_nohz_idle_kick()) {
> +		rq->idle_balance = 1;
> +		raise_softirq_irqoff(SCHED_SOFTIRQ);
> +	}
> +}
> +
>   #else /* CONFIG_NO_HZ_COMMON */
>   
>   static inline bool got_nohz_idle_kick(void)
> @@ -2292,6 +2307,11 @@ void sched_ttwu_pending(void)
>   	rq_unlock_irqrestore(rq, &rf);
>   }
>   
> +static void wake_csd_func(void *info)
> +{
> +	sched_ttwu_pending();
> +}
> +
>   void scheduler_ipi(void)
>   {
>   	/*
> @@ -2300,34 +2320,6 @@ void scheduler_ipi(void)
>   	 * this IPI.
>   	 */
>   	preempt_fold_need_resched();
> -
> -	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> -		return;
> -
> -	/*
> -	 * Not all reschedule IPI handlers call irq_enter/irq_exit, since
> -	 * traditionally all their work was done from the interrupt return
> -	 * path. Now that we actually do some work, we need to make sure
> -	 * we do call them.
> -	 *
> -	 * Some archs already do call them, luckily irq_enter/exit nest
> -	 * properly.
> -	 *
> -	 * Arguably we should visit all archs and update all handlers,
> -	 * however a fair share of IPIs are still resched only so this would
> -	 * somewhat pessimize the simple resched case.
> -	 */
> -	irq_enter();
> -	sched_ttwu_pending();
> -
> -	/*
> -	 * Check if someone kicked us for doing the nohz idle load balance.
> -	 */
> -	if (unlikely(got_nohz_idle_kick())) {
> -		this_rq()->idle_balance = 1;
> -		raise_softirq_irqoff(SCHED_SOFTIRQ);
> -	}
> -	irq_exit();
>   }
>   
>   static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
> @@ -2336,9 +2328,9 @@ static void ttwu_queue_remote(struct tas
>   
>   	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
>   
> -	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
> +	if (llist_add(&p->wake_entry, &rq->wake_list)) {
>   		if (!set_nr_if_polling(rq->idle))
> -			smp_send_reschedule(cpu);
> +			smp_call_function_single_async(cpu, &rq->wake_csd);
>   		else
>   			trace_sched_wake_idle_without_ipi(cpu);
>   	}
> @@ -6685,12 +6677,16 @@ void __init sched_init(void)
>   		rq->avg_idle = 2*sysctl_sched_migration_cost;
>   		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>   
> +		rq_csd_init(rq, &rq->wake_csd, wake_csd_func);
> +
>   		INIT_LIST_HEAD(&rq->cfs_tasks);
>   
>   		rq_attach_root(rq, &def_root_domain);
>   #ifdef CONFIG_NO_HZ_COMMON
>   		rq->last_blocked_load_update_tick = jiffies;
>   		atomic_set(&rq->nohz_flags, 0);
> +
> +		rq_csd_init(rq, &rq->nohz_csd, nohz_csd_func);
>   #endif
>   #endif /* CONFIG_SMP */
>   		hrtick_rq_init(rq);
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10009,12 +10009,11 @@ static void kick_ilb(unsigned int flags)
>   		return;
>   
>   	/*
> -	 * Use smp_send_reschedule() instead of resched_cpu().
> -	 * This way we generate a sched IPI on the target CPU which
> +	 * This way we generate an IPI on the target CPU which
>   	 * is idle. And the softirq performing nohz idle load balance
>   	 * will be run before returning from the IPI.
>   	 */
> -	smp_send_reschedule(ilb_cpu);
> +	smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->wake_csd);

This should be nohz_csd instead of wake_csd, no? I.e.:

        smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);


With that:

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.


>   }
>   
>   /*
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -890,9 +890,10 @@ struct rq {
>   #ifdef CONFIG_SMP
>   	unsigned long		last_blocked_load_update_tick;
>   	unsigned int		has_blocked_load;
> +	call_single_data_t	nohz_csd;
>   #endif /* CONFIG_SMP */
>   	unsigned int		nohz_tick_stopped;
> -	atomic_t nohz_flags;
> +	atomic_t		nohz_flags;
>   #endif /* CONFIG_NO_HZ_COMMON */
>   
>   	unsigned long		nr_load_updates;
> @@ -979,7 +980,7 @@ struct rq {
>   
>   	/* This is used to determine avg_idle's max value */
>   	u64			max_idle_balance_cost;
> -#endif
> +#endif /* CONFIG_SMP */
>   
>   #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>   	u64			prev_irq_time;
> @@ -1021,6 +1022,7 @@ struct rq {
>   #endif
>   
>   #ifdef CONFIG_SMP
> +	call_single_data_t	wake_csd;
>   	struct llist_head	wake_list;
>   #endif
>   
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 04/36] sched: Make scheduler_ipi inline
  2020-05-05 13:16 ` [patch V4 part 1 04/36] sched: Make scheduler_ipi inline Thomas Gleixner
@ 2020-05-06 12:42   ` Alexandre Chartre
  2020-05-12 15:13   ` [tip: sched/core] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 12:42 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> Now that the scheduler IPI is trivial and simple again there is no point to
> have the little function out of line. This simplifies the effort of
> constraining the instrumentation nicely.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/sched.h |   10 +++++++++-
>   kernel/sched/core.c   |   10 ----------
>   2 files changed, 9 insertions(+), 11 deletions(-)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1719,7 +1719,15 @@ extern char *__get_task_comm(char *to, s
>   })
>   
>   #ifdef CONFIG_SMP
> -void scheduler_ipi(void);
> +static __always_inline void scheduler_ipi(void)
> +{
> +	/*
> +	 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
> +	 * TIF_NEED_RESCHED remotely (for the first time) will also send
> +	 * this IPI.
> +	 */
> +	preempt_fold_need_resched();
> +}
>   extern unsigned long wait_task_inactive(struct task_struct *, long match_state);
>   #else
>   static inline void scheduler_ipi(void) { }
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2312,16 +2312,6 @@ static void wake_csd_func(void *info)
>   	sched_ttwu_pending();
>   }
>   
> -void scheduler_ipi(void)
> -{
> -	/*
> -	 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
> -	 * TIF_NEED_RESCHED remotely (for the first time) will also send
> -	 * this IPI.
> -	 */
> -	preempt_fold_need_resched();
> -}
> -
>   static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
>   {
>   	struct rq *rq = cpu_rq(cpu);
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 12/36] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-05-06  7:00   ` Paolo Bonzini
@ 2020-05-06 12:53     ` Steven Rostedt
  0 siblings, 0 replies; 178+ messages in thread
From: Steven Rostedt @ 2020-05-06 12:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Thomas Gleixner, LKML, x86, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On Wed, 6 May 2020 09:00:09 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> >  	case KVM_PV_REASON_PAGE_READY:
> >   
> 
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Please crop your email. It is really annoying to scroll through 16 pages of
quoted text and come to this.

-- Steve

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-05 13:16 ` [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling Thomas Gleixner
  2020-05-06 11:53   ` Miroslav Benes
@ 2020-05-06 13:06   ` Alexandre Chartre
  2020-05-06 16:26   ` Borislav Petkov
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 13:06 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra (Intel)


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> Make sure task_work runs before any kind of userspace -- very much
> including signals -- is invoked.
> 
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/entry/common.c |    8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -156,16 +156,16 @@ static void exit_to_usermode_loop(struct
>   		if (cached_flags & _TIF_PATCH_PENDING)
>   			klp_update_patch_state(current);
>   
> -		/* deal with pending signal delivery */
> -		if (cached_flags & _TIF_SIGPENDING)
> -			do_signal(regs);
> -
>   		if (cached_flags & _TIF_NOTIFY_RESUME) {
>   			clear_thread_flag(TIF_NOTIFY_RESUME);
>   			tracehook_notify_resume(regs);
>   			rseq_handle_notify_resume(NULL, regs);
>   		}
>   
> +		/* deal with pending signal delivery */
> +		if (cached_flags & _TIF_SIGPENDING)
> +			do_signal(regs);
> +
>   		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
>   			fire_user_return_notifiers();
>   
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 06/36] compiler: Simple READ/WRITE_ONCE() implementations
  2020-05-05 13:16 ` [patch V4 part 1 06/36] compiler: Simple READ/WRITE_ONCE() implementations Thomas Gleixner
@ 2020-05-06 13:11   ` Alexandre Chartre
  2020-05-06 13:33   ` Will Deacon
  2020-05-06 16:33   ` Borislav Petkov
  2 siblings, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 13:11 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra (Intel)



On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> READ/WRITE_ONCE_NOCHECK() is required for atomics in code which cannot be
> instrumented like the x86 int3 text poke code. As READ/WRITE_ONCE() is
> undergoing a rewrite, provide __{READ,WRITE}_ONCE_SCALAR().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/compiler.h |    8 ++++++++
>   1 file changed, 8 insertions(+)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -313,6 +313,14 @@ unsigned long read_word_at_a_time(const
>   	__u.__val;					\
>   })
>   
> +#define __READ_ONCE_SCALAR(x)				\
> +	(*(const volatile typeof(x) *)&(x))
> +
> +#define __WRITE_ONCE_SCALAR(x, val)			\
> +do {							\
> +	*(volatile typeof(x) *)&(x) = val;		\
> +} while (0)
> +
>   /**
>    * data_race - mark an expression as containing intentional data races
>    *
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 06/36] compiler: Simple READ/WRITE_ONCE() implementations
  2020-05-05 13:16 ` [patch V4 part 1 06/36] compiler: Simple READ/WRITE_ONCE() implementations Thomas Gleixner
  2020-05-06 13:11   ` Alexandre Chartre
@ 2020-05-06 13:33   ` Will Deacon
  2020-05-06 15:36     ` Peter Zijlstra
  2020-05-06 16:33   ` Borislav Petkov
  2 siblings, 1 reply; 178+ messages in thread
From: Will Deacon @ 2020-05-06 13:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra (Intel)

On Tue, May 05, 2020 at 03:16:08PM +0200, Thomas Gleixner wrote:
> READ/WRITE_ONCE_NOCHECK() is required for atomics in code which cannot be
> instrumented like the x86 int3 text poke code. As READ/WRITE_ONCE() is
> undergoing a rewrite, provide __{READ,WRITE}_ONCE_SCALAR().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/compiler.h |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -313,6 +313,14 @@ unsigned long read_word_at_a_time(const
>  	__u.__val;					\
>  })
>  
> +#define __READ_ONCE_SCALAR(x)				\
> +	(*(const volatile typeof(x) *)&(x))
> +
> +#define __WRITE_ONCE_SCALAR(x, val)			\
> +do {							\
> +	*(volatile typeof(x) *)&(x) = val;		\
> +} while (0)

FWIW, these end up being called __READ_ONCE() and __WRITE_ONCE() after
the rewrite; the *_SCALAR() variants will call into kcsan_check_atomic_*().

If you go with that naming now, then any later conflict should fall out in
the wash.

Will

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 08/36] x86/doublefault: Remove memmove() call
  2020-05-05 13:16 ` [patch V4 part 1 08/36] x86/doublefault: Remove memmove() call Thomas Gleixner
@ 2020-05-06 13:47   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 13:47 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Borislav Petkov, Peter Zijlstra (Intel)


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> Use of memmove() in #DF is problematic considered tracing and other
> instrumentation.
> 
> Remove the memmove() call and simply write out what needs doing; this
> even clarifies the code, win-win! The code copies from the espfix64
> stack to the normal task stack, there is no possible way for that to
> overlap.
> 
> Survives selftests/x86, specifically sigreturn_64.
> 
> Suggested-by: Borislav Petkov <bp@alien8.de>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Acked-by: Andy Lutomirski <luto@kernel.org>
> Link: https://lkml.kernel.org/r/20200220121727.GB507@zn.tnic
> ---
>   arch/x86/kernel/traps.c |    7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -278,6 +278,7 @@ dotraplinkage void do_double_fault(struc
>   		regs->ip == (unsigned long)native_irq_return_iret)
>   	{
>   		struct pt_regs *gpregs = (struct pt_regs *)this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
> +		unsigned long *p = (unsigned long *)regs->sp;
>   
>   		/*
>   		 * regs->sp points to the failing IRET frame on the
> @@ -285,7 +286,11 @@ dotraplinkage void do_double_fault(struc
>   		 * in gpregs->ss through gpregs->ip.
>   		 *
>   		 */
> -		memmove(&gpregs->ip, (void *)regs->sp, 5*8);
> +		gpregs->ip	= p[0];
> +		gpregs->cs	= p[1];
> +		gpregs->flags	= p[2];
> +		gpregs->sp	= p[3];
> +		gpregs->ss	= p[4];
>   		gpregs->orig_ax = 0;  /* Missing (lost) #GP error code */
>   
>   		/*
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 10/36] x86/entry: Remove the unused LOCKDEP_SYSEXIT cruft
  2020-05-05 13:16 ` [patch V4 part 1 10/36] x86/entry: Remove the unused LOCKDEP_SYSEXIT cruft Thomas Gleixner
@ 2020-05-06 13:52   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 13:52 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> No users left since two years due to commit 21d375b6b34f ("x86/entry/64:
> Remove the SYSCALL64 fast path")
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/entry/thunk_64.S       |    5 -----
>   arch/x86/include/asm/irqflags.h |   24 ------------------------
>   2 files changed, 29 deletions(-)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> --- a/arch/x86/entry/thunk_64.S
> +++ b/arch/x86/entry/thunk_64.S
> @@ -42,10 +42,6 @@ SYM_FUNC_END(\name)
>   	THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller,1
>   #endif
>   
> -#ifdef CONFIG_DEBUG_LOCK_ALLOC
> -	THUNK lockdep_sys_exit_thunk,lockdep_sys_exit
> -#endif
> -
>   #ifdef CONFIG_PREEMPTION
>   	THUNK preempt_schedule_thunk, preempt_schedule
>   	THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
> @@ -54,7 +50,6 @@ SYM_FUNC_END(\name)
>   #endif
>   
>   #if defined(CONFIG_TRACE_IRQFLAGS) \
> - || defined(CONFIG_DEBUG_LOCK_ALLOC) \
>    || defined(CONFIG_PREEMPTION)
>   SYM_CODE_START_LOCAL_NOALIGN(.L_restore)
>   	popq %r11
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -180,30 +180,6 @@ static inline int arch_irqs_disabled(voi
>   #  define TRACE_IRQS_ON
>   #  define TRACE_IRQS_OFF
>   #endif
> -#ifdef CONFIG_DEBUG_LOCK_ALLOC
> -#  ifdef CONFIG_X86_64
> -#    define LOCKDEP_SYS_EXIT		call lockdep_sys_exit_thunk
> -#    define LOCKDEP_SYS_EXIT_IRQ \
> -	TRACE_IRQS_ON; \
> -	sti; \
> -	call lockdep_sys_exit_thunk; \
> -	cli; \
> -	TRACE_IRQS_OFF;
> -#  else
> -#    define LOCKDEP_SYS_EXIT \
> -	pushl %eax;				\
> -	pushl %ecx;				\
> -	pushl %edx;				\
> -	call lockdep_sys_exit;			\
> -	popl %edx;				\
> -	popl %ecx;				\
> -	popl %eax;
> -#    define LOCKDEP_SYS_EXIT_IRQ
> -#  endif
> -#else
> -#  define LOCKDEP_SYS_EXIT
> -#  define LOCKDEP_SYS_EXIT_IRQ
> -#endif
>   #endif /* __ASSEMBLY__ */
>   
>   #endif
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 11/36] x86/kvm: Handle async page faults directly through do_page_fault()
  2020-05-05 13:16 ` [patch V4 part 1 11/36] x86/kvm: Handle async page faults directly through do_page_fault() Thomas Gleixner
  2020-05-06  7:00   ` Paolo Bonzini
@ 2020-05-06 14:05   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Andy Lutomirski
  2 siblings, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 14:05 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> KVM overloads #PF to indicate two types of not-actually-page-fault
> events.  Right now, the KVM guest code intercepts them by modifying
> the IDT and hooking the #PF vector.  This makes the already fragile
> fault code even harder to understand, and it also pollutes call
> traces with async_page_fault and do_async_page_fault for normal page
> faults.
> 
> Clean it up by moving the logic into do_page_fault() using a static
> branch.  This gets rid of the platform trap_init override mechanism
> completely.
> 
> [ tglx: Fixed up 32bit, removed error code from the async functions and
>    	massaged coding style ]
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>   arch/x86/entry/entry_32.S       |    8 --------
>   arch/x86/entry/entry_64.S       |    4 ----
>   arch/x86/include/asm/kvm_para.h |   19 +++++++++++++++++--
>   arch/x86/include/asm/x86_init.h |    2 --
>   arch/x86/kernel/kvm.c           |   39 +++++++++++++++++++++------------------
>   arch/x86/kernel/traps.c         |    2 --
>   arch/x86/kernel/x86_init.c      |    1 -
>   arch/x86/mm/fault.c             |   19 +++++++++++++++++++
>   8 files changed, 57 insertions(+), 37 deletions(-)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -1693,14 +1693,6 @@ SYM_CODE_START(general_protection)
>   	jmp	common_exception
>   SYM_CODE_END(general_protection)
>   
> -#ifdef CONFIG_KVM_GUEST
> -SYM_CODE_START(async_page_fault)
> -	ASM_CLAC
> -	pushl	$do_async_page_fault
> -	jmp	common_exception_read_cr2
> -SYM_CODE_END(async_page_fault)
> -#endif
> -
>   SYM_CODE_START(rewind_stack_do_exit)
>   	/* Prevent any naive code from trying to unwind to our caller. */
>   	xorl	%ebp, %ebp
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1204,10 +1204,6 @@ idtentry xendebug		do_debug		has_error_c
>   idtentry general_protection	do_general_protection	has_error_code=1
>   idtentry page_fault		do_page_fault		has_error_code=1	read_cr2=1
>   
> -#ifdef CONFIG_KVM_GUEST
> -idtentry async_page_fault	do_async_page_fault	has_error_code=1	read_cr2=1
> -#endif
> -
>   #ifdef CONFIG_X86_MCE
>   idtentry machine_check		do_mce			has_error_code=0	paranoid=1
>   #endif
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -91,8 +91,18 @@ unsigned int kvm_arch_para_hints(void);
>   void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
>   void kvm_async_pf_task_wake(u32 token);
>   u32 kvm_read_and_reset_pf_reason(void);
> -extern void kvm_disable_steal_time(void);
> -void do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
> +void kvm_disable_steal_time(void);
> +bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token);
> +
> +DECLARE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
> +
> +static __always_inline bool kvm_handle_async_pf(struct pt_regs *regs, u32 token)
> +{
> +	if (static_branch_unlikely(&kvm_async_pf_enabled))
> +		return __kvm_handle_async_pf(regs, token);
> +	else
> +		return false;
> +}
>   
>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
>   void __init kvm_spinlock_init(void);
> @@ -130,6 +140,11 @@ static inline void kvm_disable_steal_tim
>   {
>   	return;
>   }
> +
> +static inline bool kvm_handle_async_pf(struct pt_regs *regs, u32 token)
> +{
> +	return false;
> +}
>   #endif
>   
>   #endif /* _ASM_X86_KVM_PARA_H */
> --- a/arch/x86/include/asm/x86_init.h
> +++ b/arch/x86/include/asm/x86_init.h
> @@ -50,14 +50,12 @@ struct x86_init_resources {
>    * @pre_vector_init:		init code to run before interrupt vectors
>    *				are set up.
>    * @intr_init:			interrupt init code
> - * @trap_init:			platform specific trap setup
>    * @intr_mode_select:		interrupt delivery mode selection
>    * @intr_mode_init:		interrupt delivery mode setup
>    */
>   struct x86_init_irqs {
>   	void (*pre_vector_init)(void);
>   	void (*intr_init)(void);
> -	void (*trap_init)(void);
>   	void (*intr_mode_select)(void);
>   	void (*intr_mode_init)(void);
>   };
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -35,6 +35,8 @@
>   #include <asm/tlb.h>
>   #include <asm/cpuidle_haltpoll.h>
>   
> +DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
> +
>   static int kvmapf = 1;
>   
>   static int __init parse_no_kvmapf(char *arg)
> @@ -242,25 +244,27 @@ u32 kvm_read_and_reset_pf_reason(void)
>   EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
>   NOKPROBE_SYMBOL(kvm_read_and_reset_pf_reason);
>   
> -dotraplinkage void
> -do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address)
> +bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
>   {
> +	/*
> +	 * If we get a page fault right here, the pf_reason seems likely
> +	 * to be clobbered.  Bummer.
> +	 */
>   	switch (kvm_read_and_reset_pf_reason()) {
>   	default:
> -		do_page_fault(regs, error_code, address);
> -		break;
> +		return false;
>   	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>   		/* page is swapped out by the host. */
> -		kvm_async_pf_task_wait((u32)address, !user_mode(regs));
> -		break;
> +		kvm_async_pf_task_wait(token, !user_mode(regs));
> +		return true;
>   	case KVM_PV_REASON_PAGE_READY:
>   		rcu_irq_enter();
> -		kvm_async_pf_task_wake((u32)address);
> +		kvm_async_pf_task_wake(token);
>   		rcu_irq_exit();
> -		break;
> +		return true;
>   	}
>   }
> -NOKPROBE_SYMBOL(do_async_page_fault);
> +NOKPROBE_SYMBOL(__kvm_handle_async_pf);
>   
>   static void __init paravirt_ops_setup(void)
>   {
> @@ -306,7 +310,11 @@ static notrace void kvm_guest_apic_eoi_w
>   static void kvm_guest_cpu_init(void)
>   {
>   	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
> -		u64 pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
> +		u64 pa;
> +
> +		WARN_ON_ONCE(!static_branch_likely(&kvm_async_pf_enabled));
> +
> +		pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
>   
>   #ifdef CONFIG_PREEMPTION
>   		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
> @@ -592,12 +600,6 @@ static int kvm_cpu_down_prepare(unsigned
>   }
>   #endif
>   
> -static void __init kvm_apf_trap_init(void)
> -{
> -	update_intr_gate(X86_TRAP_PF, async_page_fault);
> -}
> -
> -
>   static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>   			const struct flush_tlb_info *info)
>   {
> @@ -632,8 +634,6 @@ static void __init kvm_guest_init(void)
>   	register_reboot_notifier(&kvm_pv_reboot_nb);
>   	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
>   		raw_spin_lock_init(&async_pf_sleepers[i].lock);
> -	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
> -		x86_init.irqs.trap_init = kvm_apf_trap_init;
>   
>   	if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
>   		has_steal_clock = 1;
> @@ -649,6 +649,9 @@ static void __init kvm_guest_init(void)
>   	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
>   		apic_set_eoi_write(kvm_guest_apic_eoi_write);
>   
> +	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf)
> +		static_branch_enable(&kvm_async_pf_enabled);
> +
>   #ifdef CONFIG_SMP
>   	smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
>   	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -988,7 +988,5 @@ void __init trap_init(void)
>   
>   	idt_setup_ist_traps();
>   
> -	x86_init.irqs.trap_init();
> -
>   	idt_setup_debugidt_traps();
>   }
> --- a/arch/x86/kernel/x86_init.c
> +++ b/arch/x86/kernel/x86_init.c
> @@ -79,7 +79,6 @@ struct x86_init_ops x86_init __initdata
>   	.irqs = {
>   		.pre_vector_init	= init_ISA_irqs,
>   		.intr_init		= native_init_IRQ,
> -		.trap_init		= x86_init_noop,
>   		.intr_mode_select	= apic_intr_mode_select,
>   		.intr_mode_init		= apic_intr_mode_init
>   	},
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -30,6 +30,7 @@
>   #include <asm/desc.h>			/* store_idt(), ...		*/
>   #include <asm/cpu_entry_area.h>		/* exception stack		*/
>   #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/
> +#include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
>   
>   #define CREATE_TRACE_POINTS
>   #include <asm/trace/exceptions.h>
> @@ -1523,6 +1524,24 @@ do_page_fault(struct pt_regs *regs, unsi
>   		unsigned long address)
>   {
>   	prefetchw(&current->mm->mmap_sem);
> +	/*
> +	 * KVM has two types of events that are, logically, interrupts, but
> +	 * are unfortunately delivered using the #PF vector.  These events are
> +	 * "you just accessed valid memory, but the host doesn't have it right
> +	 * now, so I'll put you to sleep if you continue" and "that memory
> +	 * you tried to access earlier is available now."
> +	 *
> +	 * We are relying on the interrupted context being sane (valid RSP,
> +	 * relevant locks not held, etc.), which is fine as long as the
> +	 * interrupted context had IF=1.  We are also relying on the KVM
> +	 * async pf type field and CR2 being read consistently instead of
> +	 * getting values from real and async page faults mixed up.
> +	 *
> +	 * Fingers crossed.
> +	 */
> +	if (kvm_handle_async_pf(regs, (u32)address))
> +		return;
> +
>   	trace_page_fault_entries(regs, hw_error_code, address);
>   
>   	if (unlikely(kmmio_fault(regs, address)))
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 03/36] sched: Clean up scheduler_ipi()
  2020-05-06 12:37   ` Alexandre Chartre
@ 2020-05-06 15:03     ` Thomas Gleixner
  2020-05-06 15:33     ` Peter Zijlstra
  1 sibling, 0 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-06 15:03 UTC (permalink / raw)
  To: Alexandre Chartre, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra (Intel)

Alexandre Chartre <alexandre.chartre@oracle.com> writes:
> On 5/5/20 3:16 PM, Thomas Gleixner wrote:
>> @@ -10009,12 +10009,11 @@ static void kick_ilb(unsigned int flags)
>>   		return;
>>   
>>   	/*
>> -	 * Use smp_send_reschedule() instead of resched_cpu().
>> -	 * This way we generate a sched IPI on the target CPU which
>> +	 * This way we generate an IPI on the target CPU which
>>   	 * is idle. And the softirq performing nohz idle load balance
>>   	 * will be run before returning from the IPI.
>>   	 */
>> -	smp_send_reschedule(ilb_cpu);
>> +	smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->wake_csd);
>
> This should be nohz_csd instead of wake_csd, no? I.e.:
>
>         smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);

Good catch!

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 12/36] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-05-05 13:16 ` [patch V4 part 1 12/36] x86/kvm: Sanitize kvm_async_pf_task_wait() Thomas Gleixner
  2020-05-05 17:54   ` Paul E. McKenney
  2020-05-06  7:00   ` Paolo Bonzini
@ 2020-05-06 15:13   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  3 siblings, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 15:13 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon



On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> While working on the entry consolidation I stumbled over the KVM async page
> fault handler and kvm_async_pf_task_wait() in particular. It took me a
> while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
> invocations are just cargo cult programming. Several patches "fixed" RCU
> splats by curing the symptoms without noticing that the code is flawed
> from a design perspective.
> 
> The main problem is that this async injection is not based on a proper
> handshake mechanism and only respects the minimal requirement, i.e. the
> guest is not in a state where it has interrupts disabled.
> 
> Aside of that the actual code is a convoluted one fits it all swiss army
> knife. It is invoked from different places with different RCU constraints:
> 
>    1) Host side:
> 
>       vcpu_enter_guest()
>         kvm_x86_ops->handle_exit()
>           kvm_handle_page_fault()
>             kvm_async_pf_task_wait()
> 
>       The invocation happens from fully preemptible context.
> 
>    2) Guest side:
> 
>       The async page fault interrupted:
> 
>           a) user space
> 
> 	 b) preemptible kernel code which is not in a RCU read side
> 	    critical section
> 
>       	 c) non-preemtible kernel code or a RCU read side critical section
> 	    or kernel code with CONFIG_PREEMPTION=n which allows not to
> 	    differentiate between #2b and #2c.
> 
> RCU is watching for:
> 
>    #1  The vCPU exited and current is definitely not the idle task
> 
>    #2a The #PF entry code on the guest went through enter_from_user_mode()
>        which reactivates RCU
> 
>    #2b There is no preemptible, interrupts enabled code in the kernel
>        which can run with RCU looking away. (The idle task is always
>        non preemptible).
> 
> I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU
> voodoo at all.
> 
> In #2c RCU is eventually not watching, but as that state cannot schedule
> anyway there is no point to worry about it so it has to invoke
> rcu_irq_enter() before running that code. This can be optimized, but this
> will be done as an extra step in course of the entry code consolidation
> work.
> 
> So the proper solution for this is to:
> 
>    - Split kvm_async_pf_task_wait() into schedule and halt based waiting
>      interfaces which share the enqueueing code.
> 
>    - Add comments (condensed form of this changelog) to spare others the
>      time waste and pain of reverse engineering all of this with the help of
>      uncomprehensible changelogs and code history.
> 
>    - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(),
>      user mode and schedulable kernel side async page faults (#1, #2a, #2b)
> 
>    - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel
>      case (#2c).
> 
>      For this case also remove the rcu_irq_exit()/enter() pair around the
>      halt as it is just a pointless exercise:
> 
>         - vCPUs can VMEXIT at any random point and can be scheduled out for
>           an arbitrary amount of time by the host and this is not any
>           different except that it voluntary triggers the exit via halt.
> 
>         - The interrupted context could have RCU watching already. So the
> 	 rcu_irq_exit() before the halt is not gaining anything aside of
> 	 confusing the reader. Claiming that this might prevent RCU stalls
> 	 is just an illusion.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V2: Panic if async #PF is injected into an interrupt disabled region.
> ---
>   arch/x86/include/asm/kvm_para.h |    4
>   arch/x86/kernel/kvm.c           |  201 ++++++++++++++++++++++++++++------------
>   arch/x86/kvm/mmu/mmu.c          |    2
>   3 files changed, 144 insertions(+), 63 deletions(-)


Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.


> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsign
>   bool kvm_para_available(void);
>   unsigned int kvm_arch_para_features(void);
>   unsigned int kvm_arch_para_hints(void);
> -void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
> +void kvm_async_pf_task_wait_schedule(u32 token);
>   void kvm_async_pf_task_wake(u32 token);
>   u32 kvm_read_and_reset_pf_reason(void);
>   void kvm_disable_steal_time(void);
> @@ -113,7 +113,7 @@ static inline void kvm_spinlock_init(voi
>   #endif /* CONFIG_PARAVIRT_SPINLOCKS */
>   
>   #else /* CONFIG_KVM_GUEST */
> -#define kvm_async_pf_task_wait(T, I) do {} while(0)
> +#define kvm_async_pf_task_wait_schedule(T) do {} while(0)
>   #define kvm_async_pf_task_wake(T) do {} while(0)
>   
>   static inline bool kvm_para_available(void)
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -75,7 +75,7 @@ struct kvm_task_sleep_node {
>   	struct swait_queue_head wq;
>   	u32 token;
>   	int cpu;
> -	bool halted;
> +	bool use_halt;
>   };
>   
>   static struct kvm_task_sleep_head {
> @@ -98,75 +98,145 @@ static struct kvm_task_sleep_node *_find
>   	return NULL;
>   }
>   
> -/*
> - * @interrupt_kernel: Is this called from a routine which interrupts the kernel
> - * 		      (other than user space)?
> - */
> -void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
> +static bool kvm_async_pf_queue_task(u32 token, bool use_halt,
> +				    struct kvm_task_sleep_node *n)
>   {
>   	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>   	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> -	struct kvm_task_sleep_node n, *e;
> -	DECLARE_SWAITQUEUE(wait);
> -
> -	rcu_irq_enter();
> +	struct kvm_task_sleep_node *e;
>   
>   	raw_spin_lock(&b->lock);
>   	e = _find_apf_task(b, token);
>   	if (e) {
>   		/* dummy entry exist -> wake up was delivered ahead of PF */
>   		hlist_del(&e->link);
> -		kfree(e);
>   		raw_spin_unlock(&b->lock);
> +		kfree(e);
> +		return false;
> +	}
>   
> -		rcu_irq_exit();
> +	n->token = token;
> +	n->cpu = smp_processor_id();
> +	n->use_halt = use_halt;
> +	init_swait_queue_head(&n->wq);
> +	hlist_add_head(&n->link, &b->list);
> +	raw_spin_unlock(&b->lock);
> +	return true;
> +}
> +
> +/*
> + * kvm_async_pf_task_wait_schedule - Wait for pagefault to be handled
> + * @token:	Token to identify the sleep node entry
> + *
> + * Invoked from the async pagefault handling code or from the VM exit page
> + * fault handler. In both cases RCU is watching.
> + */
> +void kvm_async_pf_task_wait_schedule(u32 token)
> +{
> +	struct kvm_task_sleep_node n;
> +	DECLARE_SWAITQUEUE(wait);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	if (!kvm_async_pf_queue_task(token, false, &n))
>   		return;
> +
> +	for (;;) {
> +		prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
> +		if (hlist_unhashed(&n.link))
> +			break;
> +
> +		local_irq_enable();
> +		schedule();
> +		local_irq_disable();
>   	}
> +	finish_swait(&n.wq, &wait);
> +}
> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
>   
> -	n.token = token;
> -	n.cpu = smp_processor_id();
> -	n.halted = is_idle_task(current) ||
> -		   (IS_ENABLED(CONFIG_PREEMPT_COUNT)
> -		    ? preempt_count() > 1 || rcu_preempt_depth()
> -		    : interrupt_kernel);
> -	init_swait_queue_head(&n.wq);
> -	hlist_add_head(&n.link, &b->list);
> -	raw_spin_unlock(&b->lock);
> +/*
> + * Invoked from the async page fault handler.
> + */
> +static void kvm_async_pf_task_wait_halt(u32 token)
> +{
> +	struct kvm_task_sleep_node n;
> +
> +	if (!kvm_async_pf_queue_task(token, true, &n))
> +		return;
>   
>   	for (;;) {
> -		if (!n.halted)
> -			prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
>   		if (hlist_unhashed(&n.link))
>   			break;
> +		/*
> +		 * No point in doing anything about RCU here. Any RCU read
> +		 * side critical section or RCU watching section can be
> +		 * interrupted by VMEXITs and the host is free to keep the
> +		 * vCPU scheduled out as long as it sees fit. This is not
> +		 * any different just because of the halt induced voluntary
> +		 * VMEXIT.
> +		 *
> +		 * Also the async page fault could have interrupted any RCU
> +		 * watching context, so invoking rcu_irq_exit()/enter()
> +		 * around this is not gaining anything.
> +		 */
> +		native_safe_halt();
> +		local_irq_disable();
> +	}
> +}
>   
> -		rcu_irq_exit();
> +/* Invoked from the async page fault handler */
> +static void kvm_async_pf_task_wait(u32 token, bool usermode)
> +{
> +	bool can_schedule;
>   
> -		if (!n.halted) {
> -			local_irq_enable();
> -			schedule();
> -			local_irq_disable();
> -		} else {
> -			/*
> -			 * We cannot reschedule. So halt.
> -			 */
> -			native_safe_halt();
> -			local_irq_disable();
> -		}
> +	/*
> +	 * No need to check whether interrupts were disabled because the
> +	 * host will (hopefully) only inject an async page fault into
> +	 * interrupt enabled regions.
> +	 *
> +	 * If CONFIG_PREEMPTION is enabled then check whether the code
> +	 * which triggered the page fault is preemptible. This covers user
> +	 * mode as well because preempt_count() is obviously 0 there.
> +	 *
> +	 * The check for rcu_preempt_depth() is also required because
> +	 * voluntary scheduling inside a rcu read locked section is not
> +	 * allowed.
> +	 *
> +	 * The idle task is already covered by this because idle always
> +	 * has a preempt count > 0.
> +	 *
> +	 * If CONFIG_PREEMPTION is disabled only allow scheduling when
> +	 * coming from user mode as there is no indication whether the
> +	 * context which triggered the page fault could schedule or not.
> +	 */
> +	if (IS_ENABLED(CONFIG_PREEMPTION))
> +		can_schedule = preempt_count() + rcu_preempt_depth() == 0;
> +	else
> +		can_schedule = usermode;
>   
> +	/*
> +	 * If the kernel context is allowed to schedule then RCU is
> +	 * watching because no preemptible code in the kernel is inside RCU
> +	 * idle state. So it can be treated like user mode. User mode is
> +	 * safe because the #PF entry invoked enter_from_user_mode().
> +	 *
> +	 * For the non schedulable case invoke rcu_irq_enter() for
> +	 * now. This will be moved out to the pagefault entry code later
> +	 * and only invoked when really needed.
> +	 */
> +	if (can_schedule) {
> +		kvm_async_pf_task_wait_schedule(token);
> +	} else {
>   		rcu_irq_enter();
> +		kvm_async_pf_task_wait_halt(token);
> +		rcu_irq_exit();
>   	}
> -	if (!n.halted)
> -		finish_swait(&n.wq, &wait);
> -
> -	rcu_irq_exit();
> -	return;
>   }
> -EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
>   
>   static void apf_task_wake_one(struct kvm_task_sleep_node *n)
>   {
>   	hlist_del_init(&n->link);
> -	if (n->halted)
> +	if (n->use_halt)
>   		smp_send_reschedule(n->cpu);
>   	else if (swq_has_sleeper(&n->wq))
>   		swake_up_one(&n->wq);
> @@ -177,12 +247,13 @@ static void apf_task_wake_all(void)
>   	int i;
>   
>   	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
> -		struct hlist_node *p, *next;
>   		struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
> +		struct kvm_task_sleep_node *n;
> +		struct hlist_node *p, *next;
> +
>   		raw_spin_lock(&b->lock);
>   		hlist_for_each_safe(p, next, &b->list) {
> -			struct kvm_task_sleep_node *n =
> -				hlist_entry(p, typeof(*n), link);
> +			n = hlist_entry(p, typeof(*n), link);
>   			if (n->cpu == smp_processor_id())
>   				apf_task_wake_one(n);
>   		}
> @@ -223,8 +294,9 @@ void kvm_async_pf_task_wake(u32 token)
>   		n->cpu = smp_processor_id();
>   		init_swait_queue_head(&n->wq);
>   		hlist_add_head(&n->link, &b->list);
> -	} else
> +	} else {
>   		apf_task_wake_one(n);
> +	}
>   	raw_spin_unlock(&b->lock);
>   	return;
>   }
> @@ -246,23 +318,33 @@ NOKPROBE_SYMBOL(kvm_read_and_reset_pf_re
>   
>   bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
>   {
> -	/*
> -	 * If we get a page fault right here, the pf_reason seems likely
> -	 * to be clobbered.  Bummer.
> -	 */
> -	switch (kvm_read_and_reset_pf_reason()) {
> +	u32 reason = kvm_read_and_reset_pf_reason();
> +
> +	switch (reason) {
> +	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +	case KVM_PV_REASON_PAGE_READY:
> +		break;
>   	default:
>   		return false;
> -	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> +	}
> +
> +	/*
> +	 * If the host managed to inject an async #PF into an interrupt
> +	 * disabled region, then die hard as this is not going to end well
> +	 * and the host side is seriously broken.
> +	 */
> +	if (unlikely(!(regs->flags & X86_EFLAGS_IF)))
> +		panic("Host injected async #PF in interrupt disabled region\n");
> +
> +	if (reason == KVM_PV_REASON_PAGE_NOT_PRESENT) {
>   		/* page is swapped out by the host. */
> -		kvm_async_pf_task_wait(token, !user_mode(regs));
> -		return true;
> -	case KVM_PV_REASON_PAGE_READY:
> +		kvm_async_pf_task_wait(token, user_mode(regs));
> +	} else {
>   		rcu_irq_enter();
>   		kvm_async_pf_task_wake(token);
>   		rcu_irq_exit();
> -		return true;
>   	}
> +	return true;
>   }
>   NOKPROBE_SYMBOL(__kvm_handle_async_pf);
>   
> @@ -326,12 +408,12 @@ static void kvm_guest_cpu_init(void)
>   
>   		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
>   		__this_cpu_write(apf_reason.enabled, 1);
> -		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
> -		       smp_processor_id());
> +		pr_info("KVM setup async PF for cpu %d\n", smp_processor_id());
>   	}
>   
>   	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
>   		unsigned long pa;
> +
>   		/* Size alignment is implied but just to make it explicit. */
>   		BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
>   		__this_cpu_write(kvm_apic_eoi, 0);
> @@ -352,8 +434,7 @@ static void kvm_pv_disable_apf(void)
>   	wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
>   	__this_cpu_write(apf_reason.enabled, 0);
>   
> -	printk(KERN_INFO"Unregister pv shared memory for cpu %d\n",
> -	       smp_processor_id());
> +	pr_info("Unregister pv shared memory for cpu %d\n", smp_processor_id());
>   }
>   
>   static void kvm_pv_guest_cpu_reboot(void *unused)
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4198,7 +4198,7 @@ int kvm_handle_page_fault(struct kvm_vcp
>   	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>   		vcpu->arch.apf.host_apf_reason = 0;
>   		local_irq_disable();
> -		kvm_async_pf_task_wait(fault_address, 0);
> +		kvm_async_pf_task_wait_schedule(fault_address);
>   		local_irq_enable();
>   		break;
>   	case KVM_PV_REASON_PAGE_READY:
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 13/36] x86/kvm: Restrict ASYNC_PF to user space
  2020-05-05 13:16 ` [patch V4 part 1 13/36] x86/kvm: Restrict ASYNC_PF to user space Thomas Gleixner
  2020-05-06  7:00   ` Paolo Bonzini
@ 2020-05-06 15:29   ` Alexandre Chartre
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 15:29 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> The async page fault injection into kernel space creates more problems than
> it solves. The host has absolutely no knowledge about the state of the
> guest if the fault happens in CPL0. The only restriction for the host is
> interrupt disabled state. If interrupts are enabled in the guest then the
> exception can hit arbitrary code. The HALT based wait in non-preemotible
> code is a hacky replacement for a proper hypercall.
> 
> For the ongoing work to restrict instrumentation and make the RCU idle
> interaction well defined the required extra work for supporting async
> pagefault in CPL0 is just not justified and creates complexity for a
> dubious benefit.
> 
> The CPL3 injection is well defined and does not cause any issues as it is
> more or less the same as a regular page fault from CPL3.
> 
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/kernel/kvm.c |  100 +++-----------------------------------------------
>   1 file changed, 7 insertions(+), 93 deletions(-)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -75,7 +75,6 @@ struct kvm_task_sleep_node {
>   	struct swait_queue_head wq;
>   	u32 token;
>   	int cpu;
> -	bool use_halt;
>   };
>   
>   static struct kvm_task_sleep_head {
> @@ -98,8 +97,7 @@ static struct kvm_task_sleep_node *_find
>   	return NULL;
>   }
>   
> -static bool kvm_async_pf_queue_task(u32 token, bool use_halt,
> -				    struct kvm_task_sleep_node *n)
> +static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
>   {
>   	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>   	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> @@ -117,7 +115,6 @@ static bool kvm_async_pf_queue_task(u32
>   
>   	n->token = token;
>   	n->cpu = smp_processor_id();
> -	n->use_halt = use_halt;
>   	init_swait_queue_head(&n->wq);
>   	hlist_add_head(&n->link, &b->list);
>   	raw_spin_unlock(&b->lock);
> @@ -138,7 +135,7 @@ void kvm_async_pf_task_wait_schedule(u32
>   
>   	lockdep_assert_irqs_disabled();
>   
> -	if (!kvm_async_pf_queue_task(token, false, &n))
> +	if (!kvm_async_pf_queue_task(token, &n))
>   		return;
>   
>   	for (;;) {
> @@ -154,91 +151,10 @@ void kvm_async_pf_task_wait_schedule(u32
>   }
>   EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
>   
> -/*
> - * Invoked from the async page fault handler.
> - */
> -static void kvm_async_pf_task_wait_halt(u32 token)
> -{
> -	struct kvm_task_sleep_node n;
> -
> -	if (!kvm_async_pf_queue_task(token, true, &n))
> -		return;
> -
> -	for (;;) {
> -		if (hlist_unhashed(&n.link))
> -			break;
> -		/*
> -		 * No point in doing anything about RCU here. Any RCU read
> -		 * side critical section or RCU watching section can be
> -		 * interrupted by VMEXITs and the host is free to keep the
> -		 * vCPU scheduled out as long as it sees fit. This is not
> -		 * any different just because of the halt induced voluntary
> -		 * VMEXIT.
> -		 *
> -		 * Also the async page fault could have interrupted any RCU
> -		 * watching context, so invoking rcu_irq_exit()/enter()
> -		 * around this is not gaining anything.
> -		 */
> -		native_safe_halt();
> -		local_irq_disable();
> -	}
> -}
> -
> -/* Invoked from the async page fault handler */
> -static void kvm_async_pf_task_wait(u32 token, bool usermode)
> -{
> -	bool can_schedule;
> -
> -	/*
> -	 * No need to check whether interrupts were disabled because the
> -	 * host will (hopefully) only inject an async page fault into
> -	 * interrupt enabled regions.
> -	 *
> -	 * If CONFIG_PREEMPTION is enabled then check whether the code
> -	 * which triggered the page fault is preemptible. This covers user
> -	 * mode as well because preempt_count() is obviously 0 there.
> -	 *
> -	 * The check for rcu_preempt_depth() is also required because
> -	 * voluntary scheduling inside a rcu read locked section is not
> -	 * allowed.
> -	 *
> -	 * The idle task is already covered by this because idle always
> -	 * has a preempt count > 0.
> -	 *
> -	 * If CONFIG_PREEMPTION is disabled only allow scheduling when
> -	 * coming from user mode as there is no indication whether the
> -	 * context which triggered the page fault could schedule or not.
> -	 */
> -	if (IS_ENABLED(CONFIG_PREEMPTION))
> -		can_schedule = preempt_count() + rcu_preempt_depth() == 0;
> -	else
> -		can_schedule = usermode;
> -
> -	/*
> -	 * If the kernel context is allowed to schedule then RCU is
> -	 * watching because no preemptible code in the kernel is inside RCU
> -	 * idle state. So it can be treated like user mode. User mode is
> -	 * safe because the #PF entry invoked enter_from_user_mode().
> -	 *
> -	 * For the non schedulable case invoke rcu_irq_enter() for
> -	 * now. This will be moved out to the pagefault entry code later
> -	 * and only invoked when really needed.
> -	 */
> -	if (can_schedule) {
> -		kvm_async_pf_task_wait_schedule(token);
> -	} else {
> -		rcu_irq_enter();
> -		kvm_async_pf_task_wait_halt(token);
> -		rcu_irq_exit();
> -	}
> -}
> -
>   static void apf_task_wake_one(struct kvm_task_sleep_node *n)
>   {
>   	hlist_del_init(&n->link);
> -	if (n->use_halt)
> -		smp_send_reschedule(n->cpu);
> -	else if (swq_has_sleeper(&n->wq))
> +	if (swq_has_sleeper(&n->wq))
>   		swake_up_one(&n->wq);
>   }
>   
> @@ -337,8 +253,10 @@ bool __kvm_handle_async_pf(struct pt_reg
>   		panic("Host injected async #PF in interrupt disabled region\n");
>   
>   	if (reason == KVM_PV_REASON_PAGE_NOT_PRESENT) {
> -		/* page is swapped out by the host. */
> -		kvm_async_pf_task_wait(token, user_mode(regs));
> +		if (unlikely(!(user_mode(regs))))
> +			panic("Host injected async #PF in kernel mode\n");
> +		/* Page is swapped out by the host. */
> +		kvm_async_pf_task_wait_schedule(token);
>   	} else {
>   		rcu_irq_enter();
>   		kvm_async_pf_task_wake(token);
> @@ -397,10 +315,6 @@ static void kvm_guest_cpu_init(void)
>   		WARN_ON_ONCE(!static_branch_likely(&kvm_async_pf_enabled));
>   
>   		pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
> -
> -#ifdef CONFIG_PREEMPTION
> -		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
> -#endif
>   		pa |= KVM_ASYNC_PF_ENABLED;
>   
>   		if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_VMEXIT))
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 03/36] sched: Clean up scheduler_ipi()
  2020-05-06 12:37   ` Alexandre Chartre
  2020-05-06 15:03     ` Thomas Gleixner
@ 2020-05-06 15:33     ` Peter Zijlstra
  2020-05-06 18:28       ` Paul E. McKenney
  1 sibling, 1 reply; 178+ messages in thread
From: Peter Zijlstra @ 2020-05-06 15:33 UTC (permalink / raw)
  To: Alexandre Chartre
  Cc: Thomas Gleixner, LKML, x86, Paul E. McKenney, Andy Lutomirski,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

On Wed, May 06, 2020 at 02:37:19PM +0200, Alexandre Chartre wrote:
> On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> > @@ -650,6 +655,16 @@ static inline bool got_nohz_idle_kick(vo
> >   	return false;
> >   }
> > +static void nohz_csd_func(void *info)
> > +{
> > +	struct rq *rq = info;
> > +
> > +	if (got_nohz_idle_kick()) {
> > +		rq->idle_balance = 1;
> > +		raise_softirq_irqoff(SCHED_SOFTIRQ);
> > +	}
> > +}
> > +
> >   #else /* CONFIG_NO_HZ_COMMON */
> >   static inline bool got_nohz_idle_kick(void)

> >   #ifdef CONFIG_NO_HZ_COMMON
> >   		rq->last_blocked_load_update_tick = jiffies;
> >   		atomic_set(&rq->nohz_flags, 0);
> > +
> > +		rq_csd_init(rq, &rq->nohz_csd, nohz_csd_func);
> >   #endif
> >   #endif /* CONFIG_SMP */
> >   		hrtick_rq_init(rq);
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10009,12 +10009,11 @@ static void kick_ilb(unsigned int flags)
> >   		return;
> >   	/*
> > -	 * Use smp_send_reschedule() instead of resched_cpu().
> > -	 * This way we generate a sched IPI on the target CPU which
> > +	 * This way we generate an IPI on the target CPU which
> >   	 * is idle. And the softirq performing nohz idle load balance
> >   	 * will be run before returning from the IPI.
> >   	 */
> > -	smp_send_reschedule(ilb_cpu);
> > +	smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->wake_csd);
> 
> This should be nohz_csd instead of wake_csd, no? I.e.:
> 
>        smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
> 

You figured correctly. Thanks!

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic()
  2020-05-05 13:16 ` [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic() Thomas Gleixner
@ 2020-05-06 15:34   ` Alexandre Chartre
  2020-05-07 17:46   ` Andy Lutomirski
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 15:34 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> This is completely overengineered and definitely not an interface which
> should be made available to anything else than this particular MCE case.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/include/asm/traps.h   |    2 --
>   arch/x86/kernel/cpu/mce/core.c |    6 ++++--
>   arch/x86/kernel/traps.c        |   37 -------------------------------------
>   3 files changed, 4 insertions(+), 41 deletions(-)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -120,8 +120,6 @@ asmlinkage void smp_irq_move_cleanup_int
>   
>   extern void ist_enter(struct pt_regs *regs);
>   extern void ist_exit(struct pt_regs *regs);
> -extern void ist_begin_non_atomic(struct pt_regs *regs);
> -extern void ist_end_non_atomic(void);
>   
>   #ifdef CONFIG_VMAP_STACK
>   void __noreturn handle_stack_overflow(const char *message,
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1352,13 +1352,15 @@ void notrace do_machine_check(struct pt_
>   
>   	/* Fault was in user mode and we need to take some action */
>   	if ((m.cs & 3) == 3) {
> -		ist_begin_non_atomic(regs);
> +		/* If this triggers there is no way to recover. Die hard. */
> +		BUG_ON(!on_thread_stack() || !user_mode(regs));
>   		local_irq_enable();
> +		preempt_enable();
>   
>   		if (kill_it || do_memory_failure(&m))
>   			force_sig(SIGBUS);
> +		preempt_disable();
>   		local_irq_disable();
> -		ist_end_non_atomic();
>   	} else {
>   		if (!fixup_exception(regs, X86_TRAP_MC, error_code, 0))
>   			mce_panic("Failed kernel mode recovery", &m, msg);
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -117,43 +117,6 @@ void ist_exit(struct pt_regs *regs)
>   		rcu_nmi_exit();
>   }
>   
> -/**
> - * ist_begin_non_atomic() - begin a non-atomic section in an IST exception
> - * @regs:	regs passed to the IST exception handler
> - *
> - * IST exception handlers normally cannot schedule.  As a special
> - * exception, if the exception interrupted userspace code (i.e.
> - * user_mode(regs) would return true) and the exception was not
> - * a double fault, it can be safe to schedule.  ist_begin_non_atomic()
> - * begins a non-atomic section within an ist_enter()/ist_exit() region.
> - * Callers are responsible for enabling interrupts themselves inside
> - * the non-atomic section, and callers must call ist_end_non_atomic()
> - * before ist_exit().
> - */
> -void ist_begin_non_atomic(struct pt_regs *regs)
> -{
> -	BUG_ON(!user_mode(regs));
> -
> -	/*
> -	 * Sanity check: we need to be on the normal thread stack.  This
> -	 * will catch asm bugs and any attempt to use ist_preempt_enable
> -	 * from double_fault.
> -	 */
> -	BUG_ON(!on_thread_stack());
> -
> -	preempt_enable_no_resched();
> -}
> -
> -/**
> - * ist_end_non_atomic() - begin a non-atomic section in an IST exception
> - *
> - * Ends a non-atomic section started with ist_begin_non_atomic().
> - */
> -void ist_end_non_atomic(void)
> -{
> -	preempt_disable();
> -}
> -
>   int is_valid_bugaddr(unsigned long addr)
>   {
>   	unsigned short ud;
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-06 11:53   ` Miroslav Benes
  2020-05-06 12:06     ` Thomas Gleixner
@ 2020-05-06 15:35     ` Peter Zijlstra
  1 sibling, 0 replies; 178+ messages in thread
From: Peter Zijlstra @ 2020-05-06 15:35 UTC (permalink / raw)
  To: Miroslav Benes
  Cc: Thomas Gleixner, LKML, x86, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

On Wed, May 06, 2020 at 01:53:36PM +0200, Miroslav Benes wrote:
> On Tue, 5 May 2020, Thomas Gleixner wrote:
> 
> > Make sure task_work runs before any kind of userspace -- very much
> > including signals -- is invoked.
> 
> I might be missing something, but isn't this guaranteed by 
> do_signal()->get_signal()->task_work_run() path?

Yes, that too, that 'hack' can now be removed I suppose.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 06/36] compiler: Simple READ/WRITE_ONCE() implementations
  2020-05-06 13:33   ` Will Deacon
@ 2020-05-06 15:36     ` Peter Zijlstra
  0 siblings, 0 replies; 178+ messages in thread
From: Peter Zijlstra @ 2020-05-06 15:36 UTC (permalink / raw)
  To: Will Deacon
  Cc: Thomas Gleixner, LKML, x86, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf

On Wed, May 06, 2020 at 02:33:33PM +0100, Will Deacon wrote:
> On Tue, May 05, 2020 at 03:16:08PM +0200, Thomas Gleixner wrote:
> > READ/WRITE_ONCE_NOCHECK() is required for atomics in code which cannot be
> > instrumented like the x86 int3 text poke code. As READ/WRITE_ONCE() is
> > undergoing a rewrite, provide __{READ,WRITE}_ONCE_SCALAR().
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  include/linux/compiler.h |    8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > --- a/include/linux/compiler.h
> > +++ b/include/linux/compiler.h
> > @@ -313,6 +313,14 @@ unsigned long read_word_at_a_time(const
> >  	__u.__val;					\
> >  })
> >  
> > +#define __READ_ONCE_SCALAR(x)				\
> > +	(*(const volatile typeof(x) *)&(x))
> > +
> > +#define __WRITE_ONCE_SCALAR(x, val)			\
> > +do {							\
> > +	*(volatile typeof(x) *)&(x) = val;		\
> > +} while (0)
> 
> FWIW, these end up being called __READ_ONCE() and __WRITE_ONCE() after
> the rewrite; the *_SCALAR() variants will call into kcsan_check_atomic_*().
> 
> If you go with that naming now, then any later conflict should fall out in
> the wash.

Ah excellent, clearly we had slightly different resoltions vs kcsan.
Thanks!

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 15/36] kprobes: Lock kprobe_mutex while showing kprobe_blacklist
  2020-05-05 13:16 ` [patch V4 part 1 15/36] kprobes: Lock kprobe_mutex while showing kprobe_blacklist Thomas Gleixner
@ 2020-05-06 15:38   ` Alexandre Chartre
  2020-05-12 15:18   ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu
  1 sibling, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 15:38 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> From: Masami Hiramatsu <mhiramat@kernel.org>
> 
> Lock kprobe_mutex while showing kprobe_blacklist to prevent updating the
> kprobe_blacklist.
> 
> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lkml.kernel.org/r/158523417665.24735.10253198878535635600.stgit@devnote2
> 
> ---
>   kernel/kprobes.c |    8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -2426,6 +2426,7 @@ static const struct file_operations debu
>   /* kprobes/blacklist -- shows which functions can not be probed */
>   static void *kprobe_blacklist_seq_start(struct seq_file *m, loff_t *pos)
>   {
> +	mutex_lock(&kprobe_mutex);
>   	return seq_list_start(&kprobe_blacklist, *pos);
>   }
>   
> @@ -2452,10 +2453,15 @@ static int kprobe_blacklist_seq_show(str
>   	return 0;
>   }
>   
> +static void kprobe_blacklist_seq_stop(struct seq_file *f, void *v)
> +{
> +	mutex_unlock(&kprobe_mutex);
> +}
> +
>   static const struct seq_operations kprobe_blacklist_seq_ops = {
>   	.start = kprobe_blacklist_seq_start,
>   	.next  = kprobe_blacklist_seq_next,
> -	.stop  = kprobe_seq_stop,	/* Reuse void function */
> +	.stop  = kprobe_blacklist_seq_stop,
>   	.show  = kprobe_blacklist_seq_show,
>   };
>   
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 19/36] x86/entry: Exclude low level entry code from sanitizing
  2020-05-05 20:39   ` Brian Gerst
@ 2020-05-06 15:42     ` Peter Zijlstra
  0 siblings, 0 replies; 178+ messages in thread
From: Peter Zijlstra @ 2020-05-06 15:42 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Thomas Gleixner, LKML, the arch/x86 maintainers,
	Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

On Tue, May 05, 2020 at 04:39:01PM -0400, Brian Gerst wrote:
> On Tue, May 5, 2020 at 10:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > The sanitizers are not really applicable to the fragile low level entry
> > code. code. Entry code needs to carefully setup a normal 'runtime'
> > environment.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  arch/x86/entry/Makefile |    8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > --- a/arch/x86/entry/Makefile
> > +++ b/arch/x86/entry/Makefile
> > @@ -3,6 +3,14 @@
> >  # Makefile for the x86 low level entry code
> >  #
> >
> > +KASAN_SANITIZE := n
> > +UBSAN_SANITIZE := n
> > +KCOV_INSTRUMENT := n
> > +
> > +CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
> > +CFLAGS_REMOVE_syscall_32.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
> > +CFLAGS_REMOVE_syscall_64.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
> 
> Is this necessary for syscall_*.o?  They just contain the syscall
> tables (ie. data).

Proabaly not, but I just made sure to kill everything, less chance an
accident happens.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 16/36] kprobes: Support __kprobes blacklist in modules
  2020-05-05 13:16 ` [patch V4 part 1 16/36] kprobes: Support __kprobes blacklist in modules Thomas Gleixner
@ 2020-05-06 15:47   ` Alexandre Chartre
  2020-05-12 15:18   ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu
  1 sibling, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 15:47 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> From: Masami Hiramatsu <mhiramat@kernel.org>
> 
> Support __kprobes attribute for blacklist functions in modules.  The
> __kprobes attribute functions are stored in .kprobes.text section.
> 
> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lkml.kernel.org/r/158523418821.24735.15379873028352411526.stgit@devnote2
> 
> ---
>   include/linux/module.h |    4 ++++
>   kernel/kprobes.c       |   42 ++++++++++++++++++++++++++++++++++++++++++
>   kernel/module.c        |    4 ++++
>   3 files changed, 50 insertions(+)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -489,6 +489,10 @@ struct module {
>   	unsigned int num_ftrace_callsites;
>   	unsigned long *ftrace_callsites;
>   #endif
> +#ifdef CONFIG_KPROBES
> +	void *kprobes_text_start;
> +	unsigned int kprobes_text_size;
> +#endif
>   
>   #ifdef CONFIG_LIVEPATCH
>   	bool klp; /* Is this a livepatch module? */
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -2179,6 +2179,19 @@ int kprobe_add_area_blacklist(unsigned l
>   	return 0;
>   }
>   
> +/* Remove all symbols in given area from kprobe blacklist */
> +static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
> +{
> +	struct kprobe_blacklist_entry *ent, *n;
> +
> +	list_for_each_entry_safe(ent, n, &kprobe_blacklist, list) {
> +		if (ent->start_addr < start || ent->start_addr >= end)
> +			continue;
> +		list_del(&ent->list);
> +		kfree(ent);
> +	}
> +}
> +
>   int __init __weak arch_populate_kprobe_blacklist(void)
>   {
>   	return 0;
> @@ -2215,6 +2228,28 @@ static int __init populate_kprobe_blackl
>   	return ret ? : arch_populate_kprobe_blacklist();
>   }
>   
> +static void add_module_kprobe_blacklist(struct module *mod)
> +{
> +	unsigned long start, end;
> +
> +	start = (unsigned long)mod->kprobes_text_start;
> +	if (start) {
> +		end = start + mod->kprobes_text_size;
> +		kprobe_add_area_blacklist(start, end);
> +	}
> +}
> +
> +static void remove_module_kprobe_blacklist(struct module *mod)
> +{
> +	unsigned long start, end;
> +
> +	start = (unsigned long)mod->kprobes_text_start;
> +	if (start) {
> +		end = start + mod->kprobes_text_size;
> +		kprobe_remove_area_blacklist(start, end);
> +	}
> +}
> +
>   /* Module notifier call back, checking kprobes on the module */
>   static int kprobes_module_callback(struct notifier_block *nb,
>   				   unsigned long val, void *data)
> @@ -2225,6 +2260,11 @@ static int kprobes_module_callback(struc
>   	unsigned int i;
>   	int checkcore = (val == MODULE_STATE_GOING);
>   
> +	if (val == MODULE_STATE_COMING) {
> +		mutex_lock(&kprobe_mutex);
> +		add_module_kprobe_blacklist(mod);
> +		mutex_unlock(&kprobe_mutex);
> +	}
>   	if (val != MODULE_STATE_GOING && val != MODULE_STATE_LIVE)
>   		return NOTIFY_DONE;
>   
> @@ -2255,6 +2295,8 @@ static int kprobes_module_callback(struc
>   				kill_kprobe(p);
>   			}
>   	}
> +	if (val == MODULE_STATE_GOING)
> +		remove_module_kprobe_blacklist(mod);
>   	mutex_unlock(&kprobe_mutex);
>   	return NOTIFY_DONE;
>   }
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -3194,6 +3194,10 @@ static int find_module_sections(struct m
>   					    sizeof(*mod->ei_funcs),
>   					    &mod->num_ei_funcs);
>   #endif
> +#ifdef CONFIG_KPROBES
> +	mod->kprobes_text_start = section_objs(info, ".kprobes.text", 1,
> +						&mod->kprobes_text_size);
> +#endif
>   	mod->extable = section_objs(info, "__ex_table",
>   				    sizeof(*mod->extable), &mod->num_exentries);
>   
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 17/36] kprobes: Support NOKPROBE_SYMBOL() in modules
  2020-05-05 13:16 ` [patch V4 part 1 17/36] kprobes: Support NOKPROBE_SYMBOL() " Thomas Gleixner
@ 2020-05-06 15:54   ` Alexandre Chartre
  2020-05-12 15:18   ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu
  1 sibling, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 15:54 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> From: Masami Hiramatsu <mhiramat@kernel.org>
> 
> Support NOKPROBE_SYMBOL() in modules. NOKPROBE_SYMBOL() records only symbol
> address in "_kprobe_blacklist" section in the module.
> 
> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lkml.kernel.org/r/158523419989.24735.6665260504057165207.stgit@devnote2
> 
> ---
>   include/linux/module.h |    2 ++
>   kernel/kprobes.c       |   17 +++++++++++++++++
>   kernel/module.c        |    3 +++
>   3 files changed, 22 insertions(+)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> diff --git a/include/linux/module.h b/include/linux/module.h
> index 369c354f9207..1192097c9a69 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -492,6 +492,8 @@ struct module {
>   #ifdef CONFIG_KPROBES
>   	void *kprobes_text_start;
>   	unsigned int kprobes_text_size;
> +	unsigned long *kprobe_blacklist;
> +	unsigned int num_kprobe_blacklist;
>   #endif
>   
>   #ifdef CONFIG_LIVEPATCH
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index b7549992b9bd..9eb5acf0a9f3 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -2192,6 +2192,11 @@ static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
>   	}
>   }
>   
> +static void kprobe_remove_ksym_blacklist(unsigned long entry)
> +{
> +	kprobe_remove_area_blacklist(entry, entry + 1);
> +}
> +
>   int __init __weak arch_populate_kprobe_blacklist(void)
>   {
>   	return 0;
> @@ -2231,6 +2236,12 @@ static int __init populate_kprobe_blacklist(unsigned long *start,
>   static void add_module_kprobe_blacklist(struct module *mod)
>   {
>   	unsigned long start, end;
> +	int i;
> +
> +	if (mod->kprobe_blacklist) {
> +		for (i = 0; i < mod->num_kprobe_blacklist; i++)
> +			kprobe_add_ksym_blacklist(mod->kprobe_blacklist[i]);
> +	}
>   
>   	start = (unsigned long)mod->kprobes_text_start;
>   	if (start) {
> @@ -2242,6 +2253,12 @@ static void add_module_kprobe_blacklist(struct module *mod)
>   static void remove_module_kprobe_blacklist(struct module *mod)
>   {
>   	unsigned long start, end;
> +	int i;
> +
> +	if (mod->kprobe_blacklist) {
> +		for (i = 0; i < mod->num_kprobe_blacklist; i++)
> +			kprobe_remove_ksym_blacklist(mod->kprobe_blacklist[i]);
> +	}
>   
>   	start = (unsigned long)mod->kprobes_text_start;
>   	if (start) {
> diff --git a/kernel/module.c b/kernel/module.c
> index f7712a923d63..7be011dacd8a 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -3197,6 +3197,9 @@ static int find_module_sections(struct module *mod, struct load_info *info)
>   #ifdef CONFIG_KPROBES
>   	mod->kprobes_text_start = section_objs(info, ".kprobes.text", 1,
>   						&mod->kprobes_text_size);
> +	mod->kprobe_blacklist = section_objs(info, "_kprobe_blacklist",
> +						sizeof(unsigned long),
> +						&mod->num_kprobe_blacklist);
>   #endif
>   	mod->extable = section_objs(info, "__ex_table",
>   				    sizeof(*mod->extable), &mod->num_exentries);
> 
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 18/36] samples/kprobes: Add __kprobes and NOKPROBE_SYMBOL() for handlers.
  2020-05-05 13:16 ` [patch V4 part 1 18/36] samples/kprobes: Add __kprobes and NOKPROBE_SYMBOL() for handlers Thomas Gleixner
@ 2020-05-06 15:57   ` Alexandre Chartre
  2020-05-12 15:18   ` [tip: core/kprobes] " tip-bot2 for Masami Hiramatsu
  1 sibling, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 15:57 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> From: Masami Hiramatsu <mhiramat@kernel.org>
> 
> Add __kprobes and NOKPROBE_SYMBOL() for sample kprobe handlers.
> 
> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lkml.kernel.org/r/158523421177.24735.16273975317343670204.stgit@devnote2
> 
> ---
>   samples/kprobes/kprobe_example.c    |    6 ++++--
>   samples/kprobes/kretprobe_example.c |    2 ++
>   2 files changed, 6 insertions(+), 2 deletions(-)

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.

> 
> --- a/samples/kprobes/kprobe_example.c
> +++ b/samples/kprobes/kprobe_example.c
> @@ -25,7 +25,7 @@ static struct kprobe kp = {
>   };
>   
>   /* kprobe pre_handler: called just before the probed instruction is executed */
> -static int handler_pre(struct kprobe *p, struct pt_regs *regs)
> +static int __kprobes handler_pre(struct kprobe *p, struct pt_regs *regs)
>   {
>   #ifdef CONFIG_X86
>   	pr_info("<%s> pre_handler: p->addr = 0x%p, ip = %lx, flags = 0x%lx\n",
> @@ -54,7 +54,7 @@ static int handler_pre(struct kprobe *p,
>   }
>   
>   /* kprobe post_handler: called after the probed instruction is executed */
> -static void handler_post(struct kprobe *p, struct pt_regs *regs,
> +static void __kprobes handler_post(struct kprobe *p, struct pt_regs *regs,
>   				unsigned long flags)
>   {
>   #ifdef CONFIG_X86
> @@ -90,6 +90,8 @@ static int handler_fault(struct kprobe *
>   	/* Return 0 because we don't handle the fault. */
>   	return 0;
>   }
> +/* NOKPROBE_SYMBOL() is also available */
> +NOKPROBE_SYMBOL(handler_fault);
>   
>   static int __init kprobe_init(void)
>   {
> --- a/samples/kprobes/kretprobe_example.c
> +++ b/samples/kprobes/kretprobe_example.c
> @@ -48,6 +48,7 @@ static int entry_handler(struct kretprob
>   	data->entry_stamp = ktime_get();
>   	return 0;
>   }
> +NOKPROBE_SYMBOL(entry_handler);
>   
>   /*
>    * Return-probe handler: Log the return value and duration. Duration may turn
> @@ -67,6 +68,7 @@ static int ret_handler(struct kretprobe_
>   			func_name, retval, (long long)delta);
>   	return 0;
>   }
> +NOKPROBE_SYMBOL(ret_handler);
>   
>   static struct kretprobe my_kretprobe = {
>   	.handler		= ret_handler,
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 19/36] x86/entry: Exclude low level entry code from sanitizing
  2020-05-05 13:16 ` [patch V4 part 1 19/36] x86/entry: Exclude low level entry code from sanitizing Thomas Gleixner
  2020-05-05 20:39   ` Brian Gerst
@ 2020-05-06 16:03   ` Alexandre Chartre
  2020-05-13 22:58     ` Mathieu Desnoyers
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  2 siblings, 1 reply; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 16:03 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra (Intel)


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> The sanitizers are not really applicable to the fragile low level entry
> code. code. Entry code needs to carefully setup a normal 'runtime'

typo: code. code.

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.


> environment.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/entry/Makefile |    8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> --- a/arch/x86/entry/Makefile
> +++ b/arch/x86/entry/Makefile
> @@ -3,6 +3,14 @@
>   # Makefile for the x86 low level entry code
>   #
>   
> +KASAN_SANITIZE := n
> +UBSAN_SANITIZE := n
> +KCOV_INSTRUMENT := n
> +
> +CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
> +CFLAGS_REMOVE_syscall_32.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
> +CFLAGS_REMOVE_syscall_64.o = $(CC_FLAGS_FTRACE) -fstack-protector -fstack-protector-strong
> +
>   OBJECT_FILES_NON_STANDARD_entry_64_compat.o := y
>   
>   CFLAGS_syscall_64.o		+= $(call cc-option,-Wno-override-init,)
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 20/36] vmlinux.lds.h: Create section for protection against instrumentation
  2020-05-05 13:16 ` [patch V4 part 1 20/36] vmlinux.lds.h: Create section for protection against instrumentation Thomas Gleixner
@ 2020-05-06 16:08   ` Sean Christopherson
  2020-05-06 16:28     ` Peter Zijlstra
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Thomas Gleixner
  1 sibling, 1 reply; 178+ messages in thread
From: Sean Christopherson @ 2020-05-06 16:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

On Tue, May 05, 2020 at 03:16:22PM +0200, Thomas Gleixner wrote:
> Provide also a set of markers: instr_begin()/end()
> 
> These are used to mark code inside a noinstr function which calls
> into regular instrumentable text section as safe.

...

> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -120,10 +120,27 @@ void ftrace_likely_update(struct ftrace_
>  /* Annotate a C jump table to allow objtool to follow the code flow */
>  #define __annotate_jump_table __section(.rodata..c_jump_table)
>  
> +/* Begin/end of an instrumentation safe region */
> +#define instr_begin() ({						\
> +	asm volatile("%c0:\n\t"						\
> +		     ".pushsection .discard.instr_begin\n\t"		\
> +		     ".long %c0b - .\n\t"				\
> +		     ".popsection\n\t" : : "i" (__COUNTER__));		\
> +})
> +
> +#define instr_end() ({							\
> +	asm volatile("%c0:\n\t"						\
> +		     ".pushsection .discard.instr_end\n\t"		\
> +		     ".long %c0b - .\n\t"				\
> +		     ".popsection\n\t" : : "i" (__COUNTER__));		\
> +})

Any chance we could spell these out, i.e. instrumentation_begin/end()?  I
can't help but read these as "instruction_begin/end".  At a glance, the
long names shouldn't cause any wrap/indentation issues.

E.g. some of the usage in KVM is especially confusing

--- a/arch/x86/kvm/vmx/ops.h
+++ b/arch/x86/kvm/vmx/ops.h
@@ -146,7 +146,9 @@ do {						\
 			  : : op1 : "cc" : error, fault);		\
 	return;								\
 error:									\
+	instr_begin();							\
 	insn##_error(error_args);					\
+	instr_end();							\
 	return;								\
 fault:									\
 	kvm_spurious_fault();						\
@@ -161,7 +163,9 @@ do {						\
 			  : : op1, op2 : "cc" : error, fault);		\
 	return;								\
 error:									\
+	instr_begin();							\
 	insn##_error(error_args);					\
+	instr_end();							\
 	return;								\
 fault:									\
 	kvm_spurious_fault();						\


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-05 13:16 ` [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling Thomas Gleixner
  2020-05-06 11:53   ` Miroslav Benes
  2020-05-06 13:06   ` Alexandre Chartre
@ 2020-05-06 16:26   ` Borislav Petkov
  2020-05-07 17:35   ` Andy Lutomirski
  2020-05-13 20:56   ` Mathieu Desnoyers
  4 siblings, 0 replies; 178+ messages in thread
From: Borislav Petkov @ 2020-05-06 16:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

On Tue, May 05, 2020 at 03:16:07PM +0200, Thomas Gleixner wrote:
> Make sure task_work runs before any kind of userspace -- very much
> including signals -- is invoked.
> 
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Does this need

From: Peter

too?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 20/36] vmlinux.lds.h: Create section for protection against instrumentation
  2020-05-06 16:08   ` Sean Christopherson
@ 2020-05-06 16:28     ` Peter Zijlstra
  2020-05-06 16:57       ` Thomas Gleixner
  0 siblings, 1 reply; 178+ messages in thread
From: Peter Zijlstra @ 2020-05-06 16:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, LKML, x86, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

On Wed, May 06, 2020 at 09:08:31AM -0700, Sean Christopherson wrote:
> On Tue, May 05, 2020 at 03:16:22PM +0200, Thomas Gleixner wrote:
> > Provide also a set of markers: instr_begin()/end()
> > 
> > These are used to mark code inside a noinstr function which calls
> > into regular instrumentable text section as safe.
> 
> ...
> 
> > --- a/include/linux/compiler.h
> > +++ b/include/linux/compiler.h
> > @@ -120,10 +120,27 @@ void ftrace_likely_update(struct ftrace_
> >  /* Annotate a C jump table to allow objtool to follow the code flow */
> >  #define __annotate_jump_table __section(.rodata..c_jump_table)
> >  
> > +/* Begin/end of an instrumentation safe region */
> > +#define instr_begin() ({						\
> > +	asm volatile("%c0:\n\t"						\
> > +		     ".pushsection .discard.instr_begin\n\t"		\
> > +		     ".long %c0b - .\n\t"				\
> > +		     ".popsection\n\t" : : "i" (__COUNTER__));		\
> > +})
> > +
> > +#define instr_end() ({							\
> > +	asm volatile("%c0:\n\t"						\
> > +		     ".pushsection .discard.instr_end\n\t"		\
> > +		     ".long %c0b - .\n\t"				\
> > +		     ".popsection\n\t" : : "i" (__COUNTER__));		\
> > +})
> 
> Any chance we could spell these out, i.e. instrumentation_begin/end()?  I
> can't help but read these as "instruction_begin/end".  At a glance, the
> long names shouldn't cause any wrap/indentation issues.

The kernel naming convention is insn for instruction, not instr. That
said, you're not the first to be confused by this.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 06/36] compiler: Simple READ/WRITE_ONCE() implementations
  2020-05-05 13:16 ` [patch V4 part 1 06/36] compiler: Simple READ/WRITE_ONCE() implementations Thomas Gleixner
  2020-05-06 13:11   ` Alexandre Chartre
  2020-05-06 13:33   ` Will Deacon
@ 2020-05-06 16:33   ` Borislav Petkov
  2 siblings, 0 replies; 178+ messages in thread
From: Borislav Petkov @ 2020-05-06 16:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

On Tue, May 05, 2020 at 03:16:08PM +0200, Thomas Gleixner wrote:
> READ/WRITE_ONCE_NOCHECK() is required for atomics in code which cannot be
> instrumented like the x86 int3 text poke code. As READ/WRITE_ONCE() is
> undergoing a rewrite, provide __{READ,WRITE}_ONCE_SCALAR().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Subject needs a verb and patch needs From: Peter, judging by the SOB
chain.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 20/36] vmlinux.lds.h: Create section for protection against instrumentation
  2020-05-06 16:28     ` Peter Zijlstra
@ 2020-05-06 16:57       ` Thomas Gleixner
  0 siblings, 0 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-06 16:57 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon

Peter Zijlstra <peterz@infradead.org> writes:
> On Wed, May 06, 2020 at 09:08:31AM -0700, Sean Christopherson wrote:
>> On Tue, May 05, 2020 at 03:16:22PM +0200, Thomas Gleixner wrote:
>> > Provide also a set of markers: instr_begin()/end()
>> > 
>> > These are used to mark code inside a noinstr function which calls
>> > into regular instrumentable text section as safe.
>> 
>> ...
>> 
>> > --- a/include/linux/compiler.h
>> > +++ b/include/linux/compiler.h
>> > @@ -120,10 +120,27 @@ void ftrace_likely_update(struct ftrace_
>> >  /* Annotate a C jump table to allow objtool to follow the code flow */
>> >  #define __annotate_jump_table __section(.rodata..c_jump_table)
>> >  
>> > +/* Begin/end of an instrumentation safe region */
>> > +#define instr_begin() ({						\
>> > +	asm volatile("%c0:\n\t"						\
>> > +		     ".pushsection .discard.instr_begin\n\t"		\
>> > +		     ".long %c0b - .\n\t"				\
>> > +		     ".popsection\n\t" : : "i" (__COUNTER__));		\
>> > +})
>> > +
>> > +#define instr_end() ({							\
>> > +	asm volatile("%c0:\n\t"						\
>> > +		     ".pushsection .discard.instr_end\n\t"		\
>> > +		     ".long %c0b - .\n\t"				\
>> > +		     ".popsection\n\t" : : "i" (__COUNTER__));		\
>> > +})
>> 
>> Any chance we could spell these out, i.e. instrumentation_begin/end()?  I
>> can't help but read these as "instruction_begin/end".  At a glance, the
>> long names shouldn't cause any wrap/indentation issues.
>
> The kernel naming convention is insn for instruction, not instr. That
> said, you're not the first to be confused by this.

I'm happy to spell it out. Was just lazy I guess.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 36/36] rcu: Make RCU IRQ enter/exit functions rely on in_nmi()
  2020-05-05 13:16 ` [patch V4 part 1 36/36] rcu: Make RCU IRQ enter/exit functions rely on in_nmi() Thomas Gleixner
  2020-05-05 18:13   ` Paul E. McKenney
@ 2020-05-06 17:09   ` Alexandre Chartre
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Paul E. McKenney
  2 siblings, 0 replies; 178+ messages in thread
From: Alexandre Chartre @ 2020-05-06 17:09 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra (Intel)


On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> From: Paul E. McKenney <paulmck@kernel.org>
> 
> The rcu_nmi_enter_common() and rcu_nmi_exit_common() functions take an
> "irq" parameter that indicates whether these functions are invoked from
> an irq handler (irq==true) or an NMI handler (irq==false).  However,
> recent changes have applied notrace to a few critical functions such
> that rcu_nmi_enter_common() and rcu_nmi_exit_common() many now rely
> on in_nmi().  Note that in_nmi() works no differently than before,
> but rather that tracing is now prohibited in code regions where in_nmi()
> would incorrectly report NMI state.
> 
> Therefore remove the "irq" parameter and inlines rcu_nmi_enter_common() and
> rcu_nmi_exit_common() into rcu_nmi_enter() and rcu_nmi_exit(),
> respectively.
> 
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   kernel/rcu/tree.c |   47 +++++++++++++++--------------------------------
>   1 file changed, 15 insertions(+), 32 deletions(-)

I already sent a RB for the first patches for this serie, and went through the
remaining of them (20-36). I am not very familiar with some these areas so for
what it's worth:

Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>

alex.


> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -627,16 +627,18 @@ noinstr void rcu_user_enter(void)
>   }
>   #endif /* CONFIG_NO_HZ_FULL */
>   
> -/*
> +/**
> + * rcu_nmi_exit - inform RCU of exit from NMI context
> + *
>    * If we are returning from the outermost NMI handler that interrupted an
>    * RCU-idle period, update rdp->dynticks and rdp->dynticks_nmi_nesting
>    * to let the RCU grace-period handling know that the CPU is back to
>    * being RCU-idle.
>    *
> - * If you add or remove a call to rcu_nmi_exit_common(), be sure to test
> + * If you add or remove a call to rcu_nmi_exit(), be sure to test
>    * with CONFIG_RCU_EQS_DEBUG=y.
>    */
> -static __always_inline void rcu_nmi_exit_common(bool irq)
> +noinstr void rcu_nmi_exit(void)
>   {
>   	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>   
> @@ -667,7 +669,7 @@ static __always_inline void rcu_nmi_exit
>   	trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks));
>   	WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
>   
> -	if (irq)
> +	if (!in_nmi())
>   		rcu_prepare_for_idle();
>   	instr_end();
>   
> @@ -675,22 +677,11 @@ static __always_inline void rcu_nmi_exit
>   	rcu_dynticks_eqs_enter();
>   	// ... but is no longer watching here.
>   
> -	if (irq)
> +	if (!in_nmi())
>   		rcu_dynticks_task_enter();
>   }
>   
>   /**
> - * rcu_nmi_exit - inform RCU of exit from NMI context
> - *
> - * If you add or remove a call to rcu_nmi_exit(), be sure to test
> - * with CONFIG_RCU_EQS_DEBUG=y.
> - */
> -void noinstr rcu_nmi_exit(void)
> -{
> -	rcu_nmi_exit_common(false);
> -}
> -
> -/**
>    * rcu_irq_exit - inform RCU that current CPU is exiting irq towards idle
>    *
>    * Exit from an interrupt handler, which might possibly result in entering
> @@ -712,7 +703,7 @@ void noinstr rcu_nmi_exit(void)
>   void noinstr rcu_irq_exit(void)
>   {
>   	lockdep_assert_irqs_disabled();
> -	rcu_nmi_exit_common(true);
> +	rcu_nmi_exit();
>   }
>   
>   /*
> @@ -801,7 +792,7 @@ void noinstr rcu_user_exit(void)
>   #endif /* CONFIG_NO_HZ_FULL */
>   
>   /**
> - * rcu_nmi_enter_common - inform RCU of entry to NMI context
> + * rcu_nmi_enter - inform RCU of entry to NMI context
>    * @irq: Is this call from rcu_irq_enter?
>    *
>    * If the CPU was idle from RCU's viewpoint, update rdp->dynticks and
> @@ -810,10 +801,10 @@ void noinstr rcu_user_exit(void)
>    * long as the nesting level does not overflow an int.  (You will probably
>    * run out of stack space first.)
>    *
> - * If you add or remove a call to rcu_nmi_enter_common(), be sure to test
> + * If you add or remove a call to rcu_nmi_enter(), be sure to test
>    * with CONFIG_RCU_EQS_DEBUG=y.
>    */
> -static __always_inline void rcu_nmi_enter_common(bool irq)
> +noinstr void rcu_nmi_enter(void)
>   {
>   	long incby = 2;
>   	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> @@ -831,18 +822,18 @@ static __always_inline void rcu_nmi_ente
>   	 */
>   	if (rcu_dynticks_curr_cpu_in_eqs()) {
>   
> -		if (irq)
> +		if (!in_nmi())
>   			rcu_dynticks_task_exit();
>   
>   		// RCU is not watching here ...
>   		rcu_dynticks_eqs_exit();
>   		// ... but is watching here.
>   
> -		if (irq)
> +		if (!in_nmi())
>   			rcu_cleanup_after_idle();
>   
>   		incby = 1;
> -	} else if (irq) {
> +	} else if (!in_nmi()) {
>   		instr_begin();
>   		if (tick_nohz_full_cpu(rdp->cpu) &&
>   		    rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
> @@ -877,14 +868,6 @@ static __always_inline void rcu_nmi_ente
>   }
>   
>   /**
> - * rcu_nmi_enter - inform RCU of entry to NMI context
> - */
> -noinstr void rcu_nmi_enter(void)
> -{
> -	rcu_nmi_enter_common(false);
> -}
> -
> -/**
>    * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
>    *
>    * Enter an interrupt handler, which might possibly result in exiting
> @@ -909,7 +892,7 @@ noinstr void rcu_nmi_enter(void)
>   noinstr void rcu_irq_enter(void)
>   {
>   	lockdep_assert_irqs_disabled();
> -	rcu_nmi_enter_common(true);
> +	rcu_nmi_enter();
>   }
>   
>   /*
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 03/36] sched: Clean up scheduler_ipi()
  2020-05-06 15:33     ` Peter Zijlstra
@ 2020-05-06 18:28       ` Paul E. McKenney
  2020-05-06 18:37         ` Peter Zijlstra
  0 siblings, 1 reply; 178+ messages in thread
From: Paul E. McKenney @ 2020-05-06 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexandre Chartre, Thomas Gleixner, LKML, x86, Andy Lutomirski,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

On Wed, May 06, 2020 at 05:33:00PM +0200, Peter Zijlstra wrote:
> On Wed, May 06, 2020 at 02:37:19PM +0200, Alexandre Chartre wrote:
> > On 5/5/20 3:16 PM, Thomas Gleixner wrote:
> > > @@ -650,6 +655,16 @@ static inline bool got_nohz_idle_kick(vo
> > >   	return false;
> > >   }
> > > +static void nohz_csd_func(void *info)
> > > +{
> > > +	struct rq *rq = info;
> > > +
> > > +	if (got_nohz_idle_kick()) {
> > > +		rq->idle_balance = 1;
> > > +		raise_softirq_irqoff(SCHED_SOFTIRQ);
> > > +	}
> > > +}
> > > +
> > >   #else /* CONFIG_NO_HZ_COMMON */
> > >   static inline bool got_nohz_idle_kick(void)
> 
> > >   #ifdef CONFIG_NO_HZ_COMMON
> > >   		rq->last_blocked_load_update_tick = jiffies;
> > >   		atomic_set(&rq->nohz_flags, 0);
> > > +
> > > +		rq_csd_init(rq, &rq->nohz_csd, nohz_csd_func);
> > >   #endif
> > >   #endif /* CONFIG_SMP */
> > >   		hrtick_rq_init(rq);
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10009,12 +10009,11 @@ static void kick_ilb(unsigned int flags)
> > >   		return;
> > >   	/*
> > > -	 * Use smp_send_reschedule() instead of resched_cpu().
> > > -	 * This way we generate a sched IPI on the target CPU which
> > > +	 * This way we generate an IPI on the target CPU which
> > >   	 * is idle. And the softirq performing nohz idle load balance
> > >   	 * will be run before returning from the IPI.
> > >   	 */
> > > -	smp_send_reschedule(ilb_cpu);
> > > +	smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->wake_csd);
> > 
> > This should be nohz_csd instead of wake_csd, no? I.e.:
> > 
> >        smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
> 
> You figured correctly. Thanks!

And this does get rid of the smp_call_function splats I was seeing
earlier, thank you!

I still see warnings of the form "leave instruction with modified stack
frame" from older complilers and of the form "undefined stack state"
from newer compilers.  I am running stock objtool versions, so I am
guessing that this is at least one reason for these warnings.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 03/36] sched: Clean up scheduler_ipi()
  2020-05-06 18:28       ` Paul E. McKenney
@ 2020-05-06 18:37         ` Peter Zijlstra
  2020-05-06 18:46           ` Paul E. McKenney
  0 siblings, 1 reply; 178+ messages in thread
From: Peter Zijlstra @ 2020-05-06 18:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexandre Chartre, Thomas Gleixner, LKML, x86, Andy Lutomirski,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

On Wed, May 06, 2020 at 11:28:56AM -0700, Paul E. McKenney wrote:
> I still see warnings of the form "leave instruction with modified stack
> frame" from older complilers and of the form "undefined stack state"
> from newer compilers.  I am running stock objtool versions, so I am
> guessing that this is at least one reason for these warnings.

Part 5, patch 2, might be responsible. I still have to look at curing
that.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 03/36] sched: Clean up scheduler_ipi()
  2020-05-06 18:37         ` Peter Zijlstra
@ 2020-05-06 18:46           ` Paul E. McKenney
  0 siblings, 0 replies; 178+ messages in thread
From: Paul E. McKenney @ 2020-05-06 18:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexandre Chartre, Thomas Gleixner, LKML, x86, Andy Lutomirski,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

On Wed, May 06, 2020 at 08:37:03PM +0200, Peter Zijlstra wrote:
> On Wed, May 06, 2020 at 11:28:56AM -0700, Paul E. McKenney wrote:
> > I still see warnings of the form "leave instruction with modified stack
> > frame" from older complilers and of the form "undefined stack state"
> > from newer compilers.  I am running stock objtool versions, so I am
> > guessing that this is at least one reason for these warnings.
> 
> Part 5, patch 2, might be responsible. I still have to look at curing
> that.

Very good, I will await further patches.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-05 13:16 ` [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling Thomas Gleixner
                     ` (2 preceding siblings ...)
  2020-05-06 16:26   ` Borislav Petkov
@ 2020-05-07 17:35   ` Andy Lutomirski
  2020-05-13 20:56   ` Mathieu Desnoyers
  4 siblings, 0 replies; 178+ messages in thread
From: Andy Lutomirski @ 2020-05-07 17:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Peter Zijlstra (Intel)

On Tue, May 5, 2020 at 7:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Make sure task_work runs before any kind of userspace -- very much
> including signals -- is invoked.

I certainly approve of the change, but anything that fundamentally
relies on this makes me uneasy.  I'll keep an eye out for potential
problems.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic()
  2020-05-05 13:16 ` [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic() Thomas Gleixner
  2020-05-06 15:34   ` Alexandre Chartre
@ 2020-05-07 17:46   ` Andy Lutomirski
  2020-05-13 22:57   ` Mathieu Desnoyers
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Thomas Gleixner
  3 siblings, 0 replies; 178+ messages in thread
From: Andy Lutomirski @ 2020-05-07 17:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

On Tue, May 5, 2020 at 7:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> This is completely overengineered and definitely not an interface which
> should be made available to anything else than this particular MCE case.

Sorry for the overengineering. :)

--Andy

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 22/36] tracing: Provide lockdep less trace_hardirqs_on/off() variants
  2020-05-05 13:16 ` [patch V4 part 1 22/36] tracing: Provide lockdep less trace_hardirqs_on/off() variants Thomas Gleixner
@ 2020-05-07 17:55   ` Andy Lutomirski
  2020-05-07 18:52     ` Thomas Gleixner
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  1 sibling, 1 reply; 178+ messages in thread
From: Andy Lutomirski @ 2020-05-07 17:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

On Tue, May 5, 2020 at 7:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> trace_hardirqs_on/off() is only partially safe vs. RCU idle. The tracer
> core itself is safe, but the resulting tracepoints can be utilized by
> e.g. BPF which is unsafe.
>
> Provide variants which do not contain the lockdep invocation so the lockdep
> and tracer invocations can be split at the call site and placed properly.
>
> The new variants also do not use rcuidle as they are going to be called
> from entry code after/before context tracking.

I can't quite follow this.  Are you saying that the new variants are
intended to be called by the entry code in a context where tracing is
acceptable and that the lockdep part will still be called in a context
where tracing is not acceptable?

--Andy

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-05 13:16 ` [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work Thomas Gleixner
@ 2020-05-07 18:02   ` Andy Lutomirski
  2020-05-08  8:48     ` Peter Zijlstra
  2020-05-14 14:16     ` Borislav Petkov
  2020-05-13 23:42   ` Mathieu Desnoyers
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 178+ messages in thread
From: Andy Lutomirski @ 2020-05-07 18:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Peter Zijlstra (Intel)

On Tue, May 5, 2020 at 7:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> From: Peter Zijlstra <peterz@infradead.org>
>
> Convert #MC over to using task_work_add(); it will run the same code
> slightly later, on the return to user path of the same exception.

I think this patch is correct, but I think it's only one small and not
that obviously wrong step away from being broken:

>         if ((m.cs & 3) == 3) {
>                 /* If this triggers there is no way to recover. Die hard. */
>                 BUG_ON(!on_thread_stack() || !user_mode(regs));
> -               local_irq_enable();
> -               preempt_enable();
>
> -               if (kill_it || do_memory_failure(&m))
> -                       force_sig(SIGBUS);
> -               preempt_disable();
> -               local_irq_disable();
> +               current->mce_addr = m.addr;
> +               current->mce_status = m.mcgstatus;
> +               current->mce_kill_me.func = kill_me_maybe;
> +               if (kill_it)
> +                       current->mce_kill_me.func = kill_me_now;
> +               task_work_add(current, &current->mce_kill_me, true);

This is fine if the source was CPL3, but it's not going to work if CPL
was 0.  We don't *currently* do this from CPL0, but people keep
wanting to.  So perhaps there should be a comment like:

/*
 * The #MC originated at CPL3, so we know that we will go execute the
task_work before returning to the offending user code.
 */

IOW, if we want to recover from CPL0 #MC, we will need a different mechanism.

I also confess a certain amount of sadness that my beautiful
haha-not-really-atomic-here mechanism isn't being used anymore. :(

--Andy

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 35/36] x86: Replace ist_enter() with nmi_enter()
  2020-05-05 13:16 ` [patch V4 part 1 35/36] x86: Replace ist_enter() with nmi_enter() Thomas Gleixner
@ 2020-05-07 18:04   ` Andy Lutomirski
  2020-05-07 18:17     ` Mathieu Desnoyers
  2020-05-14  0:12   ` Mathieu Desnoyers
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Peter Zijlstra
  2 siblings, 1 reply; 178+ messages in thread
From: Andy Lutomirski @ 2020-05-07 18:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Peter Zijlstra (Intel)

On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> From: Peter Zijlstra <peterz@infradead.org>
>
> A few exceptions (like #DB and #BP) can happen at any location in the code,
> this then means that tracers should treat events from these exceptions as
> NMI-like. The interrupted context could be holding locks with interrupts
> disabled for instance.
>
> Similarly, #MC is an actual NMI-like exception.

Is it permissible to send a signal from inside nmi_enter()?  I imagine
so, but I just want to make sure.

--Andy

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches
  2020-05-05 13:16 [patch V4 part 1 00/36] x86/entry: Entry/exception code rework, preparatory patches Thomas Gleixner
                   ` (35 preceding siblings ...)
  2020-05-05 13:16 ` [patch V4 part 1 36/36] rcu: Make RCU IRQ enter/exit functions rely on in_nmi() Thomas Gleixner
@ 2020-05-07 18:05 ` Andy Lutomirski
  36 siblings, 0 replies; 178+ messages in thread
From: Andy Lutomirski @ 2020-05-07 18:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

On Tue, May 5, 2020 at 7:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Folks!
>
> This is the hopefully final version of the rework of the entry and
> exception code to ensure that instrumentation cannot touch the fragile
> parts of the hardware induced entry and exception code trainwreck. It
> further ensures correctness vs. RCU and moves quite some code out of the
> assembly code into C.

Other than the minor comments I sent, part 1 all looks good to me.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 35/36] x86: Replace ist_enter() with nmi_enter()
  2020-05-07 18:04   ` Andy Lutomirski
@ 2020-05-07 18:17     ` Mathieu Desnoyers
  2020-05-08  8:50       ` Peter Zijlstra
  0 siblings, 1 reply; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-07 18:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, linux-kernel, x86, paulmck, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra

----- On May 7, 2020, at 2:04 PM, Andy Lutomirski luto@kernel.org wrote:

> On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> From: Peter Zijlstra <peterz@infradead.org>
>>
>> A few exceptions (like #DB and #BP) can happen at any location in the code,
>> this then means that tracers should treat events from these exceptions as
>> NMI-like. The interrupted context could be holding locks with interrupts
>> disabled for instance.
>>
>> Similarly, #MC is an actual NMI-like exception.
> 
> Is it permissible to send a signal from inside nmi_enter()?  I imagine
> so, but I just want to make sure.

If you mean sending a proper signal, I would guess not.

I suspect you'll rather want to use "irq_work()" from NMI context to ensure
the rest of the work (e.g. sending a signal or a wakeup) is performed from
IRQ context very soon after the NMI, rather than from NMI context.

AFAIK this is how this is done today by perf, ftrace, ebpf, and lttng.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 22/36] tracing: Provide lockdep less trace_hardirqs_on/off() variants
  2020-05-07 17:55   ` Andy Lutomirski
@ 2020-05-07 18:52     ` Thomas Gleixner
  0 siblings, 0 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-07 18:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

Andy Lutomirski <luto@kernel.org> writes:

> On Tue, May 5, 2020 at 7:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> trace_hardirqs_on/off() is only partially safe vs. RCU idle. The tracer
>> core itself is safe, but the resulting tracepoints can be utilized by
>> e.g. BPF which is unsafe.
>>
>> Provide variants which do not contain the lockdep invocation so the lockdep
>> and tracer invocations can be split at the call site and placed properly.
>>
>> The new variants also do not use rcuidle as they are going to be called
>> from entry code after/before context tracking.
>
> I can't quite follow this.  Are you saying that the new variants are
> intended to be called by the entry code in a context where tracing is
> acceptable and that the lockdep part will still be called in a context
> where tracing is not acceptable?

Yes. Before RCU is reestablished and after. I'll rephrase.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 07/36] locking/atomics: Flip fallbacks and instrumentation
  2020-05-05 13:16 ` [patch V4 part 1 07/36] locking/atomics: Flip fallbacks and instrumentation Thomas Gleixner
  2020-05-05 16:04   ` Mark Rutland
@ 2020-05-07 23:41   ` Steven Rostedt
  2020-05-08  8:40     ` Peter Zijlstra
  2020-05-12 14:36   ` [tip: locking/kcsan] " tip-bot2 for Peter Zijlstra
  2 siblings, 1 reply; 178+ messages in thread
From: Steven Rostedt @ 2020-05-07 23:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra (Intel),
	Mark Rutland

On Tue, 05 May 2020 15:16:09 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> Currently instrumentation of atomic primitives is done at the
> architecture level, while composites or fallbacks are provided at the
> generic level.
> 
> The result is that there are no uninstrumented variants of the
> fallbacks. Since there is now need of such (see the next patch),

Just a comment on the change log. Can we avoid saying "see the next patch"?
A few years from now, if we stumble on changes in this commit and need to
see that next patch, if something happened to lore, it may be difficult to
find what that next patch was. But saying that patch's subject, would be
just a simple search in the git history.

That said, looking at "the next patch" which is "x86/doublefault: Remove
memmove() call", does that patch really have a need for such?

-- Steve



> invert this ordering.
> 
> Doing this means moving the instrumentation into the generic code as
> well as having (for now) two variants of the fallbacks.
> 
> Notes:
> 
>  - the various *cond_read* primitives are not proper fallbacks
>    and got moved into linux/atomic.c. No arch_ variants are
>    generated because the base primitives smp_cond_load*()
>    are instrumented.
> 
>  - once all architectures are moved over to arch_atomic_ we can remove
>    one of the fallback variants and reclaim some 2300 lines.
> 
>  - atomic_{read,set}*() are no longer double-instrumented
> 
> Reported-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mark Rutland <mark.rutland@arm.com>
>

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 24/36] lockdep: Prepare for noinstr sections
  2020-05-05 13:16 ` [patch V4 part 1 24/36] lockdep: Prepare for noinstr sections Thomas Gleixner
@ 2020-05-08  0:23   ` Steven Rostedt
  2020-05-08  8:44     ` Peter Zijlstra
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 178+ messages in thread
From: Steven Rostedt @ 2020-05-08  0:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra

On Tue, 05 May 2020 15:16:26 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -3639,9 +3639,6 @@ static void __trace_hardirqs_on_caller(u
>  {
>  	struct task_struct *curr = current;
>  
> -	/* we'll do an OFF -> ON transition: */
> -	curr->hardirqs_enabled = 1;
> -
>  	/*
>  	 * We are going to turn hardirqs on, so set the
>  	 * usage bit for all held locks:
> @@ -3653,16 +3650,13 @@ static void __trace_hardirqs_on_caller(u
>  	 * bit for all held locks. (disabled hardirqs prevented
>  	 * this bit from being set before)
>  	 */
> -	if (curr->softirqs_enabled)
> +	if (curr->softirqs_enabled) {
>  		if (!mark_held_locks(curr, LOCK_ENABLED_SOFTIRQ))
>  			return;

The above can be turned into:

	if (curr->softirqs_enabled)
		mark_held_locks(curr, LOCK_ENABLED_SOFTIRQ);

No need for the condition or the return.


> -
> -	curr->hardirq_enable_ip = ip;
> -	curr->hardirq_enable_event = ++curr->irq_events;
> -	debug_atomic_inc(hardirqs_on_events);
> +	}
>  }
>  
> -void lockdep_hardirqs_on(unsigned long ip)
> +void lockdep_hardirqs_on_prepare(unsigned long ip)
>  {
>  	if (unlikely(!debug_locks || current->lockdep_recursion))
>  		return;
> @@ -3698,20 +3692,62 @@ void lockdep_hardirqs_on(unsigned long i
>  	if (DEBUG_LOCKS_WARN_ON(current->hardirq_context))
>  		return;
>  
> +	current->hardirq_chain_key = current->curr_chain_key;
> +
>  	current->lockdep_recursion++;
>  	__trace_hardirqs_on_caller(ip);
>  	lockdep_recursion_finish();
>  }
> -NOKPROBE_SYMBOL(lockdep_hardirqs_on);
> +EXPORT_SYMBOL_GPL(lockdep_hardirqs_on_prepare);
> +
> +void noinstr lockdep_hardirqs_on(unsigned long ip)
> +{

Would be nice to have some kerneldoc explaining the difference between
lockdep_hardirqs_on_prepare() and lockdep_hardirqs_on().

-- Steve

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 32/36] sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception
  2020-05-05 13:16 ` [patch V4 part 1 32/36] sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception Thomas Gleixner
@ 2020-05-08  0:34   ` Steven Rostedt
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 178+ messages in thread
From: Steven Rostedt @ 2020-05-08  0:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra (Intel),
	Rich Felker, Yoshinori Sato

On Tue, 05 May 2020 15:16:34 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Peter Zijlstra <peterz@infradead.org>
> 
> SuperH is the last remaining user of arch_ftrace_nmi_{enter,exit}(),
> remove it from the generic code and into the SuperH code.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Yoshinori Sato <ysato@users.sourceforge.jp>

Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

> ---
>  Documentation/trace/ftrace-design.rst |    8 --------
>  arch/sh/Kconfig                       |    1 -
>  arch/sh/kernel/traps.c                |   12 ++++++++++++
>  include/linux/ftrace_irq.h            |   11 -----------
>  kernel/trace/Kconfig                  |   10 ----------
>  5 files changed, 12 insertions(+), 30 deletions(-)
> 
>

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 33/36] x86,tracing: Robustify ftrace_nmi_enter()
  2020-05-05 13:16 ` [patch V4 part 1 33/36] x86,tracing: Robustify ftrace_nmi_enter() Thomas Gleixner
@ 2020-05-08  6:19   ` Masami Hiramatsu
  0 siblings, 0 replies; 178+ messages in thread
From: Masami Hiramatsu @ 2020-05-08  6:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

On Tue, 05 May 2020 15:16:35 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Peter Zijlstra <peterz@infradead.org>
> 
>   ftrace_nmi_enter()
>      trace_hwlat_callback()
>        trace_clock_local()
>          sched_clock()
>            paravirt_sched_clock()
>            native_sched_clock()
> 
> All must not be traced or kprobed, it will be called from do_debug()
> before the kprobe handler.
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>

Thank you,


> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> ---
>  arch/x86/include/asm/paravirt.h |    2 +-
>  arch/x86/kernel/tsc.c           |    4 ++--
>  include/linux/ftrace_irq.h      |    4 ++--
>  kernel/trace/trace_clock.c      |    3 ++-
>  kernel/trace/trace_hwlat.c      |    2 +-
>  5 files changed, 8 insertions(+), 7 deletions(-)
> 
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -17,7 +17,7 @@
>  #include <linux/cpumask.h>
>  #include <asm/frame.h>
>  
> -static inline unsigned long long paravirt_sched_clock(void)
> +static __always_inline unsigned long long paravirt_sched_clock(void)
>  {
>  	return PVOP_CALL0(unsigned long long, time.sched_clock);
>  }
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -207,7 +207,7 @@ static void __init cyc2ns_init_secondary
>  /*
>   * Scheduler clock - returns current time in nanosec units.
>   */
> -u64 native_sched_clock(void)
> +noinstr u64 native_sched_clock(void)
>  {
>  	if (static_branch_likely(&__use_tsc)) {
>  		u64 tsc_now = rdtsc();
> @@ -240,7 +240,7 @@ u64 native_sched_clock_from_tsc(u64 tsc)
>  /* We need to define a real function for sched_clock, to override the
>     weak default version */
>  #ifdef CONFIG_PARAVIRT
> -unsigned long long sched_clock(void)
> +noinstr unsigned long long sched_clock(void)
>  {
>  	return paravirt_sched_clock();
>  }
> --- a/include/linux/ftrace_irq.h
> +++ b/include/linux/ftrace_irq.h
> @@ -7,7 +7,7 @@ extern bool trace_hwlat_callback_enabled
>  extern void trace_hwlat_callback(bool enter);
>  #endif
>  
> -static inline void ftrace_nmi_enter(void)
> +static __always_inline void ftrace_nmi_enter(void)
>  {
>  #ifdef CONFIG_HWLAT_TRACER
>  	if (trace_hwlat_callback_enabled)
> @@ -15,7 +15,7 @@ static inline void ftrace_nmi_enter(void
>  #endif
>  }
>  
> -static inline void ftrace_nmi_exit(void)
> +static __always_inline void ftrace_nmi_exit(void)
>  {
>  #ifdef CONFIG_HWLAT_TRACER
>  	if (trace_hwlat_callback_enabled)
> --- a/kernel/trace/trace_clock.c
> +++ b/kernel/trace/trace_clock.c
> @@ -22,6 +22,7 @@
>  #include <linux/sched/clock.h>
>  #include <linux/ktime.h>
>  #include <linux/trace_clock.h>
> +#include <linux/kprobes.h>
>  
>  /*
>   * trace_clock_local(): the simplest and least coherent tracing clock.
> @@ -29,7 +30,7 @@
>   * Useful for tracing that does not cross to other CPUs nor
>   * does it go through idle events.
>   */
> -u64 notrace trace_clock_local(void)
> +u64 noinstr trace_clock_local(void)
>  {
>  	u64 clock;
>  
> --- a/kernel/trace/trace_hwlat.c
> +++ b/kernel/trace/trace_hwlat.c
> @@ -139,7 +139,7 @@ static void trace_hwlat_sample(struct hw
>  #define init_time(a, b)	(a = b)
>  #define time_u64(a)	a
>  
> -void trace_hwlat_callback(bool enter)
> +noinstr void trace_hwlat_callback(bool enter)
>  {
>  	if (smp_processor_id() != nmi_cpu)
>  		return;
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 21/36] kprobes: Prevent probes in .noinstr.text section
  2020-05-05 13:16 ` [patch V4 part 1 21/36] kprobes: Prevent probes in .noinstr.text section Thomas Gleixner
@ 2020-05-08  6:30   ` Masami Hiramatsu
  2020-05-19 19:52   ` [tip: core/kprobes] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 178+ messages in thread
From: Masami Hiramatsu @ 2020-05-08  6:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon

On Tue, 05 May 2020 15:16:23 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> Instrumentation is forbidden in the .noinstr.text section. Make kprobes
> respect this.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

This looks good to me.

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>

Thank you!


> ---
>  include/linux/module.h |    2 ++
>  kernel/kprobes.c       |   18 ++++++++++++++++++
>  kernel/module.c        |    3 +++
>  3 files changed, 23 insertions(+)
> 
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -458,6 +458,8 @@ struct module {
>  	void __percpu *percpu;
>  	unsigned int percpu_size;
>  #endif
> +	void *noinstr_text_start;
> +	unsigned int noinstr_text_size;
>  
>  #ifdef CONFIG_TRACEPOINTS
>  	unsigned int num_tracepoints;
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -2229,6 +2229,12 @@ static int __init populate_kprobe_blackl
>  	/* Symbols in __kprobes_text are blacklisted */
>  	ret = kprobe_add_area_blacklist((unsigned long)__kprobes_text_start,
>  					(unsigned long)__kprobes_text_end);
> +	if (ret)
> +		return ret;
> +
> +	/* Symbols in noinstr section are blacklisted */
> +	ret = kprobe_add_area_blacklist((unsigned long)__noinstr_text_start,
> +					(unsigned long)__noinstr_text_end);
>  
>  	return ret ? : arch_populate_kprobe_blacklist();
>  }
> @@ -2248,6 +2254,12 @@ static void add_module_kprobe_blacklist(
>  		end = start + mod->kprobes_text_size;
>  		kprobe_add_area_blacklist(start, end);
>  	}
> +
> +	start = (unsigned long)mod->noinstr_text_start;
> +	if (start) {
> +		end = start + mod->noinstr_text_size;
> +		kprobe_add_area_blacklist(start, end);
> +	}
>  }
>  
>  static void remove_module_kprobe_blacklist(struct module *mod)
> @@ -2265,6 +2277,12 @@ static void remove_module_kprobe_blackli
>  		end = start + mod->kprobes_text_size;
>  		kprobe_remove_area_blacklist(start, end);
>  	}
> +
> +	start = (unsigned long)mod->noinstr_text_start;
> +	if (start) {
> +		end = start + mod->noinstr_text_size;
> +		kprobe_remove_area_blacklist(start, end);
> +	}
>  }
>  
>  /* Module notifier call back, checking kprobes on the module */
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -3150,6 +3150,9 @@ static int find_module_sections(struct m
>  	}
>  #endif
>  
> +	mod->noinstr_text_start = section_objs(info, ".noinstr.text", 1,
> +						&mod->noinstr_text_size);
> +
>  #ifdef CONFIG_TRACEPOINTS
>  	mod->tracepoints_ptrs = section_objs(info, "__tracepoints_ptrs",
>  					     sizeof(*mod->tracepoints_ptrs),
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 07/36] locking/atomics: Flip fallbacks and instrumentation
  2020-05-07 23:41   ` Steven Rostedt
@ 2020-05-08  8:40     ` Peter Zijlstra
  0 siblings, 0 replies; 178+ messages in thread
From: Peter Zijlstra @ 2020-05-08  8:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, LKML, x86, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Joel Fernandes, Boris Ostrovsky, Juergen Gross, Brian Gerst,
	Mathieu Desnoyers, Josh Poimboeuf, Will Deacon, Mark Rutland

On Thu, May 07, 2020 at 07:41:27PM -0400, Steven Rostedt wrote:
> On Tue, 05 May 2020 15:16:09 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > Currently instrumentation of atomic primitives is done at the
> > architecture level, while composites or fallbacks are provided at the
> > generic level.
> > 
> > The result is that there are no uninstrumented variants of the
> > fallbacks. Since there is now need of such (see the next patch),
> 
> Just a comment on the change log. Can we avoid saying "see the next patch"?
> A few years from now, if we stumble on changes in this commit and need to
> see that next patch, if something happened to lore, it may be difficult to
> find what that next patch was.

Even I can get git-log to tell me what the next patch is, and I'm an
absolute disaster with git. But yes, valid point.

> But saying that patch's subject, would be
> just a simple search in the git history.
> 
> That said, looking at "the next patch" which is "x86/doublefault: Remove
> memmove() call", does that patch really have a need for such?

No, the next patch was part4-2, so it already isn't accurate.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 24/36] lockdep: Prepare for noinstr sections
  2020-05-08  0:23   ` Steven Rostedt
@ 2020-05-08  8:44     ` Peter Zijlstra
  0 siblings, 0 replies; 178+ messages in thread
From: Peter Zijlstra @ 2020-05-08  8:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, LKML, x86, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Joel Fernandes, Boris Ostrovsky, Juergen Gross, Brian Gerst,
	Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

On Thu, May 07, 2020 at 08:23:39PM -0400, Steven Rostedt wrote:

> > +void lockdep_hardirqs_on_prepare(unsigned long ip)

> > +void noinstr lockdep_hardirqs_on(unsigned long ip)

> Would be nice to have some kerneldoc explaining the difference between
> lockdep_hardirqs_on_prepare() and lockdep_hardirqs_on().

As the naming suggests, it's not a matter of difference, but of order,
first you call on_prepare(), then you call on(). The function signature
as per the above gives clue why.


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-07 18:02   ` Andy Lutomirski
@ 2020-05-08  8:48     ` Peter Zijlstra
  2020-05-08 21:30       ` Andy Lutomirski
  2020-05-14 14:16     ` Borislav Petkov
  1 sibling, 1 reply; 178+ messages in thread
From: Peter Zijlstra @ 2020-05-08  8:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

On Thu, May 07, 2020 at 11:02:09AM -0700, Andy Lutomirski wrote:
> On Tue, May 5, 2020 at 7:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > From: Peter Zijlstra <peterz@infradead.org>
> >
> > Convert #MC over to using task_work_add(); it will run the same code
> > slightly later, on the return to user path of the same exception.
> 
> I think this patch is correct, but I think it's only one small and not
> that obviously wrong step away from being broken:
> 
> >         if ((m.cs & 3) == 3) {
> >                 /* If this triggers there is no way to recover. Die hard. */
> >                 BUG_ON(!on_thread_stack() || !user_mode(regs));
> > -               local_irq_enable();
> > -               preempt_enable();
> >
> > -               if (kill_it || do_memory_failure(&m))
> > -                       force_sig(SIGBUS);
> > -               preempt_disable();
> > -               local_irq_disable();
> > +               current->mce_addr = m.addr;
> > +               current->mce_status = m.mcgstatus;
> > +               current->mce_kill_me.func = kill_me_maybe;
> > +               if (kill_it)
> > +                       current->mce_kill_me.func = kill_me_now;
> > +               task_work_add(current, &current->mce_kill_me, true);
> 
> This is fine if the source was CPL3, but it's not going to work if CPL
> was 0.  We don't *currently* do this from CPL0, but people keep
> wanting to.  So perhaps there should be a comment like:
> 
> /*
>  * The #MC originated at CPL3, so we know that we will go execute the
> task_work before returning to the offending user code.
>  */
> 
> IOW, if we want to recover from CPL0 #MC, we will need a different mechanism.

See part4-18's IDTRENTRY_NOIST. That will get us a clear CPL3/CPL0
separation.

> I also confess a certain amount of sadness that my beautiful
> haha-not-really-atomic-here mechanism isn't being used anymore. :(

I think we have a subtely different interpretation of 'beautiful' here.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 35/36] x86: Replace ist_enter() with nmi_enter()
  2020-05-07 18:17     ` Mathieu Desnoyers
@ 2020-05-08  8:50       ` Peter Zijlstra
  2020-05-08 17:12         ` Josh Poimboeuf
  0 siblings, 1 reply; 178+ messages in thread
From: Peter Zijlstra @ 2020-05-08  8:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andy Lutomirski, Thomas Gleixner, linux-kernel, x86, paulmck,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek, rostedt,
	Joel Fernandes, Google, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Josh Poimboeuf, Will Deacon

On Thu, May 07, 2020 at 02:17:58PM -0400, Mathieu Desnoyers wrote:
> ----- On May 7, 2020, at 2:04 PM, Andy Lutomirski luto@kernel.org wrote:
> 
> > On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >> From: Peter Zijlstra <peterz@infradead.org>
> >>
> >> A few exceptions (like #DB and #BP) can happen at any location in the code,
> >> this then means that tracers should treat events from these exceptions as
> >> NMI-like. The interrupted context could be holding locks with interrupts
> >> disabled for instance.
> >>
> >> Similarly, #MC is an actual NMI-like exception.
> > 
> > Is it permissible to send a signal from inside nmi_enter()?  I imagine
> > so, but I just want to make sure.
> 
> If you mean sending a proper signal, I would guess not.
> 
> I suspect you'll rather want to use "irq_work()" from NMI context to ensure
> the rest of the work (e.g. sending a signal or a wakeup) is performed from
> IRQ context very soon after the NMI, rather than from NMI context.
> 
> AFAIK this is how this is done today by perf, ftrace, ebpf, and lttng.

What Mathieu says. But I suspect you want to keep reading until
part4-18. That should get you what you really want.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 35/36] x86: Replace ist_enter() with nmi_enter()
  2020-05-08  8:50       ` Peter Zijlstra
@ 2020-05-08 17:12         ` Josh Poimboeuf
  0 siblings, 0 replies; 178+ messages in thread
From: Josh Poimboeuf @ 2020-05-08 17:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Andy Lutomirski, Thomas Gleixner,
	linux-kernel, x86, paulmck, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Will Deacon

On Fri, May 08, 2020 at 10:50:02AM +0200, Peter Zijlstra wrote:
> On Thu, May 07, 2020 at 02:17:58PM -0400, Mathieu Desnoyers wrote:
> > ----- On May 7, 2020, at 2:04 PM, Andy Lutomirski luto@kernel.org wrote:
> > 
> > > On Tue, May 5, 2020 at 7:14 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >>
> > >> From: Peter Zijlstra <peterz@infradead.org>
> > >>
> > >> A few exceptions (like #DB and #BP) can happen at any location in the code,
> > >> this then means that tracers should treat events from these exceptions as
> > >> NMI-like. The interrupted context could be holding locks with interrupts
> > >> disabled for instance.
> > >>
> > >> Similarly, #MC is an actual NMI-like exception.
> > > 
> > > Is it permissible to send a signal from inside nmi_enter()?  I imagine
> > > so, but I just want to make sure.
> > 
> > If you mean sending a proper signal, I would guess not.
> > 
> > I suspect you'll rather want to use "irq_work()" from NMI context to ensure
> > the rest of the work (e.g. sending a signal or a wakeup) is performed from
> > IRQ context very soon after the NMI, rather than from NMI context.
> > 
> > AFAIK this is how this is done today by perf, ftrace, ebpf, and lttng.
> 
> What Mathieu says. But I suspect you want to keep reading until
> part4-18. That should get you what you really want.

LALALALA

At least give a spoiler alert for those of us still enjoying part 1!

-- 
Josh


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-08  8:48     ` Peter Zijlstra
@ 2020-05-08 21:30       ` Andy Lutomirski
  0 siblings, 0 replies; 178+ messages in thread
From: Andy Lutomirski @ 2020-05-08 21:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon

On Fri, May 8, 2020 at 1:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, May 07, 2020 at 11:02:09AM -0700, Andy Lutomirski wrote:
> > On Tue, May 5, 2020 at 7:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > From: Peter Zijlstra <peterz@infradead.org>
> > >
> > > Convert #MC over to using task_work_add(); it will run the same code
> > > slightly later, on the return to user path of the same exception.
> >
> > I think this patch is correct, but I think it's only one small and not
> > that obviously wrong step away from being broken:
> >
> > >         if ((m.cs & 3) == 3) {
> > >                 /* If this triggers there is no way to recover. Die hard. */
> > >                 BUG_ON(!on_thread_stack() || !user_mode(regs));
> > > -               local_irq_enable();
> > > -               preempt_enable();
> > >
> > > -               if (kill_it || do_memory_failure(&m))
> > > -                       force_sig(SIGBUS);
> > > -               preempt_disable();
> > > -               local_irq_disable();
> > > +               current->mce_addr = m.addr;
> > > +               current->mce_status = m.mcgstatus;
> > > +               current->mce_kill_me.func = kill_me_maybe;
> > > +               if (kill_it)
> > > +                       current->mce_kill_me.func = kill_me_now;
> > > +               task_work_add(current, &current->mce_kill_me, true);
> >
> > This is fine if the source was CPL3, but it's not going to work if CPL
> > was 0.  We don't *currently* do this from CPL0, but people keep
> > wanting to.  So perhaps there should be a comment like:
> >
> > /*
> >  * The #MC originated at CPL3, so we know that we will go execute the
> > task_work before returning to the offending user code.
> >  */
> >
> > IOW, if we want to recover from CPL0 #MC, we will need a different mechanism.
>
> See part4-18's IDTRENTRY_NOIST. That will get us a clear CPL3/CPL0
> separation.

I will hold my breath.

>
> > I also confess a certain amount of sadness that my beautiful
> > haha-not-really-atomic-here mechanism isn't being used anymore. :(
>
> I think we have a subtely different interpretation of 'beautiful' here.

Beauty is in the eye of the beholder.  And sometimes in the eye of the
person who wrote the code :)

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 02/36] x86/hw_breakpoint: Prevent data breakpoints on cpu_entry_area
  2020-05-05 13:16 ` [patch V4 part 1 02/36] x86/hw_breakpoint: Prevent data breakpoints on cpu_entry_area Thomas Gleixner
  2020-05-06  8:14   ` Borislav Petkov
  2020-05-06 12:11   ` Alexandre Chartre
@ 2020-05-09  9:00   ` Lai Jiangshan
  2020-05-09  9:23   ` Lai Jiangshan
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Andy Lutomirski
  4 siblings, 0 replies; 178+ messages in thread
From: Lai Jiangshan @ 2020-05-09  9:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

On Tue, May 5, 2020 at 10:15 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> From: Andy Lutomirski <luto@kernel.org>
>
> A data breakpoint near the top of an IST stack will cause unresoverable
> recursion.  A data breakpoint on the GDT, IDT, or TSS is terrifying.
> Prevent either of these from happening.
>
> Co-developed-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/kernel/hw_breakpoint.c |   25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
>
> --- a/arch/x86/kernel/hw_breakpoint.c
> +++ b/arch/x86/kernel/hw_breakpoint.c
> @@ -227,10 +227,35 @@ int arch_check_bp_in_kernelspace(struct
>         return (va >= TASK_SIZE_MAX) || ((va + len - 1) >= TASK_SIZE_MAX);
>  }
>
> +/*
> + * Checks whether the range from addr to end, inclusive, overlaps the CPU
> + * entry area range.
> + */
> +static inline bool within_cpu_entry_area(unsigned long addr, unsigned long end)
> +{
> +       return end >= CPU_ENTRY_AREA_PER_CPU &&
> +              addr < (CPU_ENTRY_AREA_PER_CPU + CPU_ENTRY_AREA_TOTAL_SIZE);
> +}

Hello

These two lines:
s/CPU_ENTRY_AREA_PER_CPU/CPU_ENTRY_AREA_BASE/g
or
s/CPU_ENTRY_AREA_PER_CPU/CPU_ENTRY_AREA_RO_IDT/g

or otherwise the RO_IDT is not being protected.

sees:
#define CPU_ENTRY_AREA_PER_CPU (CPU_ENTRY_AREA_RO_IDT + PAGE_SIZE)

#define CPU_ENTRY_AREA_TOTAL_SIZE (CPU_ENTRY_AREA_ARRAY_SIZE + PAGE_SIZE)
                                     ^ sizeof PER_CPU           ^ RO_IDT


Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>


> +
>  static int arch_build_bp_info(struct perf_event *bp,
>                               const struct perf_event_attr *attr,
>                               struct arch_hw_breakpoint *hw)
>  {
> +       unsigned long bp_end;
> +
> +       bp_end = attr->bp_addr + attr->bp_len - 1;
> +       if (bp_end < attr->bp_addr)
> +               return -EINVAL;
> +
> +       /*
> +        * Prevent any breakpoint of any type that overlaps the
> +        * cpu_entry_area.  This protects the IST stacks and also
> +        * reduces the chance that we ever find out what happens if
> +        * there's a data breakpoint on the GDT, IDT, or TSS.
> +        */
> +       if (within_cpu_entry_area(attr->bp_addr, bp_end))
> +               return -EINVAL;
> +
>         hw->address = attr->bp_addr;
>         hw->mask = 0;
>
>

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 02/36] x86/hw_breakpoint: Prevent data breakpoints on cpu_entry_area
  2020-05-05 13:16 ` [patch V4 part 1 02/36] x86/hw_breakpoint: Prevent data breakpoints on cpu_entry_area Thomas Gleixner
                     ` (2 preceding siblings ...)
  2020-05-09  9:00   ` Lai Jiangshan
@ 2020-05-09  9:23   ` Lai Jiangshan
  2020-05-09 19:08     ` Andy Lutomirski
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Andy Lutomirski
  4 siblings, 1 reply; 178+ messages in thread
From: Lai Jiangshan @ 2020-05-09  9:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel)

On Tue, May 5, 2020 at 10:15 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> From: Andy Lutomirski <luto@kernel.org>
>
> A data breakpoint near the top of an IST stack will cause unresoverable
> recursion.  A data breakpoint on the GDT, IDT, or TSS is terrifying.
> Prevent either of these from happening.
>

What happen when a data breakpoint on the direct GDT (load_direct_gdt())
and the debug IDT (load_debug_idt()) which are not considered in this patch?

Thanks
Lai

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 02/36] x86/hw_breakpoint: Prevent data breakpoints on cpu_entry_area
  2020-05-09  9:23   ` Lai Jiangshan
@ 2020-05-09 19:08     ` Andy Lutomirski
  0 siblings, 0 replies; 178+ messages in thread
From: Andy Lutomirski @ 2020-05-09 19:08 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Thomas Gleixner, LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Peter Zijlstra (Intel)

On Sat, May 9, 2020 at 2:23 AM Lai Jiangshan
<jiangshanlai+lkml@gmail.com> wrote:
>
> On Tue, May 5, 2020 at 10:15 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > From: Andy Lutomirski <luto@kernel.org>
> >
> > A data breakpoint near the top of an IST stack will cause unresoverable
> > recursion.  A data breakpoint on the GDT, IDT, or TSS is terrifying.
> > Prevent either of these from happening.
> >
>
> What happen when a data breakpoint on the direct GDT (load_direct_gdt())
> and the debug IDT (load_debug_idt()) which are not considered in this patch?
>

I have no idea, and learning the answer may involve talking to the
respective CPU vendors' microcode engineers.  We should probably block
those, too.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: locking/kcsan] locking/atomics: Flip fallbacks and instrumentation
  2020-05-05 13:16 ` [patch V4 part 1 07/36] locking/atomics: Flip fallbacks and instrumentation Thomas Gleixner
  2020-05-05 16:04   ` Mark Rutland
  2020-05-07 23:41   ` Steven Rostedt
@ 2020-05-12 14:36   ` tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 178+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-12 14:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mark Rutland, x86, LKML

The following commit has been merged into the locking/kcsan branch of tip:

Commit-ID:     6bcc8f459fe790f35dfd8e3bb0f43e530d44ee9a
Gitweb:        https://git.kernel.org/tip/6bcc8f459fe790f35dfd8e3bb0f43e530d44ee9a
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 24 Jan 2020 22:13:03 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 12 May 2020 16:35:06 +02:00

locking/atomics: Flip fallbacks and instrumentation

Currently instrumentation of atomic primitives is done at the architecture
level, while composites or fallbacks are provided at the generic level.

The result is that there are no uninstrumented variants of the
fallbacks. Since there is now need of such variants to isolate text poke
from any form of instrumentation invert this ordering.

Doing this means moving the instrumentation into the generic code as
well as having (for now) two variants of the fallbacks.

Notes:

 - the various *cond_read* primitives are not proper fallbacks
   and got moved into linux/atomic.c. No arch_ variants are
   generated because the base primitives smp_cond_load*()
   are instrumented.

 - once all architectures are moved over to arch_atomic_ one of the
   fallback variants can be removed and some 2300 lines reclaimed.

 - atomic_{read,set}*() are no longer double-instrumented

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lkml.kernel.org/r/20200505134058.769149955@linutronix.de
---
 arch/arm64/include/asm/atomic.h              |    6 +-
 arch/x86/include/asm/atomic.h                |   17 +-
 arch/x86/include/asm/atomic64_32.h           |    9 +-
 arch/x86/include/asm/atomic64_64.h           |   15 +-
 include/linux/atomic-arch-fallback.h         | 2291 +++++++++++++++++-
 include/linux/atomic-fallback.h              |    8 +-
 include/linux/atomic.h                       |   11 +-
 scripts/atomic/fallbacks/acquire             |    4 +-
 scripts/atomic/fallbacks/add_negative        |    6 +-
 scripts/atomic/fallbacks/add_unless          |    6 +-
 scripts/atomic/fallbacks/andnot              |    4 +-
 scripts/atomic/fallbacks/dec                 |    4 +-
 scripts/atomic/fallbacks/dec_and_test        |    6 +-
 scripts/atomic/fallbacks/dec_if_positive     |    6 +-
 scripts/atomic/fallbacks/dec_unless_positive |    6 +-
 scripts/atomic/fallbacks/fence               |    4 +-
 scripts/atomic/fallbacks/fetch_add_unless    |    8 +-
 scripts/atomic/fallbacks/inc                 |    4 +-
 scripts/atomic/fallbacks/inc_and_test        |    6 +-
 scripts/atomic/fallbacks/inc_not_zero        |    6 +-
 scripts/atomic/fallbacks/inc_unless_negative |    6 +-
 scripts/atomic/fallbacks/read_acquire        |    2 +-
 scripts/atomic/fallbacks/release             |    4 +-
 scripts/atomic/fallbacks/set_release         |    2 +-
 scripts/atomic/fallbacks/sub_and_test        |    6 +-
 scripts/atomic/fallbacks/try_cmpxchg         |    4 +-
 scripts/atomic/gen-atomic-fallback.sh        |   29 +-
 scripts/atomic/gen-atomics.sh                |    5 +-
 28 files changed, 2403 insertions(+), 82 deletions(-)
 create mode 100644 include/linux/atomic-arch-fallback.h

diff --git a/arch/arm64/include/asm/atomic.h b/arch/arm64/include/asm/atomic.h
index 9543b5e..a08890d 100644
--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@ -101,8 +101,8 @@ static inline long arch_atomic64_dec_if_positive(atomic64_t *v)
 
 #define ATOMIC_INIT(i)	{ (i) }
 
-#define arch_atomic_read(v)			READ_ONCE((v)->counter)
-#define arch_atomic_set(v, i)			WRITE_ONCE(((v)->counter), (i))
+#define arch_atomic_read(v)			__READ_ONCE((v)->counter)
+#define arch_atomic_set(v, i)			__WRITE_ONCE(((v)->counter), (i))
 
 #define arch_atomic_add_return_relaxed		arch_atomic_add_return_relaxed
 #define arch_atomic_add_return_acquire		arch_atomic_add_return_acquire
@@ -225,6 +225,6 @@ static inline long arch_atomic64_dec_if_positive(atomic64_t *v)
 
 #define arch_atomic64_dec_if_positive		arch_atomic64_dec_if_positive
 
-#include <asm-generic/atomic-instrumented.h>
+#define ARCH_ATOMIC
 
 #endif /* __ASM_ATOMIC_H */
diff --git a/arch/x86/include/asm/atomic.h b/arch/x86/include/asm/atomic.h
index 115127c..a9ae588 100644
--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -28,7 +28,7 @@ static __always_inline int arch_atomic_read(const atomic_t *v)
 	 * Note for KASAN: we deliberately don't use READ_ONCE_NOCHECK() here,
 	 * it's non-inlined function that increases binary size and stack usage.
 	 */
-	return READ_ONCE((v)->counter);
+	return __READ_ONCE((v)->counter);
 }
 
 /**
@@ -40,7 +40,7 @@ static __always_inline int arch_atomic_read(const atomic_t *v)
  */
 static __always_inline void arch_atomic_set(atomic_t *v, int i)
 {
-	WRITE_ONCE(v->counter, i);
+	__WRITE_ONCE(v->counter, i);
 }
 
 /**
@@ -166,6 +166,7 @@ static __always_inline int arch_atomic_add_return(int i, atomic_t *v)
 {
 	return i + xadd(&v->counter, i);
 }
+#define arch_atomic_add_return arch_atomic_add_return
 
 /**
  * arch_atomic_sub_return - subtract integer and return
@@ -178,32 +179,37 @@ static __always_inline int arch_atomic_sub_return(int i, atomic_t *v)
 {
 	return arch_atomic_add_return(-i, v);
 }
+#define arch_atomic_sub_return arch_atomic_sub_return
 
 static __always_inline int arch_atomic_fetch_add(int i, atomic_t *v)
 {
 	return xadd(&v->counter, i);
 }
+#define arch_atomic_fetch_add arch_atomic_fetch_add
 
 static __always_inline int arch_atomic_fetch_sub(int i, atomic_t *v)
 {
 	return xadd(&v->counter, -i);
 }
+#define arch_atomic_fetch_sub arch_atomic_fetch_sub
 
 static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
 {
 	return arch_cmpxchg(&v->counter, old, new);
 }
+#define arch_atomic_cmpxchg arch_atomic_cmpxchg
 
-#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
 static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
 {
 	return try_cmpxchg(&v->counter, old, new);
 }
+#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
 
 static inline int arch_atomic_xchg(atomic_t *v, int new)
 {
 	return arch_xchg(&v->counter, new);
 }
+#define arch_atomic_xchg arch_atomic_xchg
 
 static inline void arch_atomic_and(int i, atomic_t *v)
 {
@@ -221,6 +227,7 @@ static inline int arch_atomic_fetch_and(int i, atomic_t *v)
 
 	return val;
 }
+#define arch_atomic_fetch_and arch_atomic_fetch_and
 
 static inline void arch_atomic_or(int i, atomic_t *v)
 {
@@ -238,6 +245,7 @@ static inline int arch_atomic_fetch_or(int i, atomic_t *v)
 
 	return val;
 }
+#define arch_atomic_fetch_or arch_atomic_fetch_or
 
 static inline void arch_atomic_xor(int i, atomic_t *v)
 {
@@ -255,6 +263,7 @@ static inline int arch_atomic_fetch_xor(int i, atomic_t *v)
 
 	return val;
 }
+#define arch_atomic_fetch_xor arch_atomic_fetch_xor
 
 #ifdef CONFIG_X86_32
 # include <asm/atomic64_32.h>
@@ -262,6 +271,6 @@ static inline int arch_atomic_fetch_xor(int i, atomic_t *v)
 # include <asm/atomic64_64.h>
 #endif
 
-#include <asm-generic/atomic-instrumented.h>
+#define ARCH_ATOMIC
 
 #endif /* _ASM_X86_ATOMIC_H */
diff --git a/arch/x86/include/asm/atomic64_32.h b/arch/x86/include/asm/atomic64_32.h
index 52cfaec..5efd01b 100644
--- a/arch/x86/include/asm/atomic64_32.h
+++ b/arch/x86/include/asm/atomic64_32.h
@@ -75,6 +75,7 @@ static inline s64 arch_atomic64_cmpxchg(atomic64_t *v, s64 o, s64 n)
 {
 	return arch_cmpxchg64(&v->counter, o, n);
 }
+#define arch_atomic64_cmpxchg arch_atomic64_cmpxchg
 
 /**
  * arch_atomic64_xchg - xchg atomic64 variable
@@ -94,6 +95,7 @@ static inline s64 arch_atomic64_xchg(atomic64_t *v, s64 n)
 			     : "memory");
 	return o;
 }
+#define arch_atomic64_xchg arch_atomic64_xchg
 
 /**
  * arch_atomic64_set - set atomic64 variable
@@ -138,6 +140,7 @@ static inline s64 arch_atomic64_add_return(s64 i, atomic64_t *v)
 			     ASM_NO_INPUT_CLOBBER("memory"));
 	return i;
 }
+#define arch_atomic64_add_return arch_atomic64_add_return
 
 /*
  * Other variants with different arithmetic operators:
@@ -149,6 +152,7 @@ static inline s64 arch_atomic64_sub_return(s64 i, atomic64_t *v)
 			     ASM_NO_INPUT_CLOBBER("memory"));
 	return i;
 }
+#define arch_atomic64_sub_return arch_atomic64_sub_return
 
 static inline s64 arch_atomic64_inc_return(atomic64_t *v)
 {
@@ -242,6 +246,7 @@ static inline int arch_atomic64_add_unless(atomic64_t *v, s64 a, s64 u)
 			     "S" (v) : "memory");
 	return (int)a;
 }
+#define arch_atomic64_add_unless arch_atomic64_add_unless
 
 static inline int arch_atomic64_inc_not_zero(atomic64_t *v)
 {
@@ -281,6 +286,7 @@ static inline s64 arch_atomic64_fetch_and(s64 i, atomic64_t *v)
 
 	return old;
 }
+#define arch_atomic64_fetch_and arch_atomic64_fetch_and
 
 static inline void arch_atomic64_or(s64 i, atomic64_t *v)
 {
@@ -299,6 +305,7 @@ static inline s64 arch_atomic64_fetch_or(s64 i, atomic64_t *v)
 
 	return old;
 }
+#define arch_atomic64_fetch_or arch_atomic64_fetch_or
 
 static inline void arch_atomic64_xor(s64 i, atomic64_t *v)
 {
@@ -317,6 +324,7 @@ static inline s64 arch_atomic64_fetch_xor(s64 i, atomic64_t *v)
 
 	return old;
 }
+#define arch_atomic64_fetch_xor arch_atomic64_fetch_xor
 
 static inline s64 arch_atomic64_fetch_add(s64 i, atomic64_t *v)
 {
@@ -327,6 +335,7 @@ static inline s64 arch_atomic64_fetch_add(s64 i, atomic64_t *v)
 
 	return old;
 }
+#define arch_atomic64_fetch_add arch_atomic64_fetch_add
 
 #define arch_atomic64_fetch_sub(i, v)	arch_atomic64_fetch_add(-(i), (v))
 
diff --git a/arch/x86/include/asm/atomic64_64.h b/arch/x86/include/asm/atomic64_64.h
index 95c6cea..809bd01 100644
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -19,7 +19,7 @@
  */
 static inline s64 arch_atomic64_read(const atomic64_t *v)
 {
-	return READ_ONCE((v)->counter);
+	return __READ_ONCE((v)->counter);
 }
 
 /**
@@ -31,7 +31,7 @@ static inline s64 arch_atomic64_read(const atomic64_t *v)
  */
 static inline void arch_atomic64_set(atomic64_t *v, s64 i)
 {
-	WRITE_ONCE(v->counter, i);
+	__WRITE_ONCE(v->counter, i);
 }
 
 /**
@@ -159,37 +159,43 @@ static __always_inline s64 arch_atomic64_add_return(s64 i, atomic64_t *v)
 {
 	return i + xadd(&v->counter, i);
 }
+#define arch_atomic64_add_return arch_atomic64_add_return
 
 static inline s64 arch_atomic64_sub_return(s64 i, atomic64_t *v)
 {
 	return arch_atomic64_add_return(-i, v);
 }
+#define arch_atomic64_sub_return arch_atomic64_sub_return
 
 static inline s64 arch_atomic64_fetch_add(s64 i, atomic64_t *v)
 {
 	return xadd(&v->counter, i);
 }
+#define arch_atomic64_fetch_add arch_atomic64_fetch_add
 
 static inline s64 arch_atomic64_fetch_sub(s64 i, atomic64_t *v)
 {
 	return xadd(&v->counter, -i);
 }
+#define arch_atomic64_fetch_sub arch_atomic64_fetch_sub
 
 static inline s64 arch_atomic64_cmpxchg(atomic64_t *v, s64 old, s64 new)
 {
 	return arch_cmpxchg(&v->counter, old, new);
 }
+#define arch_atomic64_cmpxchg arch_atomic64_cmpxchg
 
-#define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg
 static __always_inline bool arch_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
 {
 	return try_cmpxchg(&v->counter, old, new);
 }
+#define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg
 
 static inline s64 arch_atomic64_xchg(atomic64_t *v, s64 new)
 {
 	return arch_xchg(&v->counter, new);
 }
+#define arch_atomic64_xchg arch_atomic64_xchg
 
 static inline void arch_atomic64_and(s64 i, atomic64_t *v)
 {
@@ -207,6 +213,7 @@ static inline s64 arch_atomic64_fetch_and(s64 i, atomic64_t *v)
 	} while (!arch_atomic64_try_cmpxchg(v, &val, val & i));
 	return val;
 }
+#define arch_atomic64_fetch_and arch_atomic64_fetch_and
 
 static inline void arch_atomic64_or(s64 i, atomic64_t *v)
 {
@@ -224,6 +231,7 @@ static inline s64 arch_atomic64_fetch_or(s64 i, atomic64_t *v)
 	} while (!arch_atomic64_try_cmpxchg(v, &val, val | i));
 	return val;
 }
+#define arch_atomic64_fetch_or arch_atomic64_fetch_or
 
 static inline void arch_atomic64_xor(s64 i, atomic64_t *v)
 {
@@ -241,5 +249,6 @@ static inline s64 arch_atomic64_fetch_xor(s64 i, atomic64_t *v)
 	} while (!arch_atomic64_try_cmpxchg(v, &val, val ^ i));
 	return val;
 }
+#define arch_atomic64_fetch_xor arch_atomic64_fetch_xor
 
 #endif /* _ASM_X86_ATOMIC64_64_H */
diff --git a/include/linux/atomic-arch-fallback.h b/include/linux/atomic-arch-fallback.h
new file mode 100644
index 0000000..bcb6aa2
--- /dev/null
+++ b/include/linux/atomic-arch-fallback.h
@@ -0,0 +1,2291 @@
+// SPDX-License-Identifier: GPL-2.0
+
+// Generated by scripts/atomic/gen-atomic-fallback.sh
+// DO NOT MODIFY THIS FILE DIRECTLY
+
+#ifndef _LINUX_ATOMIC_FALLBACK_H
+#define _LINUX_ATOMIC_FALLBACK_H
+
+#include <linux/compiler.h>
+
+#ifndef arch_xchg_relaxed
+#define arch_xchg_relaxed		arch_xchg
+#define arch_xchg_acquire		arch_xchg
+#define arch_xchg_release		arch_xchg
+#else /* arch_xchg_relaxed */
+
+#ifndef arch_xchg_acquire
+#define arch_xchg_acquire(...) \
+	__atomic_op_acquire(arch_xchg, __VA_ARGS__)
+#endif
+
+#ifndef arch_xchg_release
+#define arch_xchg_release(...) \
+	__atomic_op_release(arch_xchg, __VA_ARGS__)
+#endif
+
+#ifndef arch_xchg
+#define arch_xchg(...) \
+	__atomic_op_fence(arch_xchg, __VA_ARGS__)
+#endif
+
+#endif /* arch_xchg_relaxed */
+
+#ifndef arch_cmpxchg_relaxed
+#define arch_cmpxchg_relaxed		arch_cmpxchg
+#define arch_cmpxchg_acquire		arch_cmpxchg
+#define arch_cmpxchg_release		arch_cmpxchg
+#else /* arch_cmpxchg_relaxed */
+
+#ifndef arch_cmpxchg_acquire
+#define arch_cmpxchg_acquire(...) \
+	__atomic_op_acquire(arch_cmpxchg, __VA_ARGS__)
+#endif
+
+#ifndef arch_cmpxchg_release
+#define arch_cmpxchg_release(...) \
+	__atomic_op_release(arch_cmpxchg, __VA_ARGS__)
+#endif
+
+#ifndef arch_cmpxchg
+#define arch_cmpxchg(...) \
+	__atomic_op_fence(arch_cmpxchg, __VA_ARGS__)
+#endif
+
+#endif /* arch_cmpxchg_relaxed */
+
+#ifndef arch_cmpxchg64_relaxed
+#define arch_cmpxchg64_relaxed		arch_cmpxchg64
+#define arch_cmpxchg64_acquire		arch_cmpxchg64
+#define arch_cmpxchg64_release		arch_cmpxchg64
+#else /* arch_cmpxchg64_relaxed */
+
+#ifndef arch_cmpxchg64_acquire
+#define arch_cmpxchg64_acquire(...) \
+	__atomic_op_acquire(arch_cmpxchg64, __VA_ARGS__)
+#endif
+
+#ifndef arch_cmpxchg64_release
+#define arch_cmpxchg64_release(...) \
+	__atomic_op_release(arch_cmpxchg64, __VA_ARGS__)
+#endif
+
+#ifndef arch_cmpxchg64
+#define arch_cmpxchg64(...) \
+	__atomic_op_fence(arch_cmpxchg64, __VA_ARGS__)
+#endif
+
+#endif /* arch_cmpxchg64_relaxed */
+
+#ifndef arch_atomic_read_acquire
+static __always_inline int
+arch_atomic_read_acquire(const atomic_t *v)
+{
+	return smp_load_acquire(&(v)->counter);
+}
+#define arch_atomic_read_acquire arch_atomic_read_acquire
+#endif
+
+#ifndef arch_atomic_set_release
+static __always_inline void
+arch_atomic_set_release(atomic_t *v, int i)
+{
+	smp_store_release(&(v)->counter, i);
+}
+#define arch_atomic_set_release arch_atomic_set_release
+#endif
+
+#ifndef arch_atomic_add_return_relaxed
+#define arch_atomic_add_return_acquire arch_atomic_add_return
+#define arch_atomic_add_return_release arch_atomic_add_return
+#define arch_atomic_add_return_relaxed arch_atomic_add_return
+#else /* arch_atomic_add_return_relaxed */
+
+#ifndef arch_atomic_add_return_acquire
+static __always_inline int
+arch_atomic_add_return_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_add_return_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_add_return_acquire arch_atomic_add_return_acquire
+#endif
+
+#ifndef arch_atomic_add_return_release
+static __always_inline int
+arch_atomic_add_return_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_add_return_relaxed(i, v);
+}
+#define arch_atomic_add_return_release arch_atomic_add_return_release
+#endif
+
+#ifndef arch_atomic_add_return
+static __always_inline int
+arch_atomic_add_return(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_add_return_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_add_return arch_atomic_add_return
+#endif
+
+#endif /* arch_atomic_add_return_relaxed */
+
+#ifndef arch_atomic_fetch_add_relaxed
+#define arch_atomic_fetch_add_acquire arch_atomic_fetch_add
+#define arch_atomic_fetch_add_release arch_atomic_fetch_add
+#define arch_atomic_fetch_add_relaxed arch_atomic_fetch_add
+#else /* arch_atomic_fetch_add_relaxed */
+
+#ifndef arch_atomic_fetch_add_acquire
+static __always_inline int
+arch_atomic_fetch_add_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_fetch_add_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_add_acquire arch_atomic_fetch_add_acquire
+#endif
+
+#ifndef arch_atomic_fetch_add_release
+static __always_inline int
+arch_atomic_fetch_add_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_add_relaxed(i, v);
+}
+#define arch_atomic_fetch_add_release arch_atomic_fetch_add_release
+#endif
+
+#ifndef arch_atomic_fetch_add
+static __always_inline int
+arch_atomic_fetch_add(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_add_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_add arch_atomic_fetch_add
+#endif
+
+#endif /* arch_atomic_fetch_add_relaxed */
+
+#ifndef arch_atomic_sub_return_relaxed
+#define arch_atomic_sub_return_acquire arch_atomic_sub_return
+#define arch_atomic_sub_return_release arch_atomic_sub_return
+#define arch_atomic_sub_return_relaxed arch_atomic_sub_return
+#else /* arch_atomic_sub_return_relaxed */
+
+#ifndef arch_atomic_sub_return_acquire
+static __always_inline int
+arch_atomic_sub_return_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_sub_return_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_sub_return_acquire arch_atomic_sub_return_acquire
+#endif
+
+#ifndef arch_atomic_sub_return_release
+static __always_inline int
+arch_atomic_sub_return_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_sub_return_relaxed(i, v);
+}
+#define arch_atomic_sub_return_release arch_atomic_sub_return_release
+#endif
+
+#ifndef arch_atomic_sub_return
+static __always_inline int
+arch_atomic_sub_return(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_sub_return_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_sub_return arch_atomic_sub_return
+#endif
+
+#endif /* arch_atomic_sub_return_relaxed */
+
+#ifndef arch_atomic_fetch_sub_relaxed
+#define arch_atomic_fetch_sub_acquire arch_atomic_fetch_sub
+#define arch_atomic_fetch_sub_release arch_atomic_fetch_sub
+#define arch_atomic_fetch_sub_relaxed arch_atomic_fetch_sub
+#else /* arch_atomic_fetch_sub_relaxed */
+
+#ifndef arch_atomic_fetch_sub_acquire
+static __always_inline int
+arch_atomic_fetch_sub_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_fetch_sub_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_sub_acquire arch_atomic_fetch_sub_acquire
+#endif
+
+#ifndef arch_atomic_fetch_sub_release
+static __always_inline int
+arch_atomic_fetch_sub_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_sub_relaxed(i, v);
+}
+#define arch_atomic_fetch_sub_release arch_atomic_fetch_sub_release
+#endif
+
+#ifndef arch_atomic_fetch_sub
+static __always_inline int
+arch_atomic_fetch_sub(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_sub_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_sub arch_atomic_fetch_sub
+#endif
+
+#endif /* arch_atomic_fetch_sub_relaxed */
+
+#ifndef arch_atomic_inc
+static __always_inline void
+arch_atomic_inc(atomic_t *v)
+{
+	arch_atomic_add(1, v);
+}
+#define arch_atomic_inc arch_atomic_inc
+#endif
+
+#ifndef arch_atomic_inc_return_relaxed
+#ifdef arch_atomic_inc_return
+#define arch_atomic_inc_return_acquire arch_atomic_inc_return
+#define arch_atomic_inc_return_release arch_atomic_inc_return
+#define arch_atomic_inc_return_relaxed arch_atomic_inc_return
+#endif /* arch_atomic_inc_return */
+
+#ifndef arch_atomic_inc_return
+static __always_inline int
+arch_atomic_inc_return(atomic_t *v)
+{
+	return arch_atomic_add_return(1, v);
+}
+#define arch_atomic_inc_return arch_atomic_inc_return
+#endif
+
+#ifndef arch_atomic_inc_return_acquire
+static __always_inline int
+arch_atomic_inc_return_acquire(atomic_t *v)
+{
+	return arch_atomic_add_return_acquire(1, v);
+}
+#define arch_atomic_inc_return_acquire arch_atomic_inc_return_acquire
+#endif
+
+#ifndef arch_atomic_inc_return_release
+static __always_inline int
+arch_atomic_inc_return_release(atomic_t *v)
+{
+	return arch_atomic_add_return_release(1, v);
+}
+#define arch_atomic_inc_return_release arch_atomic_inc_return_release
+#endif
+
+#ifndef arch_atomic_inc_return_relaxed
+static __always_inline int
+arch_atomic_inc_return_relaxed(atomic_t *v)
+{
+	return arch_atomic_add_return_relaxed(1, v);
+}
+#define arch_atomic_inc_return_relaxed arch_atomic_inc_return_relaxed
+#endif
+
+#else /* arch_atomic_inc_return_relaxed */
+
+#ifndef arch_atomic_inc_return_acquire
+static __always_inline int
+arch_atomic_inc_return_acquire(atomic_t *v)
+{
+	int ret = arch_atomic_inc_return_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_inc_return_acquire arch_atomic_inc_return_acquire
+#endif
+
+#ifndef arch_atomic_inc_return_release
+static __always_inline int
+arch_atomic_inc_return_release(atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_inc_return_relaxed(v);
+}
+#define arch_atomic_inc_return_release arch_atomic_inc_return_release
+#endif
+
+#ifndef arch_atomic_inc_return
+static __always_inline int
+arch_atomic_inc_return(atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_inc_return_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_inc_return arch_atomic_inc_return
+#endif
+
+#endif /* arch_atomic_inc_return_relaxed */
+
+#ifndef arch_atomic_fetch_inc_relaxed
+#ifdef arch_atomic_fetch_inc
+#define arch_atomic_fetch_inc_acquire arch_atomic_fetch_inc
+#define arch_atomic_fetch_inc_release arch_atomic_fetch_inc
+#define arch_atomic_fetch_inc_relaxed arch_atomic_fetch_inc
+#endif /* arch_atomic_fetch_inc */
+
+#ifndef arch_atomic_fetch_inc
+static __always_inline int
+arch_atomic_fetch_inc(atomic_t *v)
+{
+	return arch_atomic_fetch_add(1, v);
+}
+#define arch_atomic_fetch_inc arch_atomic_fetch_inc
+#endif
+
+#ifndef arch_atomic_fetch_inc_acquire
+static __always_inline int
+arch_atomic_fetch_inc_acquire(atomic_t *v)
+{
+	return arch_atomic_fetch_add_acquire(1, v);
+}
+#define arch_atomic_fetch_inc_acquire arch_atomic_fetch_inc_acquire
+#endif
+
+#ifndef arch_atomic_fetch_inc_release
+static __always_inline int
+arch_atomic_fetch_inc_release(atomic_t *v)
+{
+	return arch_atomic_fetch_add_release(1, v);
+}
+#define arch_atomic_fetch_inc_release arch_atomic_fetch_inc_release
+#endif
+
+#ifndef arch_atomic_fetch_inc_relaxed
+static __always_inline int
+arch_atomic_fetch_inc_relaxed(atomic_t *v)
+{
+	return arch_atomic_fetch_add_relaxed(1, v);
+}
+#define arch_atomic_fetch_inc_relaxed arch_atomic_fetch_inc_relaxed
+#endif
+
+#else /* arch_atomic_fetch_inc_relaxed */
+
+#ifndef arch_atomic_fetch_inc_acquire
+static __always_inline int
+arch_atomic_fetch_inc_acquire(atomic_t *v)
+{
+	int ret = arch_atomic_fetch_inc_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_inc_acquire arch_atomic_fetch_inc_acquire
+#endif
+
+#ifndef arch_atomic_fetch_inc_release
+static __always_inline int
+arch_atomic_fetch_inc_release(atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_inc_relaxed(v);
+}
+#define arch_atomic_fetch_inc_release arch_atomic_fetch_inc_release
+#endif
+
+#ifndef arch_atomic_fetch_inc
+static __always_inline int
+arch_atomic_fetch_inc(atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_inc_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_inc arch_atomic_fetch_inc
+#endif
+
+#endif /* arch_atomic_fetch_inc_relaxed */
+
+#ifndef arch_atomic_dec
+static __always_inline void
+arch_atomic_dec(atomic_t *v)
+{
+	arch_atomic_sub(1, v);
+}
+#define arch_atomic_dec arch_atomic_dec
+#endif
+
+#ifndef arch_atomic_dec_return_relaxed
+#ifdef arch_atomic_dec_return
+#define arch_atomic_dec_return_acquire arch_atomic_dec_return
+#define arch_atomic_dec_return_release arch_atomic_dec_return
+#define arch_atomic_dec_return_relaxed arch_atomic_dec_return
+#endif /* arch_atomic_dec_return */
+
+#ifndef arch_atomic_dec_return
+static __always_inline int
+arch_atomic_dec_return(atomic_t *v)
+{
+	return arch_atomic_sub_return(1, v);
+}
+#define arch_atomic_dec_return arch_atomic_dec_return
+#endif
+
+#ifndef arch_atomic_dec_return_acquire
+static __always_inline int
+arch_atomic_dec_return_acquire(atomic_t *v)
+{
+	return arch_atomic_sub_return_acquire(1, v);
+}
+#define arch_atomic_dec_return_acquire arch_atomic_dec_return_acquire
+#endif
+
+#ifndef arch_atomic_dec_return_release
+static __always_inline int
+arch_atomic_dec_return_release(atomic_t *v)
+{
+	return arch_atomic_sub_return_release(1, v);
+}
+#define arch_atomic_dec_return_release arch_atomic_dec_return_release
+#endif
+
+#ifndef arch_atomic_dec_return_relaxed
+static __always_inline int
+arch_atomic_dec_return_relaxed(atomic_t *v)
+{
+	return arch_atomic_sub_return_relaxed(1, v);
+}
+#define arch_atomic_dec_return_relaxed arch_atomic_dec_return_relaxed
+#endif
+
+#else /* arch_atomic_dec_return_relaxed */
+
+#ifndef arch_atomic_dec_return_acquire
+static __always_inline int
+arch_atomic_dec_return_acquire(atomic_t *v)
+{
+	int ret = arch_atomic_dec_return_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_dec_return_acquire arch_atomic_dec_return_acquire
+#endif
+
+#ifndef arch_atomic_dec_return_release
+static __always_inline int
+arch_atomic_dec_return_release(atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_dec_return_relaxed(v);
+}
+#define arch_atomic_dec_return_release arch_atomic_dec_return_release
+#endif
+
+#ifndef arch_atomic_dec_return
+static __always_inline int
+arch_atomic_dec_return(atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_dec_return_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_dec_return arch_atomic_dec_return
+#endif
+
+#endif /* arch_atomic_dec_return_relaxed */
+
+#ifndef arch_atomic_fetch_dec_relaxed
+#ifdef arch_atomic_fetch_dec
+#define arch_atomic_fetch_dec_acquire arch_atomic_fetch_dec
+#define arch_atomic_fetch_dec_release arch_atomic_fetch_dec
+#define arch_atomic_fetch_dec_relaxed arch_atomic_fetch_dec
+#endif /* arch_atomic_fetch_dec */
+
+#ifndef arch_atomic_fetch_dec
+static __always_inline int
+arch_atomic_fetch_dec(atomic_t *v)
+{
+	return arch_atomic_fetch_sub(1, v);
+}
+#define arch_atomic_fetch_dec arch_atomic_fetch_dec
+#endif
+
+#ifndef arch_atomic_fetch_dec_acquire
+static __always_inline int
+arch_atomic_fetch_dec_acquire(atomic_t *v)
+{
+	return arch_atomic_fetch_sub_acquire(1, v);
+}
+#define arch_atomic_fetch_dec_acquire arch_atomic_fetch_dec_acquire
+#endif
+
+#ifndef arch_atomic_fetch_dec_release
+static __always_inline int
+arch_atomic_fetch_dec_release(atomic_t *v)
+{
+	return arch_atomic_fetch_sub_release(1, v);
+}
+#define arch_atomic_fetch_dec_release arch_atomic_fetch_dec_release
+#endif
+
+#ifndef arch_atomic_fetch_dec_relaxed
+static __always_inline int
+arch_atomic_fetch_dec_relaxed(atomic_t *v)
+{
+	return arch_atomic_fetch_sub_relaxed(1, v);
+}
+#define arch_atomic_fetch_dec_relaxed arch_atomic_fetch_dec_relaxed
+#endif
+
+#else /* arch_atomic_fetch_dec_relaxed */
+
+#ifndef arch_atomic_fetch_dec_acquire
+static __always_inline int
+arch_atomic_fetch_dec_acquire(atomic_t *v)
+{
+	int ret = arch_atomic_fetch_dec_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_dec_acquire arch_atomic_fetch_dec_acquire
+#endif
+
+#ifndef arch_atomic_fetch_dec_release
+static __always_inline int
+arch_atomic_fetch_dec_release(atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_dec_relaxed(v);
+}
+#define arch_atomic_fetch_dec_release arch_atomic_fetch_dec_release
+#endif
+
+#ifndef arch_atomic_fetch_dec
+static __always_inline int
+arch_atomic_fetch_dec(atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_dec_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_dec arch_atomic_fetch_dec
+#endif
+
+#endif /* arch_atomic_fetch_dec_relaxed */
+
+#ifndef arch_atomic_fetch_and_relaxed
+#define arch_atomic_fetch_and_acquire arch_atomic_fetch_and
+#define arch_atomic_fetch_and_release arch_atomic_fetch_and
+#define arch_atomic_fetch_and_relaxed arch_atomic_fetch_and
+#else /* arch_atomic_fetch_and_relaxed */
+
+#ifndef arch_atomic_fetch_and_acquire
+static __always_inline int
+arch_atomic_fetch_and_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_fetch_and_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_and_acquire arch_atomic_fetch_and_acquire
+#endif
+
+#ifndef arch_atomic_fetch_and_release
+static __always_inline int
+arch_atomic_fetch_and_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_and_relaxed(i, v);
+}
+#define arch_atomic_fetch_and_release arch_atomic_fetch_and_release
+#endif
+
+#ifndef arch_atomic_fetch_and
+static __always_inline int
+arch_atomic_fetch_and(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_and_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_and arch_atomic_fetch_and
+#endif
+
+#endif /* arch_atomic_fetch_and_relaxed */
+
+#ifndef arch_atomic_andnot
+static __always_inline void
+arch_atomic_andnot(int i, atomic_t *v)
+{
+	arch_atomic_and(~i, v);
+}
+#define arch_atomic_andnot arch_atomic_andnot
+#endif
+
+#ifndef arch_atomic_fetch_andnot_relaxed
+#ifdef arch_atomic_fetch_andnot
+#define arch_atomic_fetch_andnot_acquire arch_atomic_fetch_andnot
+#define arch_atomic_fetch_andnot_release arch_atomic_fetch_andnot
+#define arch_atomic_fetch_andnot_relaxed arch_atomic_fetch_andnot
+#endif /* arch_atomic_fetch_andnot */
+
+#ifndef arch_atomic_fetch_andnot
+static __always_inline int
+arch_atomic_fetch_andnot(int i, atomic_t *v)
+{
+	return arch_atomic_fetch_and(~i, v);
+}
+#define arch_atomic_fetch_andnot arch_atomic_fetch_andnot
+#endif
+
+#ifndef arch_atomic_fetch_andnot_acquire
+static __always_inline int
+arch_atomic_fetch_andnot_acquire(int i, atomic_t *v)
+{
+	return arch_atomic_fetch_and_acquire(~i, v);
+}
+#define arch_atomic_fetch_andnot_acquire arch_atomic_fetch_andnot_acquire
+#endif
+
+#ifndef arch_atomic_fetch_andnot_release
+static __always_inline int
+arch_atomic_fetch_andnot_release(int i, atomic_t *v)
+{
+	return arch_atomic_fetch_and_release(~i, v);
+}
+#define arch_atomic_fetch_andnot_release arch_atomic_fetch_andnot_release
+#endif
+
+#ifndef arch_atomic_fetch_andnot_relaxed
+static __always_inline int
+arch_atomic_fetch_andnot_relaxed(int i, atomic_t *v)
+{
+	return arch_atomic_fetch_and_relaxed(~i, v);
+}
+#define arch_atomic_fetch_andnot_relaxed arch_atomic_fetch_andnot_relaxed
+#endif
+
+#else /* arch_atomic_fetch_andnot_relaxed */
+
+#ifndef arch_atomic_fetch_andnot_acquire
+static __always_inline int
+arch_atomic_fetch_andnot_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_fetch_andnot_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_andnot_acquire arch_atomic_fetch_andnot_acquire
+#endif
+
+#ifndef arch_atomic_fetch_andnot_release
+static __always_inline int
+arch_atomic_fetch_andnot_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_andnot_relaxed(i, v);
+}
+#define arch_atomic_fetch_andnot_release arch_atomic_fetch_andnot_release
+#endif
+
+#ifndef arch_atomic_fetch_andnot
+static __always_inline int
+arch_atomic_fetch_andnot(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_andnot_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_andnot arch_atomic_fetch_andnot
+#endif
+
+#endif /* arch_atomic_fetch_andnot_relaxed */
+
+#ifndef arch_atomic_fetch_or_relaxed
+#define arch_atomic_fetch_or_acquire arch_atomic_fetch_or
+#define arch_atomic_fetch_or_release arch_atomic_fetch_or
+#define arch_atomic_fetch_or_relaxed arch_atomic_fetch_or
+#else /* arch_atomic_fetch_or_relaxed */
+
+#ifndef arch_atomic_fetch_or_acquire
+static __always_inline int
+arch_atomic_fetch_or_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_fetch_or_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_or_acquire arch_atomic_fetch_or_acquire
+#endif
+
+#ifndef arch_atomic_fetch_or_release
+static __always_inline int
+arch_atomic_fetch_or_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_or_relaxed(i, v);
+}
+#define arch_atomic_fetch_or_release arch_atomic_fetch_or_release
+#endif
+
+#ifndef arch_atomic_fetch_or
+static __always_inline int
+arch_atomic_fetch_or(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_or_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_or arch_atomic_fetch_or
+#endif
+
+#endif /* arch_atomic_fetch_or_relaxed */
+
+#ifndef arch_atomic_fetch_xor_relaxed
+#define arch_atomic_fetch_xor_acquire arch_atomic_fetch_xor
+#define arch_atomic_fetch_xor_release arch_atomic_fetch_xor
+#define arch_atomic_fetch_xor_relaxed arch_atomic_fetch_xor
+#else /* arch_atomic_fetch_xor_relaxed */
+
+#ifndef arch_atomic_fetch_xor_acquire
+static __always_inline int
+arch_atomic_fetch_xor_acquire(int i, atomic_t *v)
+{
+	int ret = arch_atomic_fetch_xor_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_fetch_xor_acquire arch_atomic_fetch_xor_acquire
+#endif
+
+#ifndef arch_atomic_fetch_xor_release
+static __always_inline int
+arch_atomic_fetch_xor_release(int i, atomic_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic_fetch_xor_relaxed(i, v);
+}
+#define arch_atomic_fetch_xor_release arch_atomic_fetch_xor_release
+#endif
+
+#ifndef arch_atomic_fetch_xor
+static __always_inline int
+arch_atomic_fetch_xor(int i, atomic_t *v)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_fetch_xor_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_fetch_xor arch_atomic_fetch_xor
+#endif
+
+#endif /* arch_atomic_fetch_xor_relaxed */
+
+#ifndef arch_atomic_xchg_relaxed
+#define arch_atomic_xchg_acquire arch_atomic_xchg
+#define arch_atomic_xchg_release arch_atomic_xchg
+#define arch_atomic_xchg_relaxed arch_atomic_xchg
+#else /* arch_atomic_xchg_relaxed */
+
+#ifndef arch_atomic_xchg_acquire
+static __always_inline int
+arch_atomic_xchg_acquire(atomic_t *v, int i)
+{
+	int ret = arch_atomic_xchg_relaxed(v, i);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_xchg_acquire arch_atomic_xchg_acquire
+#endif
+
+#ifndef arch_atomic_xchg_release
+static __always_inline int
+arch_atomic_xchg_release(atomic_t *v, int i)
+{
+	__atomic_release_fence();
+	return arch_atomic_xchg_relaxed(v, i);
+}
+#define arch_atomic_xchg_release arch_atomic_xchg_release
+#endif
+
+#ifndef arch_atomic_xchg
+static __always_inline int
+arch_atomic_xchg(atomic_t *v, int i)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_xchg_relaxed(v, i);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_xchg arch_atomic_xchg
+#endif
+
+#endif /* arch_atomic_xchg_relaxed */
+
+#ifndef arch_atomic_cmpxchg_relaxed
+#define arch_atomic_cmpxchg_acquire arch_atomic_cmpxchg
+#define arch_atomic_cmpxchg_release arch_atomic_cmpxchg
+#define arch_atomic_cmpxchg_relaxed arch_atomic_cmpxchg
+#else /* arch_atomic_cmpxchg_relaxed */
+
+#ifndef arch_atomic_cmpxchg_acquire
+static __always_inline int
+arch_atomic_cmpxchg_acquire(atomic_t *v, int old, int new)
+{
+	int ret = arch_atomic_cmpxchg_relaxed(v, old, new);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_cmpxchg_acquire arch_atomic_cmpxchg_acquire
+#endif
+
+#ifndef arch_atomic_cmpxchg_release
+static __always_inline int
+arch_atomic_cmpxchg_release(atomic_t *v, int old, int new)
+{
+	__atomic_release_fence();
+	return arch_atomic_cmpxchg_relaxed(v, old, new);
+}
+#define arch_atomic_cmpxchg_release arch_atomic_cmpxchg_release
+#endif
+
+#ifndef arch_atomic_cmpxchg
+static __always_inline int
+arch_atomic_cmpxchg(atomic_t *v, int old, int new)
+{
+	int ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_cmpxchg_relaxed(v, old, new);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_cmpxchg arch_atomic_cmpxchg
+#endif
+
+#endif /* arch_atomic_cmpxchg_relaxed */
+
+#ifndef arch_atomic_try_cmpxchg_relaxed
+#ifdef arch_atomic_try_cmpxchg
+#define arch_atomic_try_cmpxchg_acquire arch_atomic_try_cmpxchg
+#define arch_atomic_try_cmpxchg_release arch_atomic_try_cmpxchg
+#define arch_atomic_try_cmpxchg_relaxed arch_atomic_try_cmpxchg
+#endif /* arch_atomic_try_cmpxchg */
+
+#ifndef arch_atomic_try_cmpxchg
+static __always_inline bool
+arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
+{
+	int r, o = *old;
+	r = arch_atomic_cmpxchg(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
+#endif
+
+#ifndef arch_atomic_try_cmpxchg_acquire
+static __always_inline bool
+arch_atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
+{
+	int r, o = *old;
+	r = arch_atomic_cmpxchg_acquire(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic_try_cmpxchg_acquire arch_atomic_try_cmpxchg_acquire
+#endif
+
+#ifndef arch_atomic_try_cmpxchg_release
+static __always_inline bool
+arch_atomic_try_cmpxchg_release(atomic_t *v, int *old, int new)
+{
+	int r, o = *old;
+	r = arch_atomic_cmpxchg_release(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic_try_cmpxchg_release arch_atomic_try_cmpxchg_release
+#endif
+
+#ifndef arch_atomic_try_cmpxchg_relaxed
+static __always_inline bool
+arch_atomic_try_cmpxchg_relaxed(atomic_t *v, int *old, int new)
+{
+	int r, o = *old;
+	r = arch_atomic_cmpxchg_relaxed(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic_try_cmpxchg_relaxed arch_atomic_try_cmpxchg_relaxed
+#endif
+
+#else /* arch_atomic_try_cmpxchg_relaxed */
+
+#ifndef arch_atomic_try_cmpxchg_acquire
+static __always_inline bool
+arch_atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
+{
+	bool ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic_try_cmpxchg_acquire arch_atomic_try_cmpxchg_acquire
+#endif
+
+#ifndef arch_atomic_try_cmpxchg_release
+static __always_inline bool
+arch_atomic_try_cmpxchg_release(atomic_t *v, int *old, int new)
+{
+	__atomic_release_fence();
+	return arch_atomic_try_cmpxchg_relaxed(v, old, new);
+}
+#define arch_atomic_try_cmpxchg_release arch_atomic_try_cmpxchg_release
+#endif
+
+#ifndef arch_atomic_try_cmpxchg
+static __always_inline bool
+arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
+{
+	bool ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
+#endif
+
+#endif /* arch_atomic_try_cmpxchg_relaxed */
+
+#ifndef arch_atomic_sub_and_test
+/**
+ * arch_atomic_sub_and_test - subtract value from variable and test result
+ * @i: integer value to subtract
+ * @v: pointer of type atomic_t
+ *
+ * Atomically subtracts @i from @v and returns
+ * true if the result is zero, or false for all
+ * other cases.
+ */
+static __always_inline bool
+arch_atomic_sub_and_test(int i, atomic_t *v)
+{
+	return arch_atomic_sub_return(i, v) == 0;
+}
+#define arch_atomic_sub_and_test arch_atomic_sub_and_test
+#endif
+
+#ifndef arch_atomic_dec_and_test
+/**
+ * arch_atomic_dec_and_test - decrement and test
+ * @v: pointer of type atomic_t
+ *
+ * Atomically decrements @v by 1 and
+ * returns true if the result is 0, or false for all other
+ * cases.
+ */
+static __always_inline bool
+arch_atomic_dec_and_test(atomic_t *v)
+{
+	return arch_atomic_dec_return(v) == 0;
+}
+#define arch_atomic_dec_and_test arch_atomic_dec_and_test
+#endif
+
+#ifndef arch_atomic_inc_and_test
+/**
+ * arch_atomic_inc_and_test - increment and test
+ * @v: pointer of type atomic_t
+ *
+ * Atomically increments @v by 1
+ * and returns true if the result is zero, or false for all
+ * other cases.
+ */
+static __always_inline bool
+arch_atomic_inc_and_test(atomic_t *v)
+{
+	return arch_atomic_inc_return(v) == 0;
+}
+#define arch_atomic_inc_and_test arch_atomic_inc_and_test
+#endif
+
+#ifndef arch_atomic_add_negative
+/**
+ * arch_atomic_add_negative - add and test if negative
+ * @i: integer value to add
+ * @v: pointer of type atomic_t
+ *
+ * Atomically adds @i to @v and returns true
+ * if the result is negative, or false when
+ * result is greater than or equal to zero.
+ */
+static __always_inline bool
+arch_atomic_add_negative(int i, atomic_t *v)
+{
+	return arch_atomic_add_return(i, v) < 0;
+}
+#define arch_atomic_add_negative arch_atomic_add_negative
+#endif
+
+#ifndef arch_atomic_fetch_add_unless
+/**
+ * arch_atomic_fetch_add_unless - add unless the number is already a given value
+ * @v: pointer of type atomic_t
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ *
+ * Atomically adds @a to @v, so long as @v was not already @u.
+ * Returns original value of @v
+ */
+static __always_inline int
+arch_atomic_fetch_add_unless(atomic_t *v, int a, int u)
+{
+	int c = arch_atomic_read(v);
+
+	do {
+		if (unlikely(c == u))
+			break;
+	} while (!arch_atomic_try_cmpxchg(v, &c, c + a));
+
+	return c;
+}
+#define arch_atomic_fetch_add_unless arch_atomic_fetch_add_unless
+#endif
+
+#ifndef arch_atomic_add_unless
+/**
+ * arch_atomic_add_unless - add unless the number is already a given value
+ * @v: pointer of type atomic_t
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ *
+ * Atomically adds @a to @v, if @v was not already @u.
+ * Returns true if the addition was done.
+ */
+static __always_inline bool
+arch_atomic_add_unless(atomic_t *v, int a, int u)
+{
+	return arch_atomic_fetch_add_unless(v, a, u) != u;
+}
+#define arch_atomic_add_unless arch_atomic_add_unless
+#endif
+
+#ifndef arch_atomic_inc_not_zero
+/**
+ * arch_atomic_inc_not_zero - increment unless the number is zero
+ * @v: pointer of type atomic_t
+ *
+ * Atomically increments @v by 1, if @v is non-zero.
+ * Returns true if the increment was done.
+ */
+static __always_inline bool
+arch_atomic_inc_not_zero(atomic_t *v)
+{
+	return arch_atomic_add_unless(v, 1, 0);
+}
+#define arch_atomic_inc_not_zero arch_atomic_inc_not_zero
+#endif
+
+#ifndef arch_atomic_inc_unless_negative
+static __always_inline bool
+arch_atomic_inc_unless_negative(atomic_t *v)
+{
+	int c = arch_atomic_read(v);
+
+	do {
+		if (unlikely(c < 0))
+			return false;
+	} while (!arch_atomic_try_cmpxchg(v, &c, c + 1));
+
+	return true;
+}
+#define arch_atomic_inc_unless_negative arch_atomic_inc_unless_negative
+#endif
+
+#ifndef arch_atomic_dec_unless_positive
+static __always_inline bool
+arch_atomic_dec_unless_positive(atomic_t *v)
+{
+	int c = arch_atomic_read(v);
+
+	do {
+		if (unlikely(c > 0))
+			return false;
+	} while (!arch_atomic_try_cmpxchg(v, &c, c - 1));
+
+	return true;
+}
+#define arch_atomic_dec_unless_positive arch_atomic_dec_unless_positive
+#endif
+
+#ifndef arch_atomic_dec_if_positive
+static __always_inline int
+arch_atomic_dec_if_positive(atomic_t *v)
+{
+	int dec, c = arch_atomic_read(v);
+
+	do {
+		dec = c - 1;
+		if (unlikely(dec < 0))
+			break;
+	} while (!arch_atomic_try_cmpxchg(v, &c, dec));
+
+	return dec;
+}
+#define arch_atomic_dec_if_positive arch_atomic_dec_if_positive
+#endif
+
+#ifdef CONFIG_GENERIC_ATOMIC64
+#include <asm-generic/atomic64.h>
+#endif
+
+#ifndef arch_atomic64_read_acquire
+static __always_inline s64
+arch_atomic64_read_acquire(const atomic64_t *v)
+{
+	return smp_load_acquire(&(v)->counter);
+}
+#define arch_atomic64_read_acquire arch_atomic64_read_acquire
+#endif
+
+#ifndef arch_atomic64_set_release
+static __always_inline void
+arch_atomic64_set_release(atomic64_t *v, s64 i)
+{
+	smp_store_release(&(v)->counter, i);
+}
+#define arch_atomic64_set_release arch_atomic64_set_release
+#endif
+
+#ifndef arch_atomic64_add_return_relaxed
+#define arch_atomic64_add_return_acquire arch_atomic64_add_return
+#define arch_atomic64_add_return_release arch_atomic64_add_return
+#define arch_atomic64_add_return_relaxed arch_atomic64_add_return
+#else /* arch_atomic64_add_return_relaxed */
+
+#ifndef arch_atomic64_add_return_acquire
+static __always_inline s64
+arch_atomic64_add_return_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_add_return_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_add_return_acquire arch_atomic64_add_return_acquire
+#endif
+
+#ifndef arch_atomic64_add_return_release
+static __always_inline s64
+arch_atomic64_add_return_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_add_return_relaxed(i, v);
+}
+#define arch_atomic64_add_return_release arch_atomic64_add_return_release
+#endif
+
+#ifndef arch_atomic64_add_return
+static __always_inline s64
+arch_atomic64_add_return(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_add_return_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_add_return arch_atomic64_add_return
+#endif
+
+#endif /* arch_atomic64_add_return_relaxed */
+
+#ifndef arch_atomic64_fetch_add_relaxed
+#define arch_atomic64_fetch_add_acquire arch_atomic64_fetch_add
+#define arch_atomic64_fetch_add_release arch_atomic64_fetch_add
+#define arch_atomic64_fetch_add_relaxed arch_atomic64_fetch_add
+#else /* arch_atomic64_fetch_add_relaxed */
+
+#ifndef arch_atomic64_fetch_add_acquire
+static __always_inline s64
+arch_atomic64_fetch_add_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_add_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_add_acquire arch_atomic64_fetch_add_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_add_release
+static __always_inline s64
+arch_atomic64_fetch_add_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_add_relaxed(i, v);
+}
+#define arch_atomic64_fetch_add_release arch_atomic64_fetch_add_release
+#endif
+
+#ifndef arch_atomic64_fetch_add
+static __always_inline s64
+arch_atomic64_fetch_add(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_add_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_add arch_atomic64_fetch_add
+#endif
+
+#endif /* arch_atomic64_fetch_add_relaxed */
+
+#ifndef arch_atomic64_sub_return_relaxed
+#define arch_atomic64_sub_return_acquire arch_atomic64_sub_return
+#define arch_atomic64_sub_return_release arch_atomic64_sub_return
+#define arch_atomic64_sub_return_relaxed arch_atomic64_sub_return
+#else /* arch_atomic64_sub_return_relaxed */
+
+#ifndef arch_atomic64_sub_return_acquire
+static __always_inline s64
+arch_atomic64_sub_return_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_sub_return_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_sub_return_acquire arch_atomic64_sub_return_acquire
+#endif
+
+#ifndef arch_atomic64_sub_return_release
+static __always_inline s64
+arch_atomic64_sub_return_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_sub_return_relaxed(i, v);
+}
+#define arch_atomic64_sub_return_release arch_atomic64_sub_return_release
+#endif
+
+#ifndef arch_atomic64_sub_return
+static __always_inline s64
+arch_atomic64_sub_return(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_sub_return_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_sub_return arch_atomic64_sub_return
+#endif
+
+#endif /* arch_atomic64_sub_return_relaxed */
+
+#ifndef arch_atomic64_fetch_sub_relaxed
+#define arch_atomic64_fetch_sub_acquire arch_atomic64_fetch_sub
+#define arch_atomic64_fetch_sub_release arch_atomic64_fetch_sub
+#define arch_atomic64_fetch_sub_relaxed arch_atomic64_fetch_sub
+#else /* arch_atomic64_fetch_sub_relaxed */
+
+#ifndef arch_atomic64_fetch_sub_acquire
+static __always_inline s64
+arch_atomic64_fetch_sub_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_sub_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_sub_acquire arch_atomic64_fetch_sub_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_sub_release
+static __always_inline s64
+arch_atomic64_fetch_sub_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_sub_relaxed(i, v);
+}
+#define arch_atomic64_fetch_sub_release arch_atomic64_fetch_sub_release
+#endif
+
+#ifndef arch_atomic64_fetch_sub
+static __always_inline s64
+arch_atomic64_fetch_sub(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_sub_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_sub arch_atomic64_fetch_sub
+#endif
+
+#endif /* arch_atomic64_fetch_sub_relaxed */
+
+#ifndef arch_atomic64_inc
+static __always_inline void
+arch_atomic64_inc(atomic64_t *v)
+{
+	arch_atomic64_add(1, v);
+}
+#define arch_atomic64_inc arch_atomic64_inc
+#endif
+
+#ifndef arch_atomic64_inc_return_relaxed
+#ifdef arch_atomic64_inc_return
+#define arch_atomic64_inc_return_acquire arch_atomic64_inc_return
+#define arch_atomic64_inc_return_release arch_atomic64_inc_return
+#define arch_atomic64_inc_return_relaxed arch_atomic64_inc_return
+#endif /* arch_atomic64_inc_return */
+
+#ifndef arch_atomic64_inc_return
+static __always_inline s64
+arch_atomic64_inc_return(atomic64_t *v)
+{
+	return arch_atomic64_add_return(1, v);
+}
+#define arch_atomic64_inc_return arch_atomic64_inc_return
+#endif
+
+#ifndef arch_atomic64_inc_return_acquire
+static __always_inline s64
+arch_atomic64_inc_return_acquire(atomic64_t *v)
+{
+	return arch_atomic64_add_return_acquire(1, v);
+}
+#define arch_atomic64_inc_return_acquire arch_atomic64_inc_return_acquire
+#endif
+
+#ifndef arch_atomic64_inc_return_release
+static __always_inline s64
+arch_atomic64_inc_return_release(atomic64_t *v)
+{
+	return arch_atomic64_add_return_release(1, v);
+}
+#define arch_atomic64_inc_return_release arch_atomic64_inc_return_release
+#endif
+
+#ifndef arch_atomic64_inc_return_relaxed
+static __always_inline s64
+arch_atomic64_inc_return_relaxed(atomic64_t *v)
+{
+	return arch_atomic64_add_return_relaxed(1, v);
+}
+#define arch_atomic64_inc_return_relaxed arch_atomic64_inc_return_relaxed
+#endif
+
+#else /* arch_atomic64_inc_return_relaxed */
+
+#ifndef arch_atomic64_inc_return_acquire
+static __always_inline s64
+arch_atomic64_inc_return_acquire(atomic64_t *v)
+{
+	s64 ret = arch_atomic64_inc_return_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_inc_return_acquire arch_atomic64_inc_return_acquire
+#endif
+
+#ifndef arch_atomic64_inc_return_release
+static __always_inline s64
+arch_atomic64_inc_return_release(atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_inc_return_relaxed(v);
+}
+#define arch_atomic64_inc_return_release arch_atomic64_inc_return_release
+#endif
+
+#ifndef arch_atomic64_inc_return
+static __always_inline s64
+arch_atomic64_inc_return(atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_inc_return_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_inc_return arch_atomic64_inc_return
+#endif
+
+#endif /* arch_atomic64_inc_return_relaxed */
+
+#ifndef arch_atomic64_fetch_inc_relaxed
+#ifdef arch_atomic64_fetch_inc
+#define arch_atomic64_fetch_inc_acquire arch_atomic64_fetch_inc
+#define arch_atomic64_fetch_inc_release arch_atomic64_fetch_inc
+#define arch_atomic64_fetch_inc_relaxed arch_atomic64_fetch_inc
+#endif /* arch_atomic64_fetch_inc */
+
+#ifndef arch_atomic64_fetch_inc
+static __always_inline s64
+arch_atomic64_fetch_inc(atomic64_t *v)
+{
+	return arch_atomic64_fetch_add(1, v);
+}
+#define arch_atomic64_fetch_inc arch_atomic64_fetch_inc
+#endif
+
+#ifndef arch_atomic64_fetch_inc_acquire
+static __always_inline s64
+arch_atomic64_fetch_inc_acquire(atomic64_t *v)
+{
+	return arch_atomic64_fetch_add_acquire(1, v);
+}
+#define arch_atomic64_fetch_inc_acquire arch_atomic64_fetch_inc_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_inc_release
+static __always_inline s64
+arch_atomic64_fetch_inc_release(atomic64_t *v)
+{
+	return arch_atomic64_fetch_add_release(1, v);
+}
+#define arch_atomic64_fetch_inc_release arch_atomic64_fetch_inc_release
+#endif
+
+#ifndef arch_atomic64_fetch_inc_relaxed
+static __always_inline s64
+arch_atomic64_fetch_inc_relaxed(atomic64_t *v)
+{
+	return arch_atomic64_fetch_add_relaxed(1, v);
+}
+#define arch_atomic64_fetch_inc_relaxed arch_atomic64_fetch_inc_relaxed
+#endif
+
+#else /* arch_atomic64_fetch_inc_relaxed */
+
+#ifndef arch_atomic64_fetch_inc_acquire
+static __always_inline s64
+arch_atomic64_fetch_inc_acquire(atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_inc_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_inc_acquire arch_atomic64_fetch_inc_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_inc_release
+static __always_inline s64
+arch_atomic64_fetch_inc_release(atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_inc_relaxed(v);
+}
+#define arch_atomic64_fetch_inc_release arch_atomic64_fetch_inc_release
+#endif
+
+#ifndef arch_atomic64_fetch_inc
+static __always_inline s64
+arch_atomic64_fetch_inc(atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_inc_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_inc arch_atomic64_fetch_inc
+#endif
+
+#endif /* arch_atomic64_fetch_inc_relaxed */
+
+#ifndef arch_atomic64_dec
+static __always_inline void
+arch_atomic64_dec(atomic64_t *v)
+{
+	arch_atomic64_sub(1, v);
+}
+#define arch_atomic64_dec arch_atomic64_dec
+#endif
+
+#ifndef arch_atomic64_dec_return_relaxed
+#ifdef arch_atomic64_dec_return
+#define arch_atomic64_dec_return_acquire arch_atomic64_dec_return
+#define arch_atomic64_dec_return_release arch_atomic64_dec_return
+#define arch_atomic64_dec_return_relaxed arch_atomic64_dec_return
+#endif /* arch_atomic64_dec_return */
+
+#ifndef arch_atomic64_dec_return
+static __always_inline s64
+arch_atomic64_dec_return(atomic64_t *v)
+{
+	return arch_atomic64_sub_return(1, v);
+}
+#define arch_atomic64_dec_return arch_atomic64_dec_return
+#endif
+
+#ifndef arch_atomic64_dec_return_acquire
+static __always_inline s64
+arch_atomic64_dec_return_acquire(atomic64_t *v)
+{
+	return arch_atomic64_sub_return_acquire(1, v);
+}
+#define arch_atomic64_dec_return_acquire arch_atomic64_dec_return_acquire
+#endif
+
+#ifndef arch_atomic64_dec_return_release
+static __always_inline s64
+arch_atomic64_dec_return_release(atomic64_t *v)
+{
+	return arch_atomic64_sub_return_release(1, v);
+}
+#define arch_atomic64_dec_return_release arch_atomic64_dec_return_release
+#endif
+
+#ifndef arch_atomic64_dec_return_relaxed
+static __always_inline s64
+arch_atomic64_dec_return_relaxed(atomic64_t *v)
+{
+	return arch_atomic64_sub_return_relaxed(1, v);
+}
+#define arch_atomic64_dec_return_relaxed arch_atomic64_dec_return_relaxed
+#endif
+
+#else /* arch_atomic64_dec_return_relaxed */
+
+#ifndef arch_atomic64_dec_return_acquire
+static __always_inline s64
+arch_atomic64_dec_return_acquire(atomic64_t *v)
+{
+	s64 ret = arch_atomic64_dec_return_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_dec_return_acquire arch_atomic64_dec_return_acquire
+#endif
+
+#ifndef arch_atomic64_dec_return_release
+static __always_inline s64
+arch_atomic64_dec_return_release(atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_dec_return_relaxed(v);
+}
+#define arch_atomic64_dec_return_release arch_atomic64_dec_return_release
+#endif
+
+#ifndef arch_atomic64_dec_return
+static __always_inline s64
+arch_atomic64_dec_return(atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_dec_return_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_dec_return arch_atomic64_dec_return
+#endif
+
+#endif /* arch_atomic64_dec_return_relaxed */
+
+#ifndef arch_atomic64_fetch_dec_relaxed
+#ifdef arch_atomic64_fetch_dec
+#define arch_atomic64_fetch_dec_acquire arch_atomic64_fetch_dec
+#define arch_atomic64_fetch_dec_release arch_atomic64_fetch_dec
+#define arch_atomic64_fetch_dec_relaxed arch_atomic64_fetch_dec
+#endif /* arch_atomic64_fetch_dec */
+
+#ifndef arch_atomic64_fetch_dec
+static __always_inline s64
+arch_atomic64_fetch_dec(atomic64_t *v)
+{
+	return arch_atomic64_fetch_sub(1, v);
+}
+#define arch_atomic64_fetch_dec arch_atomic64_fetch_dec
+#endif
+
+#ifndef arch_atomic64_fetch_dec_acquire
+static __always_inline s64
+arch_atomic64_fetch_dec_acquire(atomic64_t *v)
+{
+	return arch_atomic64_fetch_sub_acquire(1, v);
+}
+#define arch_atomic64_fetch_dec_acquire arch_atomic64_fetch_dec_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_dec_release
+static __always_inline s64
+arch_atomic64_fetch_dec_release(atomic64_t *v)
+{
+	return arch_atomic64_fetch_sub_release(1, v);
+}
+#define arch_atomic64_fetch_dec_release arch_atomic64_fetch_dec_release
+#endif
+
+#ifndef arch_atomic64_fetch_dec_relaxed
+static __always_inline s64
+arch_atomic64_fetch_dec_relaxed(atomic64_t *v)
+{
+	return arch_atomic64_fetch_sub_relaxed(1, v);
+}
+#define arch_atomic64_fetch_dec_relaxed arch_atomic64_fetch_dec_relaxed
+#endif
+
+#else /* arch_atomic64_fetch_dec_relaxed */
+
+#ifndef arch_atomic64_fetch_dec_acquire
+static __always_inline s64
+arch_atomic64_fetch_dec_acquire(atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_dec_relaxed(v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_dec_acquire arch_atomic64_fetch_dec_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_dec_release
+static __always_inline s64
+arch_atomic64_fetch_dec_release(atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_dec_relaxed(v);
+}
+#define arch_atomic64_fetch_dec_release arch_atomic64_fetch_dec_release
+#endif
+
+#ifndef arch_atomic64_fetch_dec
+static __always_inline s64
+arch_atomic64_fetch_dec(atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_dec_relaxed(v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_dec arch_atomic64_fetch_dec
+#endif
+
+#endif /* arch_atomic64_fetch_dec_relaxed */
+
+#ifndef arch_atomic64_fetch_and_relaxed
+#define arch_atomic64_fetch_and_acquire arch_atomic64_fetch_and
+#define arch_atomic64_fetch_and_release arch_atomic64_fetch_and
+#define arch_atomic64_fetch_and_relaxed arch_atomic64_fetch_and
+#else /* arch_atomic64_fetch_and_relaxed */
+
+#ifndef arch_atomic64_fetch_and_acquire
+static __always_inline s64
+arch_atomic64_fetch_and_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_and_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_and_acquire arch_atomic64_fetch_and_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_and_release
+static __always_inline s64
+arch_atomic64_fetch_and_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_and_relaxed(i, v);
+}
+#define arch_atomic64_fetch_and_release arch_atomic64_fetch_and_release
+#endif
+
+#ifndef arch_atomic64_fetch_and
+static __always_inline s64
+arch_atomic64_fetch_and(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_and_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_and arch_atomic64_fetch_and
+#endif
+
+#endif /* arch_atomic64_fetch_and_relaxed */
+
+#ifndef arch_atomic64_andnot
+static __always_inline void
+arch_atomic64_andnot(s64 i, atomic64_t *v)
+{
+	arch_atomic64_and(~i, v);
+}
+#define arch_atomic64_andnot arch_atomic64_andnot
+#endif
+
+#ifndef arch_atomic64_fetch_andnot_relaxed
+#ifdef arch_atomic64_fetch_andnot
+#define arch_atomic64_fetch_andnot_acquire arch_atomic64_fetch_andnot
+#define arch_atomic64_fetch_andnot_release arch_atomic64_fetch_andnot
+#define arch_atomic64_fetch_andnot_relaxed arch_atomic64_fetch_andnot
+#endif /* arch_atomic64_fetch_andnot */
+
+#ifndef arch_atomic64_fetch_andnot
+static __always_inline s64
+arch_atomic64_fetch_andnot(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_fetch_and(~i, v);
+}
+#define arch_atomic64_fetch_andnot arch_atomic64_fetch_andnot
+#endif
+
+#ifndef arch_atomic64_fetch_andnot_acquire
+static __always_inline s64
+arch_atomic64_fetch_andnot_acquire(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_fetch_and_acquire(~i, v);
+}
+#define arch_atomic64_fetch_andnot_acquire arch_atomic64_fetch_andnot_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_andnot_release
+static __always_inline s64
+arch_atomic64_fetch_andnot_release(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_fetch_and_release(~i, v);
+}
+#define arch_atomic64_fetch_andnot_release arch_atomic64_fetch_andnot_release
+#endif
+
+#ifndef arch_atomic64_fetch_andnot_relaxed
+static __always_inline s64
+arch_atomic64_fetch_andnot_relaxed(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_fetch_and_relaxed(~i, v);
+}
+#define arch_atomic64_fetch_andnot_relaxed arch_atomic64_fetch_andnot_relaxed
+#endif
+
+#else /* arch_atomic64_fetch_andnot_relaxed */
+
+#ifndef arch_atomic64_fetch_andnot_acquire
+static __always_inline s64
+arch_atomic64_fetch_andnot_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_andnot_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_andnot_acquire arch_atomic64_fetch_andnot_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_andnot_release
+static __always_inline s64
+arch_atomic64_fetch_andnot_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_andnot_relaxed(i, v);
+}
+#define arch_atomic64_fetch_andnot_release arch_atomic64_fetch_andnot_release
+#endif
+
+#ifndef arch_atomic64_fetch_andnot
+static __always_inline s64
+arch_atomic64_fetch_andnot(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_andnot_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_andnot arch_atomic64_fetch_andnot
+#endif
+
+#endif /* arch_atomic64_fetch_andnot_relaxed */
+
+#ifndef arch_atomic64_fetch_or_relaxed
+#define arch_atomic64_fetch_or_acquire arch_atomic64_fetch_or
+#define arch_atomic64_fetch_or_release arch_atomic64_fetch_or
+#define arch_atomic64_fetch_or_relaxed arch_atomic64_fetch_or
+#else /* arch_atomic64_fetch_or_relaxed */
+
+#ifndef arch_atomic64_fetch_or_acquire
+static __always_inline s64
+arch_atomic64_fetch_or_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_or_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_or_acquire arch_atomic64_fetch_or_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_or_release
+static __always_inline s64
+arch_atomic64_fetch_or_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_or_relaxed(i, v);
+}
+#define arch_atomic64_fetch_or_release arch_atomic64_fetch_or_release
+#endif
+
+#ifndef arch_atomic64_fetch_or
+static __always_inline s64
+arch_atomic64_fetch_or(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_or_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_or arch_atomic64_fetch_or
+#endif
+
+#endif /* arch_atomic64_fetch_or_relaxed */
+
+#ifndef arch_atomic64_fetch_xor_relaxed
+#define arch_atomic64_fetch_xor_acquire arch_atomic64_fetch_xor
+#define arch_atomic64_fetch_xor_release arch_atomic64_fetch_xor
+#define arch_atomic64_fetch_xor_relaxed arch_atomic64_fetch_xor
+#else /* arch_atomic64_fetch_xor_relaxed */
+
+#ifndef arch_atomic64_fetch_xor_acquire
+static __always_inline s64
+arch_atomic64_fetch_xor_acquire(s64 i, atomic64_t *v)
+{
+	s64 ret = arch_atomic64_fetch_xor_relaxed(i, v);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_xor_acquire arch_atomic64_fetch_xor_acquire
+#endif
+
+#ifndef arch_atomic64_fetch_xor_release
+static __always_inline s64
+arch_atomic64_fetch_xor_release(s64 i, atomic64_t *v)
+{
+	__atomic_release_fence();
+	return arch_atomic64_fetch_xor_relaxed(i, v);
+}
+#define arch_atomic64_fetch_xor_release arch_atomic64_fetch_xor_release
+#endif
+
+#ifndef arch_atomic64_fetch_xor
+static __always_inline s64
+arch_atomic64_fetch_xor(s64 i, atomic64_t *v)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_fetch_xor_relaxed(i, v);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_fetch_xor arch_atomic64_fetch_xor
+#endif
+
+#endif /* arch_atomic64_fetch_xor_relaxed */
+
+#ifndef arch_atomic64_xchg_relaxed
+#define arch_atomic64_xchg_acquire arch_atomic64_xchg
+#define arch_atomic64_xchg_release arch_atomic64_xchg
+#define arch_atomic64_xchg_relaxed arch_atomic64_xchg
+#else /* arch_atomic64_xchg_relaxed */
+
+#ifndef arch_atomic64_xchg_acquire
+static __always_inline s64
+arch_atomic64_xchg_acquire(atomic64_t *v, s64 i)
+{
+	s64 ret = arch_atomic64_xchg_relaxed(v, i);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_xchg_acquire arch_atomic64_xchg_acquire
+#endif
+
+#ifndef arch_atomic64_xchg_release
+static __always_inline s64
+arch_atomic64_xchg_release(atomic64_t *v, s64 i)
+{
+	__atomic_release_fence();
+	return arch_atomic64_xchg_relaxed(v, i);
+}
+#define arch_atomic64_xchg_release arch_atomic64_xchg_release
+#endif
+
+#ifndef arch_atomic64_xchg
+static __always_inline s64
+arch_atomic64_xchg(atomic64_t *v, s64 i)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_xchg_relaxed(v, i);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_xchg arch_atomic64_xchg
+#endif
+
+#endif /* arch_atomic64_xchg_relaxed */
+
+#ifndef arch_atomic64_cmpxchg_relaxed
+#define arch_atomic64_cmpxchg_acquire arch_atomic64_cmpxchg
+#define arch_atomic64_cmpxchg_release arch_atomic64_cmpxchg
+#define arch_atomic64_cmpxchg_relaxed arch_atomic64_cmpxchg
+#else /* arch_atomic64_cmpxchg_relaxed */
+
+#ifndef arch_atomic64_cmpxchg_acquire
+static __always_inline s64
+arch_atomic64_cmpxchg_acquire(atomic64_t *v, s64 old, s64 new)
+{
+	s64 ret = arch_atomic64_cmpxchg_relaxed(v, old, new);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_cmpxchg_acquire arch_atomic64_cmpxchg_acquire
+#endif
+
+#ifndef arch_atomic64_cmpxchg_release
+static __always_inline s64
+arch_atomic64_cmpxchg_release(atomic64_t *v, s64 old, s64 new)
+{
+	__atomic_release_fence();
+	return arch_atomic64_cmpxchg_relaxed(v, old, new);
+}
+#define arch_atomic64_cmpxchg_release arch_atomic64_cmpxchg_release
+#endif
+
+#ifndef arch_atomic64_cmpxchg
+static __always_inline s64
+arch_atomic64_cmpxchg(atomic64_t *v, s64 old, s64 new)
+{
+	s64 ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_cmpxchg_relaxed(v, old, new);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_cmpxchg arch_atomic64_cmpxchg
+#endif
+
+#endif /* arch_atomic64_cmpxchg_relaxed */
+
+#ifndef arch_atomic64_try_cmpxchg_relaxed
+#ifdef arch_atomic64_try_cmpxchg
+#define arch_atomic64_try_cmpxchg_acquire arch_atomic64_try_cmpxchg
+#define arch_atomic64_try_cmpxchg_release arch_atomic64_try_cmpxchg
+#define arch_atomic64_try_cmpxchg_relaxed arch_atomic64_try_cmpxchg
+#endif /* arch_atomic64_try_cmpxchg */
+
+#ifndef arch_atomic64_try_cmpxchg
+static __always_inline bool
+arch_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
+{
+	s64 r, o = *old;
+	r = arch_atomic64_cmpxchg(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg
+#endif
+
+#ifndef arch_atomic64_try_cmpxchg_acquire
+static __always_inline bool
+arch_atomic64_try_cmpxchg_acquire(atomic64_t *v, s64 *old, s64 new)
+{
+	s64 r, o = *old;
+	r = arch_atomic64_cmpxchg_acquire(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic64_try_cmpxchg_acquire arch_atomic64_try_cmpxchg_acquire
+#endif
+
+#ifndef arch_atomic64_try_cmpxchg_release
+static __always_inline bool
+arch_atomic64_try_cmpxchg_release(atomic64_t *v, s64 *old, s64 new)
+{
+	s64 r, o = *old;
+	r = arch_atomic64_cmpxchg_release(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic64_try_cmpxchg_release arch_atomic64_try_cmpxchg_release
+#endif
+
+#ifndef arch_atomic64_try_cmpxchg_relaxed
+static __always_inline bool
+arch_atomic64_try_cmpxchg_relaxed(atomic64_t *v, s64 *old, s64 new)
+{
+	s64 r, o = *old;
+	r = arch_atomic64_cmpxchg_relaxed(v, o, new);
+	if (unlikely(r != o))
+		*old = r;
+	return likely(r == o);
+}
+#define arch_atomic64_try_cmpxchg_relaxed arch_atomic64_try_cmpxchg_relaxed
+#endif
+
+#else /* arch_atomic64_try_cmpxchg_relaxed */
+
+#ifndef arch_atomic64_try_cmpxchg_acquire
+static __always_inline bool
+arch_atomic64_try_cmpxchg_acquire(atomic64_t *v, s64 *old, s64 new)
+{
+	bool ret = arch_atomic64_try_cmpxchg_relaxed(v, old, new);
+	__atomic_acquire_fence();
+	return ret;
+}
+#define arch_atomic64_try_cmpxchg_acquire arch_atomic64_try_cmpxchg_acquire
+#endif
+
+#ifndef arch_atomic64_try_cmpxchg_release
+static __always_inline bool
+arch_atomic64_try_cmpxchg_release(atomic64_t *v, s64 *old, s64 new)
+{
+	__atomic_release_fence();
+	return arch_atomic64_try_cmpxchg_relaxed(v, old, new);
+}
+#define arch_atomic64_try_cmpxchg_release arch_atomic64_try_cmpxchg_release
+#endif
+
+#ifndef arch_atomic64_try_cmpxchg
+static __always_inline bool
+arch_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
+{
+	bool ret;
+	__atomic_pre_full_fence();
+	ret = arch_atomic64_try_cmpxchg_relaxed(v, old, new);
+	__atomic_post_full_fence();
+	return ret;
+}
+#define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg
+#endif
+
+#endif /* arch_atomic64_try_cmpxchg_relaxed */
+
+#ifndef arch_atomic64_sub_and_test
+/**
+ * arch_atomic64_sub_and_test - subtract value from variable and test result
+ * @i: integer value to subtract
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically subtracts @i from @v and returns
+ * true if the result is zero, or false for all
+ * other cases.
+ */
+static __always_inline bool
+arch_atomic64_sub_and_test(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_sub_return(i, v) == 0;
+}
+#define arch_atomic64_sub_and_test arch_atomic64_sub_and_test
+#endif
+
+#ifndef arch_atomic64_dec_and_test
+/**
+ * arch_atomic64_dec_and_test - decrement and test
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically decrements @v by 1 and
+ * returns true if the result is 0, or false for all other
+ * cases.
+ */
+static __always_inline bool
+arch_atomic64_dec_and_test(atomic64_t *v)
+{
+	return arch_atomic64_dec_return(v) == 0;
+}
+#define arch_atomic64_dec_and_test arch_atomic64_dec_and_test
+#endif
+
+#ifndef arch_atomic64_inc_and_test
+/**
+ * arch_atomic64_inc_and_test - increment and test
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically increments @v by 1
+ * and returns true if the result is zero, or false for all
+ * other cases.
+ */
+static __always_inline bool
+arch_atomic64_inc_and_test(atomic64_t *v)
+{
+	return arch_atomic64_inc_return(v) == 0;
+}
+#define arch_atomic64_inc_and_test arch_atomic64_inc_and_test
+#endif
+
+#ifndef arch_atomic64_add_negative
+/**
+ * arch_atomic64_add_negative - add and test if negative
+ * @i: integer value to add
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically adds @i to @v and returns true
+ * if the result is negative, or false when
+ * result is greater than or equal to zero.
+ */
+static __always_inline bool
+arch_atomic64_add_negative(s64 i, atomic64_t *v)
+{
+	return arch_atomic64_add_return(i, v) < 0;
+}
+#define arch_atomic64_add_negative arch_atomic64_add_negative
+#endif
+
+#ifndef arch_atomic64_fetch_add_unless
+/**
+ * arch_atomic64_fetch_add_unless - add unless the number is already a given value
+ * @v: pointer of type atomic64_t
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ *
+ * Atomically adds @a to @v, so long as @v was not already @u.
+ * Returns original value of @v
+ */
+static __always_inline s64
+arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, s64 u)
+{
+	s64 c = arch_atomic64_read(v);
+
+	do {
+		if (unlikely(c == u))
+			break;
+	} while (!arch_atomic64_try_cmpxchg(v, &c, c + a));
+
+	return c;
+}
+#define arch_atomic64_fetch_add_unless arch_atomic64_fetch_add_unless
+#endif
+
+#ifndef arch_atomic64_add_unless
+/**
+ * arch_atomic64_add_unless - add unless the number is already a given value
+ * @v: pointer of type atomic64_t
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ *
+ * Atomically adds @a to @v, if @v was not already @u.
+ * Returns true if the addition was done.
+ */
+static __always_inline bool
+arch_atomic64_add_unless(atomic64_t *v, s64 a, s64 u)
+{
+	return arch_atomic64_fetch_add_unless(v, a, u) != u;
+}
+#define arch_atomic64_add_unless arch_atomic64_add_unless
+#endif
+
+#ifndef arch_atomic64_inc_not_zero
+/**
+ * arch_atomic64_inc_not_zero - increment unless the number is zero
+ * @v: pointer of type atomic64_t
+ *
+ * Atomically increments @v by 1, if @v is non-zero.
+ * Returns true if the increment was done.
+ */
+static __always_inline bool
+arch_atomic64_inc_not_zero(atomic64_t *v)
+{
+	return arch_atomic64_add_unless(v, 1, 0);
+}
+#define arch_atomic64_inc_not_zero arch_atomic64_inc_not_zero
+#endif
+
+#ifndef arch_atomic64_inc_unless_negative
+static __always_inline bool
+arch_atomic64_inc_unless_negative(atomic64_t *v)
+{
+	s64 c = arch_atomic64_read(v);
+
+	do {
+		if (unlikely(c < 0))
+			return false;
+	} while (!arch_atomic64_try_cmpxchg(v, &c, c + 1));
+
+	return true;
+}
+#define arch_atomic64_inc_unless_negative arch_atomic64_inc_unless_negative
+#endif
+
+#ifndef arch_atomic64_dec_unless_positive
+static __always_inline bool
+arch_atomic64_dec_unless_positive(atomic64_t *v)
+{
+	s64 c = arch_atomic64_read(v);
+
+	do {
+		if (unlikely(c > 0))
+			return false;
+	} while (!arch_atomic64_try_cmpxchg(v, &c, c - 1));
+
+	return true;
+}
+#define arch_atomic64_dec_unless_positive arch_atomic64_dec_unless_positive
+#endif
+
+#ifndef arch_atomic64_dec_if_positive
+static __always_inline s64
+arch_atomic64_dec_if_positive(atomic64_t *v)
+{
+	s64 dec, c = arch_atomic64_read(v);
+
+	do {
+		dec = c - 1;
+		if (unlikely(dec < 0))
+			break;
+	} while (!arch_atomic64_try_cmpxchg(v, &c, dec));
+
+	return dec;
+}
+#define arch_atomic64_dec_if_positive arch_atomic64_dec_if_positive
+#endif
+
+#endif /* _LINUX_ATOMIC_FALLBACK_H */
+// 90cd26cfd69d2250303d654955a0cc12620fb91b
diff --git a/include/linux/atomic-fallback.h b/include/linux/atomic-fallback.h
index 656b548..2c4927b 100644
--- a/include/linux/atomic-fallback.h
+++ b/include/linux/atomic-fallback.h
@@ -1180,9 +1180,6 @@ atomic_dec_if_positive(atomic_t *v)
 #define atomic_dec_if_positive atomic_dec_if_positive
 #endif
 
-#define atomic_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
-#define atomic_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
-
 #ifdef CONFIG_GENERIC_ATOMIC64
 #include <asm-generic/atomic64.h>
 #endif
@@ -2290,8 +2287,5 @@ atomic64_dec_if_positive(atomic64_t *v)
 #define atomic64_dec_if_positive atomic64_dec_if_positive
 #endif
 
-#define atomic64_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
-#define atomic64_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
-
 #endif /* _LINUX_ATOMIC_FALLBACK_H */
-// baaf45f4c24ed88ceae58baca39d7fd80bb8101b
+// 1fac0941c79bf0ae100723cc2ac9b94061f0b67a
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 4c0d009..571a110 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -25,6 +25,12 @@
  * See Documentation/memory-barriers.txt for ACQUIRE/RELEASE definitions.
  */
 
+#define atomic_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
+#define atomic_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
+
+#define atomic64_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
+#define atomic64_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
+
 /*
  * The idea here is to build acquire/release variants by adding explicit
  * barriers on top of the relaxed variant. In the case where the relaxed
@@ -71,7 +77,12 @@
 	__ret;								\
 })
 
+#ifdef ARCH_ATOMIC
+#include <linux/atomic-arch-fallback.h>
+#include <asm-generic/atomic-instrumented.h>
+#else
 #include <linux/atomic-fallback.h>
+#endif
 
 #include <asm-generic/atomic-long.h>
 
diff --git a/scripts/atomic/fallbacks/acquire b/scripts/atomic/fallbacks/acquire
index ea489ac..59c0052 100755
--- a/scripts/atomic/fallbacks/acquire
+++ b/scripts/atomic/fallbacks/acquire
@@ -1,8 +1,8 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_${pfx}${name}${sfx}_acquire(${params})
+${arch}${atomic}_${pfx}${name}${sfx}_acquire(${params})
 {
-	${ret} ret = ${atomic}_${pfx}${name}${sfx}_relaxed(${args});
+	${ret} ret = ${arch}${atomic}_${pfx}${name}${sfx}_relaxed(${args});
 	__atomic_acquire_fence();
 	return ret;
 }
diff --git a/scripts/atomic/fallbacks/add_negative b/scripts/atomic/fallbacks/add_negative
index 03cc2e0..a66635b 100755
--- a/scripts/atomic/fallbacks/add_negative
+++ b/scripts/atomic/fallbacks/add_negative
@@ -1,6 +1,6 @@
 cat <<EOF
 /**
- * ${atomic}_add_negative - add and test if negative
+ * ${arch}${atomic}_add_negative - add and test if negative
  * @i: integer value to add
  * @v: pointer of type ${atomic}_t
  *
@@ -9,8 +9,8 @@ cat <<EOF
  * result is greater than or equal to zero.
  */
 static __always_inline bool
-${atomic}_add_negative(${int} i, ${atomic}_t *v)
+${arch}${atomic}_add_negative(${int} i, ${atomic}_t *v)
 {
-	return ${atomic}_add_return(i, v) < 0;
+	return ${arch}${atomic}_add_return(i, v) < 0;
 }
 EOF
diff --git a/scripts/atomic/fallbacks/add_unless b/scripts/atomic/fallbacks/add_unless
index daf87a0..2ff598a 100755
--- a/scripts/atomic/fallbacks/add_unless
+++ b/scripts/atomic/fallbacks/add_unless
@@ -1,6 +1,6 @@
 cat << EOF
 /**
- * ${atomic}_add_unless - add unless the number is already a given value
+ * ${arch}${atomic}_add_unless - add unless the number is already a given value
  * @v: pointer of type ${atomic}_t
  * @a: the amount to add to v...
  * @u: ...unless v is equal to u.
@@ -9,8 +9,8 @@ cat << EOF
  * Returns true if the addition was done.
  */
 static __always_inline bool
-${atomic}_add_unless(${atomic}_t *v, ${int} a, ${int} u)
+${arch}${atomic}_add_unless(${atomic}_t *v, ${int} a, ${int} u)
 {
-	return ${atomic}_fetch_add_unless(v, a, u) != u;
+	return ${arch}${atomic}_fetch_add_unless(v, a, u) != u;
 }
 EOF
diff --git a/scripts/atomic/fallbacks/andnot b/scripts/atomic/fallbacks/andnot
index 14efce0..3f18663 100755
--- a/scripts/atomic/fallbacks/andnot
+++ b/scripts/atomic/fallbacks/andnot
@@ -1,7 +1,7 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_${pfx}andnot${sfx}${order}(${int} i, ${atomic}_t *v)
+${arch}${atomic}_${pfx}andnot${sfx}${order}(${int} i, ${atomic}_t *v)
 {
-	${retstmt}${atomic}_${pfx}and${sfx}${order}(~i, v);
+	${retstmt}${arch}${atomic}_${pfx}and${sfx}${order}(~i, v);
 }
 EOF
diff --git a/scripts/atomic/fallbacks/dec b/scripts/atomic/fallbacks/dec
index 118282f..e2e01f0 100755
--- a/scripts/atomic/fallbacks/dec
+++ b/scripts/atomic/fallbacks/dec
@@ -1,7 +1,7 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_${pfx}dec${sfx}${order}(${atomic}_t *v)
+${arch}${atomic}_${pfx}dec${sfx}${order}(${atomic}_t *v)
 {
-	${retstmt}${atomic}_${pfx}sub${sfx}${order}(1, v);
+	${retstmt}${arch}${atomic}_${pfx}sub${sfx}${order}(1, v);
 }
 EOF
diff --git a/scripts/atomic/fallbacks/dec_and_test b/scripts/atomic/fallbacks/dec_and_test
index f8967a8..e8a5e49 100755
--- a/scripts/atomic/fallbacks/dec_and_test
+++ b/scripts/atomic/fallbacks/dec_and_test
@@ -1,6 +1,6 @@
 cat <<EOF
 /**
- * ${atomic}_dec_and_test - decrement and test
+ * ${arch}${atomic}_dec_and_test - decrement and test
  * @v: pointer of type ${atomic}_t
  *
  * Atomically decrements @v by 1 and
@@ -8,8 +8,8 @@ cat <<EOF
  * cases.
  */
 static __always_inline bool
-${atomic}_dec_and_test(${atomic}_t *v)
+${arch}${atomic}_dec_and_test(${atomic}_t *v)
 {
-	return ${atomic}_dec_return(v) == 0;
+	return ${arch}${atomic}_dec_return(v) == 0;
 }
 EOF
diff --git a/scripts/atomic/fallbacks/dec_if_positive b/scripts/atomic/fallbacks/dec_if_positive
index cfb380b..527adec 100755
--- a/scripts/atomic/fallbacks/dec_if_positive
+++ b/scripts/atomic/fallbacks/dec_if_positive
@@ -1,14 +1,14 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_dec_if_positive(${atomic}_t *v)
+${arch}${atomic}_dec_if_positive(${atomic}_t *v)
 {
-	${int} dec, c = ${atomic}_read(v);
+	${int} dec, c = ${arch}${atomic}_read(v);
 
 	do {
 		dec = c - 1;
 		if (unlikely(dec < 0))
 			break;
-	} while (!${atomic}_try_cmpxchg(v, &c, dec));
+	} while (!${arch}${atomic}_try_cmpxchg(v, &c, dec));
 
 	return dec;
 }
diff --git a/scripts/atomic/fallbacks/dec_unless_positive b/scripts/atomic/fallbacks/dec_unless_positive
index 69cb7aa..dcab684 100755
--- a/scripts/atomic/fallbacks/dec_unless_positive
+++ b/scripts/atomic/fallbacks/dec_unless_positive
@@ -1,13 +1,13 @@
 cat <<EOF
 static __always_inline bool
-${atomic}_dec_unless_positive(${atomic}_t *v)
+${arch}${atomic}_dec_unless_positive(${atomic}_t *v)
 {
-	${int} c = ${atomic}_read(v);
+	${int} c = ${arch}${atomic}_read(v);
 
 	do {
 		if (unlikely(c > 0))
 			return false;
-	} while (!${atomic}_try_cmpxchg(v, &c, c - 1));
+	} while (!${arch}${atomic}_try_cmpxchg(v, &c, c - 1));
 
 	return true;
 }
diff --git a/scripts/atomic/fallbacks/fence b/scripts/atomic/fallbacks/fence
index 92a3a46..3764fc8 100755
--- a/scripts/atomic/fallbacks/fence
+++ b/scripts/atomic/fallbacks/fence
@@ -1,10 +1,10 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_${pfx}${name}${sfx}(${params})
+${arch}${atomic}_${pfx}${name}${sfx}(${params})
 {
 	${ret} ret;
 	__atomic_pre_full_fence();
-	ret = ${atomic}_${pfx}${name}${sfx}_relaxed(${args});
+	ret = ${arch}${atomic}_${pfx}${name}${sfx}_relaxed(${args});
 	__atomic_post_full_fence();
 	return ret;
 }
diff --git a/scripts/atomic/fallbacks/fetch_add_unless b/scripts/atomic/fallbacks/fetch_add_unless
index fffbc0d..0e0b9ae 100755
--- a/scripts/atomic/fallbacks/fetch_add_unless
+++ b/scripts/atomic/fallbacks/fetch_add_unless
@@ -1,6 +1,6 @@
 cat << EOF
 /**
- * ${atomic}_fetch_add_unless - add unless the number is already a given value
+ * ${arch}${atomic}_fetch_add_unless - add unless the number is already a given value
  * @v: pointer of type ${atomic}_t
  * @a: the amount to add to v...
  * @u: ...unless v is equal to u.
@@ -9,14 +9,14 @@ cat << EOF
  * Returns original value of @v
  */
 static __always_inline ${int}
-${atomic}_fetch_add_unless(${atomic}_t *v, ${int} a, ${int} u)
+${arch}${atomic}_fetch_add_unless(${atomic}_t *v, ${int} a, ${int} u)
 {
-	${int} c = ${atomic}_read(v);
+	${int} c = ${arch}${atomic}_read(v);
 
 	do {
 		if (unlikely(c == u))
 			break;
-	} while (!${atomic}_try_cmpxchg(v, &c, c + a));
+	} while (!${arch}${atomic}_try_cmpxchg(v, &c, c + a));
 
 	return c;
 }
diff --git a/scripts/atomic/fallbacks/inc b/scripts/atomic/fallbacks/inc
index 10751cd..15ec629 100755
--- a/scripts/atomic/fallbacks/inc
+++ b/scripts/atomic/fallbacks/inc
@@ -1,7 +1,7 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_${pfx}inc${sfx}${order}(${atomic}_t *v)
+${arch}${atomic}_${pfx}inc${sfx}${order}(${atomic}_t *v)
 {
-	${retstmt}${atomic}_${pfx}add${sfx}${order}(1, v);
+	${retstmt}${arch}${atomic}_${pfx}add${sfx}${order}(1, v);
 }
 EOF
diff --git a/scripts/atomic/fallbacks/inc_and_test b/scripts/atomic/fallbacks/inc_and_test
index 4acea9c..cecc832 100755
--- a/scripts/atomic/fallbacks/inc_and_test
+++ b/scripts/atomic/fallbacks/inc_and_test
@@ -1,6 +1,6 @@
 cat <<EOF
 /**
- * ${atomic}_inc_and_test - increment and test
+ * ${arch}${atomic}_inc_and_test - increment and test
  * @v: pointer of type ${atomic}_t
  *
  * Atomically increments @v by 1
@@ -8,8 +8,8 @@ cat <<EOF
  * other cases.
  */
 static __always_inline bool
-${atomic}_inc_and_test(${atomic}_t *v)
+${arch}${atomic}_inc_and_test(${atomic}_t *v)
 {
-	return ${atomic}_inc_return(v) == 0;
+	return ${arch}${atomic}_inc_return(v) == 0;
 }
 EOF
diff --git a/scripts/atomic/fallbacks/inc_not_zero b/scripts/atomic/fallbacks/inc_not_zero
index d9f7b97..50f2d4d 100755
--- a/scripts/atomic/fallbacks/inc_not_zero
+++ b/scripts/atomic/fallbacks/inc_not_zero
@@ -1,14 +1,14 @@
 cat <<EOF
 /**
- * ${atomic}_inc_not_zero - increment unless the number is zero
+ * ${arch}${atomic}_inc_not_zero - increment unless the number is zero
  * @v: pointer of type ${atomic}_t
  *
  * Atomically increments @v by 1, if @v is non-zero.
  * Returns true if the increment was done.
  */
 static __always_inline bool
-${atomic}_inc_not_zero(${atomic}_t *v)
+${arch}${atomic}_inc_not_zero(${atomic}_t *v)
 {
-	return ${atomic}_add_unless(v, 1, 0);
+	return ${arch}${atomic}_add_unless(v, 1, 0);
 }
 EOF
diff --git a/scripts/atomic/fallbacks/inc_unless_negative b/scripts/atomic/fallbacks/inc_unless_negative
index 177a7cb..87629e0 100755
--- a/scripts/atomic/fallbacks/inc_unless_negative
+++ b/scripts/atomic/fallbacks/inc_unless_negative
@@ -1,13 +1,13 @@
 cat <<EOF
 static __always_inline bool
-${atomic}_inc_unless_negative(${atomic}_t *v)
+${arch}${atomic}_inc_unless_negative(${atomic}_t *v)
 {
-	${int} c = ${atomic}_read(v);
+	${int} c = ${arch}${atomic}_read(v);
 
 	do {
 		if (unlikely(c < 0))
 			return false;
-	} while (!${atomic}_try_cmpxchg(v, &c, c + 1));
+	} while (!${arch}${atomic}_try_cmpxchg(v, &c, c + 1));
 
 	return true;
 }
diff --git a/scripts/atomic/fallbacks/read_acquire b/scripts/atomic/fallbacks/read_acquire
index 12fa83c..341a88d 100755
--- a/scripts/atomic/fallbacks/read_acquire
+++ b/scripts/atomic/fallbacks/read_acquire
@@ -1,6 +1,6 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_read_acquire(const ${atomic}_t *v)
+${arch}${atomic}_read_acquire(const ${atomic}_t *v)
 {
 	return smp_load_acquire(&(v)->counter);
 }
diff --git a/scripts/atomic/fallbacks/release b/scripts/atomic/fallbacks/release
index 730d2a6..f8906d5 100755
--- a/scripts/atomic/fallbacks/release
+++ b/scripts/atomic/fallbacks/release
@@ -1,8 +1,8 @@
 cat <<EOF
 static __always_inline ${ret}
-${atomic}_${pfx}${name}${sfx}_release(${params})
+${arch}${atomic}_${pfx}${name}${sfx}_release(${params})
 {
 	__atomic_release_fence();
-	${retstmt}${atomic}_${pfx}${name}${sfx}_relaxed(${args});
+	${retstmt}${arch}${atomic}_${pfx}${name}${sfx}_relaxed(${args});
 }
 EOF
diff --git a/scripts/atomic/fallbacks/set_release b/scripts/atomic/fallbacks/set_release
index e5d72c7..7606827 100755
--- a/scripts/atomic/fallbacks/set_release
+++ b/scripts/atomic/fallbacks/set_release
@@ -1,6 +1,6 @@
 cat <<EOF
 static __always_inline void
-${atomic}_set_release(${atomic}_t *v, ${int} i)
+${arch}${atomic}_set_release(${atomic}_t *v, ${int} i)
 {
 	smp_store_release(&(v)->counter, i);
 }
diff --git a/scripts/atomic/fallbacks/sub_and_test b/scripts/atomic/fallbacks/sub_and_test
index 6cfe4ed..c580f4c 100755
--- a/scripts/atomic/fallbacks/sub_and_test
+++ b/scripts/atomic/fallbacks/sub_and_test
@@ -1,6 +1,6 @@
 cat <<EOF
 /**
- * ${atomic}_sub_and_test - subtract value from variable and test result
+ * ${arch}${atomic}_sub_and_test - subtract value from variable and test result
  * @i: integer value to subtract
  * @v: pointer of type ${atomic}_t
  *
@@ -9,8 +9,8 @@ cat <<EOF
  * other cases.
  */
 static __always_inline bool
-${atomic}_sub_and_test(${int} i, ${atomic}_t *v)
+${arch}${atomic}_sub_and_test(${int} i, ${atomic}_t *v)
 {
-	return ${atomic}_sub_return(i, v) == 0;
+	return ${arch}${atomic}_sub_return(i, v) == 0;
 }
 EOF
diff --git a/scripts/atomic/fallbacks/try_cmpxchg b/scripts/atomic/fallbacks/try_cmpxchg
index c7a2621..06db0f7 100755
--- a/scripts/atomic/fallbacks/try_cmpxchg
+++ b/scripts/atomic/fallbacks/try_cmpxchg
@@ -1,9 +1,9 @@
 cat <<EOF
 static __always_inline bool
-${atomic}_try_cmpxchg${order}(${atomic}_t *v, ${int} *old, ${int} new)
+${arch}${atomic}_try_cmpxchg${order}(${atomic}_t *v, ${int} *old, ${int} new)
 {
 	${int} r, o = *old;
-	r = ${atomic}_cmpxchg${order}(v, o, new);
+	r = ${arch}${atomic}_cmpxchg${order}(v, o, new);
 	if (unlikely(r != o))
 		*old = r;
 	return likely(r == o);
diff --git a/scripts/atomic/gen-atomic-fallback.sh b/scripts/atomic/gen-atomic-fallback.sh
index b6c6f5d..0fd1cf0 100755
--- a/scripts/atomic/gen-atomic-fallback.sh
+++ b/scripts/atomic/gen-atomic-fallback.sh
@@ -2,10 +2,11 @@
 # SPDX-License-Identifier: GPL-2.0
 
 ATOMICDIR=$(dirname $0)
+ARCH=$2
 
 . ${ATOMICDIR}/atomic-tbl.sh
 
-#gen_template_fallback(template, meta, pfx, name, sfx, order, atomic, int, args...)
+#gen_template_fallback(template, meta, pfx, name, sfx, order, arch, atomic, int, args...)
 gen_template_fallback()
 {
 	local template="$1"; shift
@@ -14,10 +15,11 @@ gen_template_fallback()
 	local name="$1"; shift
 	local sfx="$1"; shift
 	local order="$1"; shift
+	local arch="$1"; shift
 	local atomic="$1"; shift
 	local int="$1"; shift
 
-	local atomicname="${atomic}_${pfx}${name}${sfx}${order}"
+	local atomicname="${arch}${atomic}_${pfx}${name}${sfx}${order}"
 
 	local ret="$(gen_ret_type "${meta}" "${int}")"
 	local retstmt="$(gen_ret_stmt "${meta}")"
@@ -32,7 +34,7 @@ gen_template_fallback()
 	fi
 }
 
-#gen_proto_fallback(meta, pfx, name, sfx, order, atomic, int, args...)
+#gen_proto_fallback(meta, pfx, name, sfx, order, arch, atomic, int, args...)
 gen_proto_fallback()
 {
 	local meta="$1"; shift
@@ -56,16 +58,17 @@ cat << EOF
 EOF
 }
 
-#gen_proto_order_variants(meta, pfx, name, sfx, atomic, int, args...)
+#gen_proto_order_variants(meta, pfx, name, sfx, arch, atomic, int, args...)
 gen_proto_order_variants()
 {
 	local meta="$1"; shift
 	local pfx="$1"; shift
 	local name="$1"; shift
 	local sfx="$1"; shift
-	local atomic="$1"
+	local arch="$1"
+	local atomic="$2"
 
-	local basename="${atomic}_${pfx}${name}${sfx}"
+	local basename="${arch}${atomic}_${pfx}${name}${sfx}"
 
 	local template="$(find_fallback_template "${pfx}" "${name}" "${sfx}" "${order}")"
 
@@ -94,7 +97,7 @@ gen_proto_order_variants()
 	gen_basic_fallbacks "${basename}"
 
 	if [ ! -z "${template}" ]; then
-		printf "#endif /* ${atomic}_${pfx}${name}${sfx} */\n\n"
+		printf "#endif /* ${arch}${atomic}_${pfx}${name}${sfx} */\n\n"
 		gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "" "$@"
 		gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "_acquire" "$@"
 		gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "_release" "$@"
@@ -153,18 +156,15 @@ cat << EOF
 
 EOF
 
-for xchg in "xchg" "cmpxchg" "cmpxchg64"; do
+for xchg in "${ARCH}xchg" "${ARCH}cmpxchg" "${ARCH}cmpxchg64"; do
 	gen_xchg_fallbacks "${xchg}"
 done
 
 grep '^[a-z]' "$1" | while read name meta args; do
-	gen_proto "${meta}" "${name}" "atomic" "int" ${args}
+	gen_proto "${meta}" "${name}" "${ARCH}" "atomic" "int" ${args}
 done
 
 cat <<EOF
-#define atomic_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
-#define atomic_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
-
 #ifdef CONFIG_GENERIC_ATOMIC64
 #include <asm-generic/atomic64.h>
 #endif
@@ -172,12 +172,9 @@ cat <<EOF
 EOF
 
 grep '^[a-z]' "$1" | while read name meta args; do
-	gen_proto "${meta}" "${name}" "atomic64" "s64" ${args}
+	gen_proto "${meta}" "${name}" "${ARCH}" "atomic64" "s64" ${args}
 done
 
 cat <<EOF
-#define atomic64_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
-#define atomic64_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
-
 #endif /* _LINUX_ATOMIC_FALLBACK_H */
 EOF
diff --git a/scripts/atomic/gen-atomics.sh b/scripts/atomic/gen-atomics.sh
index 000dc64..d29e159 100644
--- a/scripts/atomic/gen-atomics.sh
+++ b/scripts/atomic/gen-atomics.sh
@@ -10,10 +10,11 @@ LINUXDIR=${ATOMICDIR}/../..
 cat <<EOF |
 gen-atomic-instrumented.sh      asm-generic/atomic-instrumented.h
 gen-atomic-long.sh              asm-generic/atomic-long.h
+gen-atomic-fallback.sh          linux/atomic-arch-fallback.h		arch_
 gen-atomic-fallback.sh          linux/atomic-fallback.h
 EOF
-while read script header; do
-	/bin/sh ${ATOMICDIR}/${script} ${ATOMICTBL} > ${LINUXDIR}/include/${header}
+while read script header args; do
+	/bin/sh ${ATOMICDIR}/${script} ${ATOMICTBL} ${args} > ${LINUXDIR}/include/${header}
 	HASH="$(sha1sum ${LINUXDIR}/include/${header})"
 	HASH="${HASH%% *}"
 	printf "// %s\n" "${HASH}" >> ${LINUXDIR}/include/${header}

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: sched/core] sched: Make scheduler_ipi inline
  2020-05-05 13:16 ` [patch V4 part 1 04/36] sched: Make scheduler_ipi inline Thomas Gleixner
  2020-05-06 12:42   ` Alexandre Chartre
@ 2020-05-12 15:13   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 178+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-12 15:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Alexandre Chartre, Peter Zijlstra, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2a0a24ebb499c9d499eea948d3fc108f936e36d4
Gitweb:        https://git.kernel.org/tip/2a0a24ebb499c9d499eea948d3fc108f936e36d4
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 27 Mar 2020 12:42:00 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 12 May 2020 17:10:49 +02:00

sched: Make scheduler_ipi inline

Now that the scheduler IPI is trivial and simple again there is no point to
have the little function out of line. This simplifies the effort of
constraining the instrumentation nicely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134058.453581595@linutronix.de
---
 include/linux/sched.h | 10 +++++++++-
 kernel/sched/core.c   | 10 ----------
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5c..d4ea440 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1715,7 +1715,15 @@ extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk);
 })
 
 #ifdef CONFIG_SMP
-void scheduler_ipi(void);
+static __always_inline void scheduler_ipi(void)
+{
+	/*
+	 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
+	 * TIF_NEED_RESCHED remotely (for the first time) will also send
+	 * this IPI.
+	 */
+	preempt_fold_need_resched();
+}
 extern unsigned long wait_task_inactive(struct task_struct *, long match_state);
 #else
 static inline void scheduler_ipi(void) { }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cd2070d..74fb89b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2312,16 +2312,6 @@ static void wake_csd_func(void *info)
 	sched_ttwu_pending();
 }
 
-void scheduler_ipi(void)
-{
-	/*
-	 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
-	 * TIF_NEED_RESCHED remotely (for the first time) will also send
-	 * this IPI.
-	 */
-	preempt_fold_need_resched();
-}
-
 static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: sched/core] sched: Clean up scheduler_ipi()
  2020-05-05 13:16 ` [patch V4 part 1 03/36] sched: Clean up scheduler_ipi() Thomas Gleixner
                     ` (2 preceding siblings ...)
  2020-05-06 12:37   ` Alexandre Chartre
@ 2020-05-12 15:13   ` tip-bot2 for Peter Zijlstra (Intel)
  3 siblings, 0 replies; 178+ messages in thread
From: tip-bot2 for Peter Zijlstra (Intel) @ 2020-05-12 15:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, Alexandre Chartre, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     90b5363acd4739769c3f38c1aff16171bd133e8c
Gitweb:        https://git.kernel.org/tip/90b5363acd4739769c3f38c1aff16171bd133e8c
Author:        Peter Zijlstra (Intel) <peterz@infradead.org>
AuthorDate:    Fri, 27 Mar 2020 11:44:56 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 12 May 2020 17:10:48 +02:00

sched: Clean up scheduler_ipi()

The scheduler IPI has grown weird and wonderful over the years, time
for spring cleaning.

Move all the non-trivial stuff out of it and into a regular smp function
call IPI. This then reduces the schedule_ipi() to most of it's former NOP
glory and ensures to keep the interrupt vector lean and mean.

Aside of that avoiding the full irq_enter() in the x86 IPI implementation
is incorrect as scheduler_ipi() can be instrumented. To work around that
scheduler_ipi() had an irq_enter/exit() hack when heavy work was
pending. This is gone now.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134058.361859938@linutronix.de
---
 kernel/sched/core.c  | 64 ++++++++++++++++++++-----------------------
 kernel/sched/fair.c  |  5 +--
 kernel/sched/sched.h |  6 ++--
 3 files changed, 36 insertions(+), 39 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b58efb1..cd2070d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -219,6 +219,13 @@ void update_rq_clock(struct rq *rq)
 	update_rq_clock_task(rq, delta);
 }
 
+static inline void
+rq_csd_init(struct rq *rq, call_single_data_t *csd, smp_call_func_t func)
+{
+	csd->flags = 0;
+	csd->func = func;
+	csd->info = rq;
+}
 
 #ifdef CONFIG_SCHED_HRTICK
 /*
@@ -314,16 +321,14 @@ void hrtick_start(struct rq *rq, u64 delay)
 	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),
 		      HRTIMER_MODE_REL_PINNED_HARD);
 }
+
 #endif /* CONFIG_SMP */
 
 static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
-	rq->hrtick_csd.flags = 0;
-	rq->hrtick_csd.func = __hrtick_start;
-	rq->hrtick_csd.info = rq;
+	rq_csd_init(rq, &rq->hrtick_csd, __hrtick_start);
 #endif
-
 	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
 	rq->hrtick_timer.function = hrtick;
 }
@@ -650,6 +655,16 @@ static inline bool got_nohz_idle_kick(void)
 	return false;
 }
 
+static void nohz_csd_func(void *info)
+{
+	struct rq *rq = info;
+
+	if (got_nohz_idle_kick()) {
+		rq->idle_balance = 1;
+		raise_softirq_irqoff(SCHED_SOFTIRQ);
+	}
+}
+
 #else /* CONFIG_NO_HZ_COMMON */
 
 static inline bool got_nohz_idle_kick(void)
@@ -2292,6 +2307,11 @@ void sched_ttwu_pending(void)
 	rq_unlock_irqrestore(rq, &rf);
 }
 
+static void wake_csd_func(void *info)
+{
+	sched_ttwu_pending();
+}
+
 void scheduler_ipi(void)
 {
 	/*
@@ -2300,34 +2320,6 @@ void scheduler_ipi(void)
 	 * this IPI.
 	 */
 	preempt_fold_need_resched();
-
-	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
-		return;
-
-	/*
-	 * Not all reschedule IPI handlers call irq_enter/irq_exit, since
-	 * traditionally all their work was done from the interrupt return
-	 * path. Now that we actually do some work, we need to make sure
-	 * we do call them.
-	 *
-	 * Some archs already do call them, luckily irq_enter/exit nest
-	 * properly.
-	 *
-	 * Arguably we should visit all archs and update all handlers,
-	 * however a fair share of IPIs are still resched only so this would
-	 * somewhat pessimize the simple resched case.
-	 */
-	irq_enter();
-	sched_ttwu_pending();
-
-	/*
-	 * Check if someone kicked us for doing the nohz idle load balance.
-	 */
-	if (unlikely(got_nohz_idle_kick())) {
-		this_rq()->idle_balance = 1;
-		raise_softirq_irqoff(SCHED_SOFTIRQ);
-	}
-	irq_exit();
 }
 
 static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
@@ -2336,9 +2328,9 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
-	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
+	if (llist_add(&p->wake_entry, &rq->wake_list)) {
 		if (!set_nr_if_polling(rq->idle))
-			smp_send_reschedule(cpu);
+			smp_call_function_single_async(cpu, &rq->wake_csd);
 		else
 			trace_sched_wake_idle_without_ipi(cpu);
 	}
@@ -6693,12 +6685,16 @@ void __init sched_init(void)
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
+		rq_csd_init(rq, &rq->wake_csd, wake_csd_func);
+
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ_COMMON
 		rq->last_blocked_load_update_tick = jiffies;
 		atomic_set(&rq->nohz_flags, 0);
+
+		rq_csd_init(rq, &rq->nohz_csd, nohz_csd_func);
 #endif
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46b7bd4..6b7f147 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10000,12 +10000,11 @@ static void kick_ilb(unsigned int flags)
 		return;
 
 	/*
-	 * Use smp_send_reschedule() instead of resched_cpu().
-	 * This way we generate a sched IPI on the target CPU which
+	 * This way we generate an IPI on the target CPU which
 	 * is idle. And the softirq performing nohz idle load balance
 	 * will be run before returning from the IPI.
 	 */
-	smp_send_reschedule(ilb_cpu);
+	smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 978c6fa..21416b3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -889,9 +889,10 @@ struct rq {
 #ifdef CONFIG_SMP
 	unsigned long		last_blocked_load_update_tick;
 	unsigned int		has_blocked_load;
+	call_single_data_t	nohz_csd;
 #endif /* CONFIG_SMP */
 	unsigned int		nohz_tick_stopped;
-	atomic_t nohz_flags;
+	atomic_t		nohz_flags;
 #endif /* CONFIG_NO_HZ_COMMON */
 
 	unsigned long		nr_load_updates;
@@ -978,7 +979,7 @@ struct rq {
 
 	/* This is used to determine avg_idle's max value */
 	u64			max_idle_balance_cost;
-#endif
+#endif /* CONFIG_SMP */
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	u64			prev_irq_time;
@@ -1020,6 +1021,7 @@ struct rq {
 #endif
 
 #ifdef CONFIG_SMP
+	call_single_data_t	wake_csd;
 	struct llist_head	wake_list;
 #endif
 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/kprobes] kprobes: Support NOKPROBE_SYMBOL() in modules
  2020-05-05 13:16 ` [patch V4 part 1 17/36] kprobes: Support NOKPROBE_SYMBOL() " Thomas Gleixner
  2020-05-06 15:54   ` Alexandre Chartre
@ 2020-05-12 15:18   ` tip-bot2 for Masami Hiramatsu
  1 sibling, 0 replies; 178+ messages in thread
From: tip-bot2 for Masami Hiramatsu @ 2020-05-12 15:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Masami Hiramatsu, Thomas Gleixner, Alexandre Chartre,
	Peter Zijlstra, x86, LKML

The following commit has been merged into the core/kprobes branch of tip:

Commit-ID:     16db6264c93d2d7df9eb8be5d9eb717ab30105fe
Gitweb:        https://git.kernel.org/tip/16db6264c93d2d7df9eb8be5d9eb717ab30105fe
Author:        Masami Hiramatsu <mhiramat@kernel.org>
AuthorDate:    Thu, 26 Mar 2020 23:50:00 +09:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 12 May 2020 17:15:32 +02:00

kprobes: Support NOKPROBE_SYMBOL() in modules

Support NOKPROBE_SYMBOL() in modules. NOKPROBE_SYMBOL() records only symbol
address in "_kprobe_blacklist" section in the module.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.771170126@linutronix.de

---
 include/linux/module.h |  2 ++
 kernel/kprobes.c       | 17 +++++++++++++++++
 kernel/module.c        |  3 +++
 3 files changed, 22 insertions(+)

diff --git a/include/linux/module.h b/include/linux/module.h
index 369c354..1192097 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -492,6 +492,8 @@ struct module {
 #ifdef CONFIG_KPROBES
 	void *kprobes_text_start;
 	unsigned int kprobes_text_size;
+	unsigned long *kprobe_blacklist;
+	unsigned int num_kprobe_blacklist;
 #endif
 
 #ifdef CONFIG_LIVEPATCH
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index b754999..9eb5acf 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2192,6 +2192,11 @@ static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
 	}
 }
 
+static void kprobe_remove_ksym_blacklist(unsigned long entry)
+{
+	kprobe_remove_area_blacklist(entry, entry + 1);
+}
+
 int __init __weak arch_populate_kprobe_blacklist(void)
 {
 	return 0;
@@ -2231,6 +2236,12 @@ static int __init populate_kprobe_blacklist(unsigned long *start,
 static void add_module_kprobe_blacklist(struct module *mod)
 {
 	unsigned long start, end;
+	int i;
+
+	if (mod->kprobe_blacklist) {
+		for (i = 0; i < mod->num_kprobe_blacklist; i++)
+			kprobe_add_ksym_blacklist(mod->kprobe_blacklist[i]);
+	}
 
 	start = (unsigned long)mod->kprobes_text_start;
 	if (start) {
@@ -2242,6 +2253,12 @@ static void add_module_kprobe_blacklist(struct module *mod)
 static void remove_module_kprobe_blacklist(struct module *mod)
 {
 	unsigned long start, end;
+	int i;
+
+	if (mod->kprobe_blacklist) {
+		for (i = 0; i < mod->num_kprobe_blacklist; i++)
+			kprobe_remove_ksym_blacklist(mod->kprobe_blacklist[i]);
+	}
 
 	start = (unsigned long)mod->kprobes_text_start;
 	if (start) {
diff --git a/kernel/module.c b/kernel/module.c
index 978f3fa..faf7337 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3197,6 +3197,9 @@ static int find_module_sections(struct module *mod, struct load_info *info)
 #ifdef CONFIG_KPROBES
 	mod->kprobes_text_start = section_objs(info, ".kprobes.text", 1,
 						&mod->kprobes_text_size);
+	mod->kprobe_blacklist = section_objs(info, "_kprobe_blacklist",
+						sizeof(unsigned long),
+						&mod->num_kprobe_blacklist);
 #endif
 	mod->extable = section_objs(info, "__ex_table",
 				    sizeof(*mod->extable), &mod->num_exentries);

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/kprobes] kprobes: Support __kprobes blacklist in modules
  2020-05-05 13:16 ` [patch V4 part 1 16/36] kprobes: Support __kprobes blacklist in modules Thomas Gleixner
  2020-05-06 15:47   ` Alexandre Chartre
@ 2020-05-12 15:18   ` tip-bot2 for Masami Hiramatsu
  1 sibling, 0 replies; 178+ messages in thread
From: tip-bot2 for Masami Hiramatsu @ 2020-05-12 15:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Masami Hiramatsu, Thomas Gleixner, Alexandre Chartre,
	Peter Zijlstra, x86, LKML

The following commit has been merged into the core/kprobes branch of tip:

Commit-ID:     1e6769b0aece51ea7a3dc3117c37d4a5669e4a21
Gitweb:        https://git.kernel.org/tip/1e6769b0aece51ea7a3dc3117c37d4a5669e4a21
Author:        Masami Hiramatsu <mhiramat@kernel.org>
AuthorDate:    Thu, 26 Mar 2020 23:49:48 +09:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 12 May 2020 17:15:32 +02:00

kprobes: Support __kprobes blacklist in modules

Support __kprobes attribute for blacklist functions in modules.  The
__kprobes attribute functions are stored in .kprobes.text section.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.678201813@linutronix.de

---
 include/linux/module.h |  4 ++++-
 kernel/kprobes.c       | 42 +++++++++++++++++++++++++++++++++++++++++-
 kernel/module.c        |  4 ++++-
 3 files changed, 50 insertions(+)

diff --git a/include/linux/module.h b/include/linux/module.h
index 1ad393e..369c354 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -489,6 +489,10 @@ struct module {
 	unsigned int num_ftrace_callsites;
 	unsigned long *ftrace_callsites;
 #endif
+#ifdef CONFIG_KPROBES
+	void *kprobes_text_start;
+	unsigned int kprobes_text_size;
+#endif
 
 #ifdef CONFIG_LIVEPATCH
 	bool klp; /* Is this a livepatch module? */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 570d608..b754999 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2179,6 +2179,19 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end)
 	return 0;
 }
 
+/* Remove all symbols in given area from kprobe blacklist */
+static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
+{
+	struct kprobe_blacklist_entry *ent, *n;
+
+	list_for_each_entry_safe(ent, n, &kprobe_blacklist, list) {
+		if (ent->start_addr < start || ent->start_addr >= end)
+			continue;
+		list_del(&ent->list);
+		kfree(ent);
+	}
+}
+
 int __init __weak arch_populate_kprobe_blacklist(void)
 {
 	return 0;
@@ -2215,6 +2228,28 @@ static int __init populate_kprobe_blacklist(unsigned long *start,
 	return ret ? : arch_populate_kprobe_blacklist();
 }
 
+static void add_module_kprobe_blacklist(struct module *mod)
+{
+	unsigned long start, end;
+
+	start = (unsigned long)mod->kprobes_text_start;
+	if (start) {
+		end = start + mod->kprobes_text_size;
+		kprobe_add_area_blacklist(start, end);
+	}
+}
+
+static void remove_module_kprobe_blacklist(struct module *mod)
+{
+	unsigned long start, end;
+
+	start = (unsigned long)mod->kprobes_text_start;
+	if (start) {
+		end = start + mod->kprobes_text_size;
+		kprobe_remove_area_blacklist(start, end);
+	}
+}
+
 /* Module notifier call back, checking kprobes on the module */
 static int kprobes_module_callback(struct notifier_block *nb,
 				   unsigned long val, void *data)
@@ -2225,6 +2260,11 @@ static int kprobes_module_callback(struct notifier_block *nb,
 	unsigned int i;
 	int checkcore = (val == MODULE_STATE_GOING);
 
+	if (val == MODULE_STATE_COMING) {
+		mutex_lock(&kprobe_mutex);
+		add_module_kprobe_blacklist(mod);
+		mutex_unlock(&kprobe_mutex);
+	}
 	if (val != MODULE_STATE_GOING && val != MODULE_STATE_LIVE)
 		return NOTIFY_DONE;
 
@@ -2255,6 +2295,8 @@ static int kprobes_module_callback(struct notifier_block *nb,
 				kill_kprobe(p);
 			}
 	}
+	if (val == MODULE_STATE_GOING)
+		remove_module_kprobe_blacklist(mod);
 	mutex_unlock(&kprobe_mutex);
 	return NOTIFY_DONE;
 }
diff --git a/kernel/module.c b/kernel/module.c
index 646f1e2..978f3fa 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3194,6 +3194,10 @@ static int find_module_sections(struct module *mod, struct load_info *info)
 					    sizeof(*mod->ei_funcs),
 					    &mod->num_ei_funcs);
 #endif
+#ifdef CONFIG_KPROBES
+	mod->kprobes_text_start = section_objs(info, ".kprobes.text", 1,
+						&mod->kprobes_text_size);
+#endif
 	mod->extable = section_objs(info, "__ex_table",
 				    sizeof(*mod->extable), &mod->num_exentries);
 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/kprobes] samples/kprobes: Add __kprobes and NOKPROBE_SYMBOL() for handlers.
  2020-05-05 13:16 ` [patch V4 part 1 18/36] samples/kprobes: Add __kprobes and NOKPROBE_SYMBOL() for handlers Thomas Gleixner
  2020-05-06 15:57   ` Alexandre Chartre
@ 2020-05-12 15:18   ` tip-bot2 for Masami Hiramatsu
  1 sibling, 0 replies; 178+ messages in thread
From: tip-bot2 for Masami Hiramatsu @ 2020-05-12 15:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Masami Hiramatsu, Thomas Gleixner, Alexandre Chartre,
	Peter Zijlstra, x86, LKML

The following commit has been merged into the core/kprobes branch of tip:

Commit-ID:     d85eaa9411472a99de4b5732cb59c8bae629d5f1
Gitweb:        https://git.kernel.org/tip/d85eaa9411472a99de4b5732cb59c8bae629d5f1
Author:        Masami Hiramatsu <mhiramat@kernel.org>
AuthorDate:    Thu, 26 Mar 2020 23:50:11 +09:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 12 May 2020 17:15:33 +02:00

samples/kprobes: Add __kprobes and NOKPROBE_SYMBOL() for handlers.

Add __kprobes and NOKPROBE_SYMBOL() for sample kprobe handlers.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.878578033@linutronix.de

---
 samples/kprobes/kprobe_example.c    | 6 ++++--
 samples/kprobes/kretprobe_example.c | 2 ++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/samples/kprobes/kprobe_example.c b/samples/kprobes/kprobe_example.c
index d693c23..501911d 100644
--- a/samples/kprobes/kprobe_example.c
+++ b/samples/kprobes/kprobe_example.c
@@ -25,7 +25,7 @@ static struct kprobe kp = {
 };
 
 /* kprobe pre_handler: called just before the probed instruction is executed */
-static int handler_pre(struct kprobe *p, struct pt_regs *regs)
+static int __kprobes handler_pre(struct kprobe *p, struct pt_regs *regs)
 {
 #ifdef CONFIG_X86
 	pr_info("<%s> pre_handler: p->addr = 0x%p, ip = %lx, flags = 0x%lx\n",
@@ -54,7 +54,7 @@ static int handler_pre(struct kprobe *p, struct pt_regs *regs)
 }
 
 /* kprobe post_handler: called after the probed instruction is executed */
-static void handler_post(struct kprobe *p, struct pt_regs *regs,
+static void __kprobes handler_post(struct kprobe *p, struct pt_regs *regs,
 				unsigned long flags)
 {
 #ifdef CONFIG_X86
@@ -90,6 +90,8 @@ static int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
 	/* Return 0 because we don't handle the fault. */
 	return 0;
 }
+/* NOKPROBE_SYMBOL() is also available */
+NOKPROBE_SYMBOL(handler_fault);
 
 static int __init kprobe_init(void)
 {
diff --git a/samples/kprobes/kretprobe_example.c b/samples/kprobes/kretprobe_example.c
index 186315c..013e8e6 100644
--- a/samples/kprobes/kretprobe_example.c
+++ b/samples/kprobes/kretprobe_example.c
@@ -48,6 +48,7 @@ static int entry_handler(struct kretprobe_instance *ri, struct pt_regs *regs)
 	data->entry_stamp = ktime_get();
 	return 0;
 }
+NOKPROBE_SYMBOL(entry_handler);
 
 /*
  * Return-probe handler: Log the return value and duration. Duration may turn
@@ -67,6 +68,7 @@ static int ret_handler(struct kretprobe_instance *ri, struct pt_regs *regs)
 			func_name, retval, (long long)delta);
 	return 0;
 }
+NOKPROBE_SYMBOL(ret_handler);
 
 static struct kretprobe my_kretprobe = {
 	.handler		= ret_handler,

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/kprobes] kprobes: Lock kprobe_mutex while showing kprobe_blacklist
  2020-05-05 13:16 ` [patch V4 part 1 15/36] kprobes: Lock kprobe_mutex while showing kprobe_blacklist Thomas Gleixner
  2020-05-06 15:38   ` Alexandre Chartre
@ 2020-05-12 15:18   ` tip-bot2 for Masami Hiramatsu
  1 sibling, 0 replies; 178+ messages in thread
From: tip-bot2 for Masami Hiramatsu @ 2020-05-12 15:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Masami Hiramatsu, Thomas Gleixner, Alexandre Chartre,
	Peter Zijlstra, x86, LKML

The following commit has been merged into the core/kprobes branch of tip:

Commit-ID:     4fdd88877e5259dcc5e504dca449acc0324d71b9
Gitweb:        https://git.kernel.org/tip/4fdd88877e5259dcc5e504dca449acc0324d71b9
Author:        Masami Hiramatsu <mhiramat@kernel.org>
AuthorDate:    Thu, 26 Mar 2020 23:49:36 +09:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 12 May 2020 17:15:31 +02:00

kprobes: Lock kprobe_mutex while showing kprobe_blacklist

Lock kprobe_mutex while showing kprobe_blacklist to prevent updating the
kprobe_blacklist.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.571125195@linutronix.de

---
 kernel/kprobes.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 2625c24..570d608 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2420,6 +2420,7 @@ static const struct file_operations debugfs_kprobes_operations = {
 /* kprobes/blacklist -- shows which functions can not be probed */
 static void *kprobe_blacklist_seq_start(struct seq_file *m, loff_t *pos)
 {
+	mutex_lock(&kprobe_mutex);
 	return seq_list_start(&kprobe_blacklist, *pos);
 }
 
@@ -2446,10 +2447,15 @@ static int kprobe_blacklist_seq_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static void kprobe_blacklist_seq_stop(struct seq_file *f, void *v)
+{
+	mutex_unlock(&kprobe_mutex);
+}
+
 static const struct seq_operations kprobe_blacklist_seq_ops = {
 	.start = kprobe_blacklist_seq_start,
 	.next  = kprobe_blacklist_seq_next,
-	.stop  = kprobe_seq_stop,	/* Reuse void function */
+	.stop  = kprobe_blacklist_seq_stop,
 	.show  = kprobe_blacklist_seq_show,
 };
 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-05 13:16 ` [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling Thomas Gleixner
                     ` (3 preceding siblings ...)
  2020-05-07 17:35   ` Andy Lutomirski
@ 2020-05-13 20:56   ` Mathieu Desnoyers
  2020-05-13 21:10     ` Steven Rostedt
  4 siblings, 1 reply; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-13 20:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra

----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:

> Make sure task_work runs before any kind of userspace -- very much
> including signals -- is invoked.

What is missing from this patch description is: _why_ is this deemed
useful ?

Also, color me confused: is "do_signal()" actually running any user-space,
or just setting up the user-space stack for eventual return to signal handler ?

Also, it might be OK, but we're changing the order of two things which
have effects on each other: restartable sequences abort fixup for preemption
and do_signal(), which also have effects on rseq abort.

Because those two will cause the abort to trigger, I suspect changing
the order might be OK, but we really need to think this through.

Thanks,

Mathieu

> 
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> arch/x86/entry/common.c |    8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
> 
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -156,16 +156,16 @@ static void exit_to_usermode_loop(struct
> 		if (cached_flags & _TIF_PATCH_PENDING)
> 			klp_update_patch_state(current);
> 
> -		/* deal with pending signal delivery */
> -		if (cached_flags & _TIF_SIGPENDING)
> -			do_signal(regs);
> -
> 		if (cached_flags & _TIF_NOTIFY_RESUME) {
> 			clear_thread_flag(TIF_NOTIFY_RESUME);
> 			tracehook_notify_resume(regs);
> 			rseq_handle_notify_resume(NULL, regs);
> 		}
> 
> +		/* deal with pending signal delivery */
> +		if (cached_flags & _TIF_SIGPENDING)
> +			do_signal(regs);
> +
> 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
>  			fire_user_return_notifiers();

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-13 20:56   ` Mathieu Desnoyers
@ 2020-05-13 21:10     ` Steven Rostedt
  2020-05-13 22:48       ` Mathieu Desnoyers
  2020-05-14  0:12       ` Thomas Gleixner
  0 siblings, 2 replies; 178+ messages in thread
From: Steven Rostedt @ 2020-05-13 21:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, linux-kernel, x86, paulmck, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Joel Fernandes, Google, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Josh Poimboeuf, Will Deacon, Peter Zijlstra

On Wed, 13 May 2020 16:56:41 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> ----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
> 
> > Make sure task_work runs before any kind of userspace -- very much
> > including signals -- is invoked.  
> 
> What is missing from this patch description is: _why_ is this deemed
> useful ?
> 
> Also, color me confused: is "do_signal()" actually running any user-space,
> or just setting up the user-space stack for eventual return to signal handler ?
> 
> Also, it might be OK, but we're changing the order of two things which
> have effects on each other: restartable sequences abort fixup for preemption
> and do_signal(), which also have effects on rseq abort.
> 
> Because those two will cause the abort to trigger, I suspect changing
> the order might be OK, but we really need to think this through.
> 
> Thanks,
> 
> Mathieu
> 
> > 
> > Suggested-by: Andy Lutomirski <luto@kernel.org>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > ---
> > arch/x86/entry/common.c |    8 ++++----
> > 1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > --- a/arch/x86/entry/common.c
> > +++ b/arch/x86/entry/common.c
> > @@ -156,16 +156,16 @@ static void exit_to_usermode_loop(struct
> > 		if (cached_flags & _TIF_PATCH_PENDING)
> > 			klp_update_patch_state(current);
> > 
> > -		/* deal with pending signal delivery */
> > -		if (cached_flags & _TIF_SIGPENDING)
> > -			do_signal(regs);
> > -
> > 		if (cached_flags & _TIF_NOTIFY_RESUME) {
> > 			clear_thread_flag(TIF_NOTIFY_RESUME);
> > 			tracehook_notify_resume(regs);
> > 			rseq_handle_notify_resume(NULL, regs);
> > 		}
> > 
> > +		/* deal with pending signal delivery */
> > +		if (cached_flags & _TIF_SIGPENDING)
> > +			do_signal(regs);

Looking deeper into this, it appears that do_signal() can freeze or kill the
task.

That is, it wont go back to user space here, but simply schedule out (being
traced) or even exit (killed).

Before the resume hooks would never be called in such cases, and now they
are.

-- Steve


> > +
> > 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
> >  			fire_user_return_notifiers();  
> 


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-13 21:10     ` Steven Rostedt
@ 2020-05-13 22:48       ` Mathieu Desnoyers
  2020-05-14  0:12       ` Thomas Gleixner
  1 sibling, 0 replies; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-13 22:48 UTC (permalink / raw)
  To: rostedt
  Cc: Thomas Gleixner, linux-kernel, x86, paulmck, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Joel Fernandes, Google, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Josh Poimboeuf, Will Deacon, Peter Zijlstra

----- On May 13, 2020, at 5:10 PM, rostedt rostedt@goodmis.org wrote:

> On Wed, 13 May 2020 16:56:41 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> ----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
>> 
>> > Make sure task_work runs before any kind of userspace -- very much
>> > including signals -- is invoked.
>> 
>> What is missing from this patch description is: _why_ is this deemed
>> useful ?
>> 
>> Also, color me confused: is "do_signal()" actually running any user-space,
>> or just setting up the user-space stack for eventual return to signal handler ?
>> 
>> Also, it might be OK, but we're changing the order of two things which
>> have effects on each other: restartable sequences abort fixup for preemption
>> and do_signal(), which also have effects on rseq abort.
>> 
>> Because those two will cause the abort to trigger, I suspect changing
>> the order might be OK, but we really need to think this through.
>> 
>> Thanks,
>> 
>> Mathieu
>> 
>> > 
>> > Suggested-by: Andy Lutomirski <luto@kernel.org>
>> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> > ---
>> > arch/x86/entry/common.c |    8 ++++----
>> > 1 file changed, 4 insertions(+), 4 deletions(-)
>> > 
>> > --- a/arch/x86/entry/common.c
>> > +++ b/arch/x86/entry/common.c
>> > @@ -156,16 +156,16 @@ static void exit_to_usermode_loop(struct
>> > 		if (cached_flags & _TIF_PATCH_PENDING)
>> > 			klp_update_patch_state(current);
>> > 
>> > -		/* deal with pending signal delivery */
>> > -		if (cached_flags & _TIF_SIGPENDING)
>> > -			do_signal(regs);
>> > -
>> > 		if (cached_flags & _TIF_NOTIFY_RESUME) {
>> > 			clear_thread_flag(TIF_NOTIFY_RESUME);
>> > 			tracehook_notify_resume(regs);
>> > 			rseq_handle_notify_resume(NULL, regs);
>> > 		}
>> > 
>> > +		/* deal with pending signal delivery */
>> > +		if (cached_flags & _TIF_SIGPENDING)
>> > +			do_signal(regs);
> 
> Looking deeper into this, it appears that do_signal() can freeze or kill the
> task.
> 
> That is, it wont go back to user space here, but simply schedule out (being
> traced) or even exit (killed).
> 
> Before the resume hooks would never be called in such cases, and now they
> are.

Regarding swapping order of tracehook vs do_signal: Is the task really resumed
if it gets frozen or killed ? What is this change trying to accomplish, why
is it needed in the first place ?

Regarding swapping order of rseq wrt do_signal: why is it needed, why, and has
this been tested ?

Thanks,

Mathieu

> 
> -- Steve
> 
> 
>> > +
>> > 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
>> >  			fire_user_return_notifiers();

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic()
  2020-05-05 13:16 ` [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic() Thomas Gleixner
  2020-05-06 15:34   ` Alexandre Chartre
  2020-05-07 17:46   ` Andy Lutomirski
@ 2020-05-13 22:57   ` Mathieu Desnoyers
  2020-05-14  0:13     ` Steven Rostedt
  2020-05-15  9:34     ` Thomas Gleixner
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Thomas Gleixner
  3 siblings, 2 replies; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-13 22:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon

----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:

> This is completely overengineered and definitely not an interface which
> should be made available to anything else than this particular MCE case.

This patch introduces a significant change under the radar (not explained
in the changelog): it turns preempt_enable_no_resched() into preempt_enable().

Why, and why was it a no_resched() in the first place ? Was it for performance
or correctness reasons ?

Thanks,

Mathieu

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> arch/x86/include/asm/traps.h   |    2 --
> arch/x86/kernel/cpu/mce/core.c |    6 ++++--
> arch/x86/kernel/traps.c        |   37 -------------------------------------
> 3 files changed, 4 insertions(+), 41 deletions(-)
> 
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -120,8 +120,6 @@ asmlinkage void smp_irq_move_cleanup_int
> 
> extern void ist_enter(struct pt_regs *regs);
> extern void ist_exit(struct pt_regs *regs);
> -extern void ist_begin_non_atomic(struct pt_regs *regs);
> -extern void ist_end_non_atomic(void);
> 
> #ifdef CONFIG_VMAP_STACK
> void __noreturn handle_stack_overflow(const char *message,
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1352,13 +1352,15 @@ void notrace do_machine_check(struct pt_
> 
> 	/* Fault was in user mode and we need to take some action */
> 	if ((m.cs & 3) == 3) {
> -		ist_begin_non_atomic(regs);
> +		/* If this triggers there is no way to recover. Die hard. */
> +		BUG_ON(!on_thread_stack() || !user_mode(regs));
> 		local_irq_enable();
> +		preempt_enable();
> 
> 		if (kill_it || do_memory_failure(&m))
> 			force_sig(SIGBUS);
> +		preempt_disable();
> 		local_irq_disable();
> -		ist_end_non_atomic();
> 	} else {
> 		if (!fixup_exception(regs, X86_TRAP_MC, error_code, 0))
> 			mce_panic("Failed kernel mode recovery", &m, msg);
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -117,43 +117,6 @@ void ist_exit(struct pt_regs *regs)
> 		rcu_nmi_exit();
> }
> 
> -/**
> - * ist_begin_non_atomic() - begin a non-atomic section in an IST exception
> - * @regs:	regs passed to the IST exception handler
> - *
> - * IST exception handlers normally cannot schedule.  As a special
> - * exception, if the exception interrupted userspace code (i.e.
> - * user_mode(regs) would return true) and the exception was not
> - * a double fault, it can be safe to schedule.  ist_begin_non_atomic()
> - * begins a non-atomic section within an ist_enter()/ist_exit() region.
> - * Callers are responsible for enabling interrupts themselves inside
> - * the non-atomic section, and callers must call ist_end_non_atomic()
> - * before ist_exit().
> - */
> -void ist_begin_non_atomic(struct pt_regs *regs)
> -{
> -	BUG_ON(!user_mode(regs));
> -
> -	/*
> -	 * Sanity check: we need to be on the normal thread stack.  This
> -	 * will catch asm bugs and any attempt to use ist_preempt_enable
> -	 * from double_fault.
> -	 */
> -	BUG_ON(!on_thread_stack());
> -
> -	preempt_enable_no_resched();
> -}
> -
> -/**
> - * ist_end_non_atomic() - begin a non-atomic section in an IST exception
> - *
> - * Ends a non-atomic section started with ist_begin_non_atomic().
> - */
> -void ist_end_non_atomic(void)
> -{
> -	preempt_disable();
> -}
> -
> int is_valid_bugaddr(unsigned long addr)
> {
>  	unsigned short ud;

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 19/36] x86/entry: Exclude low level entry code from sanitizing
  2020-05-06 16:03   ` Alexandre Chartre
@ 2020-05-13 22:58     ` Mathieu Desnoyers
  0 siblings, 0 replies; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-13 22:58 UTC (permalink / raw)
  To: Alexandre Chartre, Thomas Gleixner
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, rostedt, Joel Fernandes, Google, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Josh Poimboeuf, Will Deacon,
	Peter Zijlstra

----- On May 6, 2020, at 12:03 PM, Alexandre Chartre alexandre.chartre@oracle.com wrote:

> On 5/5/20 3:16 PM, Thomas Gleixner wrote:
>> The sanitizers are not really applicable to the fragile low level entry
>> code. code. Entry code needs to carefully setup a normal 'runtime'
> 
> typo: code. code.

Looking at updated tree, the reviewed-by is there, but there is still
way too much "code". ;)

Thanks,

Mathieu

> 
> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
> 
> alex.
> 
> 
>> environment.
>> 
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> ---
>>   arch/x86/entry/Makefile |    8 ++++++++
>>   1 file changed, 8 insertions(+)
>> 
>> --- a/arch/x86/entry/Makefile
>> +++ b/arch/x86/entry/Makefile
>> @@ -3,6 +3,14 @@
>>   # Makefile for the x86 low level entry code
>>   #
>>   
>> +KASAN_SANITIZE := n
>> +UBSAN_SANITIZE := n
>> +KCOV_INSTRUMENT := n
>> +
>> +CFLAGS_REMOVE_common.o = $(CC_FLAGS_FTRACE) -fstack-protector
>> -fstack-protector-strong
>> +CFLAGS_REMOVE_syscall_32.o = $(CC_FLAGS_FTRACE) -fstack-protector
>> -fstack-protector-strong
>> +CFLAGS_REMOVE_syscall_64.o = $(CC_FLAGS_FTRACE) -fstack-protector
>> -fstack-protector-strong
>> +
>>   OBJECT_FILES_NON_STANDARD_entry_64_compat.o := y
>>   
>>   CFLAGS_syscall_64.o		+= $(call cc-option,-Wno-override-init,)

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 23/36] bug: Annotate WARN/BUG/stackfail as noinstr safe
  2020-05-05 13:16 ` [patch V4 part 1 23/36] bug: Annotate WARN/BUG/stackfail as noinstr safe Thomas Gleixner
@ 2020-05-13 23:12   ` Mathieu Desnoyers
  2020-05-19 19:58   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-13 23:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon

----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:

> Warnings, bugs and stack protection fails from noinstr sections, e.g. low
> level and early entry code, are likely to be fatal.
> 
> Mark them as "safe" to be invoked from noinstr protected code to avoid
> annotating all usage sites. Getting the information out is important.

Why instrument at the x86 level (and miss other architectures) when this
could perhaps be done directly in the macro WARN_ON_ONCE(condition) in
generic code ?

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> arch/x86/include/asm/bug.h |    3 +++
> include/asm-generic/bug.h  |    9 +++++++--
> kernel/panic.c             |    4 +++-
> 3 files changed, 13 insertions(+), 3 deletions(-)
> 
> --- a/arch/x86/include/asm/bug.h
> +++ b/arch/x86/include/asm/bug.h
> @@ -70,13 +70,16 @@ do {									\
> #define HAVE_ARCH_BUG
> #define BUG()							\
> do {								\
> +	instr_begin();						\
> 	_BUG_FLAGS(ASM_UD2, 0);					\
> 	unreachable();						\
> } while (0)
> 
> #define __WARN_FLAGS(flags)					\
> do {								\
> +	instr_begin();						\
> 	_BUG_FLAGS(ASM_UD2, BUGFLAG_WARNING|(flags));		\
> +	instr_end();						\
> 	annotate_reachable();					\
> } while (0)

riscv, arm64, s390, powerpc, parisc and sh also have __WARN_FLAGS.

> 
> --- a/include/asm-generic/bug.h
> +++ b/include/asm-generic/bug.h
> @@ -83,14 +83,19 @@ extern __printf(4, 5)
> void warn_slowpath_fmt(const char *file, const int line, unsigned taint,
> 		       const char *fmt, ...);
> #define __WARN()		__WARN_printf(TAINT_WARN, NULL)
> -#define __WARN_printf(taint, arg...)					\
> -	warn_slowpath_fmt(__FILE__, __LINE__, taint, arg)
> +#define __WARN_printf(taint, arg...) do {				\
> +		instr_begin();						\
> +		warn_slowpath_fmt(__FILE__, __LINE__, taint, arg);	\
> +		instr_end();						\
> +	} while (0)
> #else
> extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
> #define __WARN()		__WARN_FLAGS(BUGFLAG_TAINT(TAINT_WARN))
> #define __WARN_printf(taint, arg...) do {				\
> +		instr_begin();						\
> 		__warn_printk(arg);					\
> 		__WARN_FLAGS(BUGFLAG_NO_CUT_HERE | BUGFLAG_TAINT(taint));\
> +		instr_end();						\
> 	} while (0)
> #define WARN_ON_ONCE(condition) ({				\

Moving the instr_begin/end here should fix it ?

Thanks,

Mathieu

> 	int __ret_warn_on = !!(condition);			\
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -662,10 +662,12 @@ device_initcall(register_warn_debugfs);
>  * Called when gcc's -fstack-protector feature is used, and
>  * gcc detects corruption of the on-stack canary value
>  */
> -__visible void __stack_chk_fail(void)
> +__visible noinstr void __stack_chk_fail(void)
> {
> +	instr_begin();
> 	panic("stack-protector: Kernel stack is corrupted in: %pB",
> 		__builtin_return_address(0));
> +	instr_end();
> }
>  EXPORT_SYMBOL(__stack_chk_fail);

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 27/36] arm64: Prepare arch_nmi_enter() for recursion
  2020-05-05 13:16 ` [patch V4 part 1 27/36] arm64: Prepare arch_nmi_enter() for recursion Thomas Gleixner
@ 2020-05-13 23:28   ` Mathieu Desnoyers
  2020-05-15 14:04     ` Frederic Weisbecker
  2020-05-15 21:29   ` Thomas Gleixner
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Frederic Weisbecker
  2 siblings, 1 reply; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-13 23:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra, Catalin Marinas

----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:

> +#define arch_nmi_enter()						\
[...]							\
> +	___hcr = read_sysreg(hcr_el2);					\
> +	if (!(___hcr & HCR_TGE)) {					\
> +		write_sysreg(___hcr | HCR_TGE, hcr_el2);		\
> +		isb();							\

Why is there an isb() above ^ ....

> +	}								\
> +	/*								\
[...]
> -#define arch_nmi_exit()								\
[...]
> +	/*								\
> +	 * Make sure ___ctx->cnt release is visible before we		\
> +	 * restore the sysreg. Otherwise a new NMI occurring		\
> +	 * right after write_sysreg() can be fooled and think		\
> +	 * we secured things for it.					\
> +	 */								\
> +	barrier();							\
> +	if (!___ctx->cnt && !(___hcr & HCR_TGE))			\
> +		write_sysreg(___hcr, hcr_el2);				\

And not here ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-05 13:16 ` [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work Thomas Gleixner
  2020-05-07 18:02   ` Andy Lutomirski
@ 2020-05-13 23:42   ` Mathieu Desnoyers
  2020-05-14 17:38     ` Thomas Gleixner
  2020-05-14 14:17   ` Borislav Petkov
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Peter Zijlstra
  3 siblings, 1 reply; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-13 23:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra

----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:

> From: Peter Zijlstra <peterz@infradead.org>
> 

Patch title: singal -> signal.

> Convert #MC over to using task_work_add(); it will run the same code
> slightly later, on the return to user path of the same exception.

So I suspect that switching the order between tracehook_notify_resume()
(which ends up calling task_work_run()) and do_signal() done by an
earlier patch in this series intends to ensure the information about the
instruction pointer causing the #MC is not overwritten by do_signal()
(but I'm just guessing).

If it's the case, I think it should be clearly stated as the intent of the
earlier patch.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 30/36] lockdep: Always inline lockdep_{off,on}()
  2020-05-05 13:16 ` [patch V4 part 1 30/36] lockdep: Always inline lockdep_{off,on}() Thomas Gleixner
@ 2020-05-13 23:46   ` Mathieu Desnoyers
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-13 23:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra

----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
[...]
> + * Split the recrursion counter in two to readily detect 'off' vs recursion.

recrursion -> recursion

> + */
> +#define LOCKDEP_RECURSION_BITS	16
> +#define LOCKDEP_OFF		(1U << LOCKDEP_RECURSION_BITS)
> +#define LOCKDEP_RECURSION_MASK	(LOCKDEP_OFF - 1)
> +
> +/*
> + * lockdep_{off,on}() are macros to avoid tracing and kprobes; not inlines due
> + * to header dependencies.
> + */
> +
> +#define lockdep_off()					\
> +do {							\
> +	current->lockdep_recursion += LOCKDEP_OFF;	\
> +} while (0)
> +
> +#define lockdep_on()					\
> +do {							\
> +	current->lockdep_recursion -= LOCKDEP_OFF;	\
> +} while (0)

Now that those on/off are macros rather than functions, I wonder if
adding compiler barriers would be relevant ?

Thanks,

Mathieu

> 
> extern void lockdep_register_key(struct lock_class_key *key);
> extern void lockdep_unregister_key(struct lock_class_key *key);
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -393,25 +393,6 @@ void lockdep_init_task(struct task_struc
> 	task->lockdep_recursion = 0;
> }
> 
> -/*
> - * Split the recrursion counter in two to readily detect 'off' vs recursion.
> - */
> -#define LOCKDEP_RECURSION_BITS	16
> -#define LOCKDEP_OFF		(1U << LOCKDEP_RECURSION_BITS)
> -#define LOCKDEP_RECURSION_MASK	(LOCKDEP_OFF - 1)
> -
> -void lockdep_off(void)
> -{
> -	current->lockdep_recursion += LOCKDEP_OFF;
> -}
> -EXPORT_SYMBOL(lockdep_off);
> -
> -void lockdep_on(void)
> -{
> -	current->lockdep_recursion -= LOCKDEP_OFF;
> -}
> -EXPORT_SYMBOL(lockdep_on);
> -
> static inline void lockdep_recursion_finish(void)
> {
>  	if (WARN_ON_ONCE(--current->lockdep_recursion))

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-13 21:10     ` Steven Rostedt
  2020-05-13 22:48       ` Mathieu Desnoyers
@ 2020-05-14  0:12       ` Thomas Gleixner
  2020-05-14  0:37         ` Steven Rostedt
                           ` (2 more replies)
  1 sibling, 3 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-14  0:12 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra

Steven, Mathieu

(combo reply)

Steven Rostedt <rostedt@goodmis.org> writes:
> On Wed, 13 May 2020 16:56:41 -0400 (EDT)
>> > +		/* deal with pending signal delivery */
>> > +		if (cached_flags & _TIF_SIGPENDING)
>> > +			do_signal(regs);
>
> Looking deeper into this, it appears that do_signal() can freeze or kill the
> task.
>
> That is, it wont go back to user space here, but simply schedule out (being
> traced) or even exit (killed).
>
> Before the resume hooks would never be called in such cases, and now they
> are.

It theoretically matters because pending task work might kill the
task. That's the concern Andy and Peter had. Assume the following:

usermode

 -> exception
    set not fatal signal

    -> exception
        queue task work to kill task
    <- return

  <- return

The same could happen when the non fatal signal is set from a remote CPU.

So in theory that would result in:

   handle non fatal signal first

   handle task work which kills task

which would be the wrong order.

But that's just illusion.

>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

>> Also, color me confused: is "do_signal()" actually running any user-space,
>> or just setting up the user-space stack for eventual return to signal
>> handler ?

I'm surprised that you can't answer that question yourself. How did you
ever make rseq work and how did rseq_signal_deliver() end up in
setup_rt_frame()?

Hint: Tracing might answer that question

And to cut it short:

    Exit to user space happnes only through ONE channel, i.e. leaving
    prepare_exit_to usermode().

      exit_to_usermode_loop <-prepare_exit_to_usermode
      do_signal <-exit_to_usermode_loop
      get_signal <-do_signal
      setup_sigcontext <-do_signal
      do_syscall_64 <-entry_SYSCALL_64_after_hwframe
      syscall_trace_enter <-do_syscall_64

      sys_rt_sigreturn()
      restore_altstack <-__ia32_sys_rt_sigreturn
      syscall_slow_exit_work <-do_syscall_64
      exit_to_usermode_loop <-do_syscall_64

>> Also, it might be OK, but we're changing the order of two things which
>> have effects on each other: restartable sequences abort fixup for preemption
>> and do_signal(), which also have effects on rseq abort.
>> 
>> Because those two will cause the abort to trigger, I suspect changing
>> the order might be OK, but we really need to think this through.

That's a purely academic problem. The order is completely
irrelevant. You have to handle any order anyway:

usermode

  -> exception / syscall
       sets signal

   <- return

  prepare_exit_to_usemode()
      cached_flags = READ_ONCE(t->flags);
      exit_to_user_mode_loop(regs, cached_flags) {
        while (cached_flags) {
           local_irq_enable();

           handle(cached_flags & RESCHED);
           handle(cached_flags & UPROBE);
           handle(cached_flags & PATCHING);
           handle(cached_flags & SIGNAL);
           handle(cached_flags & NOTIFY_RESUME);
           handle(cached_flags & RETURN_NOTIFY);

           local_irq_disable();
           
           cached_flags = READ_ONCE(t->flags);
         }

cached_flag is a momentary snapshot when attempting to return to user
space.

But after reenabling interrupts any of the relevant flag bits can be set
by an exception/interrupt or from remote. If preemption is enabled the
task can be scheduled out, migrated at any point before disabling
interrupts again. Even after disabling interrupts and before re-reading
cached flags there might be a remote change of flags.

That said, even for the case Andy and Peter were looking at (MCE) the
ordering is completely irrelevant.

I should have thought about this before, so thanks to both of you for
making me look at it again for the very wrong reasons.

Consider the patch dropped.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 35/36] x86: Replace ist_enter() with nmi_enter()
  2020-05-05 13:16 ` [patch V4 part 1 35/36] x86: Replace ist_enter() with nmi_enter() Thomas Gleixner
  2020-05-07 18:04   ` Andy Lutomirski
@ 2020-05-14  0:12   ` Mathieu Desnoyers
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-14  0:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra

----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:

> From: Peter Zijlstra <peterz@infradead.org>
> 
> A few exceptions (like #DB and #BP) can happen at any location in the code,
> this then means that tracers should treat events from these exceptions as
> NMI-like. The interrupted context could be holding locks with interrupts
> disabled for instance.
> 
> Similarly, #MC is an actual NMI-like exception.
> 
> All of them use ist_enter() which only concerns itself with RCU, but does
> not do any of the other setup that NMIs need. This means things like:
> 
>	printk()
>	  raw_spin_lock_irq(&logbuf_lock);
>	  <#DB/#BP/#MC>
>	     printk()
>	       raw_spin_lock_irq(&logbuf_lock);
> 
> are entirely possible (well, not really since printk tries hard to
> play nice, but the concept stands).
> 
> So replace ist_enter() with nmi_enter(). Also observe that any nmi_enter()
> caller must be both notrace and NOKPROBE, or in the noinstr text section.

Are there similar issues with non-x86 architectures, or is this
exception-behaves-like-an-interrupt issue specific to x86 ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic()
  2020-05-13 22:57   ` Mathieu Desnoyers
@ 2020-05-14  0:13     ` Steven Rostedt
  2020-05-15  9:34     ` Thomas Gleixner
  1 sibling, 0 replies; 178+ messages in thread
From: Steven Rostedt @ 2020-05-14  0:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, linux-kernel, x86, paulmck, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Joel Fernandes, Google, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Josh Poimboeuf, Will Deacon

On Wed, 13 May 2020 18:57:25 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> ----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
> 
> > This is completely overengineered and definitely not an interface which
> > should be made available to anything else than this particular MCE case.  
> 
> This patch introduces a significant change under the radar (not explained
> in the changelog): it turns preempt_enable_no_resched() into preempt_enable().
> 
> Why, and why was it a no_resched() in the first place ? Was it for performance
> or correctness reasons ?

I believe the reason for no_resched, is because it was within a
local_irq_disable() section, which means it couldn't schedule anyway.

-- Steve


> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -1352,13 +1352,15 @@ void notrace do_machine_check(struct pt_
> > 
> > 	/* Fault was in user mode and we need to take some action */
> > 	if ((m.cs & 3) == 3) {
> > -		ist_begin_non_atomic(regs);
> > +		/* If this triggers there is no way to recover.
> > Die hard. */
> > +		BUG_ON(!on_thread_stack() || !user_mode(regs));
> > 		local_irq_enable();
> > +		preempt_enable();
> > 
> > 		if (kill_it || do_memory_failure(&m))
> > 			force_sig(SIGBUS);
> > +		preempt_disable();
> > 		local_irq_disable();
> > -		ist_end_non_atomic();
> > 	} else {
> > 		if (!fixup_exception(regs, X86_TRAP_MC, error_code,


> > -void ist_begin_non_atomic(struct pt_regs *regs)
> > -{
> > -	BUG_ON(!user_mode(regs));
> > -
> > -	/*
> > -	 * Sanity check: we need to be on the normal thread
> > stack.  This
> > -	 * will catch asm bugs and any attempt to use
> > ist_preempt_enable
> > -	 * from double_fault.
> > -	 */
> > -	BUG_ON(!on_thread_stack());
> > -
> > -	preempt_enable_no_resched();
> > -}
> > -
> > -/**
> > - * ist_end_non_atomic() - begin a non-atomic section in an IST
> > exception
> > - *
> > - * Ends a non-atomic section started with ist_begin_non_atomic().
> > - */
> > -void ist_end_non_atomic(void)
> > -{
> > -	preempt_disable();
> > -}
> > -
> > int is_valid_bugaddr(unsigned long addr)
> > {
> >  	unsigned short ud;  
> 


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-14  0:12       ` Thomas Gleixner
@ 2020-05-14  0:37         ` Steven Rostedt
  2020-05-14  0:49           ` Thomas Gleixner
  2020-05-14  1:22         ` Andy Lutomirski
  2020-05-14  2:51         ` Mathieu Desnoyers
  2 siblings, 1 reply; 178+ messages in thread
From: Steven Rostedt @ 2020-05-14  0:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathieu Desnoyers, linux-kernel, x86, paulmck, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Joel Fernandes, Google, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Josh Poimboeuf, Will Deacon, Peter Zijlstra

On Thu, 14 May 2020 02:12:21 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> I should have thought about this before, so thanks to both of you for
> making me look at it again for the very wrong reasons.
> 
> Consider the patch dropped.

I'm glad our incompetence could be of service to you :-/

-- Steve

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-14  0:37         ` Steven Rostedt
@ 2020-05-14  0:49           ` Thomas Gleixner
  0 siblings, 0 replies; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-14  0:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, linux-kernel, x86, paulmck, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Joel Fernandes, Google, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Josh Poimboeuf, Will Deacon, Peter Zijlstra

Steven Rostedt <rostedt@goodmis.org> writes:

> On Thu, 14 May 2020 02:12:21 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> I should have thought about this before, so thanks to both of you for
>> making me look at it again for the very wrong reasons.
>> 
>> Consider the patch dropped.
>
> I'm glad our incompetence could be of service to you :-/

Asking incompetent questions is perfectly fine depending on who is
asking them.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-14  0:12       ` Thomas Gleixner
  2020-05-14  0:37         ` Steven Rostedt
@ 2020-05-14  1:22         ` Andy Lutomirski
  2020-05-14  2:51         ` Mathieu Desnoyers
  2 siblings, 0 replies; 178+ messages in thread
From: Andy Lutomirski @ 2020-05-14  1:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Mathieu Desnoyers, linux-kernel, x86, paulmck,
	Andy Lutomirski, Alexandre Chartre, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Joel Fernandes, Google, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Josh Poimboeuf, Will Deacon,
	Peter Zijlstra



> On May 13, 2020, at 5:12 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Steven, Mathieu
> 
> (combo reply)
> 
> Steven Rostedt <rostedt@goodmis.org> writes:
>> On Wed, 13 May 2020 16:56:41 -0400 (EDT)
>>>> +        /* deal with pending signal delivery */
>>>> +        if (cached_flags & _TIF_SIGPENDING)
>>>> +            do_signal(regs);
>> 
>> Looking deeper into this, it appears that do_signal() can freeze or kill the
>> task.
>> 
>> That is, it wont go back to user space here, but simply schedule out (being
>> traced) or even exit (killed).
>> 
>> Before the resume hooks would never be called in such cases, and now they
>> are.
> 
> It theoretically matters because pending task work might kill the
> task. That's the concern Andy and Peter had. Assume the following:
> 
> usermode
> 
> -> exception
>    set not fatal signal
> 
>    -> exception
>        queue task work to kill task
>    <- return
> 
>  <- return
> 
> The same could happen when the non fatal signal is set from a remote CPU.
> 
> So in theory that would result in:
> 
>   handle non fatal signal first
> 
>   handle task work which kills task
> 
> which would be the wrong order.
> 
> But that's just illusion.
> 
>>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>>> Also, color me confused: is "do_signal()" actually running any user-space,
>>> or just setting up the user-space stack for eventual return to signal
>>> handler ?
> 
> I'm surprised that you can't answer that question yourself. How did you
> ever make rseq work and how did rseq_signal_deliver() end up in
> setup_rt_frame()?
> 
> Hint: Tracing might answer that question
> 
> And to cut it short:
> 
>    Exit to user space happnes only through ONE channel, i.e. leaving
>    prepare_exit_to usermode().
> 
>      exit_to_usermode_loop <-prepare_exit_to_usermode
>      do_signal <-exit_to_usermode_loop
>      get_signal <-do_signal
>      setup_sigcontext <-do_signal
>      do_syscall_64 <-entry_SYSCALL_64_after_hwframe
>      syscall_trace_enter <-do_syscall_64
> 
>      sys_rt_sigreturn()
>      restore_altstack <-__ia32_sys_rt_sigreturn
>      syscall_slow_exit_work <-do_syscall_64
>      exit_to_usermode_loop <-do_syscall_64
> 
>>> Also, it might be OK, but we're changing the order of two things which
>>> have effects on each other: restartable sequences abort fixup for preemption
>>> and do_signal(), which also have effects on rseq abort.
>>> 
>>> Because those two will cause the abort to trigger, I suspect changing
>>> the order might be OK, but we really need to think this through.
> 
> That's a purely academic problem. The order is completely
> irrelevant. You have to handle any order anyway:
> 
> usermode
> 
>  -> exception / syscall
>       sets signal
> 
>   <- return
> 
>  prepare_exit_to_usemode()
>      cached_flags = READ_ONCE(t->flags);
>      exit_to_user_mode_loop(regs, cached_flags) {
>        while (cached_flags) {
>           local_irq_enable();
> 
>           handle(cached_flags & RESCHED);
>           handle(cached_flags & UPROBE);
>           handle(cached_flags & PATCHING);
>           handle(cached_flags & SIGNAL);
>           handle(cached_flags & NOTIFY_RESUME);
>           handle(cached_flags & RETURN_NOTIFY);
> 
>           local_irq_disable();
> 
>           cached_flags = READ_ONCE(t->flags);
>         }
> 
> cached_flag is a momentary snapshot when attempting to return to user
> space.
> 
> But after reenabling interrupts any of the relevant flag bits can be set
> by an exception/interrupt or from remote. If preemption is enabled the
> task can be scheduled out, migrated at any point before disabling
> interrupts again. Even after disabling interrupts and before re-reading
> cached flags there might be a remote change of flags.
> 
> That said, even for the case Andy and Peter were looking at (MCE) the
> ordering is completely irrelevant.
> 
> I should have thought about this before, so thanks to both of you for
> making me look at it again for the very wrong reasons.
> 
> Consider the patch dropped.

I disagree.

There is only one relevant MCE case: #MC hits user code and tries to recover.

Right now, this works via the ist_begin_non_atomic hack. But the series changes it so that it uses task_work(). So now the kernel ends up in prepare_exit_to_usermode() and the machine check work is pending, but a signal might *also* be pending due to another CPU setting the bit.

If we process the signal first, we could block on userfaultfd or whatever, and now we could take forever to finish.

Heck, even with the order changed, we could get preempted, return to another user task, and get a new machine check. Ick.

So I agree that this patch is problematic, and it doesn’t fully fix the problem, but I do believe that the task_work thing is problematic.

Rumor has it that it gets improved a bit farther along in the series, but I’m still plodding through.



> 
> Thanks,
> 
>        tglx

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-14  0:12       ` Thomas Gleixner
  2020-05-14  0:37         ` Steven Rostedt
  2020-05-14  1:22         ` Andy Lutomirski
@ 2020-05-14  2:51         ` Mathieu Desnoyers
  2020-05-14  9:19           ` Peter Zijlstra
  2 siblings, 1 reply; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-14  2:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: rostedt, linux-kernel, x86, paulmck, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Joel Fernandes, Google, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Josh Poimboeuf, Will Deacon, Peter Zijlstra

----- On May 13, 2020, at 8:12 PM, Thomas Gleixner tglx@linutronix.de wrote:
[...]
> 
>>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>>> Also, color me confused: is "do_signal()" actually running any user-space,
>>> or just setting up the user-space stack for eventual return to signal
>>> handler ?
> 
> I'm surprised that you can't answer that question yourself. How did you
> ever make rseq work and how did rseq_signal_deliver() end up in
> setup_rt_frame()?
> 
> Hint: Tracing might answer that question
> 
> And to cut it short:
> 
>    Exit to user space happnes only through ONE channel, i.e. leaving
>    prepare_exit_to usermode().
> 

[...]

Yes, I'm very well aware of this. But the patch commit message states:

"Make sure task_work runs before any kind of userspace -- very much
including signals -- is invoked."

which seems to imply that "userspace" can be "invoked" before the task_work
runs. Which makes no sense whatsoever. Hence my confused state.

>>> Also, it might be OK, but we're changing the order of two things which
>>> have effects on each other: restartable sequences abort fixup for preemption
>>> and do_signal(), which also have effects on rseq abort.
>>> 
>>> Because those two will cause the abort to trigger, I suspect changing
>>> the order might be OK, but we really need to think this through.
> 
> That's a purely academic problem. The order is completely
> irrelevant. You have to handle any order anyway:

Yes indeed, whether either a signal handler frame fixup or return IP
fixup fires first (clearing the rseq_cs pointer in the process) is
irrelevant, because they will have the effect on the user-space program's
flow. And as you say, given it is run in a loop and can be preempted,
any order can happen here, so we have to be prepared for it. This loop
has caused me tons of headaches when stress-testing on NUMA machines by
the way.

> That said, even for the case Andy and Peter were looking at (MCE) the
> ordering is completely irrelevant.

Not sure about that, see Andy's follow up.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling
  2020-05-14  2:51         ` Mathieu Desnoyers
@ 2020-05-14  9:19           ` Peter Zijlstra
  0 siblings, 0 replies; 178+ messages in thread
From: Peter Zijlstra @ 2020-05-14  9:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, rostedt, linux-kernel, x86, paulmck,
	Andy Lutomirski, Alexandre Chartre, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Joel Fernandes, Google, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Josh Poimboeuf, Will Deacon

On Wed, May 13, 2020 at 10:51:53PM -0400, Mathieu Desnoyers wrote:

> Yes, I'm very well aware of this. But the patch commit message states:
> 
> "Make sure task_work runs before any kind of userspace -- very much
> including signals -- is invoked."
> 
> which seems to imply that "userspace" can be "invoked" before the task_work
> runs. Which makes no sense whatsoever. Hence my confused state.

I initially missed the run_task_work in the signal handling maze. Then
later figured this order still made more sense, but apparently there's
an actual problem.

I've no problem with the patch getting dropped.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-07 18:02   ` Andy Lutomirski
  2020-05-08  8:48     ` Peter Zijlstra
@ 2020-05-14 14:16     ` Borislav Petkov
  1 sibling, 0 replies; 178+ messages in thread
From: Borislav Petkov @ 2020-05-14 14:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Peter Zijlstra (Intel)

On Thu, May 07, 2020 at 11:02:09AM -0700, Andy Lutomirski wrote:
> IOW, if we want to recover from CPL0 #MC, we will need a different mechanism.

Recovering from CPL0 #MC is mostly doomed to failure. Except this mcsafe
crap with the exception handling:

                /*
                 * Handle an MCE which has happened in kernel space but from
                 * which the kernel can recover: ex_has_fault_handler() has
                 * already verified that the rIP at which the error happened is
                 * a rIP from which the kernel can recover (by jumping to
                 * recovery code specified in _ASM_EXTABLE_FAULT()) and the
                 * corresponding exception handler which would do that is the
                 * proper one.
                 */
                if (m.kflags & MCE_IN_KERNEL_RECOV) {
                        if (!fixup_exception(regs, X86_TRAP_MC, error_code, 0))
                                mce_panic("Failed kernel mode recovery", &m, msg);


Other than that, we iz done.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-05 13:16 ` [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work Thomas Gleixner
  2020-05-07 18:02   ` Andy Lutomirski
  2020-05-13 23:42   ` Mathieu Desnoyers
@ 2020-05-14 14:17   ` Borislav Petkov
  2020-05-14 16:03     ` Mathieu Desnoyers
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Peter Zijlstra
  3 siblings, 1 reply; 178+ messages in thread
From: Borislav Petkov @ 2020-05-14 14:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel),
	Tony Luck

+ Tony.

On Tue, May 05, 2020 at 03:16:31PM +0200, Thomas Gleixner wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Convert #MC over to using task_work_add(); it will run the same code
> slightly later, on the return to user path of the same exception.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  arch/x86/kernel/cpu/mce/core.c |   56 ++++++++++++++++++++++-------------------
>  include/linux/sched.h          |    6 ++++
>  2 files changed, 37 insertions(+), 25 deletions(-)

I like this:

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-14 14:17   ` Borislav Petkov
@ 2020-05-14 16:03     ` Mathieu Desnoyers
  2020-05-14 16:19       ` Andy Lutomirski
  2020-05-14 16:39       ` Borislav Petkov
  0 siblings, 2 replies; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-14 16:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, linux-kernel, x86, paulmck, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek, rostedt,
	Joel Fernandes, Google, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Josh Poimboeuf, Will Deacon, Peter Zijlstra,
	Tony Luck

----- On May 14, 2020, at 10:17 AM, Borislav Petkov bp@alien8.de wrote:

> + Tony.
> 
> On Tue, May 05, 2020 at 03:16:31PM +0200, Thomas Gleixner wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>> 
>> Convert #MC over to using task_work_add(); it will run the same code
>> slightly later, on the return to user path of the same exception.
>> 
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
>> ---
>>  arch/x86/kernel/cpu/mce/core.c |   56 ++++++++++++++++++++++-------------------
>>  include/linux/sched.h          |    6 ++++
>>  2 files changed, 37 insertions(+), 25 deletions(-)
> 
> I like this:
> 
> Reviewed-by: Borislav Petkov <bp@suse.de>

What I am not fully grasping here is whether this patch preserves the instruction
pointer (and possibly other relevant information for siginfo_t) triggering the
exception in a scenario where we have:

- #MC triggered, queuing task work,
- unrelated signal happens to be delivered to task,
- exit to usermode loop handles do_signal first,
- then it runs task work.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-14 16:03     ` Mathieu Desnoyers
@ 2020-05-14 16:19       ` Andy Lutomirski
  2020-05-14 16:39       ` Borislav Petkov
  1 sibling, 0 replies; 178+ messages in thread
From: Andy Lutomirski @ 2020-05-14 16:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Borislav Petkov, Thomas Gleixner, linux-kernel, x86, paulmck,
	Andy Lutomirski, Alexandre Chartre, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, rostedt, Joel Fernandes, Google, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Josh Poimboeuf, Will Deacon,
	Peter Zijlstra, Tony Luck



> On May 14, 2020, at 9:03 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> ----- On May 14, 2020, at 10:17 AM, Borislav Petkov bp@alien8.de wrote:
> 
>> + Tony.
>> 
>>> On Tue, May 05, 2020 at 03:16:31PM +0200, Thomas Gleixner wrote:
>>> From: Peter Zijlstra <peterz@infradead.org>
>>> 
>>> Convert #MC over to using task_work_add(); it will run the same code
>>> slightly later, on the return to user path of the same exception.
>>> 
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
>>> ---
>>> arch/x86/kernel/cpu/mce/core.c |   56 ++++++++++++++++++++++-------------------
>>> include/linux/sched.h          |    6 ++++
>>> 2 files changed, 37 insertions(+), 25 deletions(-)
>> 
>> I like this:
>> 
>> Reviewed-by: Borislav Petkov <bp@suse.de>
> 
> What I am not fully grasping here is whether this patch preserves the instruction
> pointer (and possibly other relevant information for siginfo_t) triggering the
> exception in a scenario where we have:
> 
> - #MC triggered, queuing task work,
> - unrelated signal happens to be delivered to task,
> - exit to usermode loop handles do_signal first,
> - then it runs task work.

If anyone wants to ponder this, I suspect that we have lots of delightful bugs in our handling of cr2, trapnr, and error_code in signals. We should move them to the sigcontext, at least in kernel, and fix up ucontext when we deliver the signal.  The current code can’t possibly be correct.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-14 16:03     ` Mathieu Desnoyers
  2020-05-14 16:19       ` Andy Lutomirski
@ 2020-05-14 16:39       ` Borislav Petkov
  2020-05-14 17:05         ` Mathieu Desnoyers
  1 sibling, 1 reply; 178+ messages in thread
From: Borislav Petkov @ 2020-05-14 16:39 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, linux-kernel, x86, paulmck, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek, rostedt,
	Joel Fernandes, Google, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Josh Poimboeuf, Will Deacon, Peter Zijlstra,
	Tony Luck

On Thu, May 14, 2020 at 12:03:30PM -0400, Mathieu Desnoyers wrote:
> - #MC triggered, queuing task work,
> - unrelated signal happens to be delivered to task,
> - exit to usermode loop handles do_signal first,
> - then it runs task work.

How can that even happen?

exit_to_usermode_loop->do_signal->get_signal and that does:

        if (unlikely(current->task_works))
                task_work_run();

at the top.

So the task work will always run before the signal handler.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-14 16:39       ` Borislav Petkov
@ 2020-05-14 17:05         ` Mathieu Desnoyers
  0 siblings, 0 replies; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-14 17:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, linux-kernel, x86, paulmck, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek, rostedt,
	Joel Fernandes, Google, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Josh Poimboeuf, Will Deacon, Peter Zijlstra,
	Tony Luck

----- On May 14, 2020, at 12:39 PM, Borislav Petkov bp@alien8.de wrote:

> On Thu, May 14, 2020 at 12:03:30PM -0400, Mathieu Desnoyers wrote:
>> - #MC triggered, queuing task work,
>> - unrelated signal happens to be delivered to task,
>> - exit to usermode loop handles do_signal first,
>> - then it runs task work.
> 
> How can that even happen?
> 
> exit_to_usermode_loop->do_signal->get_signal and that does:
> 
>        if (unlikely(current->task_works))
>                task_work_run();
> 
> at the top.
> 
> So the task work will always run before the signal handler.

OK yes, nevermind. I focused on its invocation from tracehook_notify_resume
and missed this invocation in do_signal. My bad.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-13 23:42   ` Mathieu Desnoyers
@ 2020-05-14 17:38     ` Thomas Gleixner
  2020-05-14 17:42       ` Mathieu Desnoyers
  0 siblings, 1 reply; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-14 17:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> ----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
>
>> From: Peter Zijlstra <peterz@infradead.org>
>> 
>
> Patch title: singal -> signal.
>
>> Convert #MC over to using task_work_add(); it will run the same code
>> slightly later, on the return to user path of the same exception.
>
> So I suspect that switching the order between tracehook_notify_resume()
> (which ends up calling task_work_run()) and do_signal() done by an
> earlier patch in this series intends to ensure the information about the
> instruction pointer causing the #MC is not overwritten by do_signal()
> (but I'm just guessing).

No, it does not. See the ordering discussion.

Aside of that signal never transported any address information. It uses
force_sig(SIGBUS).

Even if a different signal would be sent first then the register frame
of the #MC is still there when the fatal signal is sent later.

But even w/o changing the ordering the taskwork check in do_signal()
runs the pending work before delivering anything.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work
  2020-05-14 17:38     ` Thomas Gleixner
@ 2020-05-14 17:42       ` Mathieu Desnoyers
  0 siblings, 0 replies; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-14 17:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra

----- On May 14, 2020, at 1:38 PM, Thomas Gleixner tglx@linutronix.de wrote:

> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
>> ----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
>>
>>> From: Peter Zijlstra <peterz@infradead.org>
>>> 
>>
>> Patch title: singal -> signal.
>>
>>> Convert #MC over to using task_work_add(); it will run the same code
>>> slightly later, on the return to user path of the same exception.
>>
>> So I suspect that switching the order between tracehook_notify_resume()
>> (which ends up calling task_work_run()) and do_signal() done by an
>> earlier patch in this series intends to ensure the information about the
>> instruction pointer causing the #MC is not overwritten by do_signal()
>> (but I'm just guessing).
> 
> No, it does not. See the ordering discussion.
> 
> Aside of that signal never transported any address information. It uses
> force_sig(SIGBUS).
> 
> Even if a different signal would be sent first then the register frame
> of the #MC is still there when the fatal signal is sent later.
> 
> But even w/o changing the ordering the taskwork check in do_signal()
> runs the pending work before delivering anything.

Yep, that was the key thing I missed,

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic()
  2020-05-13 22:57   ` Mathieu Desnoyers
  2020-05-14  0:13     ` Steven Rostedt
@ 2020-05-15  9:34     ` Thomas Gleixner
  2020-05-15 13:11       ` Mathieu Desnoyers
  1 sibling, 1 reply; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-15  9:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:

> ----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
>
>> This is completely overengineered and definitely not an interface which
>> should be made available to anything else than this particular MCE case.
>
> This patch introduces a significant change under the radar (not explained
> in the changelog): it turns preempt_enable_no_resched() into preempt_enable().
>
> Why, and why was it a no_resched() in the first place ? Was it for performance
> or correctness reasons ?

_no_resched() is an optimization when in code which cannot schedule
anyway. But #MC is definitely not a performance critical hotpath.

So yes, it's a change but really not significant.

>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> ---
< Remove useless gunk >

Can you please trim your replies?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic()
  2020-05-15  9:34     ` Thomas Gleixner
@ 2020-05-15 13:11       ` Mathieu Desnoyers
  0 siblings, 0 replies; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-15 13:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, paulmck, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Will Deacon

----- On May 15, 2020, at 5:34 AM, Thomas Gleixner tglx@linutronix.de wrote:
[...]
> 
> Can you please trim your replies?

Sorry, will do,

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 27/36] arm64: Prepare arch_nmi_enter() for recursion
  2020-05-13 23:28   ` Mathieu Desnoyers
@ 2020-05-15 14:04     ` Frederic Weisbecker
  2020-05-15 15:45       ` Will Deacon
  0 siblings, 1 reply; 178+ messages in thread
From: Frederic Weisbecker @ 2020-05-15 14:04 UTC (permalink / raw)
  To: Mathieu Desnoyers, Will Deacon
  Cc: Thomas Gleixner, linux-kernel, x86, paulmck, Andy Lutomirski,
	Alexandre Chartre, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, rostedt, Joel Fernandes, Google,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Josh Poimboeuf,
	Peter Zijlstra, Catalin Marinas

On Wed, May 13, 2020 at 07:28:34PM -0400, Mathieu Desnoyers wrote:
> ----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
> 
> > +#define arch_nmi_enter()						\
> [...]							\
> > +	___hcr = read_sysreg(hcr_el2);					\
> > +	if (!(___hcr & HCR_TGE)) {					\
> > +		write_sysreg(___hcr | HCR_TGE, hcr_el2);		\
> > +		isb();							\
> 
> Why is there an isb() above ^ ....
> 
> > +	}								\
> > +	/*								\
> [...]
> > -#define arch_nmi_exit()								\
> [...]
> > +	/*								\
> > +	 * Make sure ___ctx->cnt release is visible before we		\
> > +	 * restore the sysreg. Otherwise a new NMI occurring		\
> > +	 * right after write_sysreg() can be fooled and think		\
> > +	 * we secured things for it.					\
> > +	 */								\
> > +	barrier();							\
> > +	if (!___ctx->cnt && !(___hcr & HCR_TGE))			\
> > +		write_sysreg(___hcr, hcr_el2);				\
> 
> And not here ?

I have to defer to Will on this detail...

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 27/36] arm64: Prepare arch_nmi_enter() for recursion
  2020-05-15 14:04     ` Frederic Weisbecker
@ 2020-05-15 15:45       ` Will Deacon
  2020-05-15 16:01         ` Mathieu Desnoyers
  0 siblings, 1 reply; 178+ messages in thread
From: Will Deacon @ 2020-05-15 15:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Mathieu Desnoyers, Thomas Gleixner, linux-kernel, x86, paulmck,
	Andy Lutomirski, Alexandre Chartre, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek, rostedt,
	Joel Fernandes, Google, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Josh Poimboeuf, Peter Zijlstra, Catalin Marinas,
	maz

On Fri, May 15, 2020 at 04:04:39PM +0200, Frederic Weisbecker wrote:
> On Wed, May 13, 2020 at 07:28:34PM -0400, Mathieu Desnoyers wrote:
> > ----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
> > 
> > > +#define arch_nmi_enter()						\
> > [...]							\
> > > +	___hcr = read_sysreg(hcr_el2);					\
> > > +	if (!(___hcr & HCR_TGE)) {					\
> > > +		write_sysreg(___hcr | HCR_TGE, hcr_el2);		\
> > > +		isb();							\
> > 
> > Why is there an isb() above ^ ....
> > 
> > > +	}								\
> > > +	/*								\
> > [...]
> > > -#define arch_nmi_exit()								\
> > [...]
> > > +	/*								\
> > > +	 * Make sure ___ctx->cnt release is visible before we		\
> > > +	 * restore the sysreg. Otherwise a new NMI occurring		\
> > > +	 * right after write_sysreg() can be fooled and think		\
> > > +	 * we secured things for it.					\
> > > +	 */								\
> > > +	barrier();							\
> > > +	if (!___ctx->cnt && !(___hcr & HCR_TGE))			\
> > > +		write_sysreg(___hcr, hcr_el2);				\
> > 
> > And not here ?
> 
> I have to defer to Will on this detail...

I think it's because we have to make sure that the register update has
taken effect before we can safely run the NMI handler (and so an ISB is
needed), but on the return path the exception return back to the interrupted
context has an implicit ISB so there's no need for an extra one here.

Make sense?

Will

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 27/36] arm64: Prepare arch_nmi_enter() for recursion
  2020-05-15 15:45       ` Will Deacon
@ 2020-05-15 16:01         ` Mathieu Desnoyers
  0 siblings, 0 replies; 178+ messages in thread
From: Mathieu Desnoyers @ 2020-05-15 16:01 UTC (permalink / raw)
  To: Will Deacon
  Cc: Frederic Weisbecker, Thomas Gleixner, linux-kernel, x86, paulmck,
	Andy Lutomirski, Alexandre Chartre, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek, rostedt,
	Joel Fernandes, Google, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Josh Poimboeuf, Peter Zijlstra, Catalin Marinas,
	maz

----- On May 15, 2020, at 11:45 AM, Will Deacon will@kernel.org wrote:

> On Fri, May 15, 2020 at 04:04:39PM +0200, Frederic Weisbecker wrote:
>> On Wed, May 13, 2020 at 07:28:34PM -0400, Mathieu Desnoyers wrote:
>> > ----- On May 5, 2020, at 9:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
>> > 
>> > > +#define arch_nmi_enter()						\
>> > [...]							\
>> > > +	___hcr = read_sysreg(hcr_el2);					\
>> > > +	if (!(___hcr & HCR_TGE)) {					\
>> > > +		write_sysreg(___hcr | HCR_TGE, hcr_el2);		\
>> > > +		isb();							\
>> > 
>> > Why is there an isb() above ^ ....
>> > 
>> > > +	}								\
>> > > +	/*								\
>> > [...]
>> > > -#define arch_nmi_exit()								\
>> > [...]
>> > > +	/*								\
>> > > +	 * Make sure ___ctx->cnt release is visible before we		\
>> > > +	 * restore the sysreg. Otherwise a new NMI occurring		\
>> > > +	 * right after write_sysreg() can be fooled and think		\
>> > > +	 * we secured things for it.					\
>> > > +	 */								\
>> > > +	barrier();							\
>> > > +	if (!___ctx->cnt && !(___hcr & HCR_TGE))			\
>> > > +		write_sysreg(___hcr, hcr_el2);				\
>> > 
>> > And not here ?
>> 
>> I have to defer to Will on this detail...
> 
> I think it's because we have to make sure that the register update has
> taken effect before we can safely run the NMI handler (and so an ISB is
> needed), but on the return path the exception return back to the interrupted
> context has an implicit ISB so there's no need for an extra one here.
> 
> Make sense?

Sure, as long as instructions executed between write_sysreg() and return
from exception do not care, which I think should be the case.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 27/36] arm64: Prepare arch_nmi_enter() for recursion
  2020-05-05 13:16 ` [patch V4 part 1 27/36] arm64: Prepare arch_nmi_enter() for recursion Thomas Gleixner
  2020-05-13 23:28   ` Mathieu Desnoyers
@ 2020-05-15 21:29   ` Thomas Gleixner
  2020-05-15 21:31     ` Frederic Weisbecker
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Frederic Weisbecker
  2 siblings, 1 reply; 178+ messages in thread
From: Thomas Gleixner @ 2020-05-15 21:29 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Peter Zijlstra (Intel),
	Catalin Marinas

Thomas Gleixner <tglx@linutronix.de> writes:

> From: Frederic Weisbecker <frederic@kernel.org>

This changelog was very empty. Here is what Peter provided:

 When using nmi_enter() recursively, arch_nmi_enter() must also be recursion
 safe. In particular, it must be ensured that HCR_TGE is always set while in
 NMI context when in HYP mode, and be restored to it's former state when
 done.

 The current code fails this when interleaved wrong. Notably it overwrites
 the original hcr state on nesting.

 Introduce a nesting counter to make sure to store the original value.


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 27/36] arm64: Prepare arch_nmi_enter() for recursion
  2020-05-15 21:29   ` Thomas Gleixner
@ 2020-05-15 21:31     ` Frederic Weisbecker
  0 siblings, 0 replies; 178+ messages in thread
From: Frederic Weisbecker @ 2020-05-15 21:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Peter Zijlstra (Intel),
	Catalin Marinas

On Fri, May 15, 2020 at 11:29:13PM +0200, Thomas Gleixner wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> 
> > From: Frederic Weisbecker <frederic@kernel.org>
> 
> This changelog was very empty. Here is what Peter provided:
> 
>  When using nmi_enter() recursively, arch_nmi_enter() must also be recursion
>  safe. In particular, it must be ensured that HCR_TGE is always set while in
>  NMI context when in HYP mode, and be restored to it's former state when
>  done.
> 
>  The current code fails this when interleaved wrong. Notably it overwrites
>  the original hcr state on nesting.
> 
>  Introduce a nesting counter to make sure to store the original value.

Nice!

Thanks.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [patch V4 part 1 25/36] rcu/tree: Mark the idle relevant functions noinstr
  2020-05-05 13:16 ` [patch V4 part 1 25/36] rcu/tree: Mark the idle relevant functions noinstr Thomas Gleixner
  2020-05-05 18:07   ` Paul E. McKenney
@ 2020-05-19 19:48   ` Joel Fernandes
  2020-05-19 19:52   ` [tip: core/rcu] " tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 178+ messages in thread
From: Joel Fernandes @ 2020-05-19 19:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, vineethrp

Hi Thomas,

On Tue, May 05, 2020 at 03:16:27PM +0200, Thomas Gleixner wrote:
> These functions are invoked from context tracking and other places in the
> low level entry code. Move them into the .noinstr.text section to exclude
> them from instrumentation.
> 
> Mark the places which are safe to invoke traceable functions with
> instr_begin/end() so objtool won't complain.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/rcu/tree.c        |   85 +++++++++++++++++++++++++----------------------
>  kernel/rcu/tree_plugin.h |    4 +-
>  kernel/rcu/update.c      |    7 +--
>  3 files changed, 52 insertions(+), 44 deletions(-)
[...]

Just a few nits/questions but otherwise LGTM.

> @@ -299,7 +294,7 @@ static void rcu_dynticks_eqs_online(void
>   *
>   * No ordering, as we are sampling CPU-local information.
>   */
> -static bool rcu_dynticks_curr_cpu_in_eqs(void)
> +static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
>  {
>  	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>  
> @@ -566,7 +561,7 @@ EXPORT_SYMBOL_GPL(rcutorture_get_gp_data
>   * the possibility of usermode upcalls having messed up our count
>   * of interrupt nesting level during the prior busy period.
>   */
> -static void rcu_eqs_enter(bool user)
> +static noinstr void rcu_eqs_enter(bool user)
>  {
>  	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>  
> @@ -581,12 +576,14 @@ static void rcu_eqs_enter(bool user)
>  	}
>  
>  	lockdep_assert_irqs_disabled();
> +	instr_begin();
>  	trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks));
>  	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
>  	rdp = this_cpu_ptr(&rcu_data);
>  	do_nocb_deferred_wakeup(rdp);
>  	rcu_prepare_for_idle();
>  	rcu_preempt_deferred_qs(current);
> +	instr_end();
>  	WRITE_ONCE(rdp->dynticks_nesting, 0); /* Avoid irq-access tearing. */
>  	// RCU is watching here ...
>  	rcu_dynticks_eqs_enter();
> @@ -623,7 +620,7 @@ void rcu_idle_enter(void)
>   * If you add or remove a call to rcu_user_enter(), be sure to test with
>   * CONFIG_RCU_EQS_DEBUG=y.
>   */
> -void rcu_user_enter(void)
> +noinstr void rcu_user_enter(void)
>  {
>  	lockdep_assert_irqs_disabled();
>  	rcu_eqs_enter(true);
> @@ -656,19 +653,23 @@ static __always_inline void rcu_nmi_exit
>  	 * leave it in non-RCU-idle state.
>  	 */
>  	if (rdp->dynticks_nmi_nesting != 1) {
> +		instr_begin();
>  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2,
>  				  atomic_read(&rdp->dynticks));
>  		WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
>  			   rdp->dynticks_nmi_nesting - 2);
> +		instr_end();

Can instr_end() be moved before the WRITE_ONCE() ?

> @@ -737,7 +738,7 @@ void rcu_irq_exit_irqson(void)
>   * allow for the possibility of usermode upcalls messing up our count of
>   * interrupt nesting level during the busy period that is just now starting.
>   */
> -static void rcu_eqs_exit(bool user)
> +static void noinstr rcu_eqs_exit(bool user)
>  {
>  	struct rcu_data *rdp;
>  	long oldval;
> @@ -755,12 +756,14 @@ static void rcu_eqs_exit(bool user)
>  	// RCU is not watching here ...
>  	rcu_dynticks_eqs_exit();
>  	// ... but is watching here.
> +	instr_begin();
>  	rcu_cleanup_after_idle();
>  	trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, atomic_read(&rdp->dynticks));
>  	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
>  	WRITE_ONCE(rdp->dynticks_nesting, 1);
>  	WARN_ON_ONCE(rdp->dynticks_nmi_nesting);
>  	WRITE_ONCE(rdp->dynticks_nmi_nesting, DYNTICK_IRQ_NONIDLE);
> +	instr_end();

And here I think instr_end() can be moved after the trace_rcu_dyntick() call?

>  }
>  
>  /**
> @@ -791,7 +794,7 @@ void rcu_idle_exit(void)
>   * If you add or remove a call to rcu_user_exit(), be sure to test with
>   * CONFIG_RCU_EQS_DEBUG=y.
>   */
> -void rcu_user_exit(void)
> +void noinstr rcu_user_exit(void)
>  {
>  	rcu_eqs_exit(1);
>  }
> @@ -839,28 +842,35 @@ static __always_inline void rcu_nmi_ente
>  			rcu_cleanup_after_idle();
>  
>  		incby = 1;
> -	} else if (irq && tick_nohz_full_cpu(rdp->cpu) &&
> -		   rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
> -		   READ_ONCE(rdp->rcu_urgent_qs) &&
> -		   !READ_ONCE(rdp->rcu_forced_tick)) {
> -		// We get here only if we had already exited the extended
> -		// quiescent state and this was an interrupt (not an NMI).
> -		// Therefore, (1) RCU is already watching and (2) The fact
> -		// that we are in an interrupt handler and that the rcu_node
> -		// lock is an irq-disabled lock prevents self-deadlock.
> -		// So we can safely recheck under the lock.
> -		raw_spin_lock_rcu_node(rdp->mynode);
> -		if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> -			// A nohz_full CPU is in the kernel and RCU
> -			// needs a quiescent state.  Turn on the tick!
> -			WRITE_ONCE(rdp->rcu_forced_tick, true);
> -			tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
> +	} else if (irq) {
> +		instr_begin();
> +		if (tick_nohz_full_cpu(rdp->cpu) &&

Not a huge fan of the extra indentation but don't see a better way :)

> +		    rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
> +		    READ_ONCE(rdp->rcu_urgent_qs) &&
> +		    !READ_ONCE(rdp->rcu_forced_tick)) {
> +			// We get here only if we had already exited the
> +			// extended quiescent state and this was an
> +			// interrupt (not an NMI).  Therefore, (1) RCU is
> +			// already watching and (2) The fact that we are in
> +			// an interrupt handler and that the rcu_node lock
> +			// is an irq-disabled lock prevents self-deadlock.
> +			// So we can safely recheck under the lock.
> +			raw_spin_lock_rcu_node(rdp->mynode);
> +			if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> +				// A nohz_full CPU is in the kernel and RCU
> +				// needs a quiescent state.  Turn on the tick!
> +				WRITE_ONCE(rdp->rcu_forced_tick, true);
> +				tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
> +			}
> +			raw_spin_unlock_rcu_node(rdp->mynode);
>  		}
> -		raw_spin_unlock_rcu_node(rdp->mynode);
> +		instr_end();
>  	}
> +	instr_begin();
>  	trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
>  			  rdp->dynticks_nmi_nesting,
>  			  rdp->dynticks_nmi_nesting + incby, atomic_read(&rdp->dynticks));
> +	instr_end();
>  	WRITE_ONCE(rdp->dynticks_nmi_nesting, /* Prevent store tearing. */
>  		   rdp->dynticks_nmi_nesting + incby);
>  	barrier();
> @@ -869,11 +879,10 @@ static __always_inline void rcu_nmi_ente
>  /**
>   * rcu_nmi_enter - inform RCU of entry to NMI context
>   */
> -void rcu_nmi_enter(void)
> +noinstr void rcu_nmi_enter(void)
>  {
>  	rcu_nmi_enter_common(false);
>  }
> -NOKPROBE_SYMBOL(rcu_nmi_enter);
>  
>  /**
>   * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
> @@ -897,7 +906,7 @@ NOKPROBE_SYMBOL(rcu_nmi_enter);
>   * If you add or remove a call to rcu_irq_enter(), be sure to test with
>   * CONFIG_RCU_EQS_DEBUG=y.
>   */
> -void rcu_irq_enter(void)
> +noinstr void rcu_irq_enter(void)

Just checking if rcu_irq_enter_irqson() needs noinstr too?

Would the noinstr checking be enforced by the kernel build as well or is the
developer required to run it themselves?

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel


>  {
>  	lockdep_assert_irqs_disabled();
>  	rcu_nmi_enter_common(true);
> @@ -942,7 +951,7 @@ static void rcu_disable_urgency_upon_qs(
>   * if the current CPU is not in its idle loop or is in an interrupt or
>   * NMI handler, return true.
>   */
> -bool notrace rcu_is_watching(void)
> +noinstr bool rcu_is_watching(void)
>  {
>  	bool ret;
>  
> @@ -986,7 +995,7 @@ void rcu_request_urgent_qs_task(struct t
>   * RCU on an offline processor during initial boot, hence the check for
>   * rcu_scheduler_fully_active.
>   */
> -bool rcu_lockdep_current_cpu_online(void)
> +noinstr bool rcu_lockdep_current_cpu_online(void)
>  {
>  	struct rcu_data *rdp;
>  	struct rcu_node *rnp;
> @@ -994,12 +1003,12 @@ bool rcu_lockdep_current_cpu_online(void
>  
>  	if (in_nmi() || !rcu_scheduler_fully_active)
>  		return true;
> -	preempt_disable();
> +	preempt_disable_notrace();
>  	rdp = this_cpu_ptr(&rcu_data);
>  	rnp = rdp->mynode;
>  	if (rdp->grpmask & rcu_rnp_online_cpus(rnp))
>  		ret = true;
> -	preempt_enable();
> +	preempt_enable_notrace();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(rcu_lockdep_current_cpu_online);
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2553,7 +2553,7 @@ static void rcu_bind_gp_kthread(void)
>  }
>  
>  /* Record the current task on dyntick-idle entry. */
> -static void rcu_dynticks_task_enter(void)
> +static void noinstr rcu_dynticks_task_enter(void)
>  {
>  #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
>  	WRITE_ONCE(current->rcu_tasks_idle_cpu, smp_processor_id());
> @@ -2561,7 +2561,7 @@ static void rcu_dynticks_task_enter(void
>  }
>  
>  /* Record no current task on dyntick-idle exit. */
> -static void rcu_dynticks_task_exit(void)
> +static void noinstr rcu_dynticks_task_exit(void)
>  {
>  #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
>  	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -95,7 +95,7 @@ module_param(rcu_normal_after_boot, int,
>   * Similarly, we avoid claiming an RCU read lock held if the current
>   * CPU is offline.
>   */
> -static bool rcu_read_lock_held_common(bool *ret)
> +static noinstr bool rcu_read_lock_held_common(bool *ret)
>  {
>  	if (!debug_lockdep_rcu_enabled()) {
>  		*ret = 1;
> @@ -112,7 +112,7 @@ static bool rcu_read_lock_held_common(bo
>  	return false;
>  }
>  
> -int rcu_read_lock_sched_held(void)
> +noinstr int rcu_read_lock_sched_held(void)
>  {
>  	bool ret;
>  
> @@ -270,13 +270,12 @@ struct lockdep_map rcu_callback_map =
>  	STATIC_LOCKDEP_MAP_INIT("rcu_callback", &rcu_callback_key);
>  EXPORT_SYMBOL_GPL(rcu_callback_map);
>  
> -int notrace debug_lockdep_rcu_enabled(void)
> +noinstr int notrace debug_lockdep_rcu_enabled(void)
>  {
>  	return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE && debug_locks &&
>  	       current->lockdep_recursion == 0;
>  }
>  EXPORT_SYMBOL_GPL(debug_lockdep_rcu_enabled);
> -NOKPROBE_SYMBOL(debug_lockdep_rcu_enabled);
>  
>  /**
>   * rcu_read_lock_held() - might we be in RCU read-side critical section?
> 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] rcu: Make RCU IRQ enter/exit functions rely on in_nmi()
  2020-05-05 13:16 ` [patch V4 part 1 36/36] rcu: Make RCU IRQ enter/exit functions rely on in_nmi() Thomas Gleixner
  2020-05-05 18:13   ` Paul E. McKenney
  2020-05-06 17:09   ` Alexandre Chartre
@ 2020-05-19 19:52   ` tip-bot2 for Paul E. McKenney
  2 siblings, 0 replies; 178+ messages in thread
From: tip-bot2 for Paul E. McKenney @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Paul E. McKenney, Peter Zijlstra (Intel),
	Thomas Gleixner, Alexandre Chartre, x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     9ea366f669ded353ae49754216c042e7d2f72ba6
Gitweb:        https://git.kernel.org/tip/9ea366f669ded353ae49754216c042e7d2f72ba6
Author:        Paul E. McKenney <paulmck@kernel.org>
AuthorDate:    Thu, 13 Feb 2020 12:31:16 -08:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:51:21 +02:00

rcu: Make RCU IRQ enter/exit functions rely on in_nmi()

The rcu_nmi_enter_common() and rcu_nmi_exit_common() functions take an
"irq" parameter that indicates whether these functions have been invoked from
an irq handler (irq==true) or an NMI handler (irq==false).

However, recent changes have applied notrace to a few critical functions
such that rcu_nmi_enter_common() and rcu_nmi_exit_common() many now rely on
in_nmi().  Note that in_nmi() works no differently than before, but rather
that tracing is now prohibited in code regions where in_nmi() would
incorrectly report NMI state.

Therefore remove the "irq" parameter and inline rcu_nmi_enter_common() and
rcu_nmi_exit_common() into rcu_nmi_enter() and rcu_nmi_exit(),
respectively.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.617130349@linutronix.de


---
 kernel/rcu/tree.c | 47 ++++++++++++++--------------------------------
 1 file changed, 15 insertions(+), 32 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0713ef3..9454016 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -664,16 +664,18 @@ noinstr void rcu_user_enter(void)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
-/*
+/**
+ * rcu_nmi_exit - inform RCU of exit from NMI context
+ *
  * If we are returning from the outermost NMI handler that interrupted an
  * RCU-idle period, update rdp->dynticks and rdp->dynticks_nmi_nesting
  * to let the RCU grace-period handling know that the CPU is back to
  * being RCU-idle.
  *
- * If you add or remove a call to rcu_nmi_exit_common(), be sure to test
+ * If you add or remove a call to rcu_nmi_exit(), be sure to test
  * with CONFIG_RCU_EQS_DEBUG=y.
  */
-static __always_inline void rcu_nmi_exit_common(bool irq)
+noinstr void rcu_nmi_exit(void)
 {
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 
@@ -704,7 +706,7 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
 	trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks));
 	WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
 
-	if (irq)
+	if (!in_nmi())
 		rcu_prepare_for_idle();
 	instrumentation_end();
 
@@ -712,22 +714,11 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
 	rcu_dynticks_eqs_enter();
 	// ... but is no longer watching here.
 
-	if (irq)
+	if (!in_nmi())
 		rcu_dynticks_task_enter();
 }
 
 /**
- * rcu_nmi_exit - inform RCU of exit from NMI context
- *
- * If you add or remove a call to rcu_nmi_exit(), be sure to test
- * with CONFIG_RCU_EQS_DEBUG=y.
- */
-void noinstr rcu_nmi_exit(void)
-{
-	rcu_nmi_exit_common(false);
-}
-
-/**
  * rcu_irq_exit - inform RCU that current CPU is exiting irq towards idle
  *
  * Exit from an interrupt handler, which might possibly result in entering
@@ -749,7 +740,7 @@ void noinstr rcu_nmi_exit(void)
 void noinstr rcu_irq_exit(void)
 {
 	lockdep_assert_irqs_disabled();
-	rcu_nmi_exit_common(true);
+	rcu_nmi_exit();
 }
 
 /*
@@ -838,7 +829,7 @@ void noinstr rcu_user_exit(void)
 #endif /* CONFIG_NO_HZ_FULL */
 
 /**
- * rcu_nmi_enter_common - inform RCU of entry to NMI context
+ * rcu_nmi_enter - inform RCU of entry to NMI context
  * @irq: Is this call from rcu_irq_enter?
  *
  * If the CPU was idle from RCU's viewpoint, update rdp->dynticks and
@@ -847,10 +838,10 @@ void noinstr rcu_user_exit(void)
  * long as the nesting level does not overflow an int.  (You will probably
  * run out of stack space first.)
  *
- * If you add or remove a call to rcu_nmi_enter_common(), be sure to test
+ * If you add or remove a call to rcu_nmi_enter(), be sure to test
  * with CONFIG_RCU_EQS_DEBUG=y.
  */
-static __always_inline void rcu_nmi_enter_common(bool irq)
+noinstr void rcu_nmi_enter(void)
 {
 	long incby = 2;
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
@@ -868,18 +859,18 @@ static __always_inline void rcu_nmi_enter_common(bool irq)
 	 */
 	if (rcu_dynticks_curr_cpu_in_eqs()) {
 
-		if (irq)
+		if (!in_nmi())
 			rcu_dynticks_task_exit();
 
 		// RCU is not watching here ...
 		rcu_dynticks_eqs_exit();
 		// ... but is watching here.
 
-		if (irq)
+		if (!in_nmi())
 			rcu_cleanup_after_idle();
 
 		incby = 1;
-	} else if (irq) {
+	} else if (!in_nmi()) {
 		instrumentation_begin();
 		if (tick_nohz_full_cpu(rdp->cpu) &&
 		    rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
@@ -914,14 +905,6 @@ static __always_inline void rcu_nmi_enter_common(bool irq)
 }
 
 /**
- * rcu_nmi_enter - inform RCU of entry to NMI context
- */
-noinstr void rcu_nmi_enter(void)
-{
-	rcu_nmi_enter_common(false);
-}
-
-/**
  * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
  *
  * Enter an interrupt handler, which might possibly result in exiting
@@ -946,7 +929,7 @@ noinstr void rcu_nmi_enter(void)
 noinstr void rcu_irq_enter(void)
 {
 	lockdep_assert_irqs_disabled();
-	rcu_nmi_enter_common(true);
+	rcu_nmi_enter();
 }
 
 /*

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] rcu/tree: Mark the idle relevant functions noinstr
  2020-05-05 13:16 ` [patch V4 part 1 25/36] rcu/tree: Mark the idle relevant functions noinstr Thomas Gleixner
  2020-05-05 18:07   ` Paul E. McKenney
  2020-05-19 19:48   ` Joel Fernandes
@ 2020-05-19 19:52   ` tip-bot2 for Thomas Gleixner
  2020-09-28 22:22     ` Kim Phillips
  2020-09-29 11:25     ` Peter Zijlstra
  2 siblings, 2 replies; 178+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Alexandre Chartre, Peter Zijlstra,
	Paul E. McKenney, x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     ff5c4f5cad33061b07c3fb9187506783c0f3cb66
Gitweb:        https://git.kernel.org/tip/ff5c4f5cad33061b07c3fb9187506783c0f3cb66
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 13 Mar 2020 17:32:17 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:51:20 +02:00

rcu/tree: Mark the idle relevant functions noinstr

These functions are invoked from context tracking and other places in the
low level entry code. Move them into the .noinstr.text section to exclude
them from instrumentation.

Mark the places which are safe to invoke traceable functions with
instrumentation_begin/end() so objtool won't complain.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20200505134100.575356107@linutronix.de


---
 kernel/rcu/tree.c        | 83 +++++++++++++++++++++------------------
 kernel/rcu/tree_plugin.h |  4 +-
 kernel/rcu/update.c      |  3 +-
 3 files changed, 49 insertions(+), 41 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f288477..0713ef3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -88,9 +88,6 @@
  */
 #define RCU_DYNTICK_CTRL_MASK 0x1
 #define RCU_DYNTICK_CTRL_CTR  (RCU_DYNTICK_CTRL_MASK + 1)
-#ifndef rcu_eqs_special_exit
-#define rcu_eqs_special_exit() do { } while (0)
-#endif
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
 	.dynticks_nesting = 1,
@@ -242,7 +239,7 @@ void rcu_softirq_qs(void)
  * RCU is watching prior to the call to this function and is no longer
  * watching upon return.
  */
-static void rcu_dynticks_eqs_enter(void)
+static noinstr void rcu_dynticks_eqs_enter(void)
 {
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 	int seq;
@@ -267,7 +264,7 @@ static void rcu_dynticks_eqs_enter(void)
  * called from an extended quiescent state, that is, RCU is not watching
  * prior to the call to this function and is watching upon return.
  */
-static void rcu_dynticks_eqs_exit(void)
+static noinstr void rcu_dynticks_eqs_exit(void)
 {
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 	int seq;
@@ -285,8 +282,6 @@ static void rcu_dynticks_eqs_exit(void)
 	if (seq & RCU_DYNTICK_CTRL_MASK) {
 		atomic_andnot(RCU_DYNTICK_CTRL_MASK, &rdp->dynticks);
 		smp_mb__after_atomic(); /* _exit after clearing mask. */
-		/* Prefer duplicate flushes to losing a flush. */
-		rcu_eqs_special_exit();
 	}
 }
 
@@ -314,7 +309,7 @@ static void rcu_dynticks_eqs_online(void)
  *
  * No ordering, as we are sampling CPU-local information.
  */
-static bool rcu_dynticks_curr_cpu_in_eqs(void)
+static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
 {
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 
@@ -603,7 +598,7 @@ EXPORT_SYMBOL_GPL(rcutorture_get_gp_data);
  * the possibility of usermode upcalls having messed up our count
  * of interrupt nesting level during the prior busy period.
  */
-static void rcu_eqs_enter(bool user)
+static noinstr void rcu_eqs_enter(bool user)
 {
 	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 
@@ -618,12 +613,14 @@ static void rcu_eqs_enter(bool user)
 	}
 
 	lockdep_assert_irqs_disabled();
+	instrumentation_begin();
 	trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks));
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
 	rdp = this_cpu_ptr(&rcu_data);
 	do_nocb_deferred_wakeup(rdp);
 	rcu_prepare_for_idle();
 	rcu_preempt_deferred_qs(current);
+	instrumentation_end();
 	WRITE_ONCE(rdp->dynticks_nesting, 0); /* Avoid irq-access tearing. */
 	// RCU is watching here ...
 	rcu_dynticks_eqs_enter();
@@ -660,7 +657,7 @@ void rcu_idle_enter(void)
  * If you add or remove a call to rcu_user_enter(), be sure to test with
  * CONFIG_RCU_EQS_DEBUG=y.
  */
-void rcu_user_enter(void)
+noinstr void rcu_user_enter(void)
 {
 	lockdep_assert_irqs_disabled();
 	rcu_eqs_enter(true);
@@ -693,19 +690,23 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
 	 * leave it in non-RCU-idle state.
 	 */
 	if (rdp->dynticks_nmi_nesting != 1) {
+		instrumentation_begin();
 		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2,
 				  atomic_read(&rdp->dynticks));
 		WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
 			   rdp->dynticks_nmi_nesting - 2);
+		instrumentation_end();
 		return;
 	}
 
+	instrumentation_begin();
 	/* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
 	trace_rcu_dyntick(TPS("Startirq"), rdp->dynticks_nmi_nesting, 0, atomic_read(&rdp->dynticks));
 	WRITE_ONCE(rdp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
 
 	if (irq)
 		rcu_prepare_for_idle();
+	instrumentation_end();
 
 	// RCU is watching here ...
 	rcu_dynticks_eqs_enter();
@@ -721,7 +722,7 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
  * If you add or remove a call to rcu_nmi_exit(), be sure to test
  * with CONFIG_RCU_EQS_DEBUG=y.
  */
-void rcu_nmi_exit(void)
+void noinstr rcu_nmi_exit(void)
 {
 	rcu_nmi_exit_common(false);
 }
@@ -745,7 +746,7 @@ void rcu_nmi_exit(void)
  * If you add or remove a call to rcu_irq_exit(), be sure to test with
  * CONFIG_RCU_EQS_DEBUG=y.
  */
-void rcu_irq_exit(void)
+void noinstr rcu_irq_exit(void)
 {
 	lockdep_assert_irqs_disabled();
 	rcu_nmi_exit_common(true);
@@ -774,7 +775,7 @@ void rcu_irq_exit_irqson(void)
  * allow for the possibility of usermode upcalls messing up our count of
  * interrupt nesting level during the busy period that is just now starting.
  */
-static void rcu_eqs_exit(bool user)
+static void noinstr rcu_eqs_exit(bool user)
 {
 	struct rcu_data *rdp;
 	long oldval;
@@ -792,12 +793,14 @@ static void rcu_eqs_exit(bool user)
 	// RCU is not watching here ...
 	rcu_dynticks_eqs_exit();
 	// ... but is watching here.
+	instrumentation_begin();
 	rcu_cleanup_after_idle();
 	trace_rcu_dyntick(TPS("End"), rdp->dynticks_nesting, 1, atomic_read(&rdp->dynticks));
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
 	WRITE_ONCE(rdp->dynticks_nesting, 1);
 	WARN_ON_ONCE(rdp->dynticks_nmi_nesting);
 	WRITE_ONCE(rdp->dynticks_nmi_nesting, DYNTICK_IRQ_NONIDLE);
+	instrumentation_end();
 }
 
 /**
@@ -828,7 +831,7 @@ void rcu_idle_exit(void)
  * If you add or remove a call to rcu_user_exit(), be sure to test with
  * CONFIG_RCU_EQS_DEBUG=y.
  */
-void rcu_user_exit(void)
+void noinstr rcu_user_exit(void)
 {
 	rcu_eqs_exit(1);
 }
@@ -876,28 +879,35 @@ static __always_inline void rcu_nmi_enter_common(bool irq)
 			rcu_cleanup_after_idle();
 
 		incby = 1;
-	} else if (irq && tick_nohz_full_cpu(rdp->cpu) &&
-		   rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
-		   READ_ONCE(rdp->rcu_urgent_qs) &&
-		   !READ_ONCE(rdp->rcu_forced_tick)) {
-		// We get here only if we had already exited the extended
-		// quiescent state and this was an interrupt (not an NMI).
-		// Therefore, (1) RCU is already watching and (2) The fact
-		// that we are in an interrupt handler and that the rcu_node
-		// lock is an irq-disabled lock prevents self-deadlock.
-		// So we can safely recheck under the lock.
-		raw_spin_lock_rcu_node(rdp->mynode);
-		if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
-			// A nohz_full CPU is in the kernel and RCU
-			// needs a quiescent state.  Turn on the tick!
-			WRITE_ONCE(rdp->rcu_forced_tick, true);
-			tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
+	} else if (irq) {
+		instrumentation_begin();
+		if (tick_nohz_full_cpu(rdp->cpu) &&
+		    rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
+		    READ_ONCE(rdp->rcu_urgent_qs) &&
+		    !READ_ONCE(rdp->rcu_forced_tick)) {
+			// We get here only if we had already exited the
+			// extended quiescent state and this was an
+			// interrupt (not an NMI).  Therefore, (1) RCU is
+			// already watching and (2) The fact that we are in
+			// an interrupt handler and that the rcu_node lock
+			// is an irq-disabled lock prevents self-deadlock.
+			// So we can safely recheck under the lock.
+			raw_spin_lock_rcu_node(rdp->mynode);
+			if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
+				// A nohz_full CPU is in the kernel and RCU
+				// needs a quiescent state.  Turn on the tick!
+				WRITE_ONCE(rdp->rcu_forced_tick, true);
+				tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
+			}
+			raw_spin_unlock_rcu_node(rdp->mynode);
 		}
-		raw_spin_unlock_rcu_node(rdp->mynode);
+		instrumentation_end();
 	}
+	instrumentation_begin();
 	trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
 			  rdp->dynticks_nmi_nesting,
 			  rdp->dynticks_nmi_nesting + incby, atomic_read(&rdp->dynticks));
+	instrumentation_end();
 	WRITE_ONCE(rdp->dynticks_nmi_nesting, /* Prevent store tearing. */
 		   rdp->dynticks_nmi_nesting + incby);
 	barrier();
@@ -906,11 +916,10 @@ static __always_inline void rcu_nmi_enter_common(bool irq)
 /**
  * rcu_nmi_enter - inform RCU of entry to NMI context
  */
-void rcu_nmi_enter(void)
+noinstr void rcu_nmi_enter(void)
 {
 	rcu_nmi_enter_common(false);
 }
-NOKPROBE_SYMBOL(rcu_nmi_enter);
 
 /**
  * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle
@@ -934,7 +943,7 @@ NOKPROBE_SYMBOL(rcu_nmi_enter);
  * If you add or remove a call to rcu_irq_enter(), be sure to test with
  * CONFIG_RCU_EQS_DEBUG=y.
  */
-void rcu_irq_enter(void)
+noinstr void rcu_irq_enter(void)
 {
 	lockdep_assert_irqs_disabled();
 	rcu_nmi_enter_common(true);
@@ -979,7 +988,7 @@ static void rcu_disable_urgency_upon_qs(struct rcu_data *rdp)
  * if the current CPU is not in its idle loop or is in an interrupt or
  * NMI handler, return true.
  */
-bool notrace rcu_is_watching(void)
+bool rcu_is_watching(void)
 {
 	bool ret;
 
@@ -1031,12 +1040,12 @@ bool rcu_lockdep_current_cpu_online(void)
 
 	if (in_nmi() || !rcu_scheduler_fully_active)
 		return true;
-	preempt_disable();
+	preempt_disable_notrace();
 	rdp = this_cpu_ptr(&rcu_data);
 	rnp = rdp->mynode;
 	if (rdp->grpmask & rcu_rnp_online_cpus(rnp))
 		ret = true;
-	preempt_enable();
+	preempt_enable_notrace();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(rcu_lockdep_current_cpu_online);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 50caa3f..3522236 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2539,7 +2539,7 @@ static void rcu_bind_gp_kthread(void)
 }
 
 /* Record the current task on dyntick-idle entry. */
-static void rcu_dynticks_task_enter(void)
+static void noinstr rcu_dynticks_task_enter(void)
 {
 #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
 	WRITE_ONCE(current->rcu_tasks_idle_cpu, smp_processor_id());
@@ -2547,7 +2547,7 @@ static void rcu_dynticks_task_enter(void)
 }
 
 /* Record no current task on dyntick-idle exit. */
-static void rcu_dynticks_task_exit(void)
+static void noinstr rcu_dynticks_task_exit(void)
 {
 #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
 	WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 3ce63a9..84843ad 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -284,13 +284,12 @@ struct lockdep_map rcu_callback_map =
 	STATIC_LOCKDEP_MAP_INIT("rcu_callback", &rcu_callback_key);
 EXPORT_SYMBOL_GPL(rcu_callback_map);
 
-int notrace debug_lockdep_rcu_enabled(void)
+noinstr int notrace debug_lockdep_rcu_enabled(void)
 {
 	return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE && debug_locks &&
 	       current->lockdep_recursion == 0;
 }
 EXPORT_SYMBOL_GPL(debug_lockdep_rcu_enabled);
-NOKPROBE_SYMBOL(debug_lockdep_rcu_enabled);
 
 /**
  * rcu_read_lock_held() - might we be in RCU read-side critical section?

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] x86: Replace ist_enter() with nmi_enter()
  2020-05-05 13:16 ` [patch V4 part 1 35/36] x86: Replace ist_enter() with nmi_enter() Thomas Gleixner
  2020-05-07 18:04   ` Andy Lutomirski
  2020-05-14  0:12   ` Mathieu Desnoyers
@ 2020-05-19 19:52   ` tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 178+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, Alexandre Chartre, x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     0d00449c7a28a1514595630735df383dec606812
Gitweb:        https://git.kernel.org/tip/0d00449c7a28a1514595630735df383dec606812
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 19 Feb 2020 09:46:43 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:51:20 +02:00

x86: Replace ist_enter() with nmi_enter()

A few exceptions (like #DB and #BP) can happen at any location in the code,
this then means that tracers should treat events from these exceptions as
NMI-like. The interrupted context could be holding locks with interrupts
disabled for instance.

Similarly, #MC is an actual NMI-like exception.

All of them use ist_enter() which only concerns itself with RCU, but does
not do any of the other setup that NMIs need. This means things like:

	printk()
	  raw_spin_lock_irq(&logbuf_lock);
	  <#DB/#BP/#MC>
	     printk()
	       raw_spin_lock_irq(&logbuf_lock);

are entirely possible (well, not really since printk tries hard to
play nice, but the concept stands).

So replace ist_enter() with nmi_enter(). Also observe that any nmi_enter()
caller must be both notrace and NOKPROBE, or in the noinstr text section.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.525508608@linutronix.de


---
 arch/x86/include/asm/traps.h      |  3 +-
 arch/x86/kernel/cpu/mce/core.c    |  5 +-
 arch/x86/kernel/cpu/mce/p5.c      |  5 +-
 arch/x86/kernel/cpu/mce/winchip.c |  5 +-
 arch/x86/kernel/traps.c           | 71 ++++++------------------------
 5 files changed, 24 insertions(+), 65 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index fe109fc..6f6c417 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -118,9 +118,6 @@ void smp_spurious_interrupt(struct pt_regs *regs);
 void smp_error_interrupt(struct pt_regs *regs);
 asmlinkage void smp_irq_move_cleanup_interrupt(void);
 
-extern void ist_enter(struct pt_regs *regs);
-extern void ist_exit(struct pt_regs *regs);
-
 #ifdef CONFIG_VMAP_STACK
 void __noreturn handle_stack_overflow(const char *message,
 				      struct pt_regs *regs,
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2f0ef95..e9265e2 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -43,6 +43,7 @@
 #include <linux/jump_label.h>
 #include <linux/set_memory.h>
 #include <linux/task_work.h>
+#include <linux/hardirq.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
@@ -1266,7 +1267,7 @@ void noinstr do_machine_check(struct pt_regs *regs, long error_code)
 	if (__mc_check_crashing_cpu(cpu))
 		return;
 
-	ist_enter(regs);
+	nmi_enter();
 
 	this_cpu_inc(mce_exception_count);
 
@@ -1374,7 +1375,7 @@ void noinstr do_machine_check(struct pt_regs *regs, long error_code)
 	}
 
 out_ist:
-	ist_exit(regs);
+	nmi_exit();
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
diff --git a/arch/x86/kernel/cpu/mce/p5.c b/arch/x86/kernel/cpu/mce/p5.c
index 4ae6df5..5ee94aa 100644
--- a/arch/x86/kernel/cpu/mce/p5.c
+++ b/arch/x86/kernel/cpu/mce/p5.c
@@ -7,6 +7,7 @@
 #include <linux/kernel.h>
 #include <linux/types.h>
 #include <linux/smp.h>
+#include <linux/hardirq.h>
 
 #include <asm/processor.h>
 #include <asm/traps.h>
@@ -24,7 +25,7 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)
 {
 	u32 loaddr, hi, lotype;
 
-	ist_enter(regs);
+	nmi_enter();
 
 	rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
 	rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);
@@ -39,7 +40,7 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)
 
 	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
 
-	ist_exit(regs);
+	nmi_exit();
 }
 
 /* Set up machine check reporting for processors with Intel style MCE: */
diff --git a/arch/x86/kernel/cpu/mce/winchip.c b/arch/x86/kernel/cpu/mce/winchip.c
index a30ea13..b3938c1 100644
--- a/arch/x86/kernel/cpu/mce/winchip.c
+++ b/arch/x86/kernel/cpu/mce/winchip.c
@@ -6,6 +6,7 @@
 #include <linux/interrupt.h>
 #include <linux/kernel.h>
 #include <linux/types.h>
+#include <linux/hardirq.h>
 
 #include <asm/processor.h>
 #include <asm/traps.h>
@@ -18,12 +19,12 @@
 /* Machine check handler for WinChip C6: */
 static void winchip_machine_check(struct pt_regs *regs, long error_code)
 {
-	ist_enter(regs);
+	nmi_enter();
 
 	pr_emerg("CPU0: Machine Check Exception.\n");
 	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
 
-	ist_exit(regs);
+	nmi_exit();
 }
 
 /* Set up machine check reporting on the Winchip C6 series */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6740e83..f7cfb9d 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -37,10 +37,12 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/hardirq.h>
+#include <linux/atomic.h>
+
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
 #include <asm/debugreg.h>
-#include <linux/atomic.h>
 #include <asm/text-patching.h>
 #include <asm/ftrace.h>
 #include <asm/traps.h>
@@ -82,41 +84,6 @@ static inline void cond_local_irq_disable(struct pt_regs *regs)
 		local_irq_disable();
 }
 
-/*
- * In IST context, we explicitly disable preemption.  This serves two
- * purposes: it makes it much less likely that we would accidentally
- * schedule in IST context and it will force a warning if we somehow
- * manage to schedule by accident.
- */
-void ist_enter(struct pt_regs *regs)
-{
-	if (user_mode(regs)) {
-		RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
-	} else {
-		/*
-		 * We might have interrupted pretty much anything.  In
-		 * fact, if we're a machine check, we can even interrupt
-		 * NMI processing.  We don't want in_nmi() to return true,
-		 * but we need to notify RCU.
-		 */
-		rcu_nmi_enter();
-	}
-
-	preempt_disable();
-
-	/* This code is a bit fragile.  Test it. */
-	RCU_LOCKDEP_WARN(!rcu_is_watching(), "ist_enter didn't work");
-}
-NOKPROBE_SYMBOL(ist_enter);
-
-void ist_exit(struct pt_regs *regs)
-{
-	preempt_enable_no_resched();
-
-	if (!user_mode(regs))
-		rcu_nmi_exit();
-}
-
 int is_valid_bugaddr(unsigned long addr)
 {
 	unsigned short ud;
@@ -326,7 +293,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code, unsign
 	 * The net result is that our #GP handler will think that we
 	 * entered from usermode with the bad user context.
 	 *
-	 * No need for ist_enter here because we don't use RCU.
+	 * No need for nmi_enter() here because we don't use RCU.
 	 */
 	if (((long)regs->sp >> P4D_SHIFT) == ESPFIX_PGD_ENTRY &&
 		regs->cs == __KERNEL_CS &&
@@ -361,7 +328,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code, unsign
 	}
 #endif
 
-	ist_enter(regs);
+	nmi_enter();
 	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
 
 	tsk->thread.error_code = error_code;
@@ -555,19 +522,13 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 		return;
 
 	/*
-	 * Unlike any other non-IST entry, we can be called from a kprobe in
-	 * non-CONTEXT_KERNEL kernel mode or even during context tracking
-	 * state changes.  Make sure that we wake up RCU even if we're coming
-	 * from kernel code.
-	 *
-	 * This means that we can't schedule even if we came from a
-	 * preemptible kernel context.  That's okay.
+	 * Unlike any other non-IST entry, we can be called from pretty much
+	 * any location in the kernel through kprobes -- text_poke() will most
+	 * likely be handled by poke_int3_handler() above. This means this
+	 * handler is effectively NMI-like.
 	 */
-	if (!user_mode(regs)) {
-		rcu_nmi_enter();
-		preempt_disable();
-	}
-	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+	if (!user_mode(regs))
+		nmi_enter();
 
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
 	if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
@@ -589,10 +550,8 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 	cond_local_irq_disable(regs);
 
 exit:
-	if (!user_mode(regs)) {
-		preempt_enable_no_resched();
-		rcu_nmi_exit();
-	}
+	if (!user_mode(regs))
+		nmi_exit();
 }
 NOKPROBE_SYMBOL(do_int3);
 
@@ -696,7 +655,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	unsigned long dr6;
 	int si_code;
 
-	ist_enter(regs);
+	nmi_enter();
 
 	get_debugreg(dr6, 6);
 	/*
@@ -789,7 +748,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	debug_stack_usage_dec();
 
 exit:
-	ist_exit(regs);
+	nmi_exit();
 }
 NOKPROBE_SYMBOL(do_debug);
 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] x86/mce: Send #MC singal from task work
  2020-05-05 13:16 ` [patch V4 part 1 29/36] x86/mce: Send #MC singal from task work Thomas Gleixner
                     ` (2 preceding siblings ...)
  2020-05-14 14:17   ` Borislav Petkov
@ 2020-05-19 19:52   ` tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 178+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Thomas Gleixner, Frederic Weisbecker, Alexandre Chartre, x86,
	LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     5567d11c21a1d508a91a8cb64a819783a0835d9f
Gitweb:        https://git.kernel.org/tip/5567d11c21a1d508a91a8cb64a819783a0835d9f
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 19 Feb 2020 10:22:06 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:51:19 +02:00

x86/mce: Send #MC singal from task work

Convert #MC over to using task_work_add(); it will run the same code
slightly later, on the return to user path of the same exception.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134100.957390899@linutronix.de


---
 arch/x86/kernel/cpu/mce/core.c | 56 ++++++++++++++++++---------------
 include/linux/sched.h          |  6 ++++-
 2 files changed, 37 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 98bf91c..2f0ef95 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -42,6 +42,7 @@
 #include <linux/export.h>
 #include <linux/jump_label.h>
 #include <linux/set_memory.h>
+#include <linux/task_work.h>
 
 #include <asm/intel-family.h>
 #include <asm/processor.h>
@@ -1086,23 +1087,6 @@ static void mce_clear_state(unsigned long *toclear)
 	}
 }
 
-static int do_memory_failure(struct mce *m)
-{
-	int flags = MF_ACTION_REQUIRED;
-	int ret;
-
-	pr_err("Uncorrected hardware memory error in user-access at %llx", m->addr);
-	if (!(m->mcgstatus & MCG_STATUS_RIPV))
-		flags |= MF_MUST_KILL;
-	ret = memory_failure(m->addr >> PAGE_SHIFT, flags);
-	if (ret)
-		pr_err("Memory error not recovered");
-	else
-		set_mce_nospec(m->addr >> PAGE_SHIFT);
-	return ret;
-}
-
-
 /*
  * Cases where we avoid rendezvous handler timeout:
  * 1) If this CPU is offline.
@@ -1204,6 +1188,29 @@ static void __mc_scan_banks(struct mce *m, struct mce *final,
 	*m = *final;
 }
 
+static void kill_me_now(struct callback_head *ch)
+{
+	force_sig(SIGBUS);
+}
+
+static void kill_me_maybe(struct callback_head *cb)
+{
+	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
+	int flags = MF_ACTION_REQUIRED;
+
+	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
+	if (!(p->mce_status & MCG_STATUS_RIPV))
+		flags |= MF_MUST_KILL;
+
+	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags)) {
+		set_mce_nospec(p->mce_addr >> PAGE_SHIFT);
+		return;
+	}
+
+	pr_err("Memory error not recovered");
+	kill_me_now(cb);
+}
+
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
@@ -1222,7 +1229,7 @@ static void __mc_scan_banks(struct mce *m, struct mce *final,
  * backing the user stack, tracing that reads the user stack will cause
  * potentially infinite recursion.
  */
-void notrace do_machine_check(struct pt_regs *regs, long error_code)
+void noinstr do_machine_check(struct pt_regs *regs, long error_code)
 {
 	DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
 	DECLARE_BITMAP(toclear, MAX_NR_BANKS);
@@ -1354,13 +1361,13 @@ void notrace do_machine_check(struct pt_regs *regs, long error_code)
 	if ((m.cs & 3) == 3) {
 		/* If this triggers there is no way to recover. Die hard. */
 		BUG_ON(!on_thread_stack() || !user_mode(regs));
-		local_irq_enable();
-		preempt_enable();
 
-		if (kill_it || do_memory_failure(&m))
-			force_sig(SIGBUS);
-		preempt_disable();
-		local_irq_disable();
+		current->mce_addr = m.addr;
+		current->mce_status = m.mcgstatus;
+		current->mce_kill_me.func = kill_me_maybe;
+		if (kill_it)
+			current->mce_kill_me.func = kill_me_now;
+		task_work_add(current, &current->mce_kill_me, true);
 	} else {
 		if (!fixup_exception(regs, X86_TRAP_MC, error_code, 0))
 			mce_panic("Failed kernel mode recovery", &m, msg);
@@ -1370,7 +1377,6 @@ out_ist:
 	ist_exit(regs);
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
-NOKPROBE_SYMBOL(do_machine_check);
 
 #ifndef CONFIG_MEMORY_FAILURE
 int memory_failure(unsigned long pfn, int flags)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9437b53..57d0ed0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1297,6 +1297,12 @@ struct task_struct {
 	unsigned long			prev_lowest_stack;
 #endif
 
+#ifdef CONFIG_X86_MCE
+	u64				mce_addr;
+	u64				mce_status;
+	struct callback_head		mce_kill_me;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] sched,rcu,tracing: Avoid tracing before in_nmi() is correct
  2020-05-05 13:16 ` [patch V4 part 1 34/36] sched,rcu,tracing: Avoid tracing before in_nmi() is correct Thomas Gleixner
@ 2020-05-19 19:52   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 178+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Steven Rostedt (VMware), Peter Zijlstra (Intel),
	Thomas Gleixner, Steven Rostedt (VMware),
	Alexandre Chartre, x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     f93524eb9c54f49be150167918f6546b0a2e09b1
Gitweb:        https://git.kernel.org/tip/f93524eb9c54f49be150167918f6546b0a2e09b1
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 12 Feb 2020 21:01:16 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:51:18 +02:00

sched,rcu,tracing: Avoid tracing before in_nmi() is correct

If a tracer is invoked before in_nmi() becomes true, the tracer can no
longer detect it is called from NMI context and behave correctly.

Therefore change nmi_{enter,exit}() to use __preempt_count_{add,sub}()
as the normal preempt_count_{add,sub}() have a (desired) function
trace entry.

This fixes a potential issue with the current code; when the function-tracer
has stack-tracing enabled __trace_stack() will malfunction when it hits the
preempt_count_add() function entry from NMI context.

Suggested-by: Steven Rostedt (VMware) <rosted@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.434193525@linutronix.de


---
 include/linux/hardirq.h | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index a043ad8..621556e 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -66,6 +66,15 @@ extern void irq_exit(void);
 #endif
 
 /*
+ * NMI vs Tracing
+ * --------------
+ *
+ * We must not land in a tracer until (or after) we've changed preempt_count
+ * such that in_nmi() becomes true. To that effect all NMI C entry points must
+ * be marked 'notrace' and call nmi_enter() as soon as possible.
+ */
+
+/*
  * nmi_enter() can nest up to 15 times; see NMI_BITS.
  */
 #define nmi_enter()						\
@@ -75,7 +84,7 @@ extern void irq_exit(void);
 		lockdep_off();					\
 		ftrace_nmi_enter();				\
 		BUG_ON(in_nmi() == NMI_MASK);			\
-		preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
+		__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
 		rcu_nmi_enter();				\
 		lockdep_hardirq_enter();			\
 	} while (0)
@@ -85,7 +94,7 @@ extern void irq_exit(void);
 		lockdep_hardirq_exit();				\
 		rcu_nmi_exit();					\
 		BUG_ON(!in_nmi());				\
-		preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);	\
+		__preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);	\
 		ftrace_nmi_exit();				\
 		lockdep_on();					\
 		printk_nmi_exit();				\

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception
  2020-05-05 13:16 ` [patch V4 part 1 32/36] sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception Thomas Gleixner
  2020-05-08  0:34   ` Steven Rostedt
@ 2020-05-19 19:52   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 178+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Thomas Gleixner, Alexandre Chartre, Steven Rostedt (VMware),
	Rich Felker, Yoshinori Sato, x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     178ba00c354eb15cec6806a812771e60a5ae3ea1
Gitweb:        https://git.kernel.org/tip/178ba00c354eb15cec6806a812771e60a5ae3ea1
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 24 Feb 2020 22:26:21 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:51:18 +02:00

sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception

SuperH is the last remaining user of arch_ftrace_nmi_{enter,exit}(),
remove it from the generic code and into the SuperH code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: https://lkml.kernel.org/r/20200505134101.248881738@linutronix.de


---
 Documentation/trace/ftrace-design.rst |  8 --------
 arch/sh/Kconfig                       |  1 -
 arch/sh/kernel/traps.c                | 12 ++++++++++++
 include/linux/ftrace_irq.h            | 11 -----------
 kernel/trace/Kconfig                  | 10 ----------
 5 files changed, 12 insertions(+), 30 deletions(-)

diff --git a/Documentation/trace/ftrace-design.rst b/Documentation/trace/ftrace-design.rst
index a8e22e0..6893399 100644
--- a/Documentation/trace/ftrace-design.rst
+++ b/Documentation/trace/ftrace-design.rst
@@ -229,14 +229,6 @@ Adding support for it is easy: just define the macro in asm/ftrace.h and
 pass the return address pointer as the 'retp' argument to
 ftrace_push_return_trace().
 
-HAVE_FTRACE_NMI_ENTER
----------------------
-
-If you can't trace NMI functions, then skip this option.
-
-<details to be filled>
-
-
 HAVE_SYSCALL_TRACEPOINTS
 ------------------------
 
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index b4f0e37..97656d2 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -71,7 +71,6 @@ config SUPERH32
 	select HAVE_FUNCTION_TRACER
 	select HAVE_FTRACE_MCOUNT_RECORD
 	select HAVE_DYNAMIC_FTRACE
-	select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_ARCH_KGDB
diff --git a/arch/sh/kernel/traps.c b/arch/sh/kernel/traps.c
index 63cf17b..2130381 100644
--- a/arch/sh/kernel/traps.c
+++ b/arch/sh/kernel/traps.c
@@ -170,11 +170,21 @@ BUILD_TRAP_HANDLER(bug)
 	force_sig(SIGTRAP);
 }
 
+#ifdef CONFIG_DYNAMIC_FTRACE
+extern void arch_ftrace_nmi_enter(void);
+extern void arch_ftrace_nmi_exit(void);
+#else
+static inline void arch_ftrace_nmi_enter(void) { }
+static inline void arch_ftrace_nmi_exit(void) { }
+#endif
+
 BUILD_TRAP_HANDLER(nmi)
 {
 	unsigned int cpu = smp_processor_id();
 	TRAP_HANDLER_DECL;
 
+	arch_ftrace_nmi_enter();
+
 	nmi_enter();
 	nmi_count(cpu)++;
 
@@ -190,4 +200,6 @@ BUILD_TRAP_HANDLER(nmi)
 	}
 
 	nmi_exit();
+
+	arch_ftrace_nmi_exit();
 }
diff --git a/include/linux/ftrace_irq.h b/include/linux/ftrace_irq.h
index ccda97d..0abd9a1 100644
--- a/include/linux/ftrace_irq.h
+++ b/include/linux/ftrace_irq.h
@@ -2,15 +2,6 @@
 #ifndef _LINUX_FTRACE_IRQ_H
 #define _LINUX_FTRACE_IRQ_H
 
-
-#ifdef CONFIG_FTRACE_NMI_ENTER
-extern void arch_ftrace_nmi_enter(void);
-extern void arch_ftrace_nmi_exit(void);
-#else
-static inline void arch_ftrace_nmi_enter(void) { }
-static inline void arch_ftrace_nmi_exit(void) { }
-#endif
-
 #ifdef CONFIG_HWLAT_TRACER
 extern bool trace_hwlat_callback_enabled;
 extern void trace_hwlat_callback(bool enter);
@@ -22,12 +13,10 @@ static inline void ftrace_nmi_enter(void)
 	if (trace_hwlat_callback_enabled)
 		trace_hwlat_callback(true);
 #endif
-	arch_ftrace_nmi_enter();
 }
 
 static inline void ftrace_nmi_exit(void)
 {
-	arch_ftrace_nmi_exit();
 #ifdef CONFIG_HWLAT_TRACER
 	if (trace_hwlat_callback_enabled)
 		trace_hwlat_callback(false);
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 1e9a8f9..24876fa 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -10,11 +10,6 @@ config USER_STACKTRACE_SUPPORT
 config NOP_TRACER
 	bool
 
-config HAVE_FTRACE_NMI_ENTER
-	bool
-	help
-	  See Documentation/trace/ftrace-design.rst
-
 config HAVE_FUNCTION_TRACER
 	bool
 	help
@@ -72,11 +67,6 @@ config RING_BUFFER
 	select TRACE_CLOCK
 	select IRQ_WORK
 
-config FTRACE_NMI_ENTER
-       bool
-       depends on HAVE_FTRACE_NMI_ENTER
-       default y
-
 config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
 	select GLOB

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] x86/entry: Get rid of ist_begin/end_non_atomic()
  2020-05-05 13:16 ` [patch V4 part 1 14/36] x86/entry: Get rid of ist_begin/end_non_atomic() Thomas Gleixner
                     ` (2 preceding siblings ...)
  2020-05-13 22:57   ` Mathieu Desnoyers
@ 2020-05-19 19:52   ` tip-bot2 for Thomas Gleixner
  3 siblings, 0 replies; 178+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Alexandre Chartre, Peter Zijlstra, x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     b052df3da821adfd6be26a6eb16624fb50e90e56
Gitweb:        https://git.kernel.org/tip/b052df3da821adfd6be26a6eb16624fb50e90e56
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 05 Mar 2020 00:52:41 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:51:19 +02:00

x86/entry: Get rid of ist_begin/end_non_atomic()

This is completely overengineered and definitely not an interface which
should be made available to anything else than this particular MCE case.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.462640294@linutronix.de


---
 arch/x86/include/asm/traps.h   |  2 +--
 arch/x86/kernel/cpu/mce/core.c |  6 +++--
 arch/x86/kernel/traps.c        | 37 +---------------------------------
 3 files changed, 4 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c26a7e1..fe109fc 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -120,8 +120,6 @@ asmlinkage void smp_irq_move_cleanup_interrupt(void);
 
 extern void ist_enter(struct pt_regs *regs);
 extern void ist_exit(struct pt_regs *regs);
-extern void ist_begin_non_atomic(struct pt_regs *regs);
-extern void ist_end_non_atomic(void);
 
 #ifdef CONFIG_VMAP_STACK
 void __noreturn handle_stack_overflow(const char *message,
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 54165f3..98bf91c 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1352,13 +1352,15 @@ void notrace do_machine_check(struct pt_regs *regs, long error_code)
 
 	/* Fault was in user mode and we need to take some action */
 	if ((m.cs & 3) == 3) {
-		ist_begin_non_atomic(regs);
+		/* If this triggers there is no way to recover. Die hard. */
+		BUG_ON(!on_thread_stack() || !user_mode(regs));
 		local_irq_enable();
+		preempt_enable();
 
 		if (kill_it || do_memory_failure(&m))
 			force_sig(SIGBUS);
+		preempt_disable();
 		local_irq_disable();
-		ist_end_non_atomic();
 	} else {
 		if (!fixup_exception(regs, X86_TRAP_MC, error_code, 0))
 			mce_panic("Failed kernel mode recovery", &m, msg);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d54cffd..6740e83 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -117,43 +117,6 @@ void ist_exit(struct pt_regs *regs)
 		rcu_nmi_exit();
 }
 
-/**
- * ist_begin_non_atomic() - begin a non-atomic section in an IST exception
- * @regs:	regs passed to the IST exception handler
- *
- * IST exception handlers normally cannot schedule.  As a special
- * exception, if the exception interrupted userspace code (i.e.
- * user_mode(regs) would return true) and the exception was not
- * a double fault, it can be safe to schedule.  ist_begin_non_atomic()
- * begins a non-atomic section within an ist_enter()/ist_exit() region.
- * Callers are responsible for enabling interrupts themselves inside
- * the non-atomic section, and callers must call ist_end_non_atomic()
- * before ist_exit().
- */
-void ist_begin_non_atomic(struct pt_regs *regs)
-{
-	BUG_ON(!user_mode(regs));
-
-	/*
-	 * Sanity check: we need to be on the normal thread stack.  This
-	 * will catch asm bugs and any attempt to use ist_preempt_enable
-	 * from double_fault.
-	 */
-	BUG_ON(!on_thread_stack());
-
-	preempt_enable_no_resched();
-}
-
-/**
- * ist_end_non_atomic() - begin a non-atomic section in an IST exception
- *
- * Ends a non-atomic section started with ist_begin_non_atomic().
- */
-void ist_end_non_atomic(void)
-{
-	preempt_disable();
-}
-
 int is_valid_bugaddr(unsigned long addr)
 {
 	unsigned short ud;

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] lockdep: Always inline lockdep_{off,on}()
  2020-05-05 13:16 ` [patch V4 part 1 30/36] lockdep: Always inline lockdep_{off,on}() Thomas Gleixner
  2020-05-13 23:46   ` Mathieu Desnoyers
@ 2020-05-19 19:52   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 178+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Steven Rostedt, Peter Zijlstra (Intel),
	Thomas Gleixner, Alexandre Chartre, x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     e616cb8daadf637175af4fe53138a94c190c4816
Gitweb:        https://git.kernel.org/tip/e616cb8daadf637175af4fe53138a94c190c4816
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 24 Feb 2020 22:14:51 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:51:18 +02:00

lockdep: Always inline lockdep_{off,on}()

These functions are called {early,late} in nmi_{enter,exit} and should
not be traced or probed. They are also puny, so 'inline' them.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.048523500@linutronix.de


---
 include/linux/lockdep.h  | 23 +++++++++++++++++++++--
 kernel/locking/lockdep.c | 19 -------------------
 2 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 206774a..8fce5c9 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -308,8 +308,27 @@ extern void lockdep_set_selftest_task(struct task_struct *task);
 
 extern void lockdep_init_task(struct task_struct *task);
 
-extern void lockdep_off(void);
-extern void lockdep_on(void);
+/*
+ * Split the recrursion counter in two to readily detect 'off' vs recursion.
+ */
+#define LOCKDEP_RECURSION_BITS	16
+#define LOCKDEP_OFF		(1U << LOCKDEP_RECURSION_BITS)
+#define LOCKDEP_RECURSION_MASK	(LOCKDEP_OFF - 1)
+
+/*
+ * lockdep_{off,on}() are macros to avoid tracing and kprobes; not inlines due
+ * to header dependencies.
+ */
+
+#define lockdep_off()					\
+do {							\
+	current->lockdep_recursion += LOCKDEP_OFF;	\
+} while (0)
+
+#define lockdep_on()					\
+do {							\
+	current->lockdep_recursion -= LOCKDEP_OFF;	\
+} while (0)
 
 extern void lockdep_register_key(struct lock_class_key *key);
 extern void lockdep_unregister_key(struct lock_class_key *key);
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index ac10db6..6f1c8cb 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -393,25 +393,6 @@ void lockdep_init_task(struct task_struct *task)
 	task->lockdep_recursion = 0;
 }
 
-/*
- * Split the recrursion counter in two to readily detect 'off' vs recursion.
- */
-#define LOCKDEP_RECURSION_BITS	16
-#define LOCKDEP_OFF		(1U << LOCKDEP_RECURSION_BITS)
-#define LOCKDEP_RECURSION_MASK	(LOCKDEP_OFF - 1)
-
-void lockdep_off(void)
-{
-	current->lockdep_recursion += LOCKDEP_OFF;
-}
-EXPORT_SYMBOL(lockdep_off);
-
-void lockdep_on(void)
-{
-	current->lockdep_recursion -= LOCKDEP_OFF;
-}
-EXPORT_SYMBOL(lockdep_on);
-
 static inline void lockdep_recursion_finish(void)
 {
 	if (WARN_ON_ONCE(--current->lockdep_recursion))

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] hardirq/nmi: Allow nested nmi_enter()
  2020-05-05 13:16 ` [patch V4 part 1 28/36] hardirq/nmi: Allow nested nmi_enter() Thomas Gleixner
@ 2020-05-19 19:52   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 178+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Thomas Gleixner, Marc Zyngier, Will Deacon, Michael Ellerman,
	x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     69ea03b56ed2c7189ccd0b5910ad39f3cad1df21
Gitweb:        https://git.kernel.org/tip/69ea03b56ed2c7189ccd0b5910ad39f3cad1df21
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 19 Feb 2020 09:46:47 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:51:17 +02:00

hardirq/nmi: Allow nested nmi_enter()

Since there are already a number of sites (ARM64, PowerPC) that effectively
nest nmi_enter(), make the primitive support this before adding even more.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lkml.kernel.org/r/20200505134100.864179229@linutronix.de


---
 arch/arm64/kernel/sdei.c    | 14 ++------------
 arch/arm64/kernel/traps.c   |  8 ++------
 arch/powerpc/kernel/traps.c | 22 ++++++----------------
 include/linux/hardirq.h     |  5 ++++-
 include/linux/preempt.h     |  4 ++--
 5 files changed, 16 insertions(+), 37 deletions(-)

diff --git a/arch/arm64/kernel/sdei.c b/arch/arm64/kernel/sdei.c
index d6259da..e396e69 100644
--- a/arch/arm64/kernel/sdei.c
+++ b/arch/arm64/kernel/sdei.c
@@ -251,22 +251,12 @@ asmlinkage __kprobes notrace unsigned long
 __sdei_handler(struct pt_regs *regs, struct sdei_registered_event *arg)
 {
 	unsigned long ret;
-	bool do_nmi_exit = false;
 
-	/*
-	 * nmi_enter() deals with printk() re-entrance and use of RCU when
-	 * RCU believed this CPU was idle. Because critical events can
-	 * interrupt normal events, we may already be in_nmi().
-	 */
-	if (!in_nmi()) {
-		nmi_enter();
-		do_nmi_exit = true;
-	}
+	nmi_enter();
 
 	ret = _sdei_handler(regs, arg);
 
-	if (do_nmi_exit)
-		nmi_exit();
+	nmi_exit();
 
 	return ret;
 }
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index cf402be..c728f16 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -906,17 +906,13 @@ bool arm64_is_fatal_ras_serror(struct pt_regs *regs, unsigned int esr)
 
 asmlinkage void do_serror(struct pt_regs *regs, unsigned int esr)
 {
-	const bool was_in_nmi = in_nmi();
-
-	if (!was_in_nmi)
-		nmi_enter();
+	nmi_enter();
 
 	/* non-RAS errors are not containable */
 	if (!arm64_is_ras_serror(esr) || arm64_is_fatal_ras_serror(regs, esr))
 		arm64_serror_panic(regs, esr);
 
-	if (!was_in_nmi)
-		nmi_exit();
+	nmi_exit();
 }
 
 asmlinkage void enter_from_user_mode(void)
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 3fca222..b44dd75 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -441,15 +441,9 @@ nonrecoverable:
 void system_reset_exception(struct pt_regs *regs)
 {
 	unsigned long hsrr0, hsrr1;
-	bool nested = in_nmi();
 	bool saved_hsrrs = false;
 
-	/*
-	 * Avoid crashes in case of nested NMI exceptions. Recoverability
-	 * is determined by RI and in_nmi
-	 */
-	if (!nested)
-		nmi_enter();
+	nmi_enter();
 
 	/*
 	 * System reset can interrupt code where HSRRs are live and MSR[RI]=1.
@@ -521,8 +515,7 @@ out:
 		mtspr(SPRN_HSRR1, hsrr1);
 	}
 
-	if (!nested)
-		nmi_exit();
+	nmi_exit();
 
 	/* What should we do here? We could issue a shutdown or hard reset. */
 }
@@ -823,9 +816,8 @@ int machine_check_generic(struct pt_regs *regs)
 void machine_check_exception(struct pt_regs *regs)
 {
 	int recover = 0;
-	bool nested = in_nmi();
-	if (!nested)
-		nmi_enter();
+
+	nmi_enter();
 
 	__this_cpu_inc(irq_stat.mce_exceptions);
 
@@ -851,8 +843,7 @@ void machine_check_exception(struct pt_regs *regs)
 	if (check_io_access(regs))
 		goto bail;
 
-	if (!nested)
-		nmi_exit();
+	nmi_exit();
 
 	die("Machine check", regs, SIGBUS);
 
@@ -863,8 +854,7 @@ void machine_check_exception(struct pt_regs *regs)
 	return;
 
 bail:
-	if (!nested)
-		nmi_exit();
+	nmi_exit();
 }
 
 void SMIException(struct pt_regs *regs)
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 7c8b82f..a043ad8 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -65,13 +65,16 @@ extern void irq_exit(void);
 #define arch_nmi_exit()		do { } while (0)
 #endif
 
+/*
+ * nmi_enter() can nest up to 15 times; see NMI_BITS.
+ */
 #define nmi_enter()						\
 	do {							\
 		arch_nmi_enter();				\
 		printk_nmi_enter();				\
 		lockdep_off();					\
 		ftrace_nmi_enter();				\
-		BUG_ON(in_nmi());				\
+		BUG_ON(in_nmi() == NMI_MASK);			\
 		preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
 		rcu_nmi_enter();				\
 		lockdep_hardirq_enter();			\
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index bc3f1ae..7d9c1c0 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -26,13 +26,13 @@
  *         PREEMPT_MASK:	0x000000ff
  *         SOFTIRQ_MASK:	0x0000ff00
  *         HARDIRQ_MASK:	0x000f0000
- *             NMI_MASK:	0x00100000
+ *             NMI_MASK:	0x00f00000
  * PREEMPT_NEED_RESCHED:	0x80000000
  */
 #define PREEMPT_BITS	8
 #define SOFTIRQ_BITS	8
 #define HARDIRQ_BITS	4
-#define NMI_BITS	1
+#define NMI_BITS	4
 
 #define PREEMPT_SHIFT	0
 #define SOFTIRQ_SHIFT	(PREEMPT_SHIFT + PREEMPT_BITS)

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] arm64: Prepare arch_nmi_enter() for recursion
  2020-05-05 13:16 ` [patch V4 part 1 27/36] arm64: Prepare arch_nmi_enter() for recursion Thomas Gleixner
  2020-05-13 23:28   ` Mathieu Desnoyers
  2020-05-15 21:29   ` Thomas Gleixner
@ 2020-05-19 19:52   ` tip-bot2 for Frederic Weisbecker
  2 siblings, 0 replies; 178+ messages in thread
From: tip-bot2 for Frederic Weisbecker @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Frederic Weisbecker, Peter Zijlstra (Intel),
	Thomas Gleixner, Alexandre Chartre, Will Deacon, Catalin Marinas,
	x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     28f6bf9e247fe23d177cfdbf7e709270e8cc7fa6
Gitweb:        https://git.kernel.org/tip/28f6bf9e247fe23d177cfdbf7e709270e8cc7fa6
Author:        Frederic Weisbecker <frederic@kernel.org>
AuthorDate:    Thu, 27 Feb 2020 09:51:40 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:51:17 +02:00

arm64: Prepare arch_nmi_enter() for recursion

When using nmi_enter() recursively, arch_nmi_enter() must also be recursion
safe. In particular, it must be ensured that HCR_TGE is always set while in
NMI context when in HYP mode, and be restored to it's former state when
done.

The current code fails this when interleaved wrong. Notably it overwrites
the original hcr state on nesting.

Introduce a nesting counter to make sure to store the original value.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lkml.kernel.org/r/20200505134100.771491291@linutronix.de


---
 arch/arm64/include/asm/hardirq.h | 78 +++++++++++++++++++++++--------
 1 file changed, 59 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/include/asm/hardirq.h b/arch/arm64/include/asm/hardirq.h
index 87ad961..985493a 100644
--- a/arch/arm64/include/asm/hardirq.h
+++ b/arch/arm64/include/asm/hardirq.h
@@ -32,30 +32,70 @@ u64 smp_irq_stat_cpu(unsigned int cpu);
 
 struct nmi_ctx {
 	u64 hcr;
+	unsigned int cnt;
 };
 
 DECLARE_PER_CPU(struct nmi_ctx, nmi_contexts);
 
-#define arch_nmi_enter()							\
-	do {									\
-		if (is_kernel_in_hyp_mode()) {					\
-			struct nmi_ctx *nmi_ctx = this_cpu_ptr(&nmi_contexts);	\
-			nmi_ctx->hcr = read_sysreg(hcr_el2);			\
-			if (!(nmi_ctx->hcr & HCR_TGE)) {			\
-				write_sysreg(nmi_ctx->hcr | HCR_TGE, hcr_el2);	\
-				isb();						\
-			}							\
-		}								\
-	} while (0)
+#define arch_nmi_enter()						\
+do {									\
+	struct nmi_ctx *___ctx;						\
+	u64 ___hcr;							\
+									\
+	if (!is_kernel_in_hyp_mode())					\
+		break;							\
+									\
+	___ctx = this_cpu_ptr(&nmi_contexts);				\
+	if (___ctx->cnt) {						\
+		___ctx->cnt++;						\
+		break;							\
+	}								\
+									\
+	___hcr = read_sysreg(hcr_el2);					\
+	if (!(___hcr & HCR_TGE)) {					\
+		write_sysreg(___hcr | HCR_TGE, hcr_el2);		\
+		isb();							\
+	}								\
+	/*								\
+	 * Make sure the sysreg write is performed before ___ctx->cnt	\
+	 * is set to 1. NMIs that see cnt == 1 will rely on us.		\
+	 */								\
+	barrier();							\
+	___ctx->cnt = 1;                                                \
+	/*								\
+	 * Make sure ___ctx->cnt is set before we save ___hcr. We	\
+	 * don't want ___ctx->hcr to be overwritten.			\
+	 */								\
+	barrier();							\
+	___ctx->hcr = ___hcr;						\
+} while (0)
 
-#define arch_nmi_exit()								\
-	do {									\
-		if (is_kernel_in_hyp_mode()) {					\
-			struct nmi_ctx *nmi_ctx = this_cpu_ptr(&nmi_contexts);	\
-			if (!(nmi_ctx->hcr & HCR_TGE))				\
-				write_sysreg(nmi_ctx->hcr, hcr_el2);		\
-		}								\
-	} while (0)
+#define arch_nmi_exit()							\
+do {									\
+	struct nmi_ctx *___ctx;						\
+	u64 ___hcr;							\
+									\
+	if (!is_kernel_in_hyp_mode())					\
+		break;							\
+									\
+	___ctx = this_cpu_ptr(&nmi_contexts);				\
+	___hcr = ___ctx->hcr;						\
+	/*								\
+	 * Make sure we read ___ctx->hcr before we release		\
+	 * ___ctx->cnt as it makes ___ctx->hcr updatable again.		\
+	 */								\
+	barrier();							\
+	___ctx->cnt--;							\
+	/*								\
+	 * Make sure ___ctx->cnt release is visible before we		\
+	 * restore the sysreg. Otherwise a new NMI occurring		\
+	 * right after write_sysreg() can be fooled and think		\
+	 * we secured things for it.					\
+	 */								\
+	barrier();							\
+	if (!___ctx->cnt && !(___hcr & HCR_TGE))			\
+		write_sysreg(___hcr, hcr_el2);				\
+} while (0)
 
 static inline void ack_bad_irq(unsigned int irq)
 {

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] printk: Disallow instrumenting print_nmi_enter()
  2020-05-05 13:16 ` [patch V4 part 1 31/36] printk: Disallow instrumenting print_nmi_enter() Thomas Gleixner
@ 2020-05-19 19:52   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 178+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner, Alexandre Chartre, x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     b0f51883f551b900a04a80f49fb0886caf7e9a12
Gitweb:        https://git.kernel.org/tip/b0f51883f551b900a04a80f49fb0886caf7e9a12
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 24 Feb 2020 22:25:03 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:51:16 +02:00

printk: Disallow instrumenting print_nmi_enter()

It happens early in nmi_enter(), no tracing, probing or other funnies
allowed. Specifically as nmi_enter() will be used in do_debug(), which
would cause recursive exceptions when kprobed.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.139720912@linutronix.de


---
 kernel/printk/printk_safe.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
index e8791f2..4242403 100644
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -10,6 +10,7 @@
 #include <linux/cpumask.h>
 #include <linux/irq_work.h>
 #include <linux/printk.h>
+#include <linux/kprobes.h>
 
 #include "internal.h"
 
@@ -293,12 +294,12 @@ static __printf(1, 0) int vprintk_nmi(const char *fmt, va_list args)
 	return printk_safe_log_store(s, fmt, args);
 }
 
-void notrace printk_nmi_enter(void)
+void noinstr printk_nmi_enter(void)
 {
 	this_cpu_add(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
 }
 
-void notrace printk_nmi_exit(void)
+void noinstr printk_nmi_exit(void)
 {
 	this_cpu_sub(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
 }

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] printk: Prepare for nested printk_nmi_enter()
  2020-05-05 13:16 ` [patch V4 part 1 26/36] printk: Prepare for nested printk_nmi_enter() Thomas Gleixner
@ 2020-05-19 19:52   ` tip-bot2 for Petr Mladek
  0 siblings, 0 replies; 178+ messages in thread
From: tip-bot2 for Petr Mladek @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Petr Mladek, Peter Zijlstra (Intel),
	Thomas Gleixner, Alexandre Chartre, x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     8c4e93c362ff114def211d4629b120af86eb1275
Gitweb:        https://git.kernel.org/tip/8c4e93c362ff114def211d4629b120af86eb1275
Author:        Petr Mladek <pmladek@suse.com>
AuthorDate:    Mon, 24 Feb 2020 13:13:31 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:51:16 +02:00

printk: Prepare for nested printk_nmi_enter()

There is plenty of space in the printk_context variable. Reserve one byte
there for the NMI context to be on the safe side.

It should never overflow. The BUG_ON(in_nmi() == NMI_MASK) in nmi_enter()
will trigger much earlier.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134100.681374113@linutronix.de



---
 kernel/printk/internal.h    | 8 +++++---
 kernel/printk/printk_safe.c | 4 ++--
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index b2b0f52..660f9a6 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -6,9 +6,11 @@
 
 #ifdef CONFIG_PRINTK
 
-#define PRINTK_SAFE_CONTEXT_MASK	 0x3fffffff
-#define PRINTK_NMI_DIRECT_CONTEXT_MASK	 0x40000000
-#define PRINTK_NMI_CONTEXT_MASK		 0x80000000
+#define PRINTK_SAFE_CONTEXT_MASK	0x007ffffff
+#define PRINTK_NMI_DIRECT_CONTEXT_MASK	0x008000000
+#define PRINTK_NMI_CONTEXT_MASK		0xff0000000
+
+#define PRINTK_NMI_CONTEXT_OFFSET	0x010000000
 
 extern raw_spinlock_t logbuf_lock;
 
diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
index d9a659a..e8791f2 100644
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -295,12 +295,12 @@ static __printf(1, 0) int vprintk_nmi(const char *fmt, va_list args)
 
 void notrace printk_nmi_enter(void)
 {
-	this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
+	this_cpu_add(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
 }
 
 void notrace printk_nmi_exit(void)
 {
-	this_cpu_and(printk_context, ~PRINTK_NMI_CONTEXT_MASK);
+	this_cpu_sub(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
 }
 
 /*

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/rcu] vmlinux.lds.h: Create section for protection against instrumentation
  2020-05-05 13:16 ` [patch V4 part 1 20/36] vmlinux.lds.h: Create section for protection against instrumentation Thomas Gleixner
  2020-05-06 16:08   ` Sean Christopherson
@ 2020-05-19 19:52   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 178+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Alexandre Chartre, Peter Zijlstra, x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     6553896666433e7efec589838b400a2a652b3ffa
Gitweb:        https://git.kernel.org/tip/6553896666433e7efec589838b400a2a652b3ffa
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Mon, 09 Mar 2020 22:47:17 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:47:20 +02:00

vmlinux.lds.h: Create section for protection against instrumentation

Some code pathes, especially the low level entry code, must be protected
against instrumentation for various reasons:

 - Low level entry code can be a fragile beast, especially on x86.

 - With NO_HZ_FULL RCU state needs to be established before using it.

Having a dedicated section for such code allows to validate with tooling
that no unsafe functions are invoked.

Add the .noinstr.text section and the noinstr attribute to mark
functions. noinstr implies notrace. Kprobes will gain a section check
later.

Provide also a set of markers: instrumentation_begin()/end()

These are used to mark code inside a noinstr function which calls
into regular instrumentable text section as safe.

The instrumentation markers are only active when CONFIG_DEBUG_ENTRY is
enabled as the end marker emits a NOP to prevent the compiler from merging
the annotation points. This means the objtool verification requires a
kernel compiled with this option.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134100.075416272@linutronix.de
---
 arch/powerpc/kernel/vmlinux.lds.S |  1 +-
 include/asm-generic/sections.h    |  3 ++-
 include/asm-generic/vmlinux.lds.h | 10 ++++++-
 include/linux/compiler.h          | 53 ++++++++++++++++++++++++++++++-
 include/linux/compiler_types.h    |  4 ++-
 scripts/mod/modpost.c             |  2 +-
 6 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S b/arch/powerpc/kernel/vmlinux.lds.S
index 31a0f20..a1706b6 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -90,6 +90,7 @@ SECTIONS
 #ifdef CONFIG_PPC64
 		*(.tramp.ftrace.text);
 #endif
+		NOINSTR_TEXT
 		SCHED_TEXT
 		CPUIDLE_TEXT
 		LOCK_TEXT
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
index d1779d4..66397ed 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -53,6 +53,9 @@ extern char __ctors_start[], __ctors_end[];
 /* Start and end of .opd section - used for function descriptors. */
 extern char __start_opd[], __end_opd[];
 
+/* Start and end of instrumentation protected text section */
+extern char __noinstr_text_start[], __noinstr_text_end[];
+
 extern __visible const void __nosave_begin, __nosave_end;
 
 /* Function descriptor handling (if any).  Override in asm/sections.h */
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 71e387a..db600ef 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -541,6 +541,15 @@
 	__end_rodata = .;
 
 /*
+ * Non-instrumentable text section
+ */
+#define NOINSTR_TEXT							\
+		ALIGN_FUNCTION();					\
+		__noinstr_text_start = .;				\
+		*(.noinstr.text)					\
+		__noinstr_text_end = .;
+
+/*
  * .text section. Map to function alignment to avoid address changes
  * during second ld run in second ld pass when generating System.map
  *
@@ -551,6 +560,7 @@
 #define TEXT_TEXT							\
 		ALIGN_FUNCTION();					\
 		*(.text.hot TEXT_MAIN .text.fixup .text.unlikely)	\
+		NOINSTR_TEXT						\
 		*(.text..refcount)					\
 		*(.ref.text)						\
 	MEM_KEEP(init.text*)						\
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 034b0a6..e9ead05 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -120,12 +120,65 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
 /* Annotate a C jump table to allow objtool to follow the code flow */
 #define __annotate_jump_table __section(.rodata..c_jump_table)
 
+#ifdef CONFIG_DEBUG_ENTRY
+/* Begin/end of an instrumentation safe region */
+#define instrumentation_begin() ({					\
+	asm volatile("%c0:\n\t"						\
+		     ".pushsection .discard.instr_begin\n\t"		\
+		     ".long %c0b - .\n\t"				\
+		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+})
+
+/*
+ * Because instrumentation_{begin,end}() can nest, objtool validation considers
+ * _begin() a +1 and _end() a -1 and computes a sum over the instructions.
+ * When the value is greater than 0, we consider instrumentation allowed.
+ *
+ * There is a problem with code like:
+ *
+ * noinstr void foo()
+ * {
+ *	instrumentation_begin();
+ *	...
+ *	if (cond) {
+ *		instrumentation_begin();
+ *		...
+ *		instrumentation_end();
+ *	}
+ *	bar();
+ *	instrumentation_end();
+ * }
+ *
+ * If instrumentation_end() would be an empty label, like all the other
+ * annotations, the inner _end(), which is at the end of a conditional block,
+ * would land on the instruction after the block.
+ *
+ * If we then consider the sum of the !cond path, we'll see that the call to
+ * bar() is with a 0-value, even though, we meant it to happen with a positive
+ * value.
+ *
+ * To avoid this, have _end() be a NOP instruction, this ensures it will be
+ * part of the condition block and does not escape.
+ */
+#define instrumentation_end() ({					\
+	asm volatile("%c0: nop\n\t"					\
+		     ".pushsection .discard.instr_end\n\t"		\
+		     ".long %c0b - .\n\t"				\
+		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+})
+#endif /* CONFIG_DEBUG_ENTRY */
+
 #else
 #define annotate_reachable()
 #define annotate_unreachable()
 #define __annotate_jump_table
 #endif
 
+#ifndef instrumentation_begin
+#define instrumentation_begin()		do { } while(0)
+#define instrumentation_end()		do { } while(0)
+#endif
+
 #ifndef ASM_UNREACHABLE
 # define ASM_UNREACHABLE
 #endif
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index e970f97..5da257c 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -118,6 +118,10 @@ struct ftrace_likely_data {
 #define notrace			__attribute__((__no_instrument_function__))
 #endif
 
+/* Section for code which can't be instrumented at all */
+#define noinstr								\
+	noinline notrace __attribute((__section__(".noinstr.text")))
+
 /*
  * it doesn't make sense on ARM (currently the only user of __naked)
  * to trace naked functions because then mcount is called without
diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index 5c3c50c..0053d4f 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -948,7 +948,7 @@ static void check_section(const char *modname, struct elf_info *elf,
 
 #define DATA_SECTIONS ".data", ".data.rel"
 #define TEXT_SECTIONS ".text", ".text.unlikely", ".sched.text", \
-		".kprobes.text", ".cpuidle.text"
+		".kprobes.text", ".cpuidle.text", ".noinstr.text"
 #define OTHER_TEXT_SECTIONS ".ref.text", ".head.text", ".spinlock.text", \
 		".fixup", ".entry.text", ".exception.text", ".text.*", \
 		".coldtext"

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [tip: core/kprobes] kprobes: Prevent probes in .noinstr.text section
  2020-05-05 13:16 ` [patch V4 part 1 21/36] kprobes: Prevent probes in .noinstr.text section Thomas Gleixner
  2020-05-08  6:30   ` Masami Hiramatsu
@ 2020-05-19 19:52   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 178+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-19 19:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Alexandre Chartre, Peter Zijlstra,
	Masami Hiramatsu, x86, LKML

The following commit has been merged into the core/kprobes branch of tip:

Commit-ID:     66e9b0717102507e64f638790eaece88765cc9e5
Gitweb:        https://git.kernel.org/tip/66e9b0717102507e64f638790eaece88765cc9e5
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 10 Mar 2020 14:04:34 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 19 May 2020 15:56:20 +02:00

kprobes: Prevent probes in .noinstr.text section

Instrumentation is forbidden in the .noinstr.text section. Make kprobes
respect this.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.ke