All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch V6 00/37] x86/entry: Rework leftovers and merge plan
@ 2020-05-15 23:45 Thomas Gleixner
  2020-05-15 23:45 ` [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns() Thomas Gleixner
                   ` (39 more replies)
  0 siblings, 40 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

Folks!

This is V6 of the rework series. V5 can be found here:

  https://lore.kernel.org/r/20200512210059.056244513@linutronix.de

The V6 leftover series is based on:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-base-v6

which is the reworked base series from part 1-4 of the original 5 part
series with a few changes which are described in detail below in the merge
plan section.

V6 has the following changes vs. V5:

    - Rebased on top entry-base-v6

    - Addressed Stevens request to split up the hardware latency detector.
      This are 3 patches now as I couldn't resist to cleanup the
      timestamping mess in that code before splitting it up.
    
    - Dropped the KVM/SVM change as that is going to be routed
      differently. See below.

The full series is available from:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-v6-the-rest

On top of that the kvm changes are applied for completeness and available
from:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-v6-full


Merge plan:
-----------

After figuring out that the entry pile and next are not really happy with
each other, I spent quite some time to come up with a plan.

The goal was to:

    - not let Stephen Rothwell grow more grey hair when trying to resolve
      the conflicts

    - allow the affected trees (RCU and KVM) to take a small part of the
      series into their trees while making sure that the x86/entry branch
      is complete and contains the required RCU and KVM changes as well.

About 10 hours of patch tetris later the solution looks like this:

  I've reshuffled the patches so that they are grouped by subsystem instead
  of having the cross tree/subsystem patches close to the actual usage site
  in the x86/entry series.

  This allowed me to tag these subsytem parts and they contain just the
  minimal subset of changes to be able to build and boot.

The resulting tag list is:

  - noinstr-lds-2020-05-15

    A single commit containing the vmlinux.lds.h change which introduces
    the noinstr.text section.

  - noinstr-core-2020-05-15

    Based on noinstr-lds-2020-05-15 and contains the core changes

  - noinstr-core-for-kvm-2020-05-15

    Subset of noinstr-core-2020-05-15 which is required to let KVM pull
    the KVM async pagefault cleanup and base the guest_enter/exit() and
    noinstr changes on top.

  - noinstr-rcu-nmi-2020-05-15

    Based on the core/rcu branch in the tip tree. It has merged in
    noinstr-lds-2020-05-15 and contains the nmi_enter/exit() changes along
    with the noinstr section changes on top.

    This tag is intended to be pulled by Paul into his rcu/next branch so
    he can sort the conflicts and base further work on top.

  - noinstr-core-2020-05-15

    Based on noinstr-core-for-kvm-2020-05-15 and contains the async page
    fault cleanup which goes into x86/entry so the IDTENTRY conversion of
    #PF which also touches the async pagefault code can be applied on top

    This tag is intended to be pulled by Paolo into his next branch so he
    can work against these changes and the merged result is also target for
    the rebased version of the KVM guest_enter/exit() changes. These are
    not part of the entry-v6-base tag. I'm going to post them as a separate
    series because the original ones are conflicting with work in that area
    in the KVM tree.

  - noinstr-kcsan-2020-05015, noinstr-kprobes-2020-05-15,
    noinstr-objtool-2020-05-15

    TIP tree internal tags which I added to reduce the brain-melt.

The x86/entry branch is based on the TIP x86/entry branch and has the
following branches and tags merged and patches from part 1-4 applied:

    - x86/asm because this has conflicting changes vs. #DF

    - A small set of preparatory changes and fixes which are independent
      of the noinstr mechanics

    - noinstr-objtool-2020-05-15
    - noinstr-core-2020-05-15
    - noinstr-kprobes-2020-05-15
    - noinstr-rcu-nmi-2020-05-15
    - noinstr-kcsan-2020-05015
    - noinstr-x86-kvm-2020-05-15
    
    - The part 1-4 patches up to

        51336ff8b658 ("x86/entry: Convert double fault exception to IDTENTRY_DF")

      This is tagged as entry-v6-base

The remaining patches in this leftover series will be applied on top.

If this works for all maintainers involved, then I'm going to pull the tags
and branches into the tip-tree which makes them immutable.

If not, please tell me ASAP that I should restart the patch tetris session
after hiding in a brown paperbag for some time to recover from brain melt.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 159+ messages in thread

* [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns()
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
@ 2020-05-15 23:45 ` Thomas Gleixner
  2020-05-19 21:26   ` Steven Rostedt
  2020-05-20 20:14   ` Peter Zijlstra
  2020-05-15 23:45 ` [patch V6 02/37] tracing/hwlat: Split ftrace_nmi_enter/exit() Thomas Gleixner
                   ` (38 subsequent siblings)
  39 siblings, 2 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

Timestamping in the hardware latency detector uses sched_clock() underneath
and depends on CONFIG_GENERIC_SCHED_CLOCK=n because sched clocks from that
subsystem are not NMI safe.

ktime_get_mono_fast_ns() is NMI safe and available on all architectures.

Replace the time getter, get rid of the CONFIG_GENERIC_SCHED_CLOCK=n
dependency and cleanup the horrible macro maze which encapsulates u64 math
in u64 macros.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/trace/trace_hwlat.c |   59 +++++++++++++++++++--------------------------
 1 file changed, 25 insertions(+), 34 deletions(-)

--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -131,29 +131,19 @@ static void trace_hwlat_sample(struct hw
 		trace_buffer_unlock_commit_nostack(buffer, event);
 }
 
-/* Macros to encapsulate the time capturing infrastructure */
-#define time_type	u64
-#define time_get()	trace_clock_local()
-#define time_to_us(x)	div_u64(x, 1000)
-#define time_sub(a, b)	((a) - (b))
-#define init_time(a, b)	(a = b)
-#define time_u64(a)	a
-
+/*
+ * Timestamping uses ktime_get_mono_fast(), the NMI safe access to
+ * CLOCK_MONOTONIC.
+ */
 void trace_hwlat_callback(bool enter)
 {
 	if (smp_processor_id() != nmi_cpu)
 		return;
 
-	/*
-	 * Currently trace_clock_local() calls sched_clock() and the
-	 * generic version is not NMI safe.
-	 */
-	if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
-		if (enter)
-			nmi_ts_start = time_get();
-		else
-			nmi_total_ts += time_get() - nmi_ts_start;
-	}
+	if (enter)
+		nmi_ts_start = ktime_get_mono_fast_ns();
+	else
+		nmi_total_ts += ktime_get_mono_fast_ns() - nmi_ts_start;
 
 	if (enter)
 		nmi_count++;
@@ -165,20 +155,22 @@ void trace_hwlat_callback(bool enter)
  * Used to repeatedly capture the CPU TSC (or similar), looking for potential
  * hardware-induced latency. Called with interrupts disabled and with
  * hwlat_data.lock held.
+ *
+ * Use ktime_get_mono_fast() here as well because it does not wait on the
+ * timekeeping seqcount like ktime_get_mono().
  */
 static int get_sample(void)
 {
 	struct trace_array *tr = hwlat_trace;
 	struct hwlat_sample s;
-	time_type start, t1, t2, last_t2;
+	u64 start, t1, t2, last_t2, thresh;
 	s64 diff, outer_diff, total, last_total = 0;
 	u64 sample = 0;
-	u64 thresh = tracing_thresh;
 	u64 outer_sample = 0;
 	int ret = -1;
 	unsigned int count = 0;
 
-	do_div(thresh, NSEC_PER_USEC); /* modifies interval value */
+	thresh = div_u64(tracing_thresh, NSEC_PER_USEC);
 
 	nmi_cpu = smp_processor_id();
 	nmi_total_ts = 0;
@@ -188,18 +180,20 @@ static int get_sample(void)
 
 	trace_hwlat_callback_enabled = true;
 
-	init_time(last_t2, 0);
-	start = time_get(); /* start timestamp */
+	/* start timestamp */
+	start = ktime_get_mono_fast_ns();
 	outer_diff = 0;
+	last_t2 = 0;
 
 	do {
 
-		t1 = time_get();	/* we'll look for a discontinuity */
-		t2 = time_get();
+		/* we'll look for a discontinuity */
+		t1 = ktime_get_mono_fast_ns();
+		t2 = ktime_get_mono_fast_ns();
 
-		if (time_u64(last_t2)) {
+		if (last_t2) {
 			/* Check the delta from outer loop (t2 to next t1) */
-			outer_diff = time_to_us(time_sub(t1, last_t2));
+			outer_diff = div_u64(t1 - last_t2, NSEC_PER_USEC);
 			/* This shouldn't happen */
 			if (outer_diff < 0) {
 				pr_err(BANNER "time running backwards\n");
@@ -210,7 +204,8 @@ static int get_sample(void)
 		}
 		last_t2 = t2;
 
-		total = time_to_us(time_sub(t2, start)); /* sample width */
+		/* sample width */
+		total = div_u64(t2 - start, NSEC_PER_USEC);
 
 		/* Check for possible overflows */
 		if (total < last_total) {
@@ -220,7 +215,7 @@ static int get_sample(void)
 		last_total = total;
 
 		/* This checks the inner loop (t1 to t2) */
-		diff = time_to_us(time_sub(t2, t1));     /* current diff */
+		diff = div_u64(t2 - t1, NSEC_PER_USEC);
 
 		if (diff > thresh || outer_diff > thresh) {
 			if (!count)
@@ -251,15 +246,11 @@ static int get_sample(void)
 
 		ret = 1;
 
-		/* We read in microseconds */
-		if (nmi_total_ts)
-			do_div(nmi_total_ts, NSEC_PER_USEC);
-
 		hwlat_data.count++;
 		s.seqnum = hwlat_data.count;
 		s.duration = sample;
 		s.outer_duration = outer_sample;
-		s.nmi_total_ts = nmi_total_ts;
+		s.nmi_total_ts = div_u64(nmi_total_ts, NSEC_PER_USEC);
 		s.nmi_count = nmi_count;
 		s.count = count;
 		trace_hwlat_sample(&s);


^ permalink raw reply	[flat|nested] 159+ messages in thread

* [patch V6 02/37] tracing/hwlat: Split ftrace_nmi_enter/exit()
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
  2020-05-15 23:45 ` [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns() Thomas Gleixner
@ 2020-05-15 23:45 ` Thomas Gleixner
  2020-05-19 22:23   ` Steven Rostedt
  2020-05-15 23:45 ` [patch V6 03/37] nmi, tracing: Provide nmi_enter/exit_notrace() Thomas Gleixner
                   ` (37 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

The hardware latency tracer calls into timekeeping and ends up in
various instrumentable functions which is problematic vs. the kprobe
handling especially the text poke machinery. It's invoked from
nmi_enter/exit(), i.e. non-instrumentable code.

Split it into two parts:

 1) NMI counter, only invoked on nmi_enter() and noinstr safe

 2) NMI timestamping, to be invoked from instrumentable code

Move it into the rcu is watching regions of nmi_enter/exit() even
if there is no actual RCU dependency right now but there is also
no point in having it early.

The actual split of nmi_enter/exit() is done in a separate step.

Requested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/ftrace_irq.h |   31 +++++++++++++++++++------------
 include/linux/hardirq.h    |    5 +++--
 kernel/trace/trace_hwlat.c |   19 ++++++++++++-------
 3 files changed, 34 insertions(+), 21 deletions(-)

--- a/include/linux/ftrace_irq.h
+++ b/include/linux/ftrace_irq.h
@@ -4,23 +4,30 @@
 
 #ifdef CONFIG_HWLAT_TRACER
 extern bool trace_hwlat_callback_enabled;
-extern void trace_hwlat_callback(bool enter);
-#endif
+extern void trace_hwlat_count_nmi(void);
+extern void trace_hwlat_timestamp(bool enter);
 
-static inline void ftrace_nmi_enter(void)
+static __always_inline void ftrace_count_nmi(void)
 {
-#ifdef CONFIG_HWLAT_TRACER
-	if (trace_hwlat_callback_enabled)
-		trace_hwlat_callback(true);
-#endif
+	if (unlikely(trace_hwlat_callback_enabled))
+		trace_hwlat_count_nmi();
 }
 
-static inline void ftrace_nmi_exit(void)
+static __always_inline void ftrace_nmi_handler_enter(void)
 {
-#ifdef CONFIG_HWLAT_TRACER
-	if (trace_hwlat_callback_enabled)
-		trace_hwlat_callback(false);
-#endif
+	if (unlikely(trace_hwlat_callback_enabled))
+		trace_hwlat_timestamp(true);
 }
 
+static __always_inline void ftrace_nmi_handler_exit(void)
+{
+	if (unlikely(trace_hwlat_callback_enabled))
+		trace_hwlat_timestamp(false);
+}
+#else /* CONFIG_HWLAT_TRACER */
+static inline void ftrace_count_nmi(void) {}
+static inline void ftrace_nmi_handler_enter(void) {}
+static inline void ftrace_nmi_handler_exit(void) {}
+#endif
+
 #endif /* _LINUX_FTRACE_IRQ_H */
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -82,20 +82,21 @@ extern void irq_exit(void);
 		arch_nmi_enter();				\
 		printk_nmi_enter();				\
 		lockdep_off();					\
-		ftrace_nmi_enter();				\
 		BUG_ON(in_nmi() == NMI_MASK);			\
 		__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
 		rcu_nmi_enter();				\
 		lockdep_hardirq_enter();			\
+		ftrace_count_nmi();				\
+		ftrace_nmi_handler_enter();			\
 	} while (0)
 
 #define nmi_exit()						\
 	do {							\
+		ftrace_nmi_handler_exit();			\
 		lockdep_hardirq_exit();				\
 		rcu_nmi_exit();					\
 		BUG_ON(!in_nmi());				\
 		__preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);	\
-		ftrace_nmi_exit();				\
 		lockdep_on();					\
 		printk_nmi_exit();				\
 		arch_nmi_exit();				\
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -132,21 +132,26 @@ static void trace_hwlat_sample(struct hw
 }
 
 /*
+ * Count NMIs in nmi_enter(). Does not take timestamps
+ * because the timestamping callchain cannot be invoked
+ * from noinstr sections.
+ */
+noinstr void trace_hwlat_count_nmi(void)
+{
+	if (smp_processor_id() == nmi_cpu)
+		nmi_count++;
+}
+
+/*
  * Timestamping uses ktime_get_mono_fast(), the NMI safe access to
  * CLOCK_MONOTONIC.
  */
-void trace_hwlat_callback(bool enter)
+void trace_hwlat_timestamp(bool enter)
 {
-	if (smp_processor_id() != nmi_cpu)
-		return;
-
 	if (enter)
 		nmi_ts_start = ktime_get_mono_fast_ns();
 	else
 		nmi_total_ts += ktime_get_mono_fast_ns() - nmi_ts_start;
-
-	if (enter)
-		nmi_count++;
 }
 
 /**


^ permalink raw reply	[flat|nested] 159+ messages in thread

* [patch V6 03/37] nmi, tracing: Provide nmi_enter/exit_notrace()
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
  2020-05-15 23:45 ` [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns() Thomas Gleixner
  2020-05-15 23:45 ` [patch V6 02/37] tracing/hwlat: Split ftrace_nmi_enter/exit() Thomas Gleixner
@ 2020-05-15 23:45 ` Thomas Gleixner
  2020-05-17  5:12   ` Andy Lutomirski
  2020-05-19 22:24   ` Steven Rostedt
  2020-05-15 23:45 ` [patch V6 04/37] x86: Make hardware latency tracing explicit Thomas Gleixner
                   ` (36 subsequent siblings)
  39 siblings, 2 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


To fully isolate #DB and #BP from instrumentable code it's necessary to
avoid invoking the hardware latency tracer on nmi_enter/exit().

Provide nmi_enter/exit() variants which are not invoking the hardware
latency tracer. That allows to put calls explicitely into the call sites
outside of the kprobe handling.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 include/linux/hardirq.h |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -76,8 +76,16 @@ extern void irq_exit(void);
 
 /*
  * nmi_enter() can nest up to 15 times; see NMI_BITS.
+ *
+ * ftrace_count_nmi() only increments a counter and is noinstr safe so it
+ * can be invoked in nmi_enter_notrace(). ftrace_nmi_handler_enter/exit()
+ * does time stamping and will be invoked in the actual NMI handling after
+ * an instrumentable section has been reached.
+ *
+ * nmi_enter/exit() still calls into the tracer so existing callers
+ * wont break.
  */
-#define nmi_enter()						\
+#define nmi_enter_notrace()					\
 	do {							\
 		arch_nmi_enter();				\
 		printk_nmi_enter();				\
@@ -87,10 +95,15 @@ extern void irq_exit(void);
 		rcu_nmi_enter();				\
 		lockdep_hardirq_enter();			\
 		ftrace_count_nmi();				\
+	} while (0)
+
+#define nmi_enter()						\
+	do {							\
+		nmi_enter_notrace();				\
 		ftrace_nmi_handler_enter();			\
 	} while (0)
 
-#define nmi_exit()						\
+#define nmi_exit_notrace()					\
 	do {							\
 		ftrace_nmi_handler_exit();			\
 		lockdep_hardirq_exit();				\
@@ -102,4 +115,10 @@ extern void irq_exit(void);
 		arch_nmi_exit();				\
 	} while (0)
 
+#define nmi_exit()						\
+	do {							\
+		ftrace_nmi_handler_exit();			\
+		nmi_exit_notrace();				\
+	} while (0)
+
 #endif /* LINUX_HARDIRQ_H */


^ permalink raw reply	[flat|nested] 159+ messages in thread

* [patch V6 04/37] x86: Make hardware latency tracing explicit
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (2 preceding siblings ...)
  2020-05-15 23:45 ` [patch V6 03/37] nmi, tracing: Provide nmi_enter/exit_notrace() Thomas Gleixner
@ 2020-05-15 23:45 ` Thomas Gleixner
  2020-05-17  5:36   ` Andy Lutomirski
  2020-05-18  8:01   ` Peter Zijlstra
  2020-05-15 23:45 ` [patch V6 05/37] genirq: Provide irq_enter/exit_rcu() Thomas Gleixner
                   ` (35 subsequent siblings)
  39 siblings, 2 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


The hardware latency tracer calls into trace_sched_clock and ends up in
various instrumentable functions which is problemeatic vs. the kprobe
handling especially the text poke machinery. It's invoked from
nmi_enter/exit(), i.e. non-instrumentable code.

Use nmi_enter/exit_notrace() instead. These variants do not invoke the
hardware latency tracer which avoids chasing down complex callchains to
make them non-instrumentable.

The real interesting measurement is the actual NMI handler. Add an explicit
invocation for the hardware latency tracer to it.

#DB and #BP are uninteresting as they really should not be in use when
analzying hardware induced latencies.

If #DF hits, hardware latency is definitely not interesting anymore and in
case of a machine check the hardware latency is not the most troublesome
issue either.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/kernel/cpu/mce/core.c |    4 ++--
 arch/x86/kernel/nmi.c          |    6 ++++--
 arch/x86/kernel/traps.c        |   12 +++++++-----
 3 files changed, 13 insertions(+), 9 deletions(-)

--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1916,7 +1916,7 @@ static __always_inline void exc_machine_
 	    mce_check_crashing_cpu())
 		return;
 
-	nmi_enter();
+	nmi_enter_notrace();
 	/*
 	 * The call targets are marked noinstr, but objtool can't figure
 	 * that out because it's an indirect call. Annotate it.
@@ -1924,7 +1924,7 @@ static __always_inline void exc_machine_
 	instrumentation_begin();
 	machine_check_vector(regs);
 	instrumentation_end();
-	nmi_exit();
+	nmi_exit_notrace();
 }
 
 static __always_inline void exc_machine_check_user(struct pt_regs *regs)
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -334,6 +334,7 @@ static noinstr void default_do_nmi(struc
 	__this_cpu_write(last_nmi_rip, regs->ip);
 
 	instrumentation_begin();
+	ftrace_nmi_handler_enter();
 
 	handled = nmi_handle(NMI_LOCAL, regs);
 	__this_cpu_add(nmi_stats.normal, handled);
@@ -420,6 +421,7 @@ static noinstr void default_do_nmi(struc
 		unknown_nmi_error(reason, regs);
 
 out:
+	ftrace_nmi_handler_exit();
 	instrumentation_end();
 }
 
@@ -536,14 +538,14 @@ DEFINE_IDTENTRY_NMI(exc_nmi)
 	}
 #endif
 
-	nmi_enter();
+	nmi_enter_notrace();
 
 	inc_irq_stat(__nmi_count);
 
 	if (!ignore_nmis)
 		default_do_nmi(regs);
 
-	nmi_exit();
+	nmi_exit_notrace();
 
 #ifdef CONFIG_X86_64
 	if (unlikely(this_cpu_read(update_debug_stack))) {
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -387,7 +387,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 	}
 #endif
 
-	nmi_enter();
+	nmi_enter_notrace();
 	instrumentation_begin();
 	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
 
@@ -632,12 +632,14 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 		instrumentation_end();
 		idtentry_exit(regs);
 	} else {
-		nmi_enter();
+		nmi_enter_notrace();
 		instrumentation_begin();
+		ftrace_nmi_handler_enter();
 		if (!do_int3(regs))
 			die("int3", regs, 0);
+		ftrace_nmi_handler_exit();
 		instrumentation_end();
-		nmi_exit();
+		nmi_exit_notrace();
 	}
 }
 
@@ -849,7 +851,7 @@ static void noinstr handle_debug(struct
 static __always_inline void exc_debug_kernel(struct pt_regs *regs,
 					     unsigned long dr6)
 {
-	nmi_enter();
+	nmi_enter_notrace();
 	/*
 	 * The SDM says "The processor clears the BTF flag when it
 	 * generates a debug exception."  Clear TIF_BLOCKSTEP to keep
@@ -871,7 +873,7 @@ static __always_inline void exc_debug_ke
 	if (dr6)
 		handle_debug(regs, dr6, false);
 
-	nmi_exit();
+	nmi_exit_notrace();
 }
 
 static __always_inline void exc_debug_user(struct pt_regs *regs,


^ permalink raw reply	[flat|nested] 159+ messages in thread

* [patch V6 05/37] genirq: Provide irq_enter/exit_rcu()
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (3 preceding siblings ...)
  2020-05-15 23:45 ` [patch V6 04/37] x86: Make hardware latency tracing explicit Thomas Gleixner
@ 2020-05-15 23:45 ` Thomas Gleixner
  2020-05-18 23:06   ` Andy Lutomirski
  2020-05-15 23:45 ` [patch V6 06/37] genirq: Provde __irq_enter/exit_raw() Thomas Gleixner
                   ` (34 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


irq_enter()/exit() include the RCU handling. To properly separate the RCU
handling provide variants which contain only the non-RCU related
functionality.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index a4c5a1df067e..f6f25fab34cb 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -43,7 +43,11 @@ extern void rcu_nmi_exit(void);
 /*
  * Enter irq context (on NO_HZ, update jiffies):
  */
-extern void irq_enter(void);
+void irq_enter(void);
+/*
+ * Like irq_enter(), but RCU is already watching.
+ */
+void irq_enter_rcu(void);
 
 /*
  * Exit irq context without processing softirqs:
@@ -58,7 +62,12 @@ extern void irq_enter(void);
 /*
  * Exit irq context and process softirqs if needed:
  */
-extern void irq_exit(void);
+void irq_exit(void);
+
+/*
+ * Like irq_exit(), but return with RCU watching.
+ */
+void irq_exit_rcu(void);
 
 #ifndef arch_nmi_enter
 #define arch_nmi_enter()	do { } while (0)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index a47c6dd57452..beb8e3a66c7c 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -339,12 +339,11 @@ asmlinkage __visible void do_softirq(void)
 	local_irq_restore(flags);
 }
 
-/*
- * Enter an interrupt context.
+/**
+ * irq_enter_rcu - Enter an interrupt context with RCU watching
  */
-void irq_enter(void)
+void irq_enter_rcu(void)
 {
-	rcu_irq_enter();
 	if (is_idle_task(current) && !in_interrupt()) {
 		/*
 		 * Prevent raise_softirq from needlessly waking up ksoftirqd
@@ -354,10 +353,18 @@ void irq_enter(void)
 		tick_irq_enter();
 		_local_bh_enable();
 	}
-
 	__irq_enter();
 }
 
+/**
+ * irq_enter - Enter an interrupt context including RCU update
+ */
+void irq_enter(void)
+{
+	rcu_irq_enter();
+	irq_enter_rcu();
+}
+
 static inline void invoke_softirq(void)
 {
 	if (ksoftirqd_running(local_softirq_pending()))
@@ -397,10 +404,12 @@ static inline void tick_irq_exit(void)
 #endif
 }
 
-/*
- * Exit an interrupt context. Process softirqs if needed and possible:
+/**
+ * irq_exit_rcu() - Exit an interrupt context without updating RCU
+ *
+ * Also processes softirqs if needed and possible.
  */
-void irq_exit(void)
+void irq_exit_rcu(void)
 {
 #ifndef __ARCH_IRQ_EXIT_IRQS_DISABLED
 	local_irq_disable();
@@ -413,6 +422,16 @@ void irq_exit(void)
 		invoke_softirq();
 
 	tick_irq_exit();
+}
+
+/**
+ * irq_exit - Exit an interrupt context, update RCU and lockdep
+ *
+ * Also processes softirqs if needed and possible.
+ */
+void irq_exit(void)
+{
+	irq_exit_rcu();
 	rcu_irq_exit();
 	 /* must be last! */
 	lockdep_hardirq_exit();


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 06/37] genirq: Provde __irq_enter/exit_raw()
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (4 preceding siblings ...)
  2020-05-15 23:45 ` [patch V6 05/37] genirq: Provide irq_enter/exit_rcu() Thomas Gleixner
@ 2020-05-15 23:45 ` Thomas Gleixner
  2020-05-18 23:07   ` Andy Lutomirski
  2020-05-15 23:45 ` [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack Thomas Gleixner
                   ` (33 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index f6f25fab34cb..adfd98b8a468 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -41,6 +41,17 @@ extern void rcu_nmi_exit(void);
 	} while (0)
 
 /*
+ * Like __irq_enter() without time accounting for fast
+ * interrupts, e.g. reschedule IPI where time accounting
+ * is more expensive than the actual interrupt.
+ */
+#define __irq_enter_raw()				\
+	do {						\
+		preempt_count_add(HARDIRQ_OFFSET);	\
+		lockdep_hardirq_enter();		\
+	} while (0)
+
+/*
  * Enter irq context (on NO_HZ, update jiffies):
  */
 void irq_enter(void);
@@ -60,6 +71,15 @@ void irq_enter_rcu(void);
 	} while (0)
 
 /*
+ * Like __irq_exit() without time accounting
+ */
+#define __irq_exit_raw()				\
+	do {						\
+		lockdep_hardirq_exit();			\
+		preempt_count_sub(HARDIRQ_OFFSET);	\
+	} while (0)
+
+/*
  * Exit irq context and process softirqs if needed:
  */
 void irq_exit(void);


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (5 preceding siblings ...)
  2020-05-15 23:45 ` [patch V6 06/37] genirq: Provde __irq_enter/exit_raw() Thomas Gleixner
@ 2020-05-15 23:45 ` Thomas Gleixner
  2020-05-18 23:11   ` Andy Lutomirski
  2020-05-15 23:45 ` [patch V6 08/37] x86/entry/64: Move do_softirq_own_stack() to C Thomas Gleixner
                   ` (32 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Device interrupt handlers and system vector handlers are executed on the
interrupt stack. The stack switch happens in the low level assembly entry
code. This conflicts with the efforts to consolidate the exit code in C to
ensure correctness vs. RCU and tracing.

As there is no way to move #DB away from IST due to the MOV SS issue, the
requirements vs. #DB and NMI for switching to the interrupt stack do not
exist anymore. The only requirement is that interrupts are disabled.

That allows to move the stack switching to C code which simplifies the
entry/exit handling further because it allows to switch stacks after
handling the entry and on exit before handling RCU, return to usermode and
kernel preemption in the same way as for regular exceptions.

The initial attempt of having the stack switching in inline ASM caused too
much headache vs. objtool and the unwinder. After analysing the use cases
it was agreed on that having the stack switch in ASM for the price of an
indirect call is acceptable as the main users are indirect call heavy
anyway and the few system vectors which are empty shells (scheduler IPI and
KVM posted interrupt vectors) can run from the regular stack.

Provide helper functions to check whether the interrupt stack is already
active and whether stack switching is required.

64 bit only for now. 32 bit has a variant of that already. Once this is
cleaned up the two implementations might be consolidated as a cleanup on
top.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 902691b35e7e..3b8da9f09297 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1106,6 +1106,45 @@ SYM_CODE_START_LOCAL_NOALIGN(.Lbad_gs)
 SYM_CODE_END(.Lbad_gs)
 	.previous
 
+/*
+ * rdi: New stack pointer points to the top word of the stack
+ * rsi: Function pointer
+ * rdx: Function argument (can be NULL if none)
+ */
+SYM_FUNC_START(asm_call_on_stack)
+	/*
+	 * Save the frame pointer unconditionally. This allows the ORC
+	 * unwinder to handle the stack switch.
+	 */
+	pushq		%rbp
+	mov		%rsp, %rbp
+
+	/*
+	 * The unwinder relies on the word at the top of the new stack
+	 * page linking back to the previous RSP.
+	 */
+	mov		%rsp, (%rdi)
+	mov		%rdi, %rsp
+	/* Move the argument to the right place */
+	mov		%rdx, %rdi
+
+1:
+	.pushsection .discard.instr_begin
+	.long 1b - .
+	.popsection
+
+	CALL_NOSPEC	rsi
+
+2:
+	.pushsection .discard.instr_end
+	.long 2b - .
+	.popsection
+
+	/* Restore the previous stack pointer from RBP. */
+	leaveq
+	ret
+SYM_FUNC_END(asm_call_on_stack)
+
 /* Call softirq on interrupt stack. Interrupts are off. */
 .pushsection .text, "ax"
 SYM_FUNC_START(do_softirq_own_stack)
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
new file mode 100644
index 000000000000..d58135152ee8
--- /dev/null
+++ b/arch/x86/include/asm/irq_stack.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_IRQ_STACK_H
+#define _ASM_X86_IRQ_STACK_H
+
+#include <linux/ptrace.h>
+
+#include <asm/processor.h>
+
+#ifdef CONFIG_X86_64
+static __always_inline bool irqstack_active(void)
+{
+	return __this_cpu_read(irq_count) != -1;
+}
+
+void asm_call_on_stack(void *sp, void *func, void *arg);
+
+static __always_inline void run_on_irqstack(void *func, void *arg)
+{
+	void *tos = __this_cpu_read(hardirq_stack_ptr);
+
+	lockdep_assert_irqs_disabled();
+
+	__this_cpu_add(irq_count, 1);
+	asm_call_on_stack(tos - 8, func, arg);
+	__this_cpu_sub(irq_count, 1);
+}
+
+#else /* CONFIG_X86_64 */
+static inline bool irqstack_active(void) { return false; }
+static inline void run_on_irqstack(void *func, void *arg) { }
+#endif /* !CONFIG_X86_64 */
+
+static __always_inline bool irq_needs_irq_stack(struct pt_regs *regs)
+{
+	if (IS_ENABLED(CONFIG_X86_32))
+		return false;
+	return !user_mode(regs) && !irqstack_active();
+}
+
+#endif


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 08/37] x86/entry/64: Move do_softirq_own_stack() to C
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (6 preceding siblings ...)
  2020-05-15 23:45 ` [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack Thomas Gleixner
@ 2020-05-15 23:45 ` Thomas Gleixner
  2020-05-18 23:48   ` Andy Lutomirski
  2020-05-15 23:45 ` [patch V6 09/37] x86/entry: Split idtentry_enter/exit() Thomas Gleixner
                   ` (31 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


The first step to get rid of the ENTER/LEAVE_IRQ_STACK ASM macro maze.  Use
the new C code helpers to move do_softirq_own_stack() out of ASM code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3b8da9f09297..bdf8391b2f95 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1145,19 +1145,6 @@ SYM_FUNC_START(asm_call_on_stack)
 	ret
 SYM_FUNC_END(asm_call_on_stack)
 
-/* Call softirq on interrupt stack. Interrupts are off. */
-.pushsection .text, "ax"
-SYM_FUNC_START(do_softirq_own_stack)
-	pushq	%rbp
-	mov	%rsp, %rbp
-	ENTER_IRQ_STACK regs=0 old_rsp=%r11
-	call	__do_softirq
-	LEAVE_IRQ_STACK regs=0
-	leaveq
-	ret
-SYM_FUNC_END(do_softirq_own_stack)
-.popsection
-
 #ifdef CONFIG_XEN_PV
 /*
  * A note on the "critical region" in our callback handler.
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 12df3a4abfdd..62cff52e03c5 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -20,6 +20,7 @@
 #include <linux/sched/task_stack.h>
 
 #include <asm/cpu_entry_area.h>
+#include <asm/irq_stack.h>
 #include <asm/io_apic.h>
 #include <asm/apic.h>
 
@@ -70,3 +71,11 @@ int irq_init_percpu_irqstack(unsigned int cpu)
 		return 0;
 	return map_irq_stack(cpu);
 }
+
+void do_softirq_own_stack(void)
+{
+	if (irqstack_active())
+		__do_softirq();
+	else
+		run_on_irqstack(__do_softirq, NULL);
+}


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 09/37] x86/entry: Split idtentry_enter/exit()
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (7 preceding siblings ...)
  2020-05-15 23:45 ` [patch V6 08/37] x86/entry/64: Move do_softirq_own_stack() to C Thomas Gleixner
@ 2020-05-15 23:45 ` Thomas Gleixner
  2020-05-18 23:49   ` Andy Lutomirski
  2020-05-15 23:45 ` [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY Thomas Gleixner
                   ` (30 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Split the implementation of idtentry_enter/exit() out into inline functions
so that variants of idtentry_enter/exit() can be implemented without
duplicating code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a1950aa90223..882ada245bd5 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -539,22 +539,7 @@ void noinstr idtentry_enter(struct pt_regs *regs)
 	}
 }
 
-/**
- * idtentry_exit - Common code to handle return from exceptions
- * @regs:	Pointer to pt_regs (exception entry regs)
- *
- * Depending on the return target (kernel/user) this runs the necessary
- * preemption and work checks if possible and required and returns to
- * the caller with interrupts disabled and no further work pending.
- *
- * This is the last action before returning to the low level ASM code which
- * just needs to return to the appropriate context.
- *
- * Invoked by all exception/interrupt IDTENTRY handlers which are not
- * returning through the paranoid exit path (all except NMI, #DF and the IST
- * variants of #MC and #DB) and are therefore on the thread stack.
- */
-void noinstr idtentry_exit(struct pt_regs *regs)
+static __always_inline void __idtentry_exit(struct pt_regs *regs)
 {
 	lockdep_assert_irqs_disabled();
 
@@ -609,3 +594,23 @@ void noinstr idtentry_exit(struct pt_regs *regs)
 		rcu_irq_exit();
 	}
 }
+
+/**
+ * idtentry_exit - Common code to handle return from exceptions
+ * @regs:	Pointer to pt_regs (exception entry regs)
+ *
+ * Depending on the return target (kernel/user) this runs the necessary
+ * preemption and work checks if possible and required and returns to
+ * the caller with interrupts disabled and no further work pending.
+ *
+ * This is the last action before returning to the low level ASM code which
+ * just needs to return to the appropriate context.
+ *
+ * Invoked by all exception/interrupt IDTENTRY handlers which are not
+ * returning through the paranoid exit path (all except NMI, #DF and the IST
+ * variants of #MC and #DB) and are therefore on the thread stack.
+ */
+void noinstr idtentry_exit(struct pt_regs *regs)
+{
+	__idtentry_exit(regs);
+}


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (8 preceding siblings ...)
  2020-05-15 23:45 ` [patch V6 09/37] x86/entry: Split idtentry_enter/exit() Thomas Gleixner
@ 2020-05-15 23:45 ` Thomas Gleixner
  2020-05-19 17:06   ` Andy Lutomirski
  2020-05-15 23:45 ` [patch V6 11/37] x86/entry/64: Simplify idtentry_body Thomas Gleixner
                   ` (29 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Convert the XEN/PV hypercall to IDTENTRY:

  - Emit the ASM stub with DECLARE_IDTENTRY
  - Remove the ASM idtentry in 64bit
  - Remove the open coded ASM entry code in 32bit
  - Remove the old prototypes

The handler stubs need to stay in ASM code as it needs corner case handling
and adjustment of the stack pointer.

Provide a new C function which invokes the entry/exit handling and calls
into the XEN handler on the interrupt stack.

The exit code is slightly different from the regular idtentry_exit() on
non-preemptible kernels. If the hypercall is preemptible and need_resched()
is set then XEN provides a preempt hypercall scheduling function. Add it as
conditional path to __idtentry_exit() so the function can be reused.

__idtentry_exit() is forced inlined so on the regular idtentry_exit() path
the extra condition is optimized out by the compiler.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 882ada245bd5..34caf3849632 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -27,6 +27,9 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 
+#include <xen/xen-ops.h>
+#include <xen/events.h>
+
 #include <asm/desc.h>
 #include <asm/traps.h>
 #include <asm/vdso.h>
@@ -35,6 +38,7 @@
 #include <asm/nospec-branch.h>
 #include <asm/io_bitmap.h>
 #include <asm/syscall.h>
+#include <asm/irq_stack.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
@@ -539,7 +543,8 @@ void noinstr idtentry_enter(struct pt_regs *regs)
 	}
 }
 
-static __always_inline void __idtentry_exit(struct pt_regs *regs)
+static __always_inline void __idtentry_exit(struct pt_regs *regs,
+					    bool preempt_hcall)
 {
 	lockdep_assert_irqs_disabled();
 
@@ -573,6 +578,16 @@ static __always_inline void __idtentry_exit(struct pt_regs *regs)
 				instrumentation_end();
 				return;
 			}
+		} else if (IS_ENABLED(CONFIG_XEN_PV)) {
+			if (preempt_hcall) {
+				/* See CONFIG_PREEMPTION above */
+				instrumentation_begin();
+				rcu_irq_exit_preempt();
+				xen_maybe_preempt_hcall();
+				trace_hardirqs_on();
+				instrumentation_end();
+				return;
+			}
 		}
 		/*
 		 * If preemption is disabled then this needs to be done
@@ -612,5 +627,43 @@ static __always_inline void __idtentry_exit(struct pt_regs *regs)
  */
 void noinstr idtentry_exit(struct pt_regs *regs)
 {
-	__idtentry_exit(regs);
+	__idtentry_exit(regs, false);
+}
+
+#ifdef CONFIG_XEN_PV
+static void __xen_pv_evtchn_do_upcall(void)
+{
+	irq_enter_rcu();
+	inc_irq_stat(irq_hv_callback_count);
+
+	xen_hvm_evtchn_do_upcall();
+
+	irq_exit_rcu();
+}
+
+__visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
+{
+	struct pt_regs *old_regs;
+
+	idtentry_enter(regs);
+	old_regs = set_irq_regs(regs);
+
+	if (!irq_needs_irq_stack(regs)) {
+		instrumentation_begin();
+		__xen_pv_evtchn_do_upcall();
+		instrumentation_end();
+	} else {
+		run_on_irqstack(__xen_pv_evtchn_do_upcall, NULL);
+	}
+
+	set_irq_regs(old_regs);
+
+	if (IS_ENABLED(CONFIG_PREEMPTION)) {
+		__idtentry_exit(regs, false);
+	} else {
+		bool inhcall = __this_cpu_read(xen_in_preemptible_hcall);
+
+		__idtentry_exit(regs, inhcall && need_resched());
+	}
 }
+#endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 7563a87d7539..6ac890d5c9d8 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1298,7 +1298,10 @@ SYM_CODE_END(native_iret)
 #endif
 
 #ifdef CONFIG_XEN_PV
-SYM_FUNC_START(xen_hypervisor_callback)
+/*
+ * See comment in entry_64.S for further explanation
+ */
+SYM_FUNC_START(exc_xen_hypervisor_callback)
 	/*
 	 * Check to see if we got the event in the critical
 	 * region in xen_iret_direct, after we've reenabled
@@ -1315,14 +1318,11 @@ SYM_FUNC_START(xen_hypervisor_callback)
 	pushl	$-1				/* orig_ax = -1 => not a system call */
 	SAVE_ALL
 	ENCODE_FRAME_POINTER
-	TRACE_IRQS_OFF
+
 	mov	%esp, %eax
-	call	xen_evtchn_do_upcall
-#ifndef CONFIG_PREEMPTION
-	call	xen_maybe_preempt_hcall
-#endif
-	jmp	ret_from_intr
-SYM_FUNC_END(xen_hypervisor_callback)
+	call	xen_pv_evtchn_do_upcall
+	jmp	handle_exception_return
+SYM_FUNC_END(exc_xen_hypervisor_callback)
 
 /*
  * Hypervisor uses this for application faults while it executes.
@@ -1464,6 +1464,7 @@ SYM_CODE_START_LOCAL_NOALIGN(handle_exception)
 	movl	%esp, %eax			# pt_regs pointer
 	CALL_NOSPEC edi
 
+handle_exception_return:
 #ifdef CONFIG_VM86
 	movl	PT_EFLAGS(%esp), %eax		# mix EFLAGS and CS
 	movb	PT_CS(%esp), %al
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index bdf8391b2f95..3eddf7c6c530 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1067,10 +1067,6 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
 
 idtentry	X86_TRAP_PF		page_fault		do_page_fault			has_error_code=1
 
-#ifdef CONFIG_XEN_PV
-idtentry	512 /* dummy */		hypervisor_callback	xen_do_hypervisor_callback	has_error_code=0
-#endif
-
 /*
  * Reload gs selector with exception handling
  * edi:  new selector
@@ -1158,9 +1154,10 @@ SYM_FUNC_END(asm_call_on_stack)
  * So, on entry to the handler we detect whether we interrupted an
  * existing activation in its critical region -- if so, we pop the current
  * activation and restart the handler using the previous one.
+ *
+ * C calling convention: exc_xen_hypervisor_callback(struct *pt_regs)
  */
-/* do_hypervisor_callback(struct *pt_regs) */
-SYM_CODE_START_LOCAL(xen_do_hypervisor_callback)
+SYM_CODE_START_LOCAL(exc_xen_hypervisor_callback)
 
 /*
  * Since we don't modify %rdi, evtchn_do_upall(struct *pt_regs) will
@@ -1170,15 +1167,10 @@ SYM_CODE_START_LOCAL(xen_do_hypervisor_callback)
 	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
 	UNWIND_HINT_REGS
 
-	ENTER_IRQ_STACK old_rsp=%r10
-	call	xen_evtchn_do_upcall
-	LEAVE_IRQ_STACK
+	call	xen_pv_evtchn_do_upcall
 
-#ifndef CONFIG_PREEMPTION
-	call	xen_maybe_preempt_hcall
-#endif
-	jmp	error_exit
-SYM_CODE_END(xen_do_hypervisor_callback)
+	jmp	error_return
+SYM_CODE_END(exc_xen_hypervisor_callback)
 
 /*
  * Hypervisor uses this for application faults while it executes.
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index e27731092999..fac73bb3577f 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -332,6 +332,13 @@ __visible noinstr void func(struct pt_regs *regs,			\
  * This avoids duplicate defines and ensures that everything is consistent.
  */
 
+/*
+ * Dummy trap number so the low level ASM macro vector number checks do not
+ * match which results in emitting plain IDTENTRY stubs without bells and
+ * whistels.
+ */
+#define X86_TRAP_OTHER		0xFFFF
+
 /* Simple exception entry points. No hardware error code */
 DECLARE_IDTENTRY(X86_TRAP_DE,		exc_divide_error);
 DECLARE_IDTENTRY(X86_TRAP_OF,		exc_overflow);
@@ -373,4 +380,10 @@ DECLARE_IDTENTRY_XEN(X86_TRAP_DB,	debug);
 DECLARE_IDTENTRY_DF(X86_TRAP_DF,	exc_double_fault);
 #endif
 
+#ifdef CONFIG_XEN_PV
+DECLARE_IDTENTRY(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
+#endif
+
+#undef X86_TRAP_OTHER
+
 #endif
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 1a2d8a50dac4..3566e37241d7 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -20,6 +20,7 @@
 #include <asm/setup.h>
 #include <asm/acpi.h>
 #include <asm/numa.h>
+#include <asm/idtentry.h>
 #include <asm/xen/hypervisor.h>
 #include <asm/xen/hypercall.h>
 
@@ -993,7 +994,8 @@ static void __init xen_pvmmu_arch_setup(void)
 	HYPERVISOR_vm_assist(VMASST_CMD_enable,
 			     VMASST_TYPE_pae_extended_cr3);
 
-	if (register_callback(CALLBACKTYPE_event, xen_hypervisor_callback) ||
+	if (register_callback(CALLBACKTYPE_event,
+			      xen_asm_exc_xen_hypervisor_callback) ||
 	    register_callback(CALLBACKTYPE_failsafe, xen_failsafe_callback))
 		BUG();
 
diff --git a/arch/x86/xen/smp_pv.c b/arch/x86/xen/smp_pv.c
index 8fb8a50a28b4..a92259d701c1 100644
--- a/arch/x86/xen/smp_pv.c
+++ b/arch/x86/xen/smp_pv.c
@@ -27,6 +27,7 @@
 #include <asm/paravirt.h>
 #include <asm/desc.h>
 #include <asm/pgtable.h>
+#include <asm/idtentry.h>
 #include <asm/cpu.h>
 
 #include <xen/interface/xen.h>
@@ -347,7 +348,7 @@ cpu_initialize_context(unsigned int cpu, struct task_struct *idle)
 	ctxt->gs_base_kernel = per_cpu_offset(cpu);
 #endif
 	ctxt->event_callback_eip    =
-		(unsigned long)xen_hypervisor_callback;
+		(unsigned long)xen_asm_exc_xen_hypervisor_callback;
 	ctxt->failsafe_callback_eip =
 		(unsigned long)xen_failsafe_callback;
 	per_cpu(xen_cr3, cpu) = __pa(swapper_pg_dir);
diff --git a/arch/x86/xen/xen-asm_32.S b/arch/x86/xen/xen-asm_32.S
index bd06ac473170..0f7ff3088065 100644
--- a/arch/x86/xen/xen-asm_32.S
+++ b/arch/x86/xen/xen-asm_32.S
@@ -93,7 +93,7 @@ xen_iret_start_crit:
 
 	/*
 	 * If there's something pending, mask events again so we can
-	 * jump back into xen_hypervisor_callback. Otherwise do not
+	 * jump back into exc_xen_hypervisor_callback. Otherwise do not
 	 * touch XEN_vcpu_info_mask.
 	 */
 	jne 1f
@@ -113,7 +113,7 @@ iret_restore_end:
 	 * Events are masked, so jumping out of the critical region is
 	 * OK.
 	 */
-	je xen_hypervisor_callback
+	je asm_exc_xen_hypervisor_callback
 
 1:	iret
 xen_iret_end_crit:
@@ -127,7 +127,7 @@ SYM_CODE_END(xen_iret)
 	.globl xen_iret_start_crit, xen_iret_end_crit
 
 /*
- * This is called by xen_hypervisor_callback in entry_32.S when it sees
+ * This is called by exc_xen_hypervisor_callback in entry_32.S when it sees
  * that the EIP at the time of interrupt was between
  * xen_iret_start_crit and xen_iret_end_crit.
  *
@@ -144,7 +144,7 @@ SYM_CODE_END(xen_iret)
  *	 eflags		}
  *	 cs		}  nested exception info
  *	 eip		}
- *	 return address	: (into xen_hypervisor_callback)
+ *	 return address	: (into asm_exc_xen_hypervisor_callback)
  *
  * In order to deliver the nested exception properly, we need to discard the
  * nested exception frame such that when we handle the exception, we do it
@@ -152,7 +152,8 @@ SYM_CODE_END(xen_iret)
  *
  * The only caveat is that if the outer eax hasn't been restored yet (i.e.
  * it's still on stack), we need to restore its value here.
- */
+*/
+.pushsection .noinstr.text, "ax"
 SYM_CODE_START(xen_iret_crit_fixup)
 	/*
 	 * Paranoia: Make sure we're really coming from kernel space.
@@ -181,3 +182,4 @@ SYM_CODE_START(xen_iret_crit_fixup)
 2:
 	ret
 SYM_CODE_END(xen_iret_crit_fixup)
+.popsection
diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S
index e46d863bcaa4..19fbbdbcbde9 100644
--- a/arch/x86/xen/xen-asm_64.S
+++ b/arch/x86/xen/xen-asm_64.S
@@ -54,7 +54,7 @@ xen_pv_trap asm_exc_simd_coprocessor_error
 #ifdef CONFIG_IA32_EMULATION
 xen_pv_trap entry_INT80_compat
 #endif
-xen_pv_trap hypervisor_callback
+xen_pv_trap asm_exc_xen_hypervisor_callback
 
 	__INIT
 SYM_CODE_START(xen_early_idt_handler_array)
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 45a441c33d6d..4eff29ed375e 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -8,7 +8,6 @@
 #include <xen/xen-ops.h>
 
 /* These are code, but not functions.  Defined in entry.S */
-extern const char xen_hypervisor_callback[];
 extern const char xen_failsafe_callback[];
 
 void xen_sysenter_target(void);
diff --git a/drivers/xen/preempt.c b/drivers/xen/preempt.c
index 17240c5325a3..287171a9dc01 100644
--- a/drivers/xen/preempt.c
+++ b/drivers/xen/preempt.c
@@ -24,7 +24,7 @@
 DEFINE_PER_CPU(bool, xen_in_preemptible_hcall);
 EXPORT_SYMBOL_GPL(xen_in_preemptible_hcall);
 
-asmlinkage __visible void xen_maybe_preempt_hcall(void)
+void xen_maybe_preempt_hcall(void)
 {
 	if (unlikely(__this_cpu_read(xen_in_preemptible_hcall)
 		     && need_resched())) {
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 095be1d66f31..0ed975cf6f79 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -214,6 +214,7 @@ bool xen_running_on_version_or_later(unsigned int major, unsigned int minor);
 
 void xen_efi_runtime_setup(void);
 
+DECLARE_PER_CPU(bool, xen_in_preemptible_hcall);
 
 #ifdef CONFIG_PREEMPTION
 
@@ -225,9 +226,9 @@ static inline void xen_preemptible_hcall_end(void)
 {
 }
 
-#else
+static inline void xen_maybe_preempt_hcall(void) { }
 
-DECLARE_PER_CPU(bool, xen_in_preemptible_hcall);
+#else
 
 static inline void xen_preemptible_hcall_begin(void)
 {
@@ -239,6 +240,8 @@ static inline void xen_preemptible_hcall_end(void)
 	__this_cpu_write(xen_in_preemptible_hcall, false);
 }
 
+void xen_maybe_preempt_hcall(void);
+
 #endif /* CONFIG_PREEMPTION */
 
 #endif /* INCLUDE_XEN_OPS_H */


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 11/37] x86/entry/64: Simplify idtentry_body
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (9 preceding siblings ...)
  2020-05-15 23:45 ` [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY Thomas Gleixner
@ 2020-05-15 23:45 ` Thomas Gleixner
  2020-05-19 17:06   ` Andy Lutomirski
  2020-05-15 23:45 ` [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu() Thomas Gleixner
                   ` (28 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


All C functions which do not have an error code have been converted to the
new IDTENTRY interface which does not expect an error code in the
arguments. Spare the XORL.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3eddf7c6c530..1d700bde232b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -531,8 +531,6 @@ SYM_CODE_END(spurious_entries_start)
 	.if \has_error_code == 1
 		movq	ORIG_RAX(%rsp), %rsi	/* get error code into 2nd argument*/
 		movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
-	.else
-		xorl	%esi, %esi		/* Clear the error code */
 	.endif
 
 	.if \vector == X86_TRAP_PF


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (10 preceding siblings ...)
  2020-05-15 23:45 ` [patch V6 11/37] x86/entry/64: Simplify idtentry_body Thomas Gleixner
@ 2020-05-15 23:45 ` Thomas Gleixner
  2020-05-19 17:08   ` Andy Lutomirski
  2020-05-27  8:12   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  2020-05-15 23:46 ` [patch V6 13/37] x86/entry: Switch page fault exception to IDTENTRY_RAW Thomas Gleixner
                   ` (27 subsequent siblings)
  39 siblings, 2 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:45 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


The pagefault handler cannot use the regular idtentry_enter() because that
invokes rcu_irq_enter() if the pagefault was caused in the kernel. Not a
problem per se, but kernel side page faults can schedule which is not
possible without invoking rcu_irq_exit().

Adding rcu_irq_exit() and a matching rcu_irq_enter() into the actual
pagefault handling code would be possible, but not pretty either.

Provide idtentry_entry/exit_cond_rcu() which calls rcu_irq_enter() only
when RCU is not watching. The conditional RCU enabling is a correctness
issue: A kernel page fault which hits a RCU idle reason can neither
schedule nor is it likely to survive. But avoiding RCU warnings or RCU side
effects is at least increasing the chance for useful debug output.

The function is also useful for implementing lightweight reschedule IPI and
KVM posted interrupt IPI entry handling later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 34caf3849632..72588f1a45a2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -515,6 +515,36 @@ SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
+static __always_inline bool __idtentry_enter(struct pt_regs *regs,
+					     bool cond_rcu)
+{
+	if (user_mode(regs)) {
+		enter_from_user_mode();
+	} else {
+		if (!cond_rcu || !__rcu_is_watching()) {
+			/*
+			 * If RCU is not watching then the same careful
+			 * sequence vs. lockdep and tracing is required.
+			 */
+			lockdep_hardirqs_off(CALLER_ADDR0);
+			rcu_irq_enter();
+			instrumentation_begin();
+			trace_hardirqs_off_prepare();
+			instrumentation_end();
+			return true;
+		} else {
+			/*
+			 * If RCU is watching then the combo function
+			 * can be used.
+			 */
+			instrumentation_begin();
+			trace_hardirqs_off();
+			instrumentation_end();
+		}
+	}
+	return false;
+}
+
 /**
  * idtentry_enter - Handle state tracking on idtentry
  * @regs:	Pointer to pt_regs of interrupted context
@@ -532,19 +562,60 @@ SYSCALL_DEFINE0(ni_syscall)
  */
 void noinstr idtentry_enter(struct pt_regs *regs)
 {
-	if (user_mode(regs)) {
-		enter_from_user_mode();
-	} else {
-		lockdep_hardirqs_off(CALLER_ADDR0);
-		rcu_irq_enter();
-		instrumentation_begin();
-		trace_hardirqs_off_prepare();
-		instrumentation_end();
-	}
+	__idtentry_enter(regs, false);
+}
+
+/**
+ * idtentry_enter_cond_rcu - Handle state tracking on idtentry with conditional
+ *			     RCU handling
+ * @regs:	Pointer to pt_regs of interrupted context
+ *
+ * Invokes:
+ *  - lockdep irqflag state tracking as low level ASM entry disabled
+ *    interrupts.
+ *
+ *  - Context tracking if the exception hit user mode.
+ *
+ *  - The hardirq tracer to keep the state consistent as low level ASM
+ *    entry disabled interrupts.
+ *
+ * For kernel mode entries the conditional RCU handling is useful for two
+ * purposes
+ *
+ * 1) Pagefaults: Kernel code can fault and sleep, e.g. on exec. This code
+ *    is not in an RCU idle section. If rcu_irq_enter() would be invoked
+ *    then nothing would invoke rcu_irq_exit() before scheduling.
+ *
+ *   If the kernel faults in a RCU idle section then all bets are off
+ *   anyway but at least avoiding a subsequent issue vs. RCU is helpful for
+ *   debugging.
+ *
+ * 2) Scheduler IPI: To avoid the overhead of a regular idtentry vs. RCU
+ *    and irq_enter() the IPI can be made lightweight if the tracepoints
+ *    are not enabled. While the IPI functionality itself does not require
+ *    RCU (folding preempt count) it still calls out into instrumentable
+ *    functions, e.g. ack_APIC_irq(). The scheduler IPI can hit RCU idle
+ *    sections, so RCU needs to be adjusted. For the fast path case, e.g.
+ *    KVM kicking a vCPU out of guest mode this can be avoided because the
+ *    IPI is handled after KVM reestablished kernel context including RCU.
+ *
+ * For user mode entries enter_from_user_mode() must be invoked to
+ * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
+ * would not be possible.
+ *
+ * Returns: True if RCU has been adjusted on a kernel entry
+ *	    False otherwise
+ *
+ * The return value must be fed into the rcu_exit argument of
+ * idtentry_exit_cond_rcu().
+ */
+bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
+{
+	return __idtentry_enter(regs, true);
 }
 
 static __always_inline void __idtentry_exit(struct pt_regs *regs,
-					    bool preempt_hcall)
+					    bool preempt_hcall, bool rcu_exit)
 {
 	lockdep_assert_irqs_disabled();
 
@@ -570,7 +641,12 @@ static __always_inline void __idtentry_exit(struct pt_regs *regs,
 				if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
 					WARN_ON_ONCE(!on_thread_stack());
 				instrumentation_begin();
-				rcu_irq_exit_preempt();
+				/*
+				 * Conditional for idtentry_exit_cond_rcu(),
+				 * unconditional for all other users.
+				 */
+				if (rcu_exit)
+					rcu_irq_exit_preempt();
 				if (need_resched())
 					preempt_schedule_irq();
 				/* Covers both tracing and lockdep */
@@ -602,11 +678,22 @@ static __always_inline void __idtentry_exit(struct pt_regs *regs,
 		trace_hardirqs_on_prepare();
 		lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 		instrumentation_end();
-		rcu_irq_exit();
+		/*
+		 * Conditional for idtentry_exit_cond_rcu(), unconditional
+		 * for all other users.
+		 */
+		if (rcu_exit)
+			rcu_irq_exit();
 		lockdep_hardirqs_on(CALLER_ADDR0);
 	} else {
-		/* IRQ flags state is correct already. Just tell RCU */
-		rcu_irq_exit();
+		/*
+		 * IRQ flags state is correct already. Just tell RCU.
+		 *
+		 * Conditional for idtentry_exit_cond_rcu(), unconditional
+		 * for all other users.
+		 */
+		if (rcu_exit)
+			rcu_irq_exit();
 	}
 }
 
@@ -627,7 +714,28 @@ static __always_inline void __idtentry_exit(struct pt_regs *regs,
  */
 void noinstr idtentry_exit(struct pt_regs *regs)
 {
-	__idtentry_exit(regs, false);
+	__idtentry_exit(regs, false, true);
+}
+
+/**
+ * idtentry_exit_cond_rcu - Handle return from exception with conditional RCU
+ *			    handling
+ * @regs:	Pointer to pt_regs (exception entry regs)
+ * @rcu_exit:	Invoke rcu_irq_exit() if true
+ *
+ * Depending on the return target (kernel/user) this runs the necessary
+ * preemption and work checks if possible and reguired and returns to
+ * the caller with interrupts disabled and no further work pending.
+ *
+ * This is the last action before returning to the low level ASM code which
+ * just needs to return to the appropriate context.
+ *
+ * Counterpart to idtentry_enter_cond_rcu(). The return value of the entry
+ * function must be fed into the @rcu_exit argument.
+ */
+void noinstr idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit)
+{
+	__idtentry_exit(regs, false, rcu_exit);
 }
 
 #ifdef CONFIG_XEN_PV
@@ -659,11 +767,11 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 	set_irq_regs(old_regs);
 
 	if (IS_ENABLED(CONFIG_PREEMPTION)) {
-		__idtentry_exit(regs, false);
+		__idtentry_exit(regs, false, true);
 	} else {
 		bool inhcall = __this_cpu_read(xen_in_preemptible_hcall);
 
-		__idtentry_exit(regs, inhcall && need_resched());
+		__idtentry_exit(regs, inhcall && need_resched(), true);
 	}
 }
 #endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index fac73bb3577f..ccd572fd6583 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -10,6 +10,9 @@
 void idtentry_enter(struct pt_regs *regs);
 void idtentry_exit(struct pt_regs *regs);
 
+bool idtentry_enter_cond_rcu(struct pt_regs *regs);
+void idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit);
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 13/37] x86/entry: Switch page fault exception to IDTENTRY_RAW
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (11 preceding siblings ...)
  2020-05-15 23:45 ` [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu() Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-19 20:12   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 14/37] x86/entry: Remove the transition leftovers Thomas Gleixner
                   ` (26 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Convert page fault exceptions to IDTENTRY_RAW:
  - Implement the C entry point with DEFINE_IDTENTRY_RAW
  - Add the CR2 read into the exception handler
  - Add the idtentry_enter/exit_cond_rcu() invocations in
    in the regular page fault handler and use the regular
    idtentry_enter/exit() for the async PF part.
  - Emit the ASM stub with DECLARE_IDTENTRY_RAW
  - Remove the ASM idtentry in 64bit
  - Remove the CR2 read from 64bit
  - Remove the open coded ASM entry code in 32bit
  - Fixup the XEN/PV code
  - Remove the old prototypes

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 6ac890d5c9d8..3c3ca6fbe58e 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1395,36 +1395,6 @@ BUILD_INTERRUPT3(hv_stimer0_callback_vector, HYPERV_STIMER0_VECTOR,
 
 #endif /* CONFIG_HYPERV */
 
-SYM_CODE_START(page_fault)
-	ASM_CLAC
-	pushl	$do_page_fault
-	jmp	common_exception_read_cr2
-SYM_CODE_END(page_fault)
-
-SYM_CODE_START_LOCAL_NOALIGN(common_exception_read_cr2)
-	/* the function address is in %gs's slot on the stack */
-	SAVE_ALL switch_stacks=1 skip_gs=1 unwind_espfix=1
-
-	ENCODE_FRAME_POINTER
-
-	/* fixup %gs */
-	GS_TO_REG %ecx
-	movl	PT_GS(%esp), %edi
-	REG_TO_PTGS %ecx
-	SET_KERNEL_GS %ecx
-
-	GET_CR2_INTO(%ecx)			# might clobber %eax
-
-	/* fixup orig %eax */
-	movl	PT_ORIG_EAX(%esp), %edx		# get the error code
-	movl	$-1, PT_ORIG_EAX(%esp)		# no syscall to restart
-
-	TRACE_IRQS_OFF
-	movl	%esp, %eax			# pt_regs pointer
-	CALL_NOSPEC edi
-	jmp	ret_from_exception
-SYM_CODE_END(common_exception_read_cr2)
-
 SYM_CODE_START_LOCAL_NOALIGN(common_exception)
 	/* the function address is in %gs's slot on the stack */
 	SAVE_ALL switch_stacks=1 skip_gs=1 unwind_espfix=1
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1d700bde232b..e061c48d0ae2 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -506,15 +506,6 @@ SYM_CODE_END(spurious_entries_start)
 	call	error_entry
 	UNWIND_HINT_REGS
 
-	.if \vector == X86_TRAP_PF
-		/*
-		 * Store CR2 early so subsequent faults cannot clobber it. Use R12 as
-		 * intermediate storage as RDX can be clobbered in enter_from_user_mode().
-		 * GET_CR2_INTO can clobber RAX.
-		 */
-		GET_CR2_INTO(%r12);
-	.endif
-
 	.if \sane == 0
 	TRACE_IRQS_OFF
 
@@ -533,10 +524,6 @@ SYM_CODE_END(spurious_entries_start)
 		movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
 	.endif
 
-	.if \vector == X86_TRAP_PF
-		movq	%r12, %rdx		/* Move CR2 into 3rd argument */
-	.endif
-
 	call	\cfunc
 
 	.if \sane == 0
@@ -1060,12 +1047,6 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
 #endif
 
 /*
- * Exception entry points.
- */
-
-idtentry	X86_TRAP_PF		page_fault		do_page_fault			has_error_code=1
-
-/*
  * Reload gs selector with exception handling
  * edi:  new selector
  *
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index ccd572fd6583..4e4975df7ce0 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -364,7 +364,8 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
 /* Raw exception entries which need extra work */
-DECLARE_IDTENTRY_RAW(X86_TRAP_BP,	exc_int3);
+DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
+DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_PF,	exc_page_fault);
 
 #ifdef CONFIG_X86_MCE
 DECLARE_IDTENTRY_MCE(X86_TRAP_MC,	exc_machine_check);
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index f5a2e438a878..d7de360eec74 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -9,17 +9,6 @@
 #include <asm/idtentry.h>
 #include <asm/siginfo.h>			/* TRAP_TRACE, ... */
 
-#define dotraplinkage __visible
-
-asmlinkage void page_fault(void);
-asmlinkage void async_page_fault(void);
-
-#if defined(CONFIG_X86_64) && defined(CONFIG_XEN_PV)
-asmlinkage void xen_page_fault(void);
-#endif
-
-dotraplinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
-
 #ifdef CONFIG_X86_64
 asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
 asmlinkage __visible notrace
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df66e9dc2087..b4bd866568ee 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -59,7 +59,7 @@ static const __initconst struct idt_data early_idts[] = {
 	INTG(X86_TRAP_DB,		asm_exc_debug),
 	SYSG(X86_TRAP_BP,		asm_exc_int3),
 #ifdef CONFIG_X86_32
-	INTG(X86_TRAP_PF,		page_fault),
+	INTG(X86_TRAP_PF,		asm_exc_page_fault),
 #endif
 };
 
@@ -153,7 +153,7 @@ static const __initconst struct idt_data apic_idts[] = {
  * stacks work only after cpu_init().
  */
 static const __initconst struct idt_data early_pf_idts[] = {
-	INTG(X86_TRAP_PF,		page_fault),
+	INTG(X86_TRAP_PF,		asm_exc_page_fault),
 };
 
 /*
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b3d9b0d7a37d..3075127ad300 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -218,7 +218,7 @@ void kvm_async_pf_task_wake(u32 token)
 }
 EXPORT_SYMBOL_GPL(kvm_async_pf_task_wake);
 
-u32 kvm_read_and_reset_pf_reason(void)
+u32 noinstr kvm_read_and_reset_pf_reason(void)
 {
 	u32 reason = 0;
 
@@ -230,9 +230,8 @@ u32 kvm_read_and_reset_pf_reason(void)
 	return reason;
 }
 EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
-NOKPROBE_SYMBOL(kvm_read_and_reset_pf_reason);
 
-bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
+noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 {
 	u32 reason = kvm_read_and_reset_pf_reason();
 
@@ -244,6 +243,9 @@ bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 		return false;
 	}
 
+	idtentry_enter(regs);
+	instrumentation_begin();
+
 	/*
 	 * If the host managed to inject an async #PF into an interrupt
 	 * disabled region, then die hard as this is not going to end well
@@ -258,13 +260,13 @@ bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 		/* Page is swapped out by the host. */
 		kvm_async_pf_task_wait_schedule(token);
 	} else {
-		rcu_irq_enter();
 		kvm_async_pf_task_wake(token);
-		rcu_irq_exit();
 	}
+
+	instrumentation_end();
+	idtentry_exit(regs);
 	return true;
 }
-NOKPROBE_SYMBOL(__kvm_handle_async_pf);
 
 static void __init paravirt_ops_setup(void)
 {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7c3ac7f251b4..9c57fb89a461 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1521,11 +1521,38 @@ trace_page_fault_entries(struct pt_regs *regs, unsigned long error_code,
 		trace_page_fault_kernel(address, regs, error_code);
 }
 
-dotraplinkage void
-do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
-		unsigned long address)
+static __always_inline void
+handle_page_fault(struct pt_regs *regs, unsigned long error_code,
+			      unsigned long address)
+{
+	trace_page_fault_entries(regs, error_code, address);
+
+	if (unlikely(kmmio_fault(regs, address)))
+		return;
+
+	/* Was the fault on kernel-controlled part of the address space? */
+	if (unlikely(fault_in_kernel_space(address))) {
+		do_kern_addr_fault(regs, error_code, address);
+	} else {
+		do_user_addr_fault(regs, error_code, address);
+		/*
+		 * User address page fault handling might have reenabled
+		 * interrupts. Fixing up all potential exit points of
+		 * do_user_addr_fault() and its leaf functions is just not
+		 * doable w/o creating an unholy mess or turning the code
+		 * upside down.
+		 */
+		local_irq_disable();
+	}
+}
+
+DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 {
+	unsigned long address = read_cr2();
+	bool rcu_exit;
+
 	prefetchw(&current->mm->mmap_sem);
+
 	/*
 	 * KVM has two types of events that are, logically, interrupts, but
 	 * are unfortunately delivered using the #PF vector.  These events are
@@ -1540,28 +1567,28 @@ do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
 	 * getting values from real and async page faults mixed up.
 	 *
 	 * Fingers crossed.
+	 *
+	 * The async #PF handling code takes care of idtentry handling
+	 * itself.
 	 */
 	if (kvm_handle_async_pf(regs, (u32)address))
 		return;
 
-	trace_page_fault_entries(regs, hw_error_code, address);
+	/*
+	 * Entry handling for valid #PF from kernel mode is slightly
+	 * different: RCU is already watching and rcu_irq_enter() must not
+	 * be invoked because a kernel fault on a user space address might
+	 * sleep.
+	 *
+	 * In case the fault hit a RCU idle region the conditional entry
+	 * code reenabled RCU to avoid subsequent wreckage which helps
+	 * debugability.
+	 */
+	rcu_exit = idtentry_enter_cond_rcu(regs);
 
-	if (unlikely(kmmio_fault(regs, address)))
-		return;
+	instrumentation_begin();
+	handle_page_fault(regs, error_code, address);
+	instrumentation_end();
 
-	/* Was the fault on kernel-controlled part of the address space? */
-	if (unlikely(fault_in_kernel_space(address))) {
-		do_kern_addr_fault(regs, hw_error_code, address);
-	} else {
-		do_user_addr_fault(regs, hw_error_code, address);
-		/*
-		 * User address page fault handling might have reenabled
-		 * interrupts. Fixing up all potential exit points of
-		 * do_user_addr_fault() and its leaf functions is just not
-		 * doable w/o creating an unholy mess or turning the code
-		 * upside down.
-		 */
-		local_irq_disable();
-	}
+	idtentry_exit_cond_rcu(regs, rcu_exit);
 }
-NOKPROBE_SYMBOL(do_page_fault);
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 851ea410085b..35321f4d49f1 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -627,7 +627,7 @@ static struct trap_array_entry trap_array[] = {
 #ifdef CONFIG_IA32_EMULATION
 	{ entry_INT80_compat,          xen_entry_INT80_compat,          false },
 #endif
-	{ page_fault,                  xen_page_fault,                  false },
+	TRAP_ENTRY(exc_page_fault,			false ),
 	TRAP_ENTRY(exc_divide_error,			false ),
 	TRAP_ENTRY(exc_bounds,				false ),
 	TRAP_ENTRY(exc_invalid_op,			false ),
diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S
index 19fbbdbcbde9..5d252aaeade8 100644
--- a/arch/x86/xen/xen-asm_64.S
+++ b/arch/x86/xen/xen-asm_64.S
@@ -43,7 +43,7 @@ xen_pv_trap asm_exc_invalid_tss
 xen_pv_trap asm_exc_segment_not_present
 xen_pv_trap asm_exc_stack_segment
 xen_pv_trap asm_exc_general_protection
-xen_pv_trap page_fault
+xen_pv_trap asm_exc_page_fault
 xen_pv_trap asm_exc_spurious_interrupt_bug
 xen_pv_trap asm_exc_coprocessor_error
 xen_pv_trap asm_exc_alignment_check


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 14/37] x86/entry: Remove the transition leftovers
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (12 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 13/37] x86/entry: Switch page fault exception to IDTENTRY_RAW Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-19 20:13   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 15/37] x86/entry: Change exit path of xen_failsafe_callback Thomas Gleixner
                   ` (25 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Now that all exceptions are converted over the sane flag is not longer
needed. Also the vector argument of idtentry_body on 64 bit is pointless
now.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 3c3ca6fbe58e..dd8064ffdf12 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -734,9 +734,8 @@
  * @asmsym:		ASM symbol for the entry point
  * @cfunc:		C function to be called
  * @has_error_code:	Hardware pushed error code on stack
- * @sane:		Compatibility flag with 64bit
  */
-.macro idtentry vector asmsym cfunc has_error_code:req sane=0
+.macro idtentry vector asmsym cfunc has_error_code:req
 SYM_CODE_START(\asmsym)
 	ASM_CLAC
 	cld
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e061c48d0ae2..1157d63b3682 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -496,27 +496,14 @@ SYM_CODE_END(spurious_entries_start)
 
 /**
  * idtentry_body - Macro to emit code calling the C function
- * @vector:		Vector number
  * @cfunc:		C function to be called
  * @has_error_code:	Hardware pushed error code on stack
- * @sane:		Sane variant which handles irq tracing, context tracking in C
  */
-.macro idtentry_body vector cfunc has_error_code:req sane=0
+.macro idtentry_body cfunc has_error_code:req
 
 	call	error_entry
 	UNWIND_HINT_REGS
 
-	.if \sane == 0
-	TRACE_IRQS_OFF
-
-#ifdef CONFIG_CONTEXT_TRACKING
-	testb	$3, CS(%rsp)
-	jz	.Lfrom_kernel_no_ctxt_tracking_\@
-	CALL_enter_from_user_mode
-.Lfrom_kernel_no_ctxt_tracking_\@:
-#endif
-	.endif
-
 	movq	%rsp, %rdi			/* pt_regs pointer into 1st argument*/
 
 	.if \has_error_code == 1
@@ -526,11 +513,7 @@ SYM_CODE_END(spurious_entries_start)
 
 	call	\cfunc
 
-	.if \sane == 0
-	jmp	error_exit
-	.else
 	jmp	error_return
-	.endif
 .endm
 
 /**
@@ -539,12 +522,11 @@ SYM_CODE_END(spurious_entries_start)
  * @asmsym:		ASM symbol for the entry point
  * @cfunc:		C function to be called
  * @has_error_code:	Hardware pushed error code on stack
- * @sane:		Sane variant which handles irq tracing, context tracking in C
  *
  * The macro emits code to set up the kernel context for straight forward
  * and simple IDT entries. No IST stack, no paranoid entry checks.
  */
-.macro idtentry vector asmsym cfunc has_error_code:req sane=0
+.macro idtentry vector asmsym cfunc has_error_code:req
 SYM_CODE_START(\asmsym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 	ASM_CLAC
@@ -567,7 +549,7 @@ SYM_CODE_START(\asmsym)
 .Lfrom_usermode_no_gap_\@:
 	.endif
 
-	idtentry_body \vector \cfunc \has_error_code \sane
+	idtentry_body \cfunc \has_error_code
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -642,7 +624,7 @@ SYM_CODE_START(\asmsym)
 
 	/* Switch to the regular task stack and use the noist entry point */
 .Lfrom_usermode_switch_stack_\@:
-	idtentry_body vector noist_\cfunc, has_error_code=0 sane=1
+	idtentry_body noist_\cfunc, has_error_code=0
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 4e4975df7ce0..3c239db60d7b 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -281,10 +281,10 @@ __visible noinstr void func(struct pt_regs *regs,			\
  * The ASM variants for DECLARE_IDTENTRY*() which emit the ASM entry stubs.
  */
 #define DECLARE_IDTENTRY(vector, func)					\
-	idtentry vector asm_##func func has_error_code=0 sane=1
+	idtentry vector asm_##func func has_error_code=0
 
 #define DECLARE_IDTENTRY_ERRORCODE(vector, func)			\
-	idtentry vector asm_##func func has_error_code=1 sane=1
+	idtentry vector asm_##func func has_error_code=1
 
 /* Special case for 32bit IRET 'trap'. Do not emit ASM code */
 #define DECLARE_IDTENTRY_SW(vector, func)
@@ -322,7 +322,7 @@ __visible noinstr void func(struct pt_regs *regs,			\
 
 /* XEN NMI and DB wrapper */
 #define DECLARE_IDTENTRY_XEN(vector, func)				\
-	idtentry vector asm_exc_xen##func exc_##func has_error_code=0 sane=1
+	idtentry vector asm_exc_xen##func exc_##func has_error_code=0
 
 #endif /* __ASSEMBLY__ */
 


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 15/37] x86/entry: Change exit path of xen_failsafe_callback
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (13 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 14/37] x86/entry: Remove the transition leftovers Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-19 20:14   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 16/37] x86/entry/64: Remove error_exit Thomas Gleixner
                   ` (24 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


xen_failsafe_callback is invoked from XEN for two cases:

  1. Fault while reloading DS, ES, FS or GS
  2. Fault while executing IRET

#1 retries the IRET after XEN has fixed up the segments.
#2 injects a #GP which kills the task

For #1 there is no reason to go through the full exception return path
because the tasks TIF state is still the same. So just going straight to
the IRET path is good enough.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index dd8064ffdf12..65fd41fe77d9 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1352,7 +1352,7 @@ SYM_FUNC_START(xen_failsafe_callback)
 5:	pushl	$-1				/* orig_ax = -1 => not a system call */
 	SAVE_ALL
 	ENCODE_FRAME_POINTER
-	jmp	ret_from_exception
+	jmp	handle_exception_return
 
 .section .fixup, "ax"
 6:	xorl	%eax, %eax
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1157d63b3682..55f612032793 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1175,7 +1175,7 @@ SYM_CODE_START(xen_failsafe_callback)
 	pushq	$-1 /* orig_ax = -1 => not a system call */
 	PUSH_AND_CLEAR_REGS
 	ENCODE_FRAME_POINTER
-	jmp	error_exit
+	jmp	error_return
 SYM_CODE_END(xen_failsafe_callback)
 #endif /* CONFIG_XEN_PV */
 


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 16/37] x86/entry/64: Remove error_exit
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (14 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 15/37] x86/entry: Change exit path of xen_failsafe_callback Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-19 20:14   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 17/37] x86/entry/32: Remove common_exception Thomas Gleixner
                   ` (23 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


No more users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 55f612032793..e908f1440f36 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1356,15 +1356,6 @@ SYM_CODE_START_LOCAL(error_entry)
 	jmp	.Lerror_entry_from_usermode_after_swapgs
 SYM_CODE_END(error_entry)
 
-SYM_CODE_START_LOCAL(error_exit)
-	UNWIND_HINT_REGS
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_OFF
-	testb	$3, CS(%rsp)
-	jz	retint_kernel
-	jmp	.Lretint_user
-SYM_CODE_END(error_exit)
-
 SYM_CODE_START_LOCAL(error_return)
 	UNWIND_HINT_REGS
 	DEBUG_ENTRY_ASSERT_IRQS_OFF


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 17/37] x86/entry/32: Remove common_exception
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (15 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 16/37] x86/entry/64: Remove error_exit Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-19 20:14   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 18/37] x86/irq: Use generic irq_regs implementation Thomas Gleixner
                   ` (22 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


No more users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 65fd41fe77d9..c6e146b71de5 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1394,27 +1394,6 @@ BUILD_INTERRUPT3(hv_stimer0_callback_vector, HYPERV_STIMER0_VECTOR,
 
 #endif /* CONFIG_HYPERV */
 
-SYM_CODE_START_LOCAL_NOALIGN(common_exception)
-	/* the function address is in %gs's slot on the stack */
-	SAVE_ALL switch_stacks=1 skip_gs=1 unwind_espfix=1
-	ENCODE_FRAME_POINTER
-
-	/* fixup %gs */
-	GS_TO_REG %ecx
-	movl	PT_GS(%esp), %edi		# get the function address
-	REG_TO_PTGS %ecx
-	SET_KERNEL_GS %ecx
-
-	/* fixup orig %eax */
-	movl	PT_ORIG_EAX(%esp), %edx		# get the error code
-	movl	$-1, PT_ORIG_EAX(%esp)		# no syscall to restart
-
-	TRACE_IRQS_OFF
-	movl	%esp, %eax			# pt_regs pointer
-	CALL_NOSPEC edi
-	jmp	ret_from_exception
-SYM_CODE_END(common_exception)
-
 SYM_CODE_START_LOCAL_NOALIGN(handle_exception)
 	/* the function address is in %gs's slot on the stack */
 	SAVE_ALL switch_stacks=1 skip_gs=1 unwind_espfix=1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 18/37] x86/irq: Use generic irq_regs implementation
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (16 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 17/37] x86/entry/32: Remove common_exception Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-15 23:46 ` [patch V6 19/37] x86/irq: Convey vector as argument and not in ptregs Thomas Gleixner
                   ` (21 subsequent siblings)
  39 siblings, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


The only difference is the name of the per-CPU variable: irq_regs
vs. __irq_regs, but the accessor functions are identical.

Remove the pointless copy and use the generic variant.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/include/asm/irq_regs.h b/arch/x86/include/asm/irq_regs.h
deleted file mode 100644
index 187ce59aea28..000000000000
--- a/arch/x86/include/asm/irq_regs.h
+++ /dev/null
@@ -1,32 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Per-cpu current frame pointer - the location of the last exception frame on
- * the stack, stored in the per-cpu area.
- *
- * Jeremy Fitzhardinge <jeremy@goop.org>
- */
-#ifndef _ASM_X86_IRQ_REGS_H
-#define _ASM_X86_IRQ_REGS_H
-
-#include <asm/percpu.h>
-
-#define ARCH_HAS_OWN_IRQ_REGS
-
-DECLARE_PER_CPU(struct pt_regs *, irq_regs);
-
-static inline struct pt_regs *get_irq_regs(void)
-{
-	return __this_cpu_read(irq_regs);
-}
-
-static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs)
-{
-	struct pt_regs *old_regs;
-
-	old_regs = get_irq_regs();
-	__this_cpu_write(irq_regs, new_regs);
-
-	return old_regs;
-}
-
-#endif /* _ASM_X86_IRQ_REGS_32_H */
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index c7965ff429c5..252065d32ab5 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -26,9 +26,6 @@
 DEFINE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
 EXPORT_PER_CPU_SYMBOL(irq_stat);
 
-DEFINE_PER_CPU(struct pt_regs *, irq_regs);
-EXPORT_PER_CPU_SYMBOL(irq_regs);
-
 atomic_t irq_err_count;
 
 /*


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 19/37] x86/irq: Convey vector as argument and not in ptregs
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (17 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 18/37] x86/irq: Use generic irq_regs implementation Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-19 20:19   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 20/37] x86/irq/64: Provide handle_irq() Thomas Gleixner
                   ` (20 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Device interrupts which go through do_IRQ() or the spurious interrupt
handler have their separate entry code on 64 bit for no good reason.

Both 32 and 64 bit transport the vector number through ORIG_[RE]AX in
pt_regs. Further the vector number is forced to fit into an u8 and is
complemented and offset by 0x80 so it's in the signed character
range. Otherwise GAS would expand the pushq to a 5 byte instruction for any
vector > 0x7F.

Treat the vector number like an error code and hand it to the C function as
argument. This allows to get rid of the extra entry code in a later step.

Simplify the error code push magic by implementing the pushq imm8 via a
'.byte 0x6a, vector' sequence so GAS is not able to screw it up. As the
pushq imm8 is sign extending the resulting error code needs to be truncated
to 8 bits in C code.

Originally-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Brian Gerst <brgerst@gmail.com>

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 1c7f13bb6728..98da0d3c0b1a 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -341,7 +341,10 @@ For 32-bit we have the following conventions - kernel is built with
 #endif
 .endm
 
-#endif /* CONFIG_X86_64 */
+#else /* CONFIG_X86_64 */
+# undef		UNWIND_HINT_IRET_REGS
+# define	UNWIND_HINT_IRET_REGS
+#endif /* !CONFIG_X86_64 */
 
 .macro STACKLEAK_ERASE
 #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index c6e146b71de5..ce113d5613b6 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1215,40 +1215,15 @@ SYM_FUNC_END(entry_INT80_32)
 #endif
 .endm
 
-/*
- * Build the entry stubs with some assembler magic.
- * We pack 1 stub into every 8-byte block.
- */
-	.align 8
-SYM_CODE_START(irq_entries_start)
-    vector=FIRST_EXTERNAL_VECTOR
-    .rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
-	pushl	$(~vector+0x80)			/* Note: always in signed byte range */
-    vector=vector+1
-	jmp	common_interrupt
-	.align	8
-    .endr
-SYM_CODE_END(irq_entries_start)
-
 #ifdef CONFIG_X86_LOCAL_APIC
-	.align 8
-SYM_CODE_START(spurious_entries_start)
-    vector=FIRST_SYSTEM_VECTOR
-    .rept (NR_VECTORS - FIRST_SYSTEM_VECTOR)
-	pushl	$(~vector+0x80)			/* Note: always in signed byte range */
-    vector=vector+1
-	jmp	common_spurious
-	.align	8
-    .endr
-SYM_CODE_END(spurious_entries_start)
-
 SYM_CODE_START_LOCAL(common_spurious)
 	ASM_CLAC
-	addl	$-0x80, (%esp)			/* Adjust vector into the [-256, -1] range */
 	SAVE_ALL switch_stacks=1
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
 	movl	%esp, %eax
+	movl	PT_ORIG_EAX(%esp), %edx		/* get the vector from stack */
+	movl	$-1, PT_ORIG_EAX(%esp)		/* no syscall to restart */
 	call	smp_spurious_interrupt
 	jmp	ret_from_intr
 SYM_CODE_END(common_spurious)
@@ -1261,12 +1236,12 @@ SYM_CODE_END(common_spurious)
 	.p2align CONFIG_X86_L1_CACHE_SHIFT
 SYM_CODE_START_LOCAL(common_interrupt)
 	ASM_CLAC
-	addl	$-0x80, (%esp)			/* Adjust vector into the [-256, -1] range */
-
 	SAVE_ALL switch_stacks=1
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
 	movl	%esp, %eax
+	movl	PT_ORIG_EAX(%esp), %edx		/* get the vector from stack */
+	movl	$-1, PT_ORIG_EAX(%esp)		/* no syscall to restart */
 	call	do_IRQ
 	jmp	ret_from_intr
 SYM_CODE_END(common_interrupt)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e908f1440f36..9c8529035e58 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -358,34 +358,6 @@ SYM_CODE_START(ret_from_fork)
 SYM_CODE_END(ret_from_fork)
 .popsection
 
-/*
- * Build the entry stubs with some assembler magic.
- * We pack 1 stub into every 8-byte block.
- */
-	.align 8
-SYM_CODE_START(irq_entries_start)
-    vector=FIRST_EXTERNAL_VECTOR
-    .rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
-	UNWIND_HINT_IRET_REGS
-	pushq	$(~vector+0x80)			/* Note: always in signed byte range */
-	jmp	common_interrupt
-	.align	8
-	vector=vector+1
-    .endr
-SYM_CODE_END(irq_entries_start)
-
-	.align 8
-SYM_CODE_START(spurious_entries_start)
-    vector=FIRST_SYSTEM_VECTOR
-    .rept (NR_VECTORS - FIRST_SYSTEM_VECTOR)
-	UNWIND_HINT_IRET_REGS
-	pushq	$(~vector+0x80)			/* Note: always in signed byte range */
-	jmp	common_spurious
-	.align	8
-	vector=vector+1
-    .endr
-SYM_CODE_END(spurious_entries_start)
-
 .macro DEBUG_ENTRY_ASSERT_IRQS_OFF
 #ifdef CONFIG_DEBUG_ENTRY
 	pushq %rax
@@ -755,13 +727,14 @@ _ASM_NOKPROBE(interrupt_entry)
 /* Interrupt entry/exit. */
 
 /*
- * The interrupt stubs push (~vector+0x80) onto the stack and
+ * The interrupt stubs push vector onto the stack and
  * then jump to common_spurious/interrupt.
  */
 SYM_CODE_START_LOCAL(common_spurious)
-	addq	$-0x80, (%rsp)			/* Adjust vector to [-256, -1] range */
 	call	interrupt_entry
 	UNWIND_HINT_REGS indirect=1
+	movq	ORIG_RAX(%rdi), %rsi		/* get vector from stack */
+	movq	$-1, ORIG_RAX(%rdi)		/* no syscall to restart */
 	call	smp_spurious_interrupt		/* rdi points to pt_regs */
 	jmp	ret_from_intr
 SYM_CODE_END(common_spurious)
@@ -770,10 +743,11 @@ _ASM_NOKPROBE(common_spurious)
 /* common_interrupt is a hotpath. Align it */
 	.p2align CONFIG_X86_L1_CACHE_SHIFT
 SYM_CODE_START_LOCAL(common_interrupt)
-	addq	$-0x80, (%rsp)			/* Adjust vector to [-256, -1] range */
 	call	interrupt_entry
 	UNWIND_HINT_REGS indirect=1
-	call	do_IRQ	/* rdi points to pt_regs */
+	movq	ORIG_RAX(%rdi), %rsi		/* get vector from stack */
+	movq	$-1, ORIG_RAX(%rdi)		/* no syscall to restart */
+	call	do_IRQ				/* rdi points to pt_regs */
 	/* 0(%rsp): old RSP */
 ret_from_intr:
 	DISABLE_INTERRUPTS(CLBR_ANY)
@@ -1022,7 +996,7 @@ apicinterrupt RESCHEDULE_VECTOR			reschedule_interrupt		smp_reschedule_interrupt
 #endif
 
 apicinterrupt ERROR_APIC_VECTOR			error_interrupt			smp_error_interrupt
-apicinterrupt SPURIOUS_APIC_VECTOR		spurious_interrupt		smp_spurious_interrupt
+apicinterrupt SPURIOUS_APIC_VECTOR		spurious_apic_interrupt		smp_spurious_apic_interrupt
 
 #ifdef CONFIG_IRQ_WORK
 apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index 416422762845..cd57ce6134c9 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -35,7 +35,7 @@ BUILD_INTERRUPT(kvm_posted_intr_nested_ipi, POSTED_INTR_NESTED_VECTOR)
 
 BUILD_INTERRUPT(apic_timer_interrupt,LOCAL_TIMER_VECTOR)
 BUILD_INTERRUPT(error_interrupt,ERROR_APIC_VECTOR)
-BUILD_INTERRUPT(spurious_interrupt,SPURIOUS_APIC_VECTOR)
+BUILD_INTERRUPT(spurious_apic_interrupt,SPURIOUS_APIC_VECTOR)
 BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
 
 #ifdef CONFIG_IRQ_WORK
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 4154bc5f6a4e..0ffe80792b2d 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -39,6 +39,7 @@ extern asmlinkage void irq_work_interrupt(void);
 extern asmlinkage void uv_bau_message_intr1(void);
 
 extern asmlinkage void spurious_interrupt(void);
+extern asmlinkage void spurious_apic_interrupt(void);
 extern asmlinkage void thermal_interrupt(void);
 extern asmlinkage void reschedule_interrupt(void);
 
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 3c239db60d7b..1fad257372e3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -324,6 +324,46 @@ __visible noinstr void func(struct pt_regs *regs,			\
 #define DECLARE_IDTENTRY_XEN(vector, func)				\
 	idtentry vector asm_exc_xen##func exc_##func has_error_code=0
 
+/*
+ * ASM code to emit the common vector entry stubs where each stub is
+ * packed into 8 bytes.
+ *
+ * Note, that the 'pushq imm8' is emitted via '.byte 0x6a, vector' because
+ * GCC treats the local vector variable as unsigned int and would expand
+ * all vectors above 0x7F to a 5 byte push. The original code did an
+ * adjustment of the vector number to be in the signed byte range to avoid
+ * this. While clever it's mindboggling counterintuitive and requires the
+ * odd conversion back to a real vector number in the C entry points. Using
+ * .byte achieves the same thing and the only fixup needed in the C entry
+ * point is to mask off the bits above bit 7 because the push is sign
+ * extending.
+ */
+	.align 8
+SYM_CODE_START(irq_entries_start)
+    vector=FIRST_EXTERNAL_VECTOR
+    .rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
+	UNWIND_HINT_IRET_REGS
+	.byte	0x6a, vector
+	jmp	common_interrupt
+	.align	8
+    vector=vector+1
+    .endr
+SYM_CODE_END(irq_entries_start)
+
+#ifdef CONFIG_X86_LOCAL_APIC
+	.align 8
+SYM_CODE_START(spurious_entries_start)
+    vector=FIRST_SYSTEM_VECTOR
+    .rept (NR_VECTORS - FIRST_SYSTEM_VECTOR)
+	UNWIND_HINT_IRET_REGS
+	.byte	0x6a, vector
+	jmp	common_spurious
+	.align	8
+    vector=vector+1
+    .endr
+SYM_CODE_END(spurious_entries_start)
+#endif
+
 #endif /* __ASSEMBLY__ */
 
 /*
diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 72fba0eeeb30..74690a373c58 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -36,7 +36,7 @@ extern void native_init_IRQ(void);
 
 extern void handle_irq(struct irq_desc *desc, struct pt_regs *regs);
 
-extern __visible void do_IRQ(struct pt_regs *regs);
+extern __visible void do_IRQ(struct pt_regs *regs, unsigned long vector);
 
 extern void init_ISA_irqs(void);
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index d7de360eec74..32b2becf7806 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -41,8 +41,9 @@ asmlinkage void smp_deferred_error_interrupt(struct pt_regs *regs);
 #endif
 
 void smp_apic_timer_interrupt(struct pt_regs *regs);
-void smp_spurious_interrupt(struct pt_regs *regs);
 void smp_error_interrupt(struct pt_regs *regs);
+void smp_spurious_apic_interrupt(struct pt_regs *regs);
+void smp_spurious_interrupt(struct pt_regs *regs, unsigned long vector);
 asmlinkage void smp_irq_move_cleanup_interrupt(void);
 
 #ifdef CONFIG_VMAP_STACK
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index e53dda210cd7..a4218a3726f9 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2153,15 +2153,29 @@ void __init register_lapic_address(unsigned long address)
  * Local APIC interrupts
  */
 
-/*
- * This interrupt should _never_ happen with our APIC/SMP architecture
+/**
+ * smp_spurious_interrupt - Catch all for interrupts raised on unused vectors
+ * @regs:	Pointer to pt_regs on stack
+ * @error_code:	The vector number is in the lower 8 bits
+ *
+ * This is invoked from ASM entry code to catch all interrupts which
+ * trigger on an entry which is routed to the common_spurious idtentry
+ * point.
+ *
+ * Also called from smp_spurious_apic_interrupt().
  */
-__visible void __irq_entry smp_spurious_interrupt(struct pt_regs *regs)
+__visible void __irq_entry smp_spurious_interrupt(struct pt_regs *regs,
+						  unsigned long vector)
 {
-	u8 vector = ~regs->orig_ax;
 	u32 v;
 
 	entering_irq();
+	/*
+	 * The push in the entry ASM code which stores the vector number on
+	 * the stack in the error code slot is sign expanding. Just use the
+	 * lower 8 bits.
+	 */
+	vector &= 0xFF;
 	trace_spurious_apic_entry(vector);
 
 	inc_irq_stat(irq_spurious_count);
@@ -2182,11 +2196,11 @@ __visible void __irq_entry smp_spurious_interrupt(struct pt_regs *regs)
 	 */
 	v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1));
 	if (v & (1 << (vector & 0x1f))) {
-		pr_info("Spurious interrupt (vector 0x%02x) on CPU#%d. Acked\n",
+		pr_info("Spurious interrupt (vector 0x%02lx) on CPU#%d. Acked\n",
 			vector, smp_processor_id());
 		ack_APIC_irq();
 	} else {
-		pr_info("Spurious interrupt (vector 0x%02x) on CPU#%d. Not pending!\n",
+		pr_info("Spurious interrupt (vector 0x%02lx) on CPU#%d. Not pending!\n",
 			vector, smp_processor_id());
 	}
 out:
@@ -2194,6 +2208,11 @@ __visible void __irq_entry smp_spurious_interrupt(struct pt_regs *regs)
 	exiting_irq();
 }
 
+__visible void smp_spurious_apic_interrupt(struct pt_regs *regs)
+{
+	smp_spurious_interrupt(regs, SPURIOUS_APIC_VECTOR);
+}
+
 /*
  * This interrupt should never happen with our APIC/SMP architecture
  */
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index b4bd866568ee..3e49e1fac1c6 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -142,7 +142,7 @@ static const __initconst struct idt_data apic_idts[] = {
 #ifdef CONFIG_X86_UV
 	INTG(UV_BAU_MESSAGE,		uv_bau_message_intr1),
 #endif
-	INTG(SPURIOUS_APIC_VECTOR,	spurious_interrupt),
+	INTG(SPURIOUS_APIC_VECTOR,	spurious_apic_interrupt),
 	INTG(ERROR_APIC_VECTOR,		error_interrupt),
 #endif
 };
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 252065d32ab5..c7669363251a 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -227,14 +227,18 @@ u64 arch_irq_stat(void)
  * SMP cross-CPU interrupts have their own specific
  * handlers).
  */
-__visible void __irq_entry do_IRQ(struct pt_regs *regs)
+__visible void __irq_entry do_IRQ(struct pt_regs *regs, unsigned long vector)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
-	struct irq_desc * desc;
-	/* high bit used in ret_from_ code  */
-	unsigned vector = ~regs->orig_ax;
+	struct irq_desc *desc;
 
 	entering_irq();
+	/*
+	 * The push in the entry ASM code which stores the vector number on
+	 * the stack in the error code slot is sign expanding. Just use the
+	 * lower 8 bits.
+	 */
+	vector &= 0xFF;
 
 	/* entering_irq() tells RCU that we're not quiescent.  Check it. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
@@ -249,7 +253,7 @@ __visible void __irq_entry do_IRQ(struct pt_regs *regs)
 		ack_APIC_irq();
 
 		if (desc == VECTOR_UNUSED) {
-			pr_emerg_ratelimited("%s: %d.%d No irq handler for vector\n",
+			pr_emerg_ratelimited("%s: %d.%lu No irq handler for vector\n",
 					     __func__, smp_processor_id(),
 					     vector);
 		} else {


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 20/37] x86/irq/64: Provide handle_irq()
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (18 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 19/37] x86/irq: Convey vector as argument and not in ptregs Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-19 20:21   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 21/37] x86/entry: Add IRQENTRY_IRQ macro Thomas Gleixner
                   ` (19 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


To consolidate the interrupt entry/exit code vs. the other exceptions
provide handle_irq() (similar to 32bit) to move the interrupt stack
switching to C code. That allows to consolidate the entry exit handling by
reusing the idtentry machinery both in ASM and C.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 62cff52e03c5..6087164e581c 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -79,3 +79,11 @@ void do_softirq_own_stack(void)
 	else
 		run_on_irqstack(__do_softirq, NULL);
 }
+
+void handle_irq(struct irq_desc *desc, struct pt_regs *regs)
+{
+	if (!irq_needs_irq_stack(regs))
+		generic_handle_irq_desc(desc);
+	else
+		run_on_irqstack(desc->handle_irq, desc);
+}


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 21/37] x86/entry: Add IRQENTRY_IRQ macro
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (19 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 20/37] x86/irq/64: Provide handle_irq() Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-19 20:27   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 22/37] x86/entry: Use idtentry for interrupts Thomas Gleixner
                   ` (18 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Provide a seperate IDTENTRY macro for device interrupts. Similar to
IDTENTRY_ERRORCODE with the addition of invoking irq_enter/exit_rcu() and
providing the errorcode as a 'u8' argument to the C function, which
truncates the sign extended vector number.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index ce113d5613b6..5ad4893ac31c 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -751,6 +751,20 @@ SYM_CODE_START(\asmsym)
 SYM_CODE_END(\asmsym)
 .endm
 
+.macro idtentry_irq vector cfunc
+	.p2align CONFIG_X86_L1_CACHE_SHIFT
+SYM_CODE_START_LOCAL(asm_\cfunc)
+	ASM_CLAC
+	SAVE_ALL switch_stacks=1
+	ENCODE_FRAME_POINTER
+	movl	%esp, %eax
+	movl	PT_ORIG_EAX(%esp), %edx		/* get the vector from stack */
+	movl	$-1, PT_ORIG_EAX(%esp)		/* no syscall to restart */
+	call	\cfunc
+	jmp	handle_exception_return
+SYM_CODE_END(asm_\cfunc)
+.endm
+
 /*
  * Include the defines which emit the idt entries which are shared
  * shared between 32 and 64 bit.
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9c8529035e58..613f606e6dbf 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -528,6 +528,20 @@ SYM_CODE_END(\asmsym)
 .endm
 
 /*
+ * Interrupt entry/exit.
+ *
+ + The interrupt stubs push (vector) onto the stack, which is the error_code
+ * position of idtentry exceptions, and jump to one of the two idtentry points
+ * (common/spurious).
+ *
+ * common_interrupt is a hotpath, align it to a cache line
+ */
+.macro idtentry_irq vector cfunc
+	.p2align CONFIG_X86_L1_CACHE_SHIFT
+	idtentry \vector asm_\cfunc \cfunc has_error_code=1
+.endm
+
+/*
  * MCE and DB exceptions
  */
 #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss_rw) + (TSS_ist + (x) * 8)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1fad257372e3..5ea35a9f4562 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -163,6 +163,49 @@ __visible noinstr void func(struct pt_regs *regs)
 #define DEFINE_IDTENTRY_RAW_ERRORCODE(func)				\
 __visible noinstr void func(struct pt_regs *regs, unsigned long error_code)
 
+/**
+ * DECLARE_IDTENTRY_IRQ - Declare functions for device interrupt IDT entry
+ *			  points (common/spurious)
+ * @vector:	Vector number (ignored for C)
+ * @func:	Function name of the entry point
+ *
+ * Maps to DECLARE_IDTENTRY_ERRORCODE()
+ */
+#define DECLARE_IDTENTRY_IRQ(vector, func)				\
+	DECLARE_IDTENTRY_ERRORCODE(vector, func)
+
+/**
+ * DEFINE_IDTENTRY_IRQ - Emit code for device interrupt IDT entry points
+ * @func:	Function name of the entry point
+ *
+ * The vector number is pushed by the low level entry stub and handed
+ * to the function as error_code argument which needs to be truncated
+ * to an u8 because the push is sign extending.
+ *
+ * On 64bit dtentry_enter/exit() are invoked in the ASM entry code before
+ * and after switching to the interrupt stack. On 32bit this happens in C.
+ *
+ * irq_enter/exit_rcu() are invoked before the function body and the
+ * KVM L1D flush request is set.
+ */
+#define DEFINE_IDTENTRY_IRQ(func)					\
+static __always_inline void __##func(struct pt_regs *regs, u8 vector);	\
+									\
+__visible noinstr void func(struct pt_regs *regs,			\
+			    unsigned long error_code)			\
+{									\
+	idtentry_enter(regs);						\
+	instrumentation_begin();					\
+	irq_enter_rcu();						\
+	kvm_set_cpu_l1tf_flush_l1d();					\
+	__##func (regs, (u8)error_code);				\
+	irq_exit_rcu();							\
+	lockdep_hardirq_exit();						\
+	instrumentation_end();						\
+	idtentry_exit(regs);						\
+}									\
+									\
+static __always_inline void __##func(struct pt_regs *regs, u8 vector)
 
 #ifdef CONFIG_X86_64
 /**
@@ -295,6 +338,10 @@ __visible noinstr void func(struct pt_regs *regs,			\
 #define DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func)			\
 	DECLARE_IDTENTRY_ERRORCODE(vector, func)
 
+/* Entries for common/spurious (device) interrupts */
+#define DECLARE_IDTENTRY_IRQ(vector, func)				\
+	idtentry_irq vector func
+
 #ifdef CONFIG_X86_64
 # define DECLARE_IDTENTRY_MCE(vector, func)				\
 	idtentry_mce_db vector asm_##func func


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 22/37] x86/entry: Use idtentry for interrupts
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (20 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 21/37] x86/entry: Add IRQENTRY_IRQ macro Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-19 20:28   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 23/37] x86/entry: Provide IDTENTRY_SYSVEC Thomas Gleixner
                   ` (17 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Replace the extra interrupt handling code and reuse the existing idtentry
machinery. This moves the irq stack switching on 64 bit from ASM to C code;
32bit already does the stack switching in C.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 5ad4893ac31c..9723b7803b17 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1229,37 +1229,6 @@ SYM_FUNC_END(entry_INT80_32)
 #endif
 .endm
 
-#ifdef CONFIG_X86_LOCAL_APIC
-SYM_CODE_START_LOCAL(common_spurious)
-	ASM_CLAC
-	SAVE_ALL switch_stacks=1
-	ENCODE_FRAME_POINTER
-	TRACE_IRQS_OFF
-	movl	%esp, %eax
-	movl	PT_ORIG_EAX(%esp), %edx		/* get the vector from stack */
-	movl	$-1, PT_ORIG_EAX(%esp)		/* no syscall to restart */
-	call	smp_spurious_interrupt
-	jmp	ret_from_intr
-SYM_CODE_END(common_spurious)
-#endif
-
-/*
- * the CPU automatically disables interrupts when executing an IRQ vector,
- * so IRQ-flags tracing has to follow that:
- */
-	.p2align CONFIG_X86_L1_CACHE_SHIFT
-SYM_CODE_START_LOCAL(common_interrupt)
-	ASM_CLAC
-	SAVE_ALL switch_stacks=1
-	ENCODE_FRAME_POINTER
-	TRACE_IRQS_OFF
-	movl	%esp, %eax
-	movl	PT_ORIG_EAX(%esp), %edx		/* get the vector from stack */
-	movl	$-1, PT_ORIG_EAX(%esp)		/* no syscall to restart */
-	call	do_IRQ
-	jmp	ret_from_intr
-SYM_CODE_END(common_interrupt)
-
 #define BUILD_INTERRUPT3(name, nr, fn)			\
 SYM_FUNC_START(name)					\
 	ASM_CLAC;					\
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 613f606e6dbf..8faf54389cd1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -737,32 +737,7 @@ SYM_CODE_START(interrupt_entry)
 SYM_CODE_END(interrupt_entry)
 _ASM_NOKPROBE(interrupt_entry)
 
-
-/* Interrupt entry/exit. */
-
-/*
- * The interrupt stubs push vector onto the stack and
- * then jump to common_spurious/interrupt.
- */
-SYM_CODE_START_LOCAL(common_spurious)
-	call	interrupt_entry
-	UNWIND_HINT_REGS indirect=1
-	movq	ORIG_RAX(%rdi), %rsi		/* get vector from stack */
-	movq	$-1, ORIG_RAX(%rdi)		/* no syscall to restart */
-	call	smp_spurious_interrupt		/* rdi points to pt_regs */
-	jmp	ret_from_intr
-SYM_CODE_END(common_spurious)
-_ASM_NOKPROBE(common_spurious)
-
-/* common_interrupt is a hotpath. Align it */
-	.p2align CONFIG_X86_L1_CACHE_SHIFT
-SYM_CODE_START_LOCAL(common_interrupt)
-	call	interrupt_entry
-	UNWIND_HINT_REGS indirect=1
-	movq	ORIG_RAX(%rdi), %rsi		/* get vector from stack */
-	movq	$-1, ORIG_RAX(%rdi)		/* no syscall to restart */
-	call	do_IRQ				/* rdi points to pt_regs */
-	/* 0(%rsp): old RSP */
+SYM_CODE_START_LOCAL(common_interrupt_return)
 ret_from_intr:
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
@@ -945,8 +920,8 @@ native_irq_return_ldt:
 	 */
 	jmp	native_irq_return_iret
 #endif
-SYM_CODE_END(common_interrupt)
-_ASM_NOKPROBE(common_interrupt)
+SYM_CODE_END(common_interrupt_return)
+_ASM_NOKPROBE(common_interrupt_return)
 
 /*
  * APIC interrupts.
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 0ffe80792b2d..3213d36b92d3 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -38,7 +38,6 @@ extern asmlinkage void error_interrupt(void);
 extern asmlinkage void irq_work_interrupt(void);
 extern asmlinkage void uv_bau_message_intr1(void);
 
-extern asmlinkage void spurious_interrupt(void);
 extern asmlinkage void spurious_apic_interrupt(void);
 extern asmlinkage void thermal_interrupt(void);
 extern asmlinkage void reschedule_interrupt(void);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5ea35a9f4562..29bcb6315ab9 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -391,7 +391,7 @@ SYM_CODE_START(irq_entries_start)
     .rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
 	UNWIND_HINT_IRET_REGS
 	.byte	0x6a, vector
-	jmp	common_interrupt
+	jmp	asm_common_interrupt
 	.align	8
     vector=vector+1
     .endr
@@ -404,7 +404,7 @@ SYM_CODE_START(spurious_entries_start)
     .rept (NR_VECTORS - FIRST_SYSTEM_VECTOR)
 	UNWIND_HINT_IRET_REGS
 	.byte	0x6a, vector
-	jmp	common_spurious
+	jmp	asm_spurious_interrupt
 	.align	8
     vector=vector+1
     .endr
@@ -475,6 +475,12 @@ DECLARE_IDTENTRY_DF(X86_TRAP_DF,	exc_double_fault);
 DECLARE_IDTENTRY(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
 #endif
 
+/* Device interrupts common/spurious */
+DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
+#ifdef CONFIG_X86_LOCAL_APIC
+DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	spurious_interrupt);
+#endif
+
 #undef X86_TRAP_OTHER
 
 #endif
diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 74690a373c58..ae10083b7631 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -36,8 +36,6 @@ extern void native_init_IRQ(void);
 
 extern void handle_irq(struct irq_desc *desc, struct pt_regs *regs);
 
-extern __visible void do_IRQ(struct pt_regs *regs, unsigned long vector);
-
 extern void init_ISA_irqs(void);
 
 extern void __init init_IRQ(void);
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 32b2becf7806..97e6945bfce8 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -43,7 +43,6 @@ asmlinkage void smp_deferred_error_interrupt(struct pt_regs *regs);
 void smp_apic_timer_interrupt(struct pt_regs *regs);
 void smp_error_interrupt(struct pt_regs *regs);
 void smp_spurious_apic_interrupt(struct pt_regs *regs);
-void smp_spurious_interrupt(struct pt_regs *regs, unsigned long vector);
 asmlinkage void smp_irq_move_cleanup_interrupt(void);
 
 #ifdef CONFIG_VMAP_STACK
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index a4218a3726f9..66c3cfcd6d47 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2154,9 +2154,9 @@ void __init register_lapic_address(unsigned long address)
  */
 
 /**
- * smp_spurious_interrupt - Catch all for interrupts raised on unused vectors
+ * spurious_interrupt - Catch all for interrupts raised on unused vectors
  * @regs:	Pointer to pt_regs on stack
- * @error_code:	The vector number is in the lower 8 bits
+ * @vector:	The vector number
  *
  * This is invoked from ASM entry code to catch all interrupts which
  * trigger on an entry which is routed to the common_spurious idtentry
@@ -2164,18 +2164,10 @@ void __init register_lapic_address(unsigned long address)
  *
  * Also called from smp_spurious_apic_interrupt().
  */
-__visible void __irq_entry smp_spurious_interrupt(struct pt_regs *regs,
-						  unsigned long vector)
+DEFINE_IDTENTRY_IRQ(spurious_interrupt)
 {
 	u32 v;
 
-	entering_irq();
-	/*
-	 * The push in the entry ASM code which stores the vector number on
-	 * the stack in the error code slot is sign expanding. Just use the
-	 * lower 8 bits.
-	 */
-	vector &= 0xFF;
 	trace_spurious_apic_entry(vector);
 
 	inc_irq_stat(irq_spurious_count);
@@ -2196,21 +2188,22 @@ __visible void __irq_entry smp_spurious_interrupt(struct pt_regs *regs,
 	 */
 	v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1));
 	if (v & (1 << (vector & 0x1f))) {
-		pr_info("Spurious interrupt (vector 0x%02lx) on CPU#%d. Acked\n",
+		pr_info("Spurious interrupt (vector 0x%02x) on CPU#%d. Acked\n",
 			vector, smp_processor_id());
 		ack_APIC_irq();
 	} else {
-		pr_info("Spurious interrupt (vector 0x%02lx) on CPU#%d. Not pending!\n",
+		pr_info("Spurious interrupt (vector 0x%02x) on CPU#%d. Not pending!\n",
 			vector, smp_processor_id());
 	}
 out:
 	trace_spurious_apic_exit(vector);
-	exiting_irq();
 }
 
 __visible void smp_spurious_apic_interrupt(struct pt_regs *regs)
 {
-	smp_spurious_interrupt(regs, SPURIOUS_APIC_VECTOR);
+	entering_irq();
+	__spurious_interrupt(regs, SPURIOUS_APIC_VECTOR);
+	exiting_irq();
 }
 
 /*
diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 159bd0cb8548..5cbaca58af95 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -115,7 +115,8 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
 	 * denote it as spurious which is no harm as this is a rare event
 	 * and interrupt handlers have to cope with spurious interrupts
 	 * anyway. If the vector is unused, then it is marked so it won't
-	 * trigger the 'No irq handler for vector' warning in do_IRQ().
+	 * trigger the 'No irq handler for vector' warning in
+	 * common_interrupt().
 	 *
 	 * This requires to hold vector lock to prevent concurrent updates to
 	 * the affected vector.
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index c7669363251a..63b848e1fe68 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -19,6 +19,7 @@
 #include <asm/mce.h>
 #include <asm/hw_irq.h>
 #include <asm/desc.h>
+#include <asm/traps.h>
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/irq_vectors.h>
@@ -223,37 +224,25 @@ u64 arch_irq_stat(void)
 
 
 /*
- * do_IRQ handles all normal device IRQ's (the special
- * SMP cross-CPU interrupts have their own specific
- * handlers).
+ * common_interrupt() handles all normal device IRQ's (the special SMP
+ * cross-CPU interrupts have their own entry points).
  */
-__visible void __irq_entry do_IRQ(struct pt_regs *regs, unsigned long vector)
+DEFINE_IDTENTRY_IRQ(common_interrupt)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 	struct irq_desc *desc;
 
-	entering_irq();
-	/*
-	 * The push in the entry ASM code which stores the vector number on
-	 * the stack in the error code slot is sign expanding. Just use the
-	 * lower 8 bits.
-	 */
-	vector &= 0xFF;
-
-	/* entering_irq() tells RCU that we're not quiescent.  Check it. */
+	/* entry code tells RCU that we're not quiescent.  Check it. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
 
 	desc = __this_cpu_read(vector_irq[vector]);
 	if (likely(!IS_ERR_OR_NULL(desc))) {
-		if (IS_ENABLED(CONFIG_X86_32))
-			handle_irq(desc, regs);
-		else
-			generic_handle_irq_desc(desc);
+		handle_irq(desc, regs);
 	} else {
 		ack_APIC_irq();
 
 		if (desc == VECTOR_UNUSED) {
-			pr_emerg_ratelimited("%s: %d.%lu No irq handler for vector\n",
+			pr_emerg_ratelimited("%s: %d.%u No irq handler for vector\n",
 					     __func__, smp_processor_id(),
 					     vector);
 		} else {
@@ -261,8 +250,6 @@ __visible void __irq_entry do_IRQ(struct pt_regs *regs, unsigned long vector)
 		}
 	}
 
-	exiting_irq();
-
 	set_irq_regs(old_regs);
 }
 


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 23/37] x86/entry: Provide IDTENTRY_SYSVEC
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (21 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 22/37] x86/entry: Use idtentry for interrupts Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:29   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 24/37] x86/entry: Convert APIC interrupts to IDTENTRY_SYSVEC Thomas Gleixner
                   ` (16 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Provide a IDTENTRY variant for system vectors to consolidate the different
mechanisms to emit the ASM stubs for 32 an 64 bit.

On 64bit this also moves the stack switching from ASM to C code. 32bit will
excute the system vectors w/o stack switching as before. As some of the
system vector handlers require access to pt_regs this requires a new stack
switching macro which can handle an argument.

The alternative solution would be to implement the set_irq_regs() dance
right in the entry macro, but most system vector handlers do not require
it, so avoid the overhead.

Provide the entry/exit handling as inline functions so the scheduler IPI
can use it to implement lightweight entry handling depending on trace point
enablement. This ensures that the code is consistent.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 9723b7803b17..1db655409dbf 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -765,6 +765,10 @@ SYM_CODE_START_LOCAL(asm_\cfunc)
 SYM_CODE_END(asm_\cfunc)
 .endm
 
+.macro idtentry_sysvec vector cfunc
+	idtentry \vector asm_\cfunc \cfunc has_error_code=0
+.endm
+
 /*
  * Include the defines which emit the idt entries which are shared
  * shared between 32 and 64 bit.
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 8faf54389cd1..2c0eb5c2100a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -542,6 +542,14 @@ SYM_CODE_END(\asmsym)
 .endm
 
 /*
+ * System vectors which invoke their handlers directly and are not
+ * going through the regular common device interrupt handling code.
+ */
+.macro idtentry_sysvec vector cfunc
+	idtentry \vector asm_\cfunc \cfunc has_error_code=0
+.endm
+
+/*
  * MCE and DB exceptions
  */
 #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss_rw) + (TSS_ist + (x) * 8)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 29bcb6315ab9..43658fcedae8 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -6,6 +6,9 @@
 #include <asm/trapnr.h>
 
 #ifndef __ASSEMBLY__
+#include <linux/hardirq.h>
+
+#include <asm/irq_stack.h>
 
 void idtentry_enter(struct pt_regs *regs);
 void idtentry_exit(struct pt_regs *regs);
@@ -207,6 +210,85 @@ __visible noinstr void func(struct pt_regs *regs,			\
 									\
 static __always_inline void __##func(struct pt_regs *regs, u8 vector)
 
+/**
+ * DECLARE_IDTENTRY_SYSVEC - Declare functions for system vector entry points
+ * @vector:	Vector number (ignored for C)
+ * @func:	Function name of the entry point
+ *
+ * Declares three functions:
+ * - The ASM entry point: asm_##func
+ * - The XEN PV trap entry point: xen_##func (maybe unused)
+ * - The C handler called from the ASM entry point
+ *
+ * Maps to DECLARE_IDTENTRY().
+ */
+#define DECLARE_IDTENTRY_SYSVEC(vector, func)				\
+	DECLARE_IDTENTRY(vector, func)
+
+
+/**
+ * DEFINE_IDTENTRY_SYSVEC - Emit code for system vector IDT entry points
+ * @func:	Function name of the entry point
+ *
+ * idtentry_enter/exit() and irq_enter/exit_rcu() are invoked before the
+ * function body. KVM L1D flush request is set.
+ *
+ * Runs the function on the interrupt stack if the entry hit kernel mode
+ */
+#define DEFINE_IDTENTRY_SYSVEC(func)					\
+static void __##func(struct pt_regs *regs);				\
+									\
+__visible noinstr void func(struct pt_regs *regs)			\
+{									\
+	idtentry_enter(regs);						\
+	instrumentation_begin();					\
+	irq_enter_rcu();						\
+	kvm_set_cpu_l1tf_flush_l1d();					\
+	if (!irq_needs_irq_stack(regs))					\
+		__##func (regs);					\
+	else								\
+		run_on_irqstack(__##func, regs);			\
+	irq_exit_rcu();							\
+	lockdep_hardirq_exit();						\
+	instrumentation_end();						\
+	idtentry_exit(regs);						\
+}									\
+									\
+static noinline void __##func(struct pt_regs *regs)
+
+/**
+ * DEFINE_IDTENTRY_SYSVEC_SIMPLE - Emit code for simple system vector IDT
+ *				   entry points
+ * @func:	Function name of the entry point
+ *
+ * Runs the function on the interrupted stack. No switch to IRQ stack.
+ * Used for 'empty' vectors like reschedule IPI and KVM posted interrupt
+ * vectors.
+ *
+ * Uses conditional RCU and does not invoke irq_enter/exit_rcu() to avoid
+ * the overhead. This is correct vs. NOHZ. If this hits in dynticks idle
+ * then the exit path from the inner idle loop will restart the tick.  If
+ * it hits user mode with ticks off then the scheduler will take care of
+ * restarting it.
+ */
+#define DEFINE_IDTENTRY_SYSVEC_SIMPLE(func)				\
+static void __##func(struct pt_regs *regs);				\
+									\
+__visible noinstr void func(struct pt_regs *regs)			\
+{									\
+	bool rcu_exit = idtentry_enter_cond_rcu(regs);			\
+									\
+	instrumentation_begin();					\
+	__irq_enter_raw();						\
+	kvm_set_cpu_l1tf_flush_l1d();					\
+	__##func (regs);						\
+	__irq_exit_raw();						\
+	instrumentation_end();						\
+	idtentry_exit_cond_rcu(regs, rcu_exit);				\
+}									\
+									\
+static void __##func(struct pt_regs *regs)
+
 #ifdef CONFIG_X86_64
 /**
  * DECLARE_IDTENTRY_IST - Declare functions for IST handling IDT entry points
@@ -342,6 +424,10 @@ __visible noinstr void func(struct pt_regs *regs,			\
 #define DECLARE_IDTENTRY_IRQ(vector, func)				\
 	idtentry_irq vector func
 
+/* System vector entries */
+#define DECLARE_IDTENTRY_SYSVEC(vector, func)				\
+	idtentry_sysvec vector func
+
 #ifdef CONFIG_X86_64
 # define DECLARE_IDTENTRY_MCE(vector, func)				\
 	idtentry_mce_db vector asm_##func func


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 24/37] x86/entry: Convert APIC interrupts to IDTENTRY_SYSVEC
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (22 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 23/37] x86/entry: Provide IDTENTRY_SYSVEC Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:27   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 25/37] x86/entry: Convert SMP system vectors " Thomas Gleixner
                   ` (15 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Convert APIC interrupts to IDTENTRY_SYSVEC
  - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
  - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
  - Remove the ASM idtentries in 64bit
  - Remove the BUILD_INTERRUPT entries in 32bit
  - Remove the old prototypes

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 2c0eb5c2100a..6e629e5d5047 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -965,9 +965,6 @@ apicinterrupt3 REBOOT_VECTOR			reboot_interrupt		smp_reboot_interrupt
 apicinterrupt3 UV_BAU_MESSAGE			uv_bau_message_intr1		uv_bau_message_interrupt
 #endif
 
-apicinterrupt LOCAL_TIMER_VECTOR		apic_timer_interrupt		smp_apic_timer_interrupt
-apicinterrupt X86_PLATFORM_IPI_VECTOR		x86_platform_ipi		smp_x86_platform_ipi
-
 #ifdef CONFIG_HAVE_KVM
 apicinterrupt3 POSTED_INTR_VECTOR		kvm_posted_intr_ipi		smp_kvm_posted_intr_ipi
 apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR	kvm_posted_intr_wakeup_ipi	smp_kvm_posted_intr_wakeup_ipi
@@ -992,9 +989,6 @@ apicinterrupt CALL_FUNCTION_VECTOR		call_function_interrupt		smp_call_function_i
 apicinterrupt RESCHEDULE_VECTOR			reschedule_interrupt		smp_reschedule_interrupt
 #endif
 
-apicinterrupt ERROR_APIC_VECTOR			error_interrupt			smp_error_interrupt
-apicinterrupt SPURIOUS_APIC_VECTOR		spurious_apic_interrupt		smp_spurious_apic_interrupt
-
 #ifdef CONFIG_IRQ_WORK
 apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
 #endif
diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index cd57ce6134c9..d10d6d807e73 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -33,11 +33,6 @@ BUILD_INTERRUPT(kvm_posted_intr_nested_ipi, POSTED_INTR_NESTED_VECTOR)
  */
 #ifdef CONFIG_X86_LOCAL_APIC
 
-BUILD_INTERRUPT(apic_timer_interrupt,LOCAL_TIMER_VECTOR)
-BUILD_INTERRUPT(error_interrupt,ERROR_APIC_VECTOR)
-BUILD_INTERRUPT(spurious_apic_interrupt,SPURIOUS_APIC_VECTOR)
-BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
-
 #ifdef CONFIG_IRQ_WORK
 BUILD_INTERRUPT(irq_work_interrupt, IRQ_WORK_VECTOR)
 #endif
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 3213d36b92d3..1765993360e7 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -29,16 +29,12 @@
 #include <asm/sections.h>
 
 /* Interrupt handlers registered during init_IRQ */
-extern asmlinkage void apic_timer_interrupt(void);
-extern asmlinkage void x86_platform_ipi(void);
 extern asmlinkage void kvm_posted_intr_ipi(void);
 extern asmlinkage void kvm_posted_intr_wakeup_ipi(void);
 extern asmlinkage void kvm_posted_intr_nested_ipi(void);
-extern asmlinkage void error_interrupt(void);
 extern asmlinkage void irq_work_interrupt(void);
 extern asmlinkage void uv_bau_message_intr1(void);
 
-extern asmlinkage void spurious_apic_interrupt(void);
 extern asmlinkage void thermal_interrupt(void);
 extern asmlinkage void reschedule_interrupt(void);
 
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 43658fcedae8..6154d1e75fce 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -567,6 +567,14 @@ DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	spurious_interrupt);
 #endif
 
+/* System vector entry points */
+#ifdef CONFIG_X86_LOCAL_APIC
+DECLARE_IDTENTRY_SYSVEC(ERROR_APIC_VECTOR,		sysvec_error_interrupt);
+DECLARE_IDTENTRY_SYSVEC(SPURIOUS_APIC_VECTOR,		sysvec_spurious_apic_interrupt);
+DECLARE_IDTENTRY_SYSVEC(LOCAL_TIMER_VECTOR,		sysvec_apic_timer_interrupt);
+DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR,	sysvec_x86_platform_ipi);
+#endif
+
 #undef X86_TRAP_OTHER
 
 #endif
diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index ae10083b7631..112c85673179 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -44,7 +44,6 @@ extern void __init init_IRQ(void);
 void arch_trigger_cpumask_backtrace(const struct cpumask *mask,
 				    bool exclude_self);
 
-extern __visible void smp_x86_platform_ipi(struct pt_regs *regs);
 #define arch_trigger_cpumask_backtrace arch_trigger_cpumask_backtrace
 #endif
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 97e6945bfce8..933934c3e173 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -40,9 +40,6 @@ asmlinkage void smp_threshold_interrupt(struct pt_regs *regs);
 asmlinkage void smp_deferred_error_interrupt(struct pt_regs *regs);
 #endif
 
-void smp_apic_timer_interrupt(struct pt_regs *regs);
-void smp_error_interrupt(struct pt_regs *regs);
-void smp_spurious_apic_interrupt(struct pt_regs *regs);
 asmlinkage void smp_irq_move_cleanup_interrupt(void);
 
 #ifdef CONFIG_VMAP_STACK
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 66c3cfcd6d47..6a8e9f343a29 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1121,23 +1121,14 @@ static void local_apic_timer_interrupt(void)
  * [ if a single-CPU system runs an SMP kernel then we call the local
  *   interrupt as well. Thus we cannot inline the local irq ... ]
  */
-__visible void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_apic_timer_interrupt)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
-	/*
-	 * NOTE! We'd better ACK the irq immediately,
-	 * because timer handling can be slow.
-	 *
-	 * update_process_times() expects us to have done irq_enter().
-	 * Besides, if we don't timer interrupts ignore the global
-	 * interrupt lock, which is the WrongThing (tm) to do.
-	 */
-	entering_ack_irq();
+	ack_APIC_irq();
 	trace_local_timer_entry(LOCAL_TIMER_VECTOR);
 	local_apic_timer_interrupt();
 	trace_local_timer_exit(LOCAL_TIMER_VECTOR);
-	exiting_irq();
 
 	set_irq_regs(old_regs);
 }
@@ -2162,7 +2153,7 @@ void __init register_lapic_address(unsigned long address)
  * trigger on an entry which is routed to the common_spurious idtentry
  * point.
  *
- * Also called from smp_spurious_apic_interrupt().
+ * Also called from sysvec_spurious_apic_interrupt().
  */
 DEFINE_IDTENTRY_IRQ(spurious_interrupt)
 {
@@ -2199,17 +2190,15 @@ DEFINE_IDTENTRY_IRQ(spurious_interrupt)
 	trace_spurious_apic_exit(vector);
 }
 
-__visible void smp_spurious_apic_interrupt(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_spurious_apic_interrupt)
 {
-	entering_irq();
 	__spurious_interrupt(regs, SPURIOUS_APIC_VECTOR);
-	exiting_irq();
 }
 
 /*
  * This interrupt should never happen with our APIC/SMP architecture
  */
-__visible void __irq_entry smp_error_interrupt(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_error_interrupt)
 {
 	static const char * const error_interrupt_reason[] = {
 		"Send CS error",		/* APIC Error Bit 0 */
@@ -2223,7 +2212,6 @@ __visible void __irq_entry smp_error_interrupt(struct pt_regs *regs)
 	};
 	u32 v, i = 0;
 
-	entering_irq();
 	trace_error_apic_entry(ERROR_APIC_VECTOR);
 
 	/* First tickle the hardware, only then report what went on. -- REW */
@@ -2247,7 +2235,6 @@ __visible void __irq_entry smp_error_interrupt(struct pt_regs *regs)
 	apic_printk(APIC_DEBUG, KERN_CONT "\n");
 
 	trace_error_apic_exit(ERROR_APIC_VECTOR);
-	exiting_irq();
 }
 
 /**
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 3e49e1fac1c6..7e38af51480d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -129,8 +129,8 @@ static const __initconst struct idt_data apic_idts[] = {
 #endif
 
 #ifdef CONFIG_X86_LOCAL_APIC
-	INTG(LOCAL_TIMER_VECTOR,	apic_timer_interrupt),
-	INTG(X86_PLATFORM_IPI_VECTOR,	x86_platform_ipi),
+	INTG(LOCAL_TIMER_VECTOR,	asm_sysvec_apic_timer_interrupt),
+	INTG(X86_PLATFORM_IPI_VECTOR,	asm_sysvec_x86_platform_ipi),
 # ifdef CONFIG_HAVE_KVM
 	INTG(POSTED_INTR_VECTOR,	kvm_posted_intr_ipi),
 	INTG(POSTED_INTR_WAKEUP_VECTOR, kvm_posted_intr_wakeup_ipi),
@@ -142,8 +142,8 @@ static const __initconst struct idt_data apic_idts[] = {
 #ifdef CONFIG_X86_UV
 	INTG(UV_BAU_MESSAGE,		uv_bau_message_intr1),
 #endif
-	INTG(SPURIOUS_APIC_VECTOR,	spurious_apic_interrupt),
-	INTG(ERROR_APIC_VECTOR,		error_interrupt),
+	INTG(SPURIOUS_APIC_VECTOR,	asm_sysvec_spurious_apic_interrupt),
+	INTG(ERROR_APIC_VECTOR,		asm_sysvec_error_interrupt),
 #endif
 };
 
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 63b848e1fe68..43f95ec5f131 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -14,6 +14,7 @@
 #include <linux/irq.h>
 
 #include <asm/apic.h>
+#include <asm/traps.h>
 #include <asm/io_apic.h>
 #include <asm/irq.h>
 #include <asm/mce.h>
@@ -259,17 +260,16 @@ void (*x86_platform_ipi_callback)(void) = NULL;
 /*
  * Handler for X86_PLATFORM_IPI_VECTOR.
  */
-__visible void __irq_entry smp_x86_platform_ipi(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
-	entering_ack_irq();
+	ack_APIC_irq();
 	trace_x86_platform_ipi_entry(X86_PLATFORM_IPI_VECTOR);
 	inc_irq_stat(x86_platform_ipis);
 	if (x86_platform_ipi_callback)
 		x86_platform_ipi_callback();
 	trace_x86_platform_ipi_exit(X86_PLATFORM_IPI_VECTOR);
-	exiting_irq();
 	set_irq_regs(old_regs);
 }
 #endif


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 25/37] x86/entry: Convert SMP system vectors to IDTENTRY_SYSVEC
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (23 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 24/37] x86/entry: Convert APIC interrupts to IDTENTRY_SYSVEC Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:28   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 26/37] x86/entry: Convert various system vectors Thomas Gleixner
                   ` (14 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Convert SMP system vectors to IDTENTRY_SYSVEC
  - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
  - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
  - Remove the ASM idtentries in 64bit
  - Remove the BUILD_INTERRUPT entries in 32bit
  - Remove the old prototypes

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6e629e5d5047..c1a47d37f29e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -956,11 +956,6 @@ apicinterrupt3 \num \sym \do_sym
 POP_SECTION_IRQENTRY
 .endm
 
-#ifdef CONFIG_SMP
-apicinterrupt3 IRQ_MOVE_CLEANUP_VECTOR		irq_move_cleanup_interrupt	smp_irq_move_cleanup_interrupt
-apicinterrupt3 REBOOT_VECTOR			reboot_interrupt		smp_reboot_interrupt
-#endif
-
 #ifdef CONFIG_X86_UV
 apicinterrupt3 UV_BAU_MESSAGE			uv_bau_message_intr1		uv_bau_message_interrupt
 #endif
@@ -984,8 +979,6 @@ apicinterrupt THERMAL_APIC_VECTOR		thermal_interrupt		smp_thermal_interrupt
 #endif
 
 #ifdef CONFIG_SMP
-apicinterrupt CALL_FUNCTION_SINGLE_VECTOR	call_function_single_interrupt	smp_call_function_single_interrupt
-apicinterrupt CALL_FUNCTION_VECTOR		call_function_interrupt		smp_call_function_interrupt
 apicinterrupt RESCHEDULE_VECTOR			reschedule_interrupt		smp_reschedule_interrupt
 #endif
 
diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index d10d6d807e73..2e2055bcfeb2 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -12,10 +12,6 @@
  */
 #ifdef CONFIG_SMP
 BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
-BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
-BUILD_INTERRUPT(call_function_single_interrupt,CALL_FUNCTION_SINGLE_VECTOR)
-BUILD_INTERRUPT(irq_move_cleanup_interrupt, IRQ_MOVE_CLEANUP_VECTOR)
-BUILD_INTERRUPT(reboot_interrupt, REBOOT_VECTOR)
 #endif
 
 #ifdef CONFIG_HAVE_KVM
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 1765993360e7..36a38695f27f 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -38,14 +38,9 @@ extern asmlinkage void uv_bau_message_intr1(void);
 extern asmlinkage void thermal_interrupt(void);
 extern asmlinkage void reschedule_interrupt(void);
 
-extern asmlinkage void irq_move_cleanup_interrupt(void);
-extern asmlinkage void reboot_interrupt(void);
 extern asmlinkage void threshold_interrupt(void);
 extern asmlinkage void deferred_error_interrupt(void);
 
-extern asmlinkage void call_function_interrupt(void);
-extern asmlinkage void call_function_single_interrupt(void);
-
 #ifdef	CONFIG_X86_LOCAL_APIC
 struct irq_data;
 struct pci_dev;
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 6154d1e75fce..20e47b8d4024 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -575,6 +575,13 @@ DECLARE_IDTENTRY_SYSVEC(LOCAL_TIMER_VECTOR,		sysvec_apic_timer_interrupt);
 DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR,	sysvec_x86_platform_ipi);
 #endif
 
+#ifdef CONFIG_SMP
+DECLARE_IDTENTRY_SYSVEC(IRQ_MOVE_CLEANUP_VECTOR,	sysvec_irq_move_cleanup);
+DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR,			sysvec_reboot);
+DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR,	sysvec_call_function_single);
+DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR,		sysvec_call_function);
+#endif
+
 #undef X86_TRAP_OTHER
 
 #endif
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 933934c3e173..0c40f37f8cb7 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -40,8 +40,6 @@ asmlinkage void smp_threshold_interrupt(struct pt_regs *regs);
 asmlinkage void smp_deferred_error_interrupt(struct pt_regs *regs);
 #endif
 
-asmlinkage void smp_irq_move_cleanup_interrupt(void);
-
 #ifdef CONFIG_VMAP_STACK
 void __noreturn handle_stack_overflow(const char *message,
 				      struct pt_regs *regs,
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 67768e54438b..c48be6e1f676 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -861,13 +861,13 @@ static void free_moved_vector(struct apic_chip_data *apicd)
 	apicd->move_in_progress = 0;
 }
 
-asmlinkage __visible void __irq_entry smp_irq_move_cleanup_interrupt(void)
+DEFINE_IDTENTRY_SYSVEC(sysvec_irq_move_cleanup)
 {
 	struct hlist_head *clhead = this_cpu_ptr(&cleanup_list);
 	struct apic_chip_data *apicd;
 	struct hlist_node *tmp;
 
-	entering_ack_irq();
+	ack_APIC_irq();
 	/* Prevent vectors vanishing under us */
 	raw_spin_lock(&vector_lock);
 
@@ -892,7 +892,6 @@ asmlinkage __visible void __irq_entry smp_irq_move_cleanup_interrupt(void)
 	}
 
 	raw_spin_unlock(&vector_lock);
-	exiting_irq();
 }
 
 static void __send_cleanup_vector(struct apic_chip_data *apicd)
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 7e38af51480d..95abcd52e5be 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -109,11 +109,11 @@ static const __initconst struct idt_data def_idts[] = {
  */
 static const __initconst struct idt_data apic_idts[] = {
 #ifdef CONFIG_SMP
-	INTG(RESCHEDULE_VECTOR,		reschedule_interrupt),
-	INTG(CALL_FUNCTION_VECTOR,	call_function_interrupt),
-	INTG(CALL_FUNCTION_SINGLE_VECTOR, call_function_single_interrupt),
-	INTG(IRQ_MOVE_CLEANUP_VECTOR,	irq_move_cleanup_interrupt),
-	INTG(REBOOT_VECTOR,		reboot_interrupt),
+	INTG(RESCHEDULE_VECTOR,			reschedule_interrupt),
+	INTG(CALL_FUNCTION_VECTOR,		asm_sysvec_call_function),
+	INTG(CALL_FUNCTION_SINGLE_VECTOR,	asm_sysvec_call_function_single),
+	INTG(IRQ_MOVE_CLEANUP_VECTOR,		asm_sysvec_irq_move_cleanup),
+	INTG(REBOOT_VECTOR,			asm_sysvec_reboot),
 #endif
 
 #ifdef CONFIG_X86_THERMAL_VECTOR
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index b8d4e9c3c070..e5647daa7e96 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -27,6 +27,7 @@
 #include <asm/mmu_context.h>
 #include <asm/proto.h>
 #include <asm/apic.h>
+#include <asm/idtentry.h>
 #include <asm/nmi.h>
 #include <asm/mce.h>
 #include <asm/trace/irq_vectors.h>
@@ -130,13 +131,11 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs)
 /*
  * this function calls the 'stop' function on all other CPUs in the system.
  */
-
-asmlinkage __visible void smp_reboot_interrupt(void)
+DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
 {
-	ipi_entering_ack_irq();
+	ack_APIC_irq();
 	cpu_emergency_vmxoff();
 	stop_this_cpu(NULL);
-	irq_exit();
 }
 
 static int register_stop_handler(void)
@@ -227,7 +226,6 @@ __visible void __irq_entry smp_reschedule_interrupt(struct pt_regs *regs)
 {
 	ack_APIC_irq();
 	inc_irq_stat(irq_resched_count);
-	kvm_set_cpu_l1tf_flush_l1d();
 
 	if (trace_resched_ipi_enabled()) {
 		/*
@@ -244,24 +242,22 @@ __visible void __irq_entry smp_reschedule_interrupt(struct pt_regs *regs)
 	scheduler_ipi();
 }
 
-__visible void __irq_entry smp_call_function_interrupt(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_call_function)
 {
-	ipi_entering_ack_irq();
+	ack_APIC_irq();
 	trace_call_function_entry(CALL_FUNCTION_VECTOR);
 	inc_irq_stat(irq_call_count);
 	generic_smp_call_function_interrupt();
 	trace_call_function_exit(CALL_FUNCTION_VECTOR);
-	exiting_irq();
 }
 
-__visible void __irq_entry smp_call_function_single_interrupt(struct pt_regs *r)
+DEFINE_IDTENTRY_SYSVEC(sysvec_call_function_single)
 {
-	ipi_entering_ack_irq();
+	ack_APIC_irq();
 	trace_call_function_single_entry(CALL_FUNCTION_SINGLE_VECTOR);
 	inc_irq_stat(irq_call_count);
 	generic_smp_call_function_single_interrupt();
 	trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR);
-	exiting_irq();
 }
 
 static int __init nonmi_ipi_setup(char *str)


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 26/37] x86/entry: Convert various system vectors
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (24 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 25/37] x86/entry: Convert SMP system vectors " Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:30   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 27/37] x86/entry: Convert KVM vectors to IDTENTRY_SYSVEC Thomas Gleixner
                   ` (13 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Convert various system vectors to IDTENTRY_SYSVEC
  - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
  - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
  - Remove the ASM idtentries in 64bit
  - Remove the BUILD_INTERRUPT entries in 32bit
  - Remove the old prototypes

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c1a47d37f29e..a8eb5c262190 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -956,9 +956,6 @@ apicinterrupt3 \num \sym \do_sym
 POP_SECTION_IRQENTRY
 .endm
 
-#ifdef CONFIG_X86_UV
-apicinterrupt3 UV_BAU_MESSAGE			uv_bau_message_intr1		uv_bau_message_interrupt
-#endif
 
 #ifdef CONFIG_HAVE_KVM
 apicinterrupt3 POSTED_INTR_VECTOR		kvm_posted_intr_ipi		smp_kvm_posted_intr_ipi
@@ -966,26 +963,10 @@ apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR	kvm_posted_intr_wakeup_ipi	smp_kvm_post
 apicinterrupt3 POSTED_INTR_NESTED_VECTOR	kvm_posted_intr_nested_ipi	smp_kvm_posted_intr_nested_ipi
 #endif
 
-#ifdef CONFIG_X86_MCE_THRESHOLD
-apicinterrupt THRESHOLD_APIC_VECTOR		threshold_interrupt		smp_threshold_interrupt
-#endif
-
-#ifdef CONFIG_X86_MCE_AMD
-apicinterrupt DEFERRED_ERROR_VECTOR		deferred_error_interrupt	smp_deferred_error_interrupt
-#endif
-
-#ifdef CONFIG_X86_THERMAL_VECTOR
-apicinterrupt THERMAL_APIC_VECTOR		thermal_interrupt		smp_thermal_interrupt
-#endif
-
 #ifdef CONFIG_SMP
 apicinterrupt RESCHEDULE_VECTOR			reschedule_interrupt		smp_reschedule_interrupt
 #endif
 
-#ifdef CONFIG_IRQ_WORK
-apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
-#endif
-
 /*
  * Reload gs selector with exception handling
  * edi:  new selector
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 19e94af9cc5d..a5416865b6fa 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -534,24 +534,11 @@ static inline void entering_ack_irq(void)
 	ack_APIC_irq();
 }
 
-static inline void ipi_entering_ack_irq(void)
-{
-	irq_enter();
-	ack_APIC_irq();
-	kvm_set_cpu_l1tf_flush_l1d();
-}
-
 static inline void exiting_irq(void)
 {
 	irq_exit();
 }
 
-static inline void exiting_ack_irq(void)
-{
-	ack_APIC_irq();
-	irq_exit();
-}
-
 extern void ioapic_zap_locks(void);
 
 #endif /* _ASM_X86_APIC_H */
diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index 2e2055bcfeb2..69a5320a4673 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -20,28 +20,3 @@ BUILD_INTERRUPT(kvm_posted_intr_wakeup_ipi, POSTED_INTR_WAKEUP_VECTOR)
 BUILD_INTERRUPT(kvm_posted_intr_nested_ipi, POSTED_INTR_NESTED_VECTOR)
 #endif
 
-/*
- * every pentium local APIC has two 'local interrupts', with a
- * soft-definable vector attached to both interrupts, one of
- * which is a timer interrupt, the other one is error counter
- * overflow. Linux uses the local APIC timer interrupt to get
- * a much simpler SMP time architecture:
- */
-#ifdef CONFIG_X86_LOCAL_APIC
-
-#ifdef CONFIG_IRQ_WORK
-BUILD_INTERRUPT(irq_work_interrupt, IRQ_WORK_VECTOR)
-#endif
-
-#ifdef CONFIG_X86_THERMAL_VECTOR
-BUILD_INTERRUPT(thermal_interrupt,THERMAL_APIC_VECTOR)
-#endif
-
-#ifdef CONFIG_X86_MCE_THRESHOLD
-BUILD_INTERRUPT(threshold_interrupt,THRESHOLD_APIC_VECTOR)
-#endif
-
-#ifdef CONFIG_X86_MCE_AMD
-BUILD_INTERRUPT(deferred_error_interrupt, DEFERRED_ERROR_VECTOR)
-#endif
-#endif
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 36a38695f27f..7281c7e3a0f6 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -32,15 +32,9 @@
 extern asmlinkage void kvm_posted_intr_ipi(void);
 extern asmlinkage void kvm_posted_intr_wakeup_ipi(void);
 extern asmlinkage void kvm_posted_intr_nested_ipi(void);
-extern asmlinkage void irq_work_interrupt(void);
-extern asmlinkage void uv_bau_message_intr1(void);
 
-extern asmlinkage void thermal_interrupt(void);
 extern asmlinkage void reschedule_interrupt(void);
 
-extern asmlinkage void threshold_interrupt(void);
-extern asmlinkage void deferred_error_interrupt(void);
-
 #ifdef	CONFIG_X86_LOCAL_APIC
 struct irq_data;
 struct pci_dev;
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 20e47b8d4024..1d9d90c85218 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -582,6 +582,28 @@ DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR,	sysvec_call_function_single
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR,		sysvec_call_function);
 #endif
 
+#ifdef CONFIG_X86_LOCAL_APIC
+# ifdef CONFIG_X86_UV
+DECLARE_IDTENTRY_SYSVEC(UV_BAU_MESSAGE,			sysvec_uv_bau_message);
+# endif
+
+# ifdef CONFIG_X86_MCE_THRESHOLD
+DECLARE_IDTENTRY_SYSVEC(THRESHOLD_APIC_VECTOR,		sysvec_threshold);
+# endif
+
+# ifdef CONFIG_X86_MCE_AMD
+DECLARE_IDTENTRY_SYSVEC(DEFERRED_ERROR_VECTOR,		sysvec_deferred_error);
+# endif
+
+# ifdef CONFIG_X86_THERMAL_VECTOR
+DECLARE_IDTENTRY_SYSVEC(THERMAL_APIC_VECTOR,		sysvec_thermal);
+# endif
+
+# ifdef CONFIG_IRQ_WORK
+DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
+# endif
+#endif
+
 #undef X86_TRAP_OTHER
 
 #endif
diff --git a/arch/x86/include/asm/irq_work.h b/arch/x86/include/asm/irq_work.h
index 80b35e3adf03..800ffce0db29 100644
--- a/arch/x86/include/asm/irq_work.h
+++ b/arch/x86/include/asm/irq_work.h
@@ -10,7 +10,6 @@ static inline bool arch_irq_work_has_interrupt(void)
 	return boot_cpu_has(X86_FEATURE_APIC);
 }
 extern void arch_irq_work_raise(void);
-extern __visible void smp_irq_work_interrupt(struct pt_regs *regs);
 #else
 static inline bool arch_irq_work_has_interrupt(void)
 {
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 0c40f37f8cb7..714b1a30e7b0 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -34,11 +34,6 @@ static inline int get_si_code(unsigned long condition)
 extern int panic_on_unrecovered_nmi;
 
 void math_emulate(struct math_emu_info *);
-#ifndef CONFIG_X86_32
-asmlinkage void smp_thermal_interrupt(struct pt_regs *regs);
-asmlinkage void smp_threshold_interrupt(struct pt_regs *regs);
-asmlinkage void smp_deferred_error_interrupt(struct pt_regs *regs);
-#endif
 
 #ifdef CONFIG_VMAP_STACK
 void __noreturn handle_stack_overflow(const char *message,
diff --git a/arch/x86/include/asm/uv/uv_bau.h b/arch/x86/include/asm/uv/uv_bau.h
index 13687bf0e0a9..f1188bd47658 100644
--- a/arch/x86/include/asm/uv/uv_bau.h
+++ b/arch/x86/include/asm/uv/uv_bau.h
@@ -12,6 +12,8 @@
 #define _ASM_X86_UV_UV_BAU_H
 
 #include <linux/bitmap.h>
+#include <asm/idtentry.h>
+
 #define BITSPERBYTE 8
 
 /*
@@ -799,12 +801,6 @@ static inline void bau_cpubits_clear(struct bau_local_cpumask *dstp, int nbits)
 	bitmap_zero(&dstp->bits, nbits);
 }
 
-extern void uv_bau_message_intr1(void);
-#ifdef CONFIG_TRACING
-#define trace_uv_bau_message_intr1 uv_bau_message_intr1
-#endif
-extern void uv_bau_timeout_intr1(void);
-
 struct atomic_short {
 	short counter;
 };
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 52de616a8065..a906d68a18a2 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -907,14 +907,13 @@ static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
 	mce_log(&m);
 }
 
-asmlinkage __visible void __irq_entry smp_deferred_error_interrupt(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_deferred_error)
 {
-	entering_irq();
 	trace_deferred_error_apic_entry(DEFERRED_ERROR_VECTOR);
 	inc_irq_stat(irq_deferred_error_count);
 	deferred_error_int_vector();
 	trace_deferred_error_apic_exit(DEFERRED_ERROR_VECTOR);
-	exiting_ack_irq();
+	ack_APIC_irq();
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c
index f36dc0742085..a7cd2d203ced 100644
--- a/arch/x86/kernel/cpu/mce/therm_throt.c
+++ b/arch/x86/kernel/cpu/mce/therm_throt.c
@@ -614,14 +614,13 @@ static void unexpected_thermal_interrupt(void)
 
 static void (*smp_thermal_vector)(void) = unexpected_thermal_interrupt;
 
-asmlinkage __visible void __irq_entry smp_thermal_interrupt(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_thermal)
 {
-	entering_irq();
 	trace_thermal_apic_entry(THERMAL_APIC_VECTOR);
 	inc_irq_stat(irq_thermal_count);
 	smp_thermal_vector();
 	trace_thermal_apic_exit(THERMAL_APIC_VECTOR);
-	exiting_ack_irq();
+	ack_APIC_irq();
 }
 
 /* Thermal monitoring depends on APIC, ACPI and clock modulation */
diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
index 28812cc15300..6a059a035021 100644
--- a/arch/x86/kernel/cpu/mce/threshold.c
+++ b/arch/x86/kernel/cpu/mce/threshold.c
@@ -21,12 +21,11 @@ static void default_threshold_interrupt(void)
 
 void (*mce_threshold_vector)(void) = default_threshold_interrupt;
 
-asmlinkage __visible void __irq_entry smp_threshold_interrupt(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_threshold)
 {
-	entering_irq();
 	trace_threshold_apic_entry(THRESHOLD_APIC_VECTOR);
 	inc_irq_stat(irq_threshold_count);
 	mce_threshold_vector();
 	trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
-	exiting_ack_irq();
+	ack_APIC_irq();
 }
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 95abcd52e5be..a986e15becc0 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -117,33 +117,33 @@ static const __initconst struct idt_data apic_idts[] = {
 #endif
 
 #ifdef CONFIG_X86_THERMAL_VECTOR
-	INTG(THERMAL_APIC_VECTOR,	thermal_interrupt),
+	INTG(THERMAL_APIC_VECTOR,		asm_sysvec_thermal),
 #endif
 
 #ifdef CONFIG_X86_MCE_THRESHOLD
-	INTG(THRESHOLD_APIC_VECTOR,	threshold_interrupt),
+	INTG(THRESHOLD_APIC_VECTOR,		asm_sysvec_threshold),
 #endif
 
 #ifdef CONFIG_X86_MCE_AMD
-	INTG(DEFERRED_ERROR_VECTOR,	deferred_error_interrupt),
+	INTG(DEFERRED_ERROR_VECTOR,		asm_sysvec_deferred_error),
 #endif
 
 #ifdef CONFIG_X86_LOCAL_APIC
-	INTG(LOCAL_TIMER_VECTOR,	asm_sysvec_apic_timer_interrupt),
-	INTG(X86_PLATFORM_IPI_VECTOR,	asm_sysvec_x86_platform_ipi),
+	INTG(LOCAL_TIMER_VECTOR,		asm_sysvec_apic_timer_interrupt),
+	INTG(X86_PLATFORM_IPI_VECTOR,		asm_sysvec_x86_platform_ipi),
 # ifdef CONFIG_HAVE_KVM
-	INTG(POSTED_INTR_VECTOR,	kvm_posted_intr_ipi),
-	INTG(POSTED_INTR_WAKEUP_VECTOR, kvm_posted_intr_wakeup_ipi),
-	INTG(POSTED_INTR_NESTED_VECTOR, kvm_posted_intr_nested_ipi),
+	INTG(POSTED_INTR_VECTOR,		kvm_posted_intr_ipi),
+	INTG(POSTED_INTR_WAKEUP_VECTOR,		kvm_posted_intr_wakeup_ipi),
+	INTG(POSTED_INTR_NESTED_VECTOR,		kvm_posted_intr_nested_ipi),
 # endif
 # ifdef CONFIG_IRQ_WORK
-	INTG(IRQ_WORK_VECTOR,		irq_work_interrupt),
+	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
 # endif
-#ifdef CONFIG_X86_UV
-	INTG(UV_BAU_MESSAGE,		uv_bau_message_intr1),
-#endif
-	INTG(SPURIOUS_APIC_VECTOR,	asm_sysvec_spurious_apic_interrupt),
-	INTG(ERROR_APIC_VECTOR,		asm_sysvec_error_interrupt),
+# ifdef CONFIG_X86_UV
+	INTG(UV_BAU_MESSAGE,			asm_sysvec_uv_bau_message),
+# endif
+	INTG(SPURIOUS_APIC_VECTOR,		asm_sysvec_spurious_apic_interrupt),
+	INTG(ERROR_APIC_VECTOR,			asm_sysvec_error_interrupt),
 #endif
 };
 
diff --git a/arch/x86/kernel/irq_work.c b/arch/x86/kernel/irq_work.c
index 80bee7695a20..890d4778cd35 100644
--- a/arch/x86/kernel/irq_work.c
+++ b/arch/x86/kernel/irq_work.c
@@ -9,18 +9,18 @@
 #include <linux/irq_work.h>
 #include <linux/hardirq.h>
 #include <asm/apic.h>
+#include <asm/idtentry.h>
 #include <asm/trace/irq_vectors.h>
 #include <linux/interrupt.h>
 
 #ifdef CONFIG_X86_LOCAL_APIC
-__visible void __irq_entry smp_irq_work_interrupt(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_irq_work)
 {
-	ipi_entering_ack_irq();
+	ack_APIC_irq();
 	trace_irq_work_entry(IRQ_WORK_VECTOR);
 	inc_irq_stat(apic_irq_work_irqs);
 	irq_work_run();
 	trace_irq_work_exit(IRQ_WORK_VECTOR);
-	exiting_irq();
 }
 
 void arch_irq_work_raise(void)
diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
index 1fd321f37f1b..b02e406a496d 100644
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -1272,7 +1272,7 @@ static void process_uv2_message(struct msg_desc *mdp, struct bau_control *bcp)
  * (the resource will not be freed until noninterruptable cpus see this
  *  interrupt; hardware may timeout the s/w ack and reply ERROR)
  */
-void uv_bau_message_interrupt(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_uv_bau_message)
 {
 	int count = 0;
 	cycles_t time_start;


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 27/37] x86/entry: Convert KVM vectors to IDTENTRY_SYSVEC
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (25 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 26/37] x86/entry: Convert various system vectors Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:30   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 28/37] x86/entry: Convert various hypervisor " Thomas Gleixner
                   ` (12 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Convert KVM specific system vectors to IDTENTRY_SYSVEC*:

The two empty stub handlers which only increment the stats counter do no
need to run on the interrupt stack. Use IDTENTRY_SYSVEC_DIRECT for them.

The wakeup handler does more work and runs on the interrupt stack.

None of these handlers need to save and restore the irq_regs pointer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>


diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a8eb5c262190..b032d32f3657 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -956,13 +956,6 @@ apicinterrupt3 \num \sym \do_sym
 POP_SECTION_IRQENTRY
 .endm
 
-
-#ifdef CONFIG_HAVE_KVM
-apicinterrupt3 POSTED_INTR_VECTOR		kvm_posted_intr_ipi		smp_kvm_posted_intr_ipi
-apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR	kvm_posted_intr_wakeup_ipi	smp_kvm_posted_intr_wakeup_ipi
-apicinterrupt3 POSTED_INTR_NESTED_VECTOR	kvm_posted_intr_nested_ipi	smp_kvm_posted_intr_nested_ipi
-#endif
-
 #ifdef CONFIG_SMP
 apicinterrupt RESCHEDULE_VECTOR			reschedule_interrupt		smp_reschedule_interrupt
 #endif
diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index 69a5320a4673..a01bb74244ac 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -13,10 +13,3 @@
 #ifdef CONFIG_SMP
 BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
 #endif
-
-#ifdef CONFIG_HAVE_KVM
-BUILD_INTERRUPT(kvm_posted_intr_ipi, POSTED_INTR_VECTOR)
-BUILD_INTERRUPT(kvm_posted_intr_wakeup_ipi, POSTED_INTR_WAKEUP_VECTOR)
-BUILD_INTERRUPT(kvm_posted_intr_nested_ipi, POSTED_INTR_NESTED_VECTOR)
-#endif
-
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 7281c7e3a0f6..fd5e7c8825e1 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -29,10 +29,6 @@
 #include <asm/sections.h>
 
 /* Interrupt handlers registered during init_IRQ */
-extern asmlinkage void kvm_posted_intr_ipi(void);
-extern asmlinkage void kvm_posted_intr_wakeup_ipi(void);
-extern asmlinkage void kvm_posted_intr_nested_ipi(void);
-
 extern asmlinkage void reschedule_interrupt(void);
 
 #ifdef	CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1d9d90c85218..98b343ea675b 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -604,6 +604,12 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
 # endif
 #endif
 
+#ifdef CONFIG_HAVE_KVM
+DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
+DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup_ipi);
+DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
+#endif
+
 #undef X86_TRAP_OTHER
 
 #endif
diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 112c85673179..f5746cf0c3bc 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -26,9 +26,6 @@ extern void fixup_irqs(void);
 
 #ifdef CONFIG_HAVE_KVM
 extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
-extern __visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs);
-extern __visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs);
-extern __visible void smp_kvm_posted_intr_nested_ipi(struct pt_regs *regs);
 #endif
 
 extern void (*x86_platform_ipi_callback)(void);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a986e15becc0..4ae0dd2773e3 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -132,9 +132,9 @@ static const __initconst struct idt_data apic_idts[] = {
 	INTG(LOCAL_TIMER_VECTOR,		asm_sysvec_apic_timer_interrupt),
 	INTG(X86_PLATFORM_IPI_VECTOR,		asm_sysvec_x86_platform_ipi),
 # ifdef CONFIG_HAVE_KVM
-	INTG(POSTED_INTR_VECTOR,		kvm_posted_intr_ipi),
-	INTG(POSTED_INTR_WAKEUP_VECTOR,		kvm_posted_intr_wakeup_ipi),
-	INTG(POSTED_INTR_NESTED_VECTOR,		kvm_posted_intr_nested_ipi),
+	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
+	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
+	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
 # endif
 # ifdef CONFIG_IRQ_WORK
 	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 43f95ec5f131..c92c0d227342 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -290,41 +290,29 @@ EXPORT_SYMBOL_GPL(kvm_set_posted_intr_wakeup_handler);
 /*
  * Handler for POSTED_INTERRUPT_VECTOR.
  */
-__visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_ipi)
 {
-	struct pt_regs *old_regs = set_irq_regs(regs);
-
-	entering_ack_irq();
+	ack_APIC_irq();
 	inc_irq_stat(kvm_posted_intr_ipis);
-	exiting_irq();
-	set_irq_regs(old_regs);
 }
 
 /*
  * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
  */
-__visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_posted_intr_wakeup_ipi)
 {
-	struct pt_regs *old_regs = set_irq_regs(regs);
-
-	entering_ack_irq();
+	ack_APIC_irq();
 	inc_irq_stat(kvm_posted_intr_wakeup_ipis);
 	kvm_posted_intr_wakeup_handler();
-	exiting_irq();
-	set_irq_regs(old_regs);
 }
 
 /*
  * Handler for POSTED_INTERRUPT_NESTED_VECTOR.
  */
-__visible void smp_kvm_posted_intr_nested_ipi(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
 {
-	struct pt_regs *old_regs = set_irq_regs(regs);
-
-	entering_ack_irq();
+	ack_APIC_irq();
 	inc_irq_stat(kvm_posted_intr_nested_ipis);
-	exiting_irq();
-	set_irq_regs(old_regs);
 }
 #endif
 


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 28/37] x86/entry: Convert various hypervisor vectors to IDTENTRY_SYSVEC
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (26 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 27/37] x86/entry: Convert KVM vectors to IDTENTRY_SYSVEC Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:31   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 29/37] x86/entry: Convert XEN hypercall vector " Thomas Gleixner
                   ` (11 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Convert various hypervisor vectors to IDTENTRY_SYSVEC
  - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
  - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
  - Remove the ASM idtentries in 64bit
  - Remove the BUILD_INTERRUPT entries in 32bit
  - Remove the old prototypes

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Michael Kelley <mikelley@microsoft.com>
Cc: Jason Chen CJ <jason.cj.chen@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>


diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 1db655409dbf..9f3e4e82708f 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1342,20 +1342,6 @@ BUILD_INTERRUPT3(xen_hvm_callback_vector, HYPERVISOR_CALLBACK_VECTOR,
 		 xen_evtchn_do_upcall)
 #endif
 
-
-#if IS_ENABLED(CONFIG_HYPERV)
-
-BUILD_INTERRUPT3(hyperv_callback_vector, HYPERVISOR_CALLBACK_VECTOR,
-		 hyperv_vector_handler)
-
-BUILD_INTERRUPT3(hyperv_reenlightenment_vector, HYPERV_REENLIGHTENMENT_VECTOR,
-		 hyperv_reenlightenment_intr)
-
-BUILD_INTERRUPT3(hv_stimer0_callback_vector, HYPERV_STIMER0_VECTOR,
-		 hv_stimer0_vector_handler)
-
-#endif /* CONFIG_HYPERV */
-
 SYM_CODE_START_LOCAL_NOALIGN(handle_exception)
 	/* the function address is in %gs's slot on the stack */
 	SAVE_ALL switch_stacks=1 skip_gs=1 unwind_espfix=1
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index b032d32f3657..ad35c6e298a6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1116,23 +1116,6 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
 	xen_hvm_callback_vector xen_evtchn_do_upcall
 #endif
 
-
-#if IS_ENABLED(CONFIG_HYPERV)
-apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
-	hyperv_callback_vector hyperv_vector_handler
-
-apicinterrupt3 HYPERV_REENLIGHTENMENT_VECTOR \
-	hyperv_reenlightenment_vector hyperv_reenlightenment_intr
-
-apicinterrupt3 HYPERV_STIMER0_VECTOR \
-	hv_stimer0_callback_vector hv_stimer0_vector_handler
-#endif /* CONFIG_HYPERV */
-
-#if IS_ENABLED(CONFIG_ACRN_GUEST)
-apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
-	acrn_hv_callback_vector acrn_hv_vector_handler
-#endif
-
 /*
  * Save all registers in pt_regs, and switch gs if needed.
  * Use slow, but surefire "are we in kernel?" check.
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index fd51bac11b46..75025a2b06e9 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -15,6 +15,7 @@
 #include <asm/hypervisor.h>
 #include <asm/hyperv-tlfs.h>
 #include <asm/mshyperv.h>
+#include <asm/idtentry.h>
 #include <linux/version.h>
 #include <linux/vmalloc.h>
 #include <linux/mm.h>
@@ -153,15 +154,11 @@ static inline bool hv_reenlightenment_available(void)
 		ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT;
 }
 
-__visible void __irq_entry hyperv_reenlightenment_intr(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_reenlightenment)
 {
-	entering_ack_irq();
-
+	ack_APIC_irq();
 	inc_irq_stat(irq_hv_reenlightenment_count);
-
 	schedule_delayed_work(&hv_reenlightenment_work, HZ/10);
-
-	exiting_irq();
 }
 
 void set_hv_tscchange_cb(void (*cb)(void))
diff --git a/arch/x86/include/asm/acrn.h b/arch/x86/include/asm/acrn.h
deleted file mode 100644
index 4adb13f08af7..000000000000
--- a/arch/x86/include/asm/acrn.h
+++ /dev/null
@@ -1,11 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_X86_ACRN_H
-#define _ASM_X86_ACRN_H
-
-extern void acrn_hv_callback_vector(void);
-#ifdef CONFIG_TRACING
-#define trace_acrn_hv_callback_vector acrn_hv_callback_vector
-#endif
-
-extern void acrn_hv_vector_handler(struct pt_regs *regs);
-#endif /* _ASM_X86_ACRN_H */
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index a5416865b6fa..2cc44e957c31 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -519,26 +519,6 @@ static inline bool apic_id_is_primary_thread(unsigned int id) { return false; }
 static inline void apic_smt_update(void) { }
 #endif
 
-extern void irq_enter(void);
-extern void irq_exit(void);
-
-static inline void entering_irq(void)
-{
-	irq_enter();
-	kvm_set_cpu_l1tf_flush_l1d();
-}
-
-static inline void entering_ack_irq(void)
-{
-	entering_irq();
-	ack_APIC_irq();
-}
-
-static inline void exiting_irq(void)
-{
-	irq_exit();
-}
-
 extern void ioapic_zap_locks(void);
 
 #endif /* _ASM_X86_APIC_H */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 98b343ea675b..b58d629b4948 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -610,6 +610,16 @@ DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
 #endif
 
+#if IS_ENABLED(CONFIG_HYPERV)
+DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_hyperv_callback);
+DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_REENLIGHTENMENT_VECTOR,	sysvec_hyperv_reenlightenment);
+DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_STIMER0_VECTOR,	sysvec_hyperv_stimer0);
+#endif
+
+#if IS_ENABLED(CONFIG_ACRN_GUEST)
+DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_acrn_hv_callback);
+#endif
+
 #undef X86_TRAP_OTHER
 
 #endif
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index d30805ed323e..60b944dd2df1 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -54,20 +54,8 @@ typedef int (*hyperv_fill_flush_list_func)(
 	vclocks_set_used(VDSO_CLOCKMODE_HVCLOCK);
 #define hv_get_raw_timer() rdtsc_ordered()
 
-void hyperv_callback_vector(void);
-void hyperv_reenlightenment_vector(void);
-#ifdef CONFIG_TRACING
-#define trace_hyperv_callback_vector hyperv_callback_vector
-#endif
 void hyperv_vector_handler(struct pt_regs *regs);
 
-/*
- * Routines for stimer0 Direct Mode handling.
- * On x86/x64, there are no percpu actions to take.
- */
-void hv_stimer0_vector_handler(struct pt_regs *regs);
-void hv_stimer0_callback_vector(void);
-
 static inline void hv_enable_stimer0_percpu_irq(int irq) {}
 static inline void hv_disable_stimer0_percpu_irq(int irq) {}
 
@@ -226,7 +214,6 @@ void hyperv_setup_mmu_ops(void);
 void *hv_alloc_hyperv_page(void);
 void *hv_alloc_hyperv_zeroed_page(void);
 void hv_free_hyperv_page(unsigned long addr);
-void hyperv_reenlightenment_intr(struct pt_regs *regs);
 void set_hv_tscchange_cb(void (*cb)(void));
 void clear_hv_tscchange_cb(void);
 void hyperv_stop_tsc_emulation(void);
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index 676022e71791..1da9b1c9a2db 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -10,10 +10,10 @@
  */
 
 #include <linux/interrupt.h>
-#include <asm/acrn.h>
 #include <asm/apic.h>
 #include <asm/desc.h>
 #include <asm/hypervisor.h>
+#include <asm/idtentry.h>
 #include <asm/irq_regs.h>
 
 static uint32_t __init acrn_detect(void)
@@ -24,7 +24,7 @@ static uint32_t __init acrn_detect(void)
 static void __init acrn_init_platform(void)
 {
 	/* Setup the IDT for ACRN hypervisor callback */
-	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, acrn_hv_callback_vector);
+	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_acrn_hv_callback);
 }
 
 static bool acrn_x2apic_available(void)
@@ -39,7 +39,7 @@ static bool acrn_x2apic_available(void)
 
 static void (*acrn_intr_handler)(void);
 
-__visible void __irq_entry acrn_hv_vector_handler(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_acrn_hv_callback)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
@@ -50,13 +50,12 @@ __visible void __irq_entry acrn_hv_vector_handler(struct pt_regs *regs)
 	 * will block the interrupt whose vector is lower than
 	 * HYPERVISOR_CALLBACK_VECTOR.
 	 */
-	entering_ack_irq();
+	ack_APIC_irq();
 	inc_irq_stat(irq_hv_callback_count);
 
 	if (acrn_intr_handler)
 		acrn_intr_handler();
 
-	exiting_irq();
 	set_irq_regs(old_regs);
 }
 
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index ebf34c7bc8bc..a103e1c0b90e 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -23,6 +23,7 @@
 #include <asm/hyperv-tlfs.h>
 #include <asm/mshyperv.h>
 #include <asm/desc.h>
+#include <asm/idtentry.h>
 #include <asm/irq_regs.h>
 #include <asm/i8259.h>
 #include <asm/apic.h>
@@ -40,11 +41,10 @@ static void (*hv_stimer0_handler)(void);
 static void (*hv_kexec_handler)(void);
 static void (*hv_crash_handler)(struct pt_regs *regs);
 
-__visible void __irq_entry hyperv_vector_handler(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
-	entering_irq();
 	inc_irq_stat(irq_hv_callback_count);
 	if (vmbus_handler)
 		vmbus_handler();
@@ -52,7 +52,6 @@ __visible void __irq_entry hyperv_vector_handler(struct pt_regs *regs)
 	if (ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED)
 		ack_APIC_irq();
 
-	exiting_irq();
 	set_irq_regs(old_regs);
 }
 
@@ -73,19 +72,16 @@ EXPORT_SYMBOL_GPL(hv_remove_vmbus_irq);
  * Routines to do per-architecture handling of stimer0
  * interrupts when in Direct Mode
  */
-
-__visible void __irq_entry hv_stimer0_vector_handler(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_stimer0)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
-	entering_irq();
 	inc_irq_stat(hyperv_stimer0_count);
 	if (hv_stimer0_handler)
 		hv_stimer0_handler();
 	add_interrupt_randomness(HYPERV_STIMER0_VECTOR, 0);
 	ack_APIC_irq();
 
-	exiting_irq();
 	set_irq_regs(old_regs);
 }
 
@@ -331,17 +327,19 @@ static void __init ms_hyperv_init_platform(void)
 	x86_platform.apic_post_init = hyperv_init;
 	hyperv_setup_mmu_ops();
 	/* Setup the IDT for hypervisor callback */
-	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, hyperv_callback_vector);
+	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, sysvec_hyperv_callback);
 
 	/* Setup the IDT for reenlightenment notifications */
-	if (ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT)
+	if (ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT) {
 		alloc_intr_gate(HYPERV_REENLIGHTENMENT_VECTOR,
-				hyperv_reenlightenment_vector);
+				asm_sysvec_hyperv_reenlightenment);
+	}
 
 	/* Setup the IDT for stimer0 */
-	if (ms_hyperv.misc_features & HV_STIMER_DIRECT_MODE_AVAILABLE)
+	if (ms_hyperv.misc_features & HV_STIMER_DIRECT_MODE_AVAILABLE) {
 		alloc_intr_gate(HYPERV_STIMER0_VECTOR,
-				hv_stimer0_callback_vector);
+				asm_sysvec_hyperv_stimer0);
+	}
 
 # ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = hv_smp_prepare_boot_cpu;


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 29/37] x86/entry: Convert XEN hypercall vector to IDTENTRY_SYSVEC
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (27 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 28/37] x86/entry: Convert various hypervisor " Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:31   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 30/37] x86/entry: Convert reschedule interrupt to IDTENTRY_RAW Thomas Gleixner
                   ` (10 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Convert the last oldstyle defined vector to IDTENTRY_SYSVEC
  - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
  - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
  - Remove the ASM idtentries in 64bit
  - Remove the BUILD_INTERRUPT entries in 32bit
  - Remove the old prototypes

Fixup the related XEN code by providing the primary C entry point in x86 to
avoid cluttering the generic code with X86'isms.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>

---
 arch/x86/entry/entry_32.S        |    5 -----
 arch/x86/entry/entry_64.S        |    5 -----
 arch/x86/include/asm/idtentry.h  |    4 ++++
 arch/x86/xen/enlighten_hvm.c     |   12 ++++++++++++
 drivers/xen/events/events_base.c |    6 ++----
 include/xen/events.h             |    7 -------
 6 files changed, 18 insertions(+), 21 deletions(-)

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1337,11 +1337,6 @@ SYM_FUNC_START(xen_failsafe_callback)
 SYM_FUNC_END(xen_failsafe_callback)
 #endif /* CONFIG_XEN_PV */
 
-#ifdef CONFIG_XEN_PVHVM
-BUILD_INTERRUPT3(xen_hvm_callback_vector, HYPERVISOR_CALLBACK_VECTOR,
-		 xen_evtchn_do_upcall)
-#endif
-
 SYM_CODE_START_LOCAL_NOALIGN(handle_exception)
 	/* the function address is in %gs's slot on the stack */
 	SAVE_ALL switch_stacks=1 skip_gs=1 unwind_espfix=1
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1111,11 +1111,6 @@ SYM_CODE_START(xen_failsafe_callback)
 SYM_CODE_END(xen_failsafe_callback)
 #endif /* CONFIG_XEN_PV */
 
-#ifdef CONFIG_XEN_PVHVM
-apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
-	xen_hvm_callback_vector xen_evtchn_do_upcall
-#endif
-
 /*
  * Save all registers in pt_regs, and switch gs if needed.
  * Use slow, but surefire "are we in kernel?" check.
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -620,6 +620,10 @@ DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_STIME
 DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_acrn_hv_callback);
 #endif
 
+#ifdef CONFIG_XEN_PVHVM
+DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_xen_hvm_callback);
+#endif
+
 #undef X86_TRAP_OTHER
 
 #endif
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -13,6 +13,7 @@
 #include <asm/smp.h>
 #include <asm/reboot.h>
 #include <asm/setup.h>
+#include <asm/idtentry.h>
 #include <asm/hypervisor.h>
 #include <asm/e820/api.h>
 #include <asm/early_ioremap.h>
@@ -118,6 +119,17 @@ static void __init init_hvm_pv_info(void
 		this_cpu_write(xen_vcpu_id, smp_processor_id());
 }
 
+DEFINE_IDTENTRY_SYSVEC(sysvec_xen_hvm_callback)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+
+	inc_irq_stat(irq_hv_callback_count);
+
+	xen_hvm_evtchn_do_upcall();
+
+	set_irq_regs(old_regs);
+}
+
 #ifdef CONFIG_KEXEC_CORE
 static void xen_hvm_shutdown(void)
 {
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -37,6 +37,7 @@
 #ifdef CONFIG_X86
 #include <asm/desc.h>
 #include <asm/ptrace.h>
+#include <asm/idtentry.h>
 #include <asm/irq.h>
 #include <asm/io_apic.h>
 #include <asm/i8259.h>
@@ -1236,9 +1237,6 @@ void xen_evtchn_do_upcall(struct pt_regs
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
 	irq_enter();
-#ifdef CONFIG_X86
-	inc_irq_stat(irq_hv_callback_count);
-#endif
 
 	__xen_evtchn_do_upcall();
 
@@ -1658,7 +1656,7 @@ static __init void xen_alloc_callback_ve
 		return;
 
 	pr_info("Xen HVM callback vector for event delivery is enabled\n");
-	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, xen_hvm_callback_vector);
+	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_xen_hvm_callback);
 }
 #else
 void xen_setup_callback_vector(void) {}
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -90,13 +90,6 @@ unsigned int irq_from_evtchn(evtchn_port
 int irq_from_virq(unsigned int cpu, unsigned int virq);
 evtchn_port_t evtchn_from_irq(unsigned irq);
 
-#ifdef CONFIG_XEN_PVHVM
-/* Xen HVM evtchn vector callback */
-void xen_hvm_callback_vector(void);
-#ifdef CONFIG_TRACING
-#define trace_xen_hvm_callback_vector xen_hvm_callback_vector
-#endif
-#endif
 int xen_set_callback_via(uint64_t via);
 void xen_evtchn_do_upcall(struct pt_regs *regs);
 void xen_hvm_evtchn_do_upcall(void);


^ permalink raw reply	[flat|nested] 159+ messages in thread

* [patch V6 30/37] x86/entry: Convert reschedule interrupt to IDTENTRY_RAW
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (28 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 29/37] x86/entry: Convert XEN hypercall vector " Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-19 23:57   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 31/37] x86/entry: Remove the apic/BUILD interrupt leftovers Thomas Gleixner
                   ` (9 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


The scheduler IPI does not need the full interrupt entry handling logic
when the entry is from kernel mode.

Even if tracing is enabled the only requirement is that RCU is watching and
preempt_count has the hardirq bit on.

The NOHZ tick state does not have to be adjusted. If the tick is not
running then the CPU is in idle and the idle exit will restore the
tick. Softinterrupts are not raised here, so handling them on return is not
required either.

User mode entry must go through the regular entry path as it will invoke
the scheduler on return so context tracking needs to be in the correct
state.

Use IDTENTRY_RAW and the RCU conditional variants of idtentry_enter/exit()
to guarantee that RCU is watching even if the IPI hits a RCU idle section.

Remove the tracepoint static key conditional which is incomplete
vs. tracing anyway because e.g. ack_APIC_irq() calls out into
instrumentable code.

Avoid the overhead of irq time accounting and introduce variants of
__irq_enter/exit() so instrumentation observes the correct preempt count
state.

Spare the switch to the interrupt stack as the IPI is not going to use only
a minimal amount of stack space.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 943ffd64363a..38dc4d1f7a7b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -956,10 +956,6 @@ apicinterrupt3 \num \sym \do_sym
 POP_SECTION_IRQENTRY
 .endm
 
-#ifdef CONFIG_SMP
-apicinterrupt RESCHEDULE_VECTOR			reschedule_interrupt		smp_reschedule_interrupt
-#endif
-
 /*
  * Reload gs selector with exception handling
  * edi:  new selector
diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index a01bb74244ac..3e841ed5c17a 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -10,6 +10,3 @@
  * is no hardware IRQ pin equivalent for them, they are triggered
  * through the ICC by us (IPIs)
  */
-#ifdef CONFIG_SMP
-BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
-#endif
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index fd5e7c8825e1..74c12437401e 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -28,9 +28,6 @@
 #include <asm/irq.h>
 #include <asm/sections.h>
 
-/* Interrupt handlers registered during init_IRQ */
-extern asmlinkage void reschedule_interrupt(void);
-
 #ifdef	CONFIG_X86_LOCAL_APIC
 struct irq_data;
 struct pci_dev;
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1bedae4f297a..a380303089cd 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -576,6 +576,7 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR,	sysvec_x86_platform_ipi);
 #endif
 
 #ifdef CONFIG_SMP
+DECLARE_IDTENTRY(RESCHEDULE_VECTOR,			sysvec_reschedule_ipi);
 DECLARE_IDTENTRY_SYSVEC(IRQ_MOVE_CLEANUP_VECTOR,	sysvec_irq_move_cleanup);
 DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR,			sysvec_reboot);
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR,	sysvec_call_function_single);
diff --git a/arch/x86/include/asm/trace/common.h b/arch/x86/include/asm/trace/common.h
index 57c8da027d99..f0f9bcdb74d9 100644
--- a/arch/x86/include/asm/trace/common.h
+++ b/arch/x86/include/asm/trace/common.h
@@ -5,12 +5,8 @@
 DECLARE_STATIC_KEY_FALSE(trace_pagefault_key);
 #define trace_pagefault_enabled()			\
 	static_branch_unlikely(&trace_pagefault_key)
-DECLARE_STATIC_KEY_FALSE(trace_resched_ipi_key);
-#define trace_resched_ipi_enabled()			\
-	static_branch_unlikely(&trace_resched_ipi_key)
 #else
 static inline bool trace_pagefault_enabled(void) { return false; }
-static inline bool trace_resched_ipi_enabled(void) { return false; }
 #endif
 
 #endif
diff --git a/arch/x86/include/asm/trace/irq_vectors.h b/arch/x86/include/asm/trace/irq_vectors.h
index 33b9d0f0aafe..88e7f0f3bf62 100644
--- a/arch/x86/include/asm/trace/irq_vectors.h
+++ b/arch/x86/include/asm/trace/irq_vectors.h
@@ -10,9 +10,6 @@
 
 #ifdef CONFIG_X86_LOCAL_APIC
 
-extern int trace_resched_ipi_reg(void);
-extern void trace_resched_ipi_unreg(void);
-
 DECLARE_EVENT_CLASS(x86_irq_vector,
 
 	TP_PROTO(int vector),
@@ -37,18 +34,6 @@ DEFINE_EVENT_FN(x86_irq_vector, name##_exit,	\
 	TP_PROTO(int vector),			\
 	TP_ARGS(vector), NULL, NULL);
 
-#define DEFINE_RESCHED_IPI_EVENT(name)		\
-DEFINE_EVENT_FN(x86_irq_vector, name##_entry,	\
-	TP_PROTO(int vector),			\
-	TP_ARGS(vector),			\
-	trace_resched_ipi_reg,			\
-	trace_resched_ipi_unreg);		\
-DEFINE_EVENT_FN(x86_irq_vector, name##_exit,	\
-	TP_PROTO(int vector),			\
-	TP_ARGS(vector),			\
-	trace_resched_ipi_reg,			\
-	trace_resched_ipi_unreg);
-
 /*
  * local_timer - called when entering/exiting a local timer interrupt
  * vector handler
@@ -99,7 +84,7 @@ TRACE_EVENT_PERF_PERM(irq_work_exit, is_sampling_event(p_event) ? -EPERM : 0);
 /*
  * reschedule - called when entering/exiting a reschedule vector handler
  */
-DEFINE_RESCHED_IPI_EVENT(reschedule);
+DEFINE_IRQ_VECTOR_EVENT(reschedule);
 
 /*
  * call_function - called when entering/exiting a call function interrupt
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 4ae0dd2773e3..eab476979697 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -109,7 +109,7 @@ static const __initconst struct idt_data def_idts[] = {
  */
 static const __initconst struct idt_data apic_idts[] = {
 #ifdef CONFIG_SMP
-	INTG(RESCHEDULE_VECTOR,			reschedule_interrupt),
+	INTG(RESCHEDULE_VECTOR,			asm_sysvec_reschedule_ipi),
 	INTG(CALL_FUNCTION_VECTOR,		asm_sysvec_call_function),
 	INTG(CALL_FUNCTION_SINGLE_VECTOR,	asm_sysvec_call_function_single),
 	INTG(IRQ_MOVE_CLEANUP_VECTOR,		asm_sysvec_irq_move_cleanup),
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index e5647daa7e96..eff4ce3b10da 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -220,26 +220,15 @@ static void native_stop_other_cpus(int wait)
 
 /*
  * Reschedule call back. KVM uses this interrupt to force a cpu out of
- * guest mode
+ * guest mode.
  */
-__visible void __irq_entry smp_reschedule_interrupt(struct pt_regs *regs)
+DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_reschedule_ipi)
 {
 	ack_APIC_irq();
+	trace_reschedule_entry(RESCHEDULE_VECTOR);
 	inc_irq_stat(irq_resched_count);
-
-	if (trace_resched_ipi_enabled()) {
-		/*
-		 * scheduler_ipi() might call irq_enter() as well, but
-		 * nested calls are fine.
-		 */
-		irq_enter();
-		trace_reschedule_entry(RESCHEDULE_VECTOR);
-		scheduler_ipi();
-		trace_reschedule_exit(RESCHEDULE_VECTOR);
-		irq_exit();
-		return;
-	}
 	scheduler_ipi();
+	trace_reschedule_exit(RESCHEDULE_VECTOR);
 }
 
 DEFINE_IDTENTRY_SYSVEC(sysvec_call_function)
diff --git a/arch/x86/kernel/tracepoint.c b/arch/x86/kernel/tracepoint.c
index 496748ed266a..fcfc077afe2d 100644
--- a/arch/x86/kernel/tracepoint.c
+++ b/arch/x86/kernel/tracepoint.c
@@ -25,20 +25,3 @@ void trace_pagefault_unreg(void)
 {
 	static_branch_dec(&trace_pagefault_key);
 }
-
-#ifdef CONFIG_SMP
-
-DEFINE_STATIC_KEY_FALSE(trace_resched_ipi_key);
-
-int trace_resched_ipi_reg(void)
-{
-	static_branch_inc(&trace_resched_ipi_key);
-	return 0;
-}
-
-void trace_resched_ipi_unreg(void)
-{
-	static_branch_dec(&trace_resched_ipi_key);
-}
-
-#endif


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 31/37] x86/entry: Remove the apic/BUILD interrupt leftovers
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (29 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 30/37] x86/entry: Convert reschedule interrupt to IDTENTRY_RAW Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:32   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 32/37] x86/entry/64: Remove IRQ stack switching ASM Thomas Gleixner
                   ` (8 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Remove all the code which was there to emit the system vector stubs. All
users are gone.

Move the now unused GET_CR2_INTO macro muck to head_64.S where the last
user is. Fixup the eye hurting comment there while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 98da0d3c0b1a..4208c1e3f601 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -351,23 +351,3 @@ For 32-bit we have the following conventions - kernel is built with
 	call stackleak_erase
 #endif
 .endm
-
-/*
- * This does 'call enter_from_user_mode' unless we can avoid it based on
- * kernel config or using the static jump infrastructure.
- */
-.macro CALL_enter_from_user_mode
-#ifdef CONFIG_CONTEXT_TRACKING
-#ifdef CONFIG_JUMP_LABEL
-	STATIC_JUMP_IF_FALSE .Lafter_call_\@, context_tracking_key, def=0
-#endif
-	call enter_from_user_mode
-.Lafter_call_\@:
-#endif
-.endm
-
-#ifdef CONFIG_PARAVIRT_XXL
-#define GET_CR2_INTO(reg) GET_CR2_INTO_AX ; _ASM_MOV %_ASM_AX, reg
-#else
-#define GET_CR2_INTO(reg) _ASM_MOV %cr2, reg
-#endif
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index b84c1c0a2fd8..97b02a776db0 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1233,24 +1233,6 @@ SYM_FUNC_END(entry_INT80_32)
 #endif
 .endm
 
-#define BUILD_INTERRUPT3(name, nr, fn)			\
-SYM_FUNC_START(name)					\
-	ASM_CLAC;					\
-	pushl	$~(nr);					\
-	SAVE_ALL switch_stacks=1;			\
-	ENCODE_FRAME_POINTER;				\
-	TRACE_IRQS_OFF					\
-	movl	%esp, %eax;				\
-	call	fn;					\
-	jmp	ret_from_intr;				\
-SYM_FUNC_END(name)
-
-#define BUILD_INTERRUPT(name, nr)		\
-	BUILD_INTERRUPT3(name, nr, smp_##name);	\
-
-/* The include is where all of the SMP etc. interrupts come from */
-#include <asm/entry_arch.h>
-
 #ifdef CONFIG_PARAVIRT
 SYM_CODE_START(native_iret)
 	iret
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 38dc4d1f7a7b..7292525e2557 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -658,108 +658,7 @@ SYM_CODE_END(\asmsym)
  */
 #include <asm/idtentry.h>
 
-/*
- * Interrupt entry helper function.
- *
- * Entry runs with interrupts off. Stack layout at entry:
- * +----------------------------------------------------+
- * | regs->ss						|
- * | regs->rsp						|
- * | regs->eflags					|
- * | regs->cs						|
- * | regs->ip						|
- * +----------------------------------------------------+
- * | regs->orig_ax = ~(interrupt number)		|
- * +----------------------------------------------------+
- * | return address					|
- * +----------------------------------------------------+
- */
-SYM_CODE_START(interrupt_entry)
-	UNWIND_HINT_IRET_REGS offset=16
-	ASM_CLAC
-	cld
-
-	testb	$3, CS-ORIG_RAX+8(%rsp)
-	jz	1f
-	SWAPGS
-	FENCE_SWAPGS_USER_ENTRY
-	/*
-	 * Switch to the thread stack. The IRET frame and orig_ax are
-	 * on the stack, as well as the return address. RDI..R12 are
-	 * not (yet) on the stack and space has not (yet) been
-	 * allocated for them.
-	 */
-	pushq	%rdi
-
-	/* Need to switch before accessing the thread stack. */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
-	movq	%rsp, %rdi
-	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
-
-	 /*
-	  * We have RDI, return address, and orig_ax on the stack on
-	  * top of the IRET frame. That means offset=24
-	  */
-	UNWIND_HINT_IRET_REGS base=%rdi offset=24
-
-	pushq	7*8(%rdi)		/* regs->ss */
-	pushq	6*8(%rdi)		/* regs->rsp */
-	pushq	5*8(%rdi)		/* regs->eflags */
-	pushq	4*8(%rdi)		/* regs->cs */
-	pushq	3*8(%rdi)		/* regs->ip */
-	UNWIND_HINT_IRET_REGS
-	pushq	2*8(%rdi)		/* regs->orig_ax */
-	pushq	8(%rdi)			/* return address */
-
-	movq	(%rdi), %rdi
-	jmp	2f
-1:
-	FENCE_SWAPGS_KERNEL_ENTRY
-2:
-	PUSH_AND_CLEAR_REGS save_ret=1
-	ENCODE_FRAME_POINTER 8
-
-	testb	$3, CS+8(%rsp)
-	jz	1f
-
-	/*
-	 * IRQ from user mode.
-	 *
-	 * We need to tell lockdep that IRQs are off.  We can't do this until
-	 * we fix gsbase, and we should do it before enter_from_user_mode
-	 * (which can take locks).  Since TRACE_IRQS_OFF is idempotent,
-	 * the simplest way to handle it is to just call it twice if
-	 * we enter from user mode.  There's no reason to optimize this since
-	 * TRACE_IRQS_OFF is a no-op if lockdep is off.
-	 */
-	TRACE_IRQS_OFF
-
-	CALL_enter_from_user_mode
-
-1:
-	ENTER_IRQ_STACK old_rsp=%rdi save_ret=1
-	/* We entered an interrupt context - irqs are off: */
-	TRACE_IRQS_OFF
-
-	ret
-SYM_CODE_END(interrupt_entry)
-_ASM_NOKPROBE(interrupt_entry)
-
 SYM_CODE_START_LOCAL(common_interrupt_return)
-ret_from_intr:
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_OFF
-
-	LEAVE_IRQ_STACK
-
-	testb	$3, CS(%rsp)
-	jz	retint_kernel
-
-	/* Interrupt came from user space */
-.Lretint_user:
-	mov	%rsp,%rdi
-	call	prepare_exit_to_usermode
-
 SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 #ifdef CONFIG_DEBUG_ENTRY
 	/* Assert that pt_regs indicates user mode. */
@@ -802,23 +701,6 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 	INTERRUPT_RETURN
 
 
-/* Returning to kernel space */
-retint_kernel:
-#ifdef CONFIG_PREEMPTION
-	/* Interrupts are off */
-	/* Check if we need preemption */
-	btl	$9, EFLAGS(%rsp)		/* were interrupts off? */
-	jnc	1f
-	cmpl	$0, PER_CPU_VAR(__preempt_count)
-	jnz	1f
-	call	preempt_schedule_irq
-1:
-#endif
-	/*
-	 * The iretq could re-enable interrupts:
-	 */
-	TRACE_IRQS_IRETQ
-
 SYM_INNER_LABEL(restore_regs_and_return_to_kernel, SYM_L_GLOBAL)
 #ifdef CONFIG_DEBUG_ENTRY
 	/* Assert that pt_regs indicates kernel mode. */
@@ -932,31 +814,6 @@ SYM_CODE_END(common_interrupt_return)
 _ASM_NOKPROBE(common_interrupt_return)
 
 /*
- * APIC interrupts.
- */
-.macro apicinterrupt3 num sym do_sym
-SYM_CODE_START(\sym)
-	UNWIND_HINT_IRET_REGS
-	pushq	$~(\num)
-	call	interrupt_entry
-	UNWIND_HINT_REGS indirect=1
-	call	\do_sym	/* rdi points to pt_regs */
-	jmp	ret_from_intr
-SYM_CODE_END(\sym)
-_ASM_NOKPROBE(\sym)
-.endm
-
-/* Make sure APIC interrupt handlers end up in the irqentry section: */
-#define PUSH_SECTION_IRQENTRY	.pushsection .irqentry.text, "ax"
-#define POP_SECTION_IRQENTRY	.popsection
-
-.macro apicinterrupt num sym do_sym
-PUSH_SECTION_IRQENTRY
-apicinterrupt3 \num \sym \do_sym
-POP_SECTION_IRQENTRY
-.endm
-
-/*
  * Reload gs selector with exception handling
  * edi:  new selector
  *
diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
deleted file mode 100644
index 3e841ed5c17a..000000000000
--- a/arch/x86/include/asm/entry_arch.h
+++ /dev/null
@@ -1,12 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * This file is designed to contain the BUILD_INTERRUPT specifications for
- * all of the extra named interrupt vectors used by the architecture.
- * Usually this is the Inter Process Interrupts (IPIs)
- */
-
-/*
- * The following vectors are part of the Linux architecture, there
- * is no hardware IRQ pin equivalent for them, they are triggered
- * through the ICC by us (IPIs)
- */
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 4bbc770af632..5ad021708849 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -29,15 +29,16 @@
 #ifdef CONFIG_PARAVIRT_XXL
 #include <asm/asm-offsets.h>
 #include <asm/paravirt.h>
+#define GET_CR2_INTO(reg) GET_CR2_INTO_AX ; _ASM_MOV %_ASM_AX, reg
 #else
 #define INTERRUPT_RETURN iretq
+#define GET_CR2_INTO(reg) _ASM_MOV %cr2, reg
 #endif
 
-/* we are not able to switch in one step to the final KERNEL ADDRESS SPACE
+/*
+ * We are not able to switch in one step to the final KERNEL ADDRESS SPACE
  * because we need identity-mapped pages.
- *
  */
-
 #define l4_index(x)	(((x) >> 39) & 511)
 #define pud_index(x)	(((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
 


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 32/37] x86/entry/64: Remove IRQ stack switching ASM
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (30 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 31/37] x86/entry: Remove the apic/BUILD interrupt leftovers Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:33   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 33/37] x86/entry: Make enter_from_user_mode() static Thomas Gleixner
                   ` (7 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7292525e2557..f213d573038e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -370,102 +370,6 @@ SYM_CODE_END(ret_from_fork)
 #endif
 .endm
 
-/*
- * Enters the IRQ stack if we're not already using it.  NMI-safe.  Clobbers
- * flags and puts old RSP into old_rsp, and leaves all other GPRs alone.
- * Requires kernel GSBASE.
- *
- * The invariant is that, if irq_count != -1, then the IRQ stack is in use.
- */
-.macro ENTER_IRQ_STACK regs=1 old_rsp save_ret=0
-	DEBUG_ENTRY_ASSERT_IRQS_OFF
-
-	.if \save_ret
-	/*
-	 * If save_ret is set, the original stack contains one additional
-	 * entry -- the return address. Therefore, move the address one
-	 * entry below %rsp to \old_rsp.
-	 */
-	leaq	8(%rsp), \old_rsp
-	.else
-	movq	%rsp, \old_rsp
-	.endif
-
-	.if \regs
-	UNWIND_HINT_REGS base=\old_rsp
-	.endif
-
-	incl	PER_CPU_VAR(irq_count)
-	jnz	.Lirq_stack_push_old_rsp_\@
-
-	/*
-	 * Right now, if we just incremented irq_count to zero, we've
-	 * claimed the IRQ stack but we haven't switched to it yet.
-	 *
-	 * If anything is added that can interrupt us here without using IST,
-	 * it must be *extremely* careful to limit its stack usage.  This
-	 * could include kprobes and a hypothetical future IST-less #DB
-	 * handler.
-	 *
-	 * The OOPS unwinder relies on the word at the top of the IRQ
-	 * stack linking back to the previous RSP for the entire time we're
-	 * on the IRQ stack.  For this to work reliably, we need to write
-	 * it before we actually move ourselves to the IRQ stack.
-	 */
-
-	movq	\old_rsp, PER_CPU_VAR(irq_stack_backing_store + IRQ_STACK_SIZE - 8)
-	movq	PER_CPU_VAR(hardirq_stack_ptr), %rsp
-
-#ifdef CONFIG_DEBUG_ENTRY
-	/*
-	 * If the first movq above becomes wrong due to IRQ stack layout
-	 * changes, the only way we'll notice is if we try to unwind right
-	 * here.  Assert that we set up the stack right to catch this type
-	 * of bug quickly.
-	 */
-	cmpq	-8(%rsp), \old_rsp
-	je	.Lirq_stack_okay\@
-	ud2
-	.Lirq_stack_okay\@:
-#endif
-
-.Lirq_stack_push_old_rsp_\@:
-	pushq	\old_rsp
-
-	.if \regs
-	UNWIND_HINT_REGS indirect=1
-	.endif
-
-	.if \save_ret
-	/*
-	 * Push the return address to the stack. This return address can
-	 * be found at the "real" original RSP, which was offset by 8 at
-	 * the beginning of this macro.
-	 */
-	pushq	-8(\old_rsp)
-	.endif
-.endm
-
-/*
- * Undoes ENTER_IRQ_STACK.
- */
-.macro LEAVE_IRQ_STACK regs=1
-	DEBUG_ENTRY_ASSERT_IRQS_OFF
-	/* We need to be off the IRQ stack before decrementing irq_count. */
-	popq	%rsp
-
-	.if \regs
-	UNWIND_HINT_REGS
-	.endif
-
-	/*
-	 * As in ENTER_IRQ_STACK, irq_count == 0, we are still claiming
-	 * the irq stack but we're not on it.
-	 */
-
-	decl	PER_CPU_VAR(irq_count)
-.endm
-
 /**
  * idtentry_body - Macro to emit code calling the C function
  * @cfunc:		C function to be called


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 33/37] x86/entry: Make enter_from_user_mode() static
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (31 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 32/37] x86/entry/64: Remove IRQ stack switching ASM Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:34   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 34/37] x86/entry/32: Remove redundant irq disable code Thomas Gleixner
                   ` (6 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


The ASM users are gone. All callers are local.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 72588f1a45a2..a0772b0d6bc2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -54,7 +54,7 @@
  * 2) Invoke context tracking if enabled to reactivate RCU
  * 3) Trace interrupts off state
  */
-__visible noinstr void enter_from_user_mode(void)
+static noinstr void enter_from_user_mode(void)
 {
 	enum ctx_state state = ct_state();
 


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 34/37] x86/entry/32: Remove redundant irq disable code
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (32 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 33/37] x86/entry: Make enter_from_user_mode() static Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:35   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 35/37] x86/entry/64: Remove TRACE_IRQS_*_DEBUG Thomas Gleixner
                   ` (5 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


All exceptions/interrupts return with interrupts disabled now. No point in
doing this in ASM again.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


---
 arch/x86/entry/entry_32.S |   76 ----------------------------------------------
 1 file changed, 76 deletions(-)

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -51,34 +51,6 @@
 
 	.section .entry.text, "ax"
 
-/*
- * We use macros for low-level operations which need to be overridden
- * for paravirtualization.  The following will never clobber any registers:
- *   INTERRUPT_RETURN (aka. "iret")
- *   GET_CR0_INTO_EAX (aka. "movl %cr0, %eax")
- *   ENABLE_INTERRUPTS_SYSEXIT (aka "sti; sysexit").
- *
- * For DISABLE_INTERRUPTS/ENABLE_INTERRUPTS (aka "cli"/"sti"), you must
- * specify what registers can be overwritten (CLBR_NONE, CLBR_EAX/EDX/ECX/ANY).
- * Allowing a register to be clobbered can shrink the paravirt replacement
- * enough to patch inline, increasing performance.
- */
-
-#ifdef CONFIG_PREEMPTION
-# define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
-#else
-# define preempt_stop(clobbers)
-#endif
-
-.macro TRACE_IRQS_IRET
-#ifdef CONFIG_TRACE_IRQFLAGS
-	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)     # interrupts off?
-	jz	1f
-	TRACE_IRQS_ON
-1:
-#endif
-.endm
-
 #define PTI_SWITCH_MASK         (1 << PAGE_SHIFT)
 
 /*
@@ -881,38 +853,6 @@ SYM_CODE_START(ret_from_fork)
 SYM_CODE_END(ret_from_fork)
 .popsection
 
-/*
- * Return to user mode is not as complex as all this looks,
- * but we want the default path for a system call return to
- * go as quickly as possible which is why some of this is
- * less clear than it otherwise should be.
- */
-
-	# userspace resumption stub bypassing syscall exit tracing
-SYM_CODE_START_LOCAL(ret_from_exception)
-	preempt_stop(CLBR_ANY)
-ret_from_intr:
-#ifdef CONFIG_VM86
-	movl	PT_EFLAGS(%esp), %eax		# mix EFLAGS and CS
-	movb	PT_CS(%esp), %al
-	andl	$(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
-#else
-	/*
-	 * We can be coming here from child spawned by kernel_thread().
-	 */
-	movl	PT_CS(%esp), %eax
-	andl	$SEGMENT_RPL_MASK, %eax
-#endif
-	cmpl	$USER_RPL, %eax
-	jb	restore_all_kernel		# not returning to v8086 or userspace
-
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_OFF
-	movl	%esp, %eax
-	call	prepare_exit_to_usermode
-	jmp	restore_all_switch_stack
-SYM_CODE_END(ret_from_exception)
-
 SYM_ENTRY(__begin_SYSENTER_singlestep_region, SYM_L_GLOBAL, SYM_A_NONE)
 /*
  * All code from here through __end_SYSENTER_singlestep_region is subject
@@ -1147,22 +1087,6 @@ SYM_FUNC_START(entry_INT80_32)
 	 */
 	INTERRUPT_RETURN
 
-restore_all_kernel:
-#ifdef CONFIG_PREEMPTION
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	cmpl	$0, PER_CPU_VAR(__preempt_count)
-	jnz	.Lno_preempt
-	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
-	jz	.Lno_preempt
-	call	preempt_schedule_irq
-.Lno_preempt:
-#endif
-	TRACE_IRQS_IRET
-	PARANOID_EXIT_TO_KERNEL_MODE
-	BUG_IF_WRONG_CR3
-	RESTORE_REGS 4
-	jmp	.Lirq_return
-
 .section .fixup, "ax"
 SYM_CODE_START(asm_iret_error)
 	pushl	$0				# no error code


^ permalink raw reply	[flat|nested] 159+ messages in thread

* [patch V6 35/37] x86/entry/64: Remove TRACE_IRQS_*_DEBUG
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (33 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 34/37] x86/entry/32: Remove redundant irq disable code Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:46   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 36/37] x86/entry: Move paranoid irq tracing out of ASM code Thomas Gleixner
                   ` (4 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Since INT3/#BP no longer runs on an IST, this workaround is no longer
required.

Tested by running lockdep+ftrace as described in the initial commit:

  5963e317b1e9 ("ftrace/x86: Do not change stacks in DEBUG when calling lockdep")

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f213d573038e..be6285f1c6f7 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -68,44 +68,6 @@ SYM_CODE_END(native_usergs_sysret64)
 .endm
 
 /*
- * When dynamic function tracer is enabled it will add a breakpoint
- * to all locations that it is about to modify, sync CPUs, update
- * all the code, sync CPUs, then remove the breakpoints. In this time
- * if lockdep is enabled, it might jump back into the debug handler
- * outside the updating of the IST protection. (TRACE_IRQS_ON/OFF).
- *
- * We need to change the IDT table before calling TRACE_IRQS_ON/OFF to
- * make sure the stack pointer does not get reset back to the top
- * of the debug stack, and instead just reuses the current stack.
- */
-#if defined(CONFIG_DYNAMIC_FTRACE) && defined(CONFIG_TRACE_IRQFLAGS)
-
-.macro TRACE_IRQS_OFF_DEBUG
-	call	debug_stack_set_zero
-	TRACE_IRQS_OFF
-	call	debug_stack_reset
-.endm
-
-.macro TRACE_IRQS_ON_DEBUG
-	call	debug_stack_set_zero
-	TRACE_IRQS_ON
-	call	debug_stack_reset
-.endm
-
-.macro TRACE_IRQS_IRETQ_DEBUG
-	btl	$9, EFLAGS(%rsp)		/* interrupts off? */
-	jnc	1f
-	TRACE_IRQS_ON_DEBUG
-1:
-.endm
-
-#else
-# define TRACE_IRQS_OFF_DEBUG			TRACE_IRQS_OFF
-# define TRACE_IRQS_ON_DEBUG			TRACE_IRQS_ON
-# define TRACE_IRQS_IRETQ_DEBUG			TRACE_IRQS_IRETQ
-#endif
-
-/*
  * 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
  *
  * This is the only entry point used for 64-bit system calls.  The
@@ -500,11 +462,7 @@ SYM_CODE_START(\asmsym)
 
 	UNWIND_HINT_REGS
 
-	.if \vector == X86_TRAP_DB
-		TRACE_IRQS_OFF_DEBUG
-	.else
-		TRACE_IRQS_OFF
-	.endif
+	TRACE_IRQS_OFF
 
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
@@ -924,7 +882,7 @@ SYM_CODE_END(paranoid_entry)
 SYM_CODE_START_LOCAL(paranoid_exit)
 	UNWIND_HINT_REGS
 	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_OFF_DEBUG
+	TRACE_IRQS_OFF
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
@@ -933,7 +891,7 @@ SYM_CODE_START_LOCAL(paranoid_exit)
 	SWAPGS_UNSAFE_STACK
 	jmp	restore_regs_and_return_to_kernel
 .Lparanoid_exit_no_swapgs:
-	TRACE_IRQS_IRETQ_DEBUG
+	TRACE_IRQS_IRETQ
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3	scratch_reg=%rbx save_reg=%r14
 	jmp restore_regs_and_return_to_kernel


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [patch V6 36/37] x86/entry: Move paranoid irq tracing out of ASM code
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (34 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 35/37] x86/entry/64: Remove TRACE_IRQS_*_DEBUG Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-20  0:53   ` Andy Lutomirski
  2020-05-15 23:46 ` [patch V6 37/37] x86/entry: Remove the TRACE_IRQS cruft Thomas Gleixner
                   ` (3 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/entry/entry_64.S |   13 -------------
 arch/x86/kernel/nmi.c     |    3 +++
 2 files changed, 3 insertions(+), 13 deletions(-)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -16,7 +16,6 @@
  *
  * Some macro usage:
  * - SYM_FUNC_START/END:Define functions in the symbol table.
- * - TRACE_IRQ_*:	Trace hardirq state for lock debugging.
  * - idtentry:		Define exception entry points.
  */
 #include <linux/linkage.h>
@@ -107,11 +106,6 @@ SYM_CODE_END(native_usergs_sysret64)
 
 SYM_CODE_START(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
-	/*
-	 * Interrupts are off on entry.
-	 * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
-	 * it is too small to ever cause noticeable irq latency.
-	 */
 
 	swapgs
 	/* tss.sp2 is scratch space. */
@@ -462,8 +456,6 @@ SYM_CODE_START(\asmsym)
 
 	UNWIND_HINT_REGS
 
-	TRACE_IRQS_OFF
-
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
 	.if \vector == X86_TRAP_DB
@@ -881,17 +873,13 @@ SYM_CODE_END(paranoid_entry)
  */
 SYM_CODE_START_LOCAL(paranoid_exit)
 	UNWIND_HINT_REGS
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_OFF
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
-	TRACE_IRQS_IRETQ
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3	scratch_reg=%rbx save_reg=%r14
 	SWAPGS_UNSAFE_STACK
 	jmp	restore_regs_and_return_to_kernel
 .Lparanoid_exit_no_swapgs:
-	TRACE_IRQS_IRETQ
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3	scratch_reg=%rbx save_reg=%r14
 	jmp restore_regs_and_return_to_kernel
@@ -1292,7 +1280,6 @@ SYM_CODE_START(asm_exc_nmi)
 	call	paranoid_entry
 	UNWIND_HINT_REGS
 
-	/* paranoidentry exc_nmi(), 0; without TRACE_IRQS_OFF */
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
 	call	exc_nmi
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -334,6 +334,7 @@ static noinstr void default_do_nmi(struc
 	__this_cpu_write(last_nmi_rip, regs->ip);
 
 	instrumentation_begin();
+	trace_hardirqs_off_prepare();
 	ftrace_nmi_handler_enter();
 
 	handled = nmi_handle(NMI_LOCAL, regs);
@@ -422,6 +423,8 @@ static noinstr void default_do_nmi(struc
 
 out:
 	ftrace_nmi_handler_exit();
+	if (regs->flags & X86_EFLAGS_IF)
+		trace_hardirqs_on_prepare();
 	instrumentation_end();
 }
 


^ permalink raw reply	[flat|nested] 159+ messages in thread

* [patch V6 37/37] x86/entry: Remove the TRACE_IRQS cruft
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (35 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 36/37] x86/entry: Move paranoid irq tracing out of ASM code Thomas Gleixner
@ 2020-05-15 23:46 ` Thomas Gleixner
  2020-05-18 23:07   ` Andy Lutomirski
  2020-05-16 17:18 ` [patch V6 00/37] x86/entry: Rework leftovers and merge plan Paul E. McKenney
                   ` (2 subsequent siblings)
  39 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-15 23:46 UTC (permalink / raw)
  To: LKML
  Cc: x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)


No more users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index dce3ede7207e..fcdb41ac5c2e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -53,19 +53,6 @@ SYM_CODE_START(native_usergs_sysret64)
 SYM_CODE_END(native_usergs_sysret64)
 #endif /* CONFIG_PARAVIRT */
 
-.macro TRACE_IRQS_FLAGS flags:req
-#ifdef CONFIG_TRACE_IRQFLAGS
-	btl	$9, \flags		/* interrupts off? */
-	jnc	1f
-	TRACE_IRQS_ON
-1:
-#endif
-.endm
-
-.macro TRACE_IRQS_IRETQ
-	TRACE_IRQS_FLAGS EFLAGS(%rsp)
-.endm
-
 /*
  * 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
  *
diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
index 34f980c9b766..ccd32877a3c4 100644
--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -3,7 +3,6 @@
  * Save registers before calling assembly functions. This avoids
  * disturbance of register allocation in some inline assembly constructs.
  * Copyright 2001,2002 by Andi Kleen, SuSE Labs.
- * Added trace_hardirqs callers - Copyright 2007 Steven Rostedt, Red Hat, Inc.
  */
 #include <linux/linkage.h>
 #include "calling.h"
@@ -37,11 +36,6 @@ SYM_FUNC_END(\name)
 	_ASM_NOKPROBE(\name)
 	.endm
 
-#ifdef CONFIG_TRACE_IRQFLAGS
-	THUNK trace_hardirqs_on_thunk,trace_hardirqs_on_caller,1
-	THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller,1
-#endif
-
 #ifdef CONFIG_PREEMPTION
 	THUNK preempt_schedule_thunk, preempt_schedule
 	THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
@@ -49,8 +43,7 @@ SYM_FUNC_END(\name)
 	EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
 #endif
 
-#if defined(CONFIG_TRACE_IRQFLAGS) \
- || defined(CONFIG_PREEMPTION)
+#ifdef CONFIG_PREEMPTION
 SYM_CODE_START_LOCAL_NOALIGN(.L_restore)
 	popq %r11
 	popq %r10
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index e00f064b009e..8ddff8dbaed5 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -172,14 +172,4 @@ static inline int arch_irqs_disabled(void)
 }
 #endif /* !__ASSEMBLY__ */
 
-#ifdef __ASSEMBLY__
-#ifdef CONFIG_TRACE_IRQFLAGS
-#  define TRACE_IRQS_ON		call trace_hardirqs_on_thunk;
-#  define TRACE_IRQS_OFF	call trace_hardirqs_off_thunk;
-#else
-#  define TRACE_IRQS_ON
-#  define TRACE_IRQS_OFF
-#endif
-#endif /* __ASSEMBLY__ */
-
 #endif


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (36 preceding siblings ...)
  2020-05-15 23:46 ` [patch V6 37/37] x86/entry: Remove the TRACE_IRQS cruft Thomas Gleixner
@ 2020-05-16 17:18 ` Paul E. McKenney
  2020-05-19 12:28   ` Joel Fernandes
  2020-05-18 16:07 ` Peter Zijlstra
  2020-05-19 18:37 ` Steven Rostedt
  39 siblings, 1 reply; 159+ messages in thread
From: Paul E. McKenney @ 2020-05-16 17:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Sat, May 16, 2020 at 01:45:47AM +0200, Thomas Gleixner wrote:

[ . . . ]

>   - noinstr-rcu-nmi-2020-05-15
> 
>     Based on the core/rcu branch in the tip tree. It has merged in
>     noinstr-lds-2020-05-15 and contains the nmi_enter/exit() changes along
>     with the noinstr section changes on top.
> 
>     This tag is intended to be pulled by Paul into his rcu/next branch so
>     he can sort the conflicts and base further work on top.

And this sorting process is now allegedly complete and available on the
-rcu tree's "dev" branch.  As you might have guessed, the major source
of conflicts were with Joel's patches, including one conflict that was
invisible to "git rebase":

1b2530e7d0c3 ("Revert b8c17e6664c4 ("rcu: Maintain special bits at bottom of ->dynticks counter")
03f31532d0ce ("rcu/tree: Add better tracing for dyntick-idle")
a309d5ce2335 ("rcu/tree: Clean up dynticks counter usage")
5c6e734fbaeb ("rcu/tree: Remove dynticks_nmi_nesting counter")

This passes modest rcutorture testing.  So far, so good!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 03/37] nmi, tracing: Provide nmi_enter/exit_notrace()
  2020-05-15 23:45 ` [patch V6 03/37] nmi, tracing: Provide nmi_enter/exit_notrace() Thomas Gleixner
@ 2020-05-17  5:12   ` Andy Lutomirski
  2020-05-19 22:24   ` Steven Rostedt
  1 sibling, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-17  5:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> To fully isolate #DB and #BP from instrumentable code it's necessary to
> avoid invoking the hardware latency tracer on nmi_enter/exit().
>
> Provide nmi_enter/exit() variants which are not invoking the hardware
> latency tracer. That allows to put calls explicitely into the call sites
> outside of the kprobe handling.


Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 04/37] x86: Make hardware latency tracing explicit
  2020-05-15 23:45 ` [patch V6 04/37] x86: Make hardware latency tracing explicit Thomas Gleixner
@ 2020-05-17  5:36   ` Andy Lutomirski
  2020-05-17  8:48     ` Thomas Gleixner
  2020-05-18  8:01   ` Peter Zijlstra
  1 sibling, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-17  5:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> The hardware latency tracer calls into trace_sched_clock and ends up in
> various instrumentable functions which is problemeatic vs. the kprobe
> handling especially the text poke machinery. It's invoked from
> nmi_enter/exit(), i.e. non-instrumentable code.
>
> Use nmi_enter/exit_notrace() instead. These variants do not invoke the
> hardware latency tracer which avoids chasing down complex callchains to
> make them non-instrumentable.
>
> The real interesting measurement is the actual NMI handler. Add an explicit
> invocation for the hardware latency tracer to it.
>
> #DB and #BP are uninteresting as they really should not be in use when
> analzying hardware induced latencies.
>

> @@ -849,7 +851,7 @@ static void noinstr handle_debug(struct
>  static __always_inline void exc_debug_kernel(struct pt_regs *regs,
>                                              unsigned long dr6)
>  {
> -       nmi_enter();
> +       nmi_enter_notrace();

Why can't exc_debug_kernel() handle instrumentation?  We shouldn't
recurse into #DB since we've already cleared DR7, right?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 04/37] x86: Make hardware latency tracing explicit
  2020-05-17  5:36   ` Andy Lutomirski
@ 2020-05-17  8:48     ` Thomas Gleixner
  2020-05-18  5:50       ` Andy Lutomirski
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-17  8:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>>
>> The hardware latency tracer calls into trace_sched_clock and ends up in
>> various instrumentable functions which is problemeatic vs. the kprobe
>> handling especially the text poke machinery. It's invoked from
>> nmi_enter/exit(), i.e. non-instrumentable code.
>>
>> Use nmi_enter/exit_notrace() instead. These variants do not invoke the
>> hardware latency tracer which avoids chasing down complex callchains to
>> make them non-instrumentable.
>>
>> The real interesting measurement is the actual NMI handler. Add an explicit
>> invocation for the hardware latency tracer to it.
>>
>> #DB and #BP are uninteresting as they really should not be in use when
>> analzying hardware induced latencies.
>>
>
>> @@ -849,7 +851,7 @@ static void noinstr handle_debug(struct
>>  static __always_inline void exc_debug_kernel(struct pt_regs *regs,
>>                                              unsigned long dr6)
>>  {
>> -       nmi_enter();
>> +       nmi_enter_notrace();
>
> Why can't exc_debug_kernel() handle instrumentation?  We shouldn't
> recurse into #DB since we've already cleared DR7, right?

It can later on. The point is that the trace stuff calls into the world
and some more before the entry handling is complete.

Remember this is about ensuring that all the state is properly
established before any of this instrumentation muck can happen.

DR7 handling is specific to #DB and done even before nmi_enter to
prevent recursion.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 04/37] x86: Make hardware latency tracing explicit
  2020-05-17  8:48     ` Thomas Gleixner
@ 2020-05-18  5:50       ` Andy Lutomirski
  2020-05-18  8:03         ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-18  5:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Sun, May 17, 2020 at 1:48 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >>
> >> The hardware latency tracer calls into trace_sched_clock and ends up in
> >> various instrumentable functions which is problemeatic vs. the kprobe
> >> handling especially the text poke machinery. It's invoked from
> >> nmi_enter/exit(), i.e. non-instrumentable code.
> >>
> >> Use nmi_enter/exit_notrace() instead. These variants do not invoke the
> >> hardware latency tracer which avoids chasing down complex callchains to
> >> make them non-instrumentable.
> >>
> >> The real interesting measurement is the actual NMI handler. Add an explicit
> >> invocation for the hardware latency tracer to it.
> >>
> >> #DB and #BP are uninteresting as they really should not be in use when
> >> analzying hardware induced latencies.
> >>
> >
> >> @@ -849,7 +851,7 @@ static void noinstr handle_debug(struct
> >>  static __always_inline void exc_debug_kernel(struct pt_regs *regs,
> >>                                              unsigned long dr6)
> >>  {
> >> -       nmi_enter();
> >> +       nmi_enter_notrace();
> >
> > Why can't exc_debug_kernel() handle instrumentation?  We shouldn't
> > recurse into #DB since we've already cleared DR7, right?
>
> It can later on. The point is that the trace stuff calls into the world
> and some more before the entry handling is complete.
>
> Remember this is about ensuring that all the state is properly
> established before any of this instrumentation muck can happen.
>
> DR7 handling is specific to #DB and done even before nmi_enter to
> prevent recursion.

So why is this change needed?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 04/37] x86: Make hardware latency tracing explicit
  2020-05-15 23:45 ` [patch V6 04/37] x86: Make hardware latency tracing explicit Thomas Gleixner
  2020-05-17  5:36   ` Andy Lutomirski
@ 2020-05-18  8:01   ` Peter Zijlstra
  2020-05-18  8:05     ` Thomas Gleixner
  1 sibling, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2020-05-18  8:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui

On Sat, May 16, 2020 at 01:45:51AM +0200, Thomas Gleixner wrote:
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -334,6 +334,7 @@ static noinstr void default_do_nmi(struc
>  	__this_cpu_write(last_nmi_rip, regs->ip);
>  
>  	instrumentation_begin();
> +	ftrace_nmi_handler_enter();
>  
>  	handled = nmi_handle(NMI_LOCAL, regs);
>  	__this_cpu_add(nmi_stats.normal, handled);
> @@ -420,6 +421,7 @@ static noinstr void default_do_nmi(struc
>  		unknown_nmi_error(reason, regs);
>  
>  out:
> +	ftrace_nmi_handler_exit();
>  	instrumentation_end();
>  }

Yeah, so I'm confused about this and the previous patch too. Why not
do just this? Remove that ftrace_nmi_handler.* crud from
nmi_{enter,exit}() and stick it here? Why do we needs the
nmi_{enter,exit}_notrace() thing?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 04/37] x86: Make hardware latency tracing explicit
  2020-05-18  5:50       ` Andy Lutomirski
@ 2020-05-18  8:03         ` Thomas Gleixner
  2020-05-18 20:42           ` Andy Lutomirski
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-18  8:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Sun, May 17, 2020 at 1:48 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Remember this is about ensuring that all the state is properly
>> established before any of this instrumentation muck can happen.
>>
>> DR7 handling is specific to #DB and done even before nmi_enter to
>> prevent recursion.
>
> So why is this change needed?

We really want nmi_enter() to be the carefully crafted mechanism which
establishes correct state in whatever strange context the exception
hits. Not more, not less.

Random instrumentation has absolutely no business there and I went a
long way to make sure that this is enforcible by objtool.

Aside of that the tracing which is contained in nmi_enter() is about
taking timestamps for hardware latency detection. If someone runs
hardware latency detection with active break/watchpoints then I really
can't help it.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 04/37] x86: Make hardware latency tracing explicit
  2020-05-18  8:01   ` Peter Zijlstra
@ 2020-05-18  8:05     ` Thomas Gleixner
  2020-05-18  8:08       ` Peter Zijlstra
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-18  8:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui

Peter Zijlstra <peterz@infradead.org> writes:
> On Sat, May 16, 2020 at 01:45:51AM +0200, Thomas Gleixner wrote:
>> --- a/arch/x86/kernel/nmi.c
>> +++ b/arch/x86/kernel/nmi.c
>> @@ -334,6 +334,7 @@ static noinstr void default_do_nmi(struc
>>  	__this_cpu_write(last_nmi_rip, regs->ip);
>>  
>>  	instrumentation_begin();
>> +	ftrace_nmi_handler_enter();
>>  
>>  	handled = nmi_handle(NMI_LOCAL, regs);
>>  	__this_cpu_add(nmi_stats.normal, handled);
>> @@ -420,6 +421,7 @@ static noinstr void default_do_nmi(struc
>>  		unknown_nmi_error(reason, regs);
>>  
>>  out:
>> +	ftrace_nmi_handler_exit();
>>  	instrumentation_end();
>>  }
>
> Yeah, so I'm confused about this and the previous patch too. Why not
> do just this? Remove that ftrace_nmi_handler.* crud from
> nmi_{enter,exit}() and stick it here? Why do we needs the
> nmi_{enter,exit}_notrace() thing?

Because you then have to fixup _all_ architectures which use
nmi_enter/exit().

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 04/37] x86: Make hardware latency tracing explicit
  2020-05-18  8:05     ` Thomas Gleixner
@ 2020-05-18  8:08       ` Peter Zijlstra
  2020-05-20 20:09         ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2020-05-18  8:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui

On Mon, May 18, 2020 at 10:05:56AM +0200, Thomas Gleixner wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> > On Sat, May 16, 2020 at 01:45:51AM +0200, Thomas Gleixner wrote:
> >> --- a/arch/x86/kernel/nmi.c
> >> +++ b/arch/x86/kernel/nmi.c
> >> @@ -334,6 +334,7 @@ static noinstr void default_do_nmi(struc
> >>  	__this_cpu_write(last_nmi_rip, regs->ip);
> >>  
> >>  	instrumentation_begin();
> >> +	ftrace_nmi_handler_enter();
> >>  
> >>  	handled = nmi_handle(NMI_LOCAL, regs);
> >>  	__this_cpu_add(nmi_stats.normal, handled);
> >> @@ -420,6 +421,7 @@ static noinstr void default_do_nmi(struc
> >>  		unknown_nmi_error(reason, regs);
> >>  
> >>  out:
> >> +	ftrace_nmi_handler_exit();
> >>  	instrumentation_end();
> >>  }
> >
> > Yeah, so I'm confused about this and the previous patch too. Why not
> > do just this? Remove that ftrace_nmi_handler.* crud from
> > nmi_{enter,exit}() and stick it here? Why do we needs the
> > nmi_{enter,exit}_notrace() thing?
> 
> Because you then have to fixup _all_ architectures which use
> nmi_enter/exit().

We probably have to anyway. But I can do that later I suppose.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (37 preceding siblings ...)
  2020-05-16 17:18 ` [patch V6 00/37] x86/entry: Rework leftovers and merge plan Paul E. McKenney
@ 2020-05-18 16:07 ` Peter Zijlstra
  2020-05-18 18:53   ` Thomas Gleixner
  2020-05-18 20:24   ` Thomas Gleixner
  2020-05-19 18:37 ` Steven Rostedt
  39 siblings, 2 replies; 159+ messages in thread
From: Peter Zijlstra @ 2020-05-18 16:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui


So on top of you entry-v8-full; I had to chase one of those
instrumentation_end() escapes an (extended) basic block chase (again!).

How about we do something like the below; that fixes the current case
(rcu_eqs_enter) but also kills the entire class.

---
 arch/x86/include/asm/bug.h     |  2 +-
 include/linux/compiler.h       | 16 +++++++++++++---
 include/linux/compiler_types.h |  4 ----
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index f128e5c2ed42..fb34ff641e0a 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -79,8 +79,8 @@ do {								\
 do {								\
 	instrumentation_begin();				\
 	_BUG_FLAGS(ASM_UD2, BUGFLAG_WARNING|(flags));		\
-	instrumentation_end();					\
 	annotate_reachable();					\
+	instrumentation_end();					\
 } while (0)
 
 #include <asm-generic/bug.h>
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 7db5902f8f6e..b4c248e6b76a 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -120,25 +120,35 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
 /* Annotate a C jump table to allow objtool to follow the code flow */
 #define __annotate_jump_table __section(.rodata..c_jump_table)
 
+#ifdef CONFIG_DEBUG_ENTRY
+/* Section for code which can't be instrumented at all */
+#define noinstr								\
+	noinline notrace __attribute((__section__(".noinstr.text")))
+
 /* Begin/end of an instrumentation safe region */
-#define instrumentation_begin() ({						\
+#define instrumentation_begin() ({					\
 	asm volatile("%c0:\n\t"						\
 		     ".pushsection .discard.instr_begin\n\t"		\
 		     ".long %c0b - .\n\t"				\
 		     ".popsection\n\t" : : "i" (__COUNTER__));		\
 })
 
-#define instrumentation_end() ({							\
-	asm volatile("%c0:\n\t"						\
+#define instrumentation_end() ({					\
+	asm volatile("%c0: nop\n\t"					\
 		     ".pushsection .discard.instr_end\n\t"		\
 		     ".long %c0b - .\n\t"				\
 		     ".popsection\n\t" : : "i" (__COUNTER__));		\
 })
+#endif /* CONFIG_DEBUG_ENTRY */
 
 #else
 #define annotate_reachable()
 #define annotate_unreachable()
 #define __annotate_jump_table
+#endif
+
+#ifndef noinstr
+#define noinstr noinline notrace
 #define instrumentation_begin()		do { } while(0)
 #define instrumentation_end()		do { } while(0)
 #endif
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index ea15ea99efb4..6ed0612bc143 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -118,10 +118,6 @@ struct ftrace_likely_data {
 #define notrace			__attribute__((__no_instrument_function__))
 #endif
 
-/* Section for code which can't be instrumented at all */
-#define noinstr								\
-	noinline notrace __attribute((__section__(".noinstr.text")))
-
 /*
  * it doesn't make sense on ARM (currently the only user of __naked)
  * to trace naked functions because then mcount is called without

^ permalink raw reply related	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-18 16:07 ` Peter Zijlstra
@ 2020-05-18 18:53   ` Thomas Gleixner
  2020-05-19  8:29     ` Peter Zijlstra
  2020-05-18 20:24   ` Thomas Gleixner
  1 sibling, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-18 18:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui

Peter Zijlstra <peterz@infradead.org> writes:
> So on top of you entry-v8-full; I had to chase one of those
> instrumentation_end() escapes an (extended) basic block chase (again!).
>  
> +#ifdef CONFIG_DEBUG_ENTRY

Why this? We lose the kprobes runtime protection that way.

> +/* Section for code which can't be instrumented at all */
> +#define noinstr								\
> +	noinline notrace __attribute((__section__(".noinstr.text")))
> +
>  /* Begin/end of an instrumentation safe region */
> -#define instrumentation_begin() ({						\
> +#define instrumentation_begin() ({					\
>  	asm volatile("%c0:\n\t"						\
>  		     ".pushsection .discard.instr_begin\n\t"		\
>  		     ".long %c0b - .\n\t"				\
>  		     ".popsection\n\t" : : "i" (__COUNTER__));

Nifty.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-18 16:07 ` Peter Zijlstra
  2020-05-18 18:53   ` Thomas Gleixner
@ 2020-05-18 20:24   ` Thomas Gleixner
  2020-05-19  8:38     ` Peter Zijlstra
  1 sibling, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-18 20:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui

Peter Zijlstra <peterz@infradead.org> writes:
> So on top of you entry-v8-full; I had to chase one of those
> instrumentation_end() escapes an (extended) basic block chase (again!).
>
> --- a/arch/x86/include/asm/bug.h
> +++ b/arch/x86/include/asm/bug.h
> @@ -79,8 +79,8 @@ do {								\
>  do {								\
>  	instrumentation_begin();				\
>  	_BUG_FLAGS(ASM_UD2, BUGFLAG_WARNING|(flags));		\
> -	instrumentation_end();					\
>  	annotate_reachable();					\
> +	instrumentation_end();					\
>  } while (0)

I just applied this part and rebuilt:

 vmlinux.o: warning: objtool: rcu_eqs_enter.constprop.77()+0xa9: call to
 rcu_preempt_deferred_qs() leaves .noinstr.text section

Did it go away after you disabled DEBUG_ENTRY perhaps?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 04/37] x86: Make hardware latency tracing explicit
  2020-05-18  8:03         ` Thomas Gleixner
@ 2020-05-18 20:42           ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-18 20:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Mon, May 18, 2020 at 1:03 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > On Sun, May 17, 2020 at 1:48 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> Remember this is about ensuring that all the state is properly
> >> established before any of this instrumentation muck can happen.
> >>
> >> DR7 handling is specific to #DB and done even before nmi_enter to
> >> prevent recursion.
> >
> > So why is this change needed?
>
> We really want nmi_enter() to be the carefully crafted mechanism which
> establishes correct state in whatever strange context the exception
> hits. Not more, not less.
>
> Random instrumentation has absolutely no business there and I went a
> long way to make sure that this is enforcible by objtool.
>
> Aside of that the tracing which is contained in nmi_enter() is about
> taking timestamps for hardware latency detection. If someone runs
> hardware latency detection with active break/watchpoints then I really
> can't help it.
>

Okay.  I'll stop looking for the bug you're fixing, then.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 05/37] genirq: Provide irq_enter/exit_rcu()
  2020-05-15 23:45 ` [patch V6 05/37] genirq: Provide irq_enter/exit_rcu() Thomas Gleixner
@ 2020-05-18 23:06   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-18 23:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> irq_enter()/exit() include the RCU handling. To properly separate the RCU
> handling provide variants which contain only the non-RCU related
> functionality.

Reviewed-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 06/37] genirq: Provde __irq_enter/exit_raw()
  2020-05-15 23:45 ` [patch V6 06/37] genirq: Provde __irq_enter/exit_raw() Thomas Gleixner
@ 2020-05-18 23:07   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-18 23:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>

Reviewed-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 37/37] x86/entry: Remove the TRACE_IRQS cruft
  2020-05-15 23:46 ` [patch V6 37/37] x86/entry: Remove the TRACE_IRQS cruft Thomas Gleixner
@ 2020-05-18 23:07   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-18 23:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:11 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> No more users.

Hallelujah!

Acked-by: Andy Lutomirski <luto@kernel.org>

>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack
  2020-05-15 23:45 ` [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack Thomas Gleixner
@ 2020-05-18 23:11   ` Andy Lutomirski
  2020-05-18 23:46     ` Andy Lutomirski
  2020-05-18 23:51     ` Thomas Gleixner
  0 siblings, 2 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-18 23:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Device interrupt handlers and system vector handlers are executed on the
> interrupt stack. The stack switch happens in the low level assembly entry
> code. This conflicts with the efforts to consolidate the exit code in C to
> ensure correctness vs. RCU and tracing.
>
> As there is no way to move #DB away from IST due to the MOV SS issue, the
> requirements vs. #DB and NMI for switching to the interrupt stack do not
> exist anymore. The only requirement is that interrupts are disabled.
>
> That allows to move the stack switching to C code which simplifies the
> entry/exit handling further because it allows to switch stacks after
> handling the entry and on exit before handling RCU, return to usermode and
> kernel preemption in the same way as for regular exceptions.
>
> The initial attempt of having the stack switching in inline ASM caused too
> much headache vs. objtool and the unwinder. After analysing the use cases
> it was agreed on that having the stack switch in ASM for the price of an
> indirect call is acceptable as the main users are indirect call heavy
> anyway and the few system vectors which are empty shells (scheduler IPI and
> KVM posted interrupt vectors) can run from the regular stack.
>
> Provide helper functions to check whether the interrupt stack is already
> active and whether stack switching is required.
>
> 64 bit only for now. 32 bit has a variant of that already. Once this is
> cleaned up the two implementations might be consolidated as a cleanup on
> top.
>

Acked-by: Andy Lutomirski <luto@kernel.org>

Have you tested by forcing a stack trace from the IRQ stack and making
sure it unwinds all the way out?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack
  2020-05-18 23:11   ` Andy Lutomirski
@ 2020-05-18 23:46     ` Andy Lutomirski
  2020-05-18 23:53       ` Thomas Gleixner
  2020-05-18 23:51     ` Thomas Gleixner
  1 sibling, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-18 23:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Mon, May 18, 2020 at 4:11 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> >
> > Device interrupt handlers and system vector handlers are executed on the
> > interrupt stack. The stack switch happens in the low level assembly entry
> > code. This conflicts with the efforts to consolidate the exit code in C to
> > ensure correctness vs. RCU and tracing.
> >
> > As there is no way to move #DB away from IST due to the MOV SS issue, the
> > requirements vs. #DB and NMI for switching to the interrupt stack do not
> > exist anymore. The only requirement is that interrupts are disabled.
> >
> > That allows to move the stack switching to C code which simplifies the
> > entry/exit handling further because it allows to switch stacks after
> > handling the entry and on exit before handling RCU, return to usermode and
> > kernel preemption in the same way as for regular exceptions.
> >
> > The initial attempt of having the stack switching in inline ASM caused too
> > much headache vs. objtool and the unwinder. After analysing the use cases
> > it was agreed on that having the stack switch in ASM for the price of an
> > indirect call is acceptable as the main users are indirect call heavy
> > anyway and the few system vectors which are empty shells (scheduler IPI and
> > KVM posted interrupt vectors) can run from the regular stack.
> >
> > Provide helper functions to check whether the interrupt stack is already
> > active and whether stack switching is required.
> >
> > 64 bit only for now. 32 bit has a variant of that already. Once this is
> > cleaned up the two implementations might be consolidated as a cleanup on
> > top.
> >
>
> Acked-by: Andy Lutomirski <luto@kernel.org>
>
> Have you tested by forcing a stack trace from the IRQ stack and making
> sure it unwinds all the way out?

Actually, I revoke my ack.  Can you make one of two changes:

Option A: Add an assertion to run_on_irqstack to verify that irq_count
was -1 at the beginning?  I suppose this also means you could just
explicitly write 0 instead of adding and subtracting.

Option B: Make run_on_irqstack() just call the function on the current
stack if we're already on the irq stack.

Right now, it's too easy to mess up and not verify the right
precondition before calling run_on_irqstack().

If you choose A, perhaps add a helper to do the if(irq_needs_irqstack)
dance so that users can just do:

run_on_irqstack_if_needed(...);

instead of checking everything themselves.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 08/37] x86/entry/64: Move do_softirq_own_stack() to C
  2020-05-15 23:45 ` [patch V6 08/37] x86/entry/64: Move do_softirq_own_stack() to C Thomas Gleixner
@ 2020-05-18 23:48   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-18 23:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> The first step to get rid of the ENTER/LEAVE_IRQ_STACK ASM macro maze.  Use
> the new C code helpers to move do_softirq_own_stack() out of ASM code.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 3b8da9f09297..bdf8391b2f95 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1145,19 +1145,6 @@ SYM_FUNC_START(asm_call_on_stack)
>         ret
>  SYM_FUNC_END(asm_call_on_stack)
>
> -/* Call softirq on interrupt stack. Interrupts are off. */
> -.pushsection .text, "ax"
> -SYM_FUNC_START(do_softirq_own_stack)
> -       pushq   %rbp
> -       mov     %rsp, %rbp
> -       ENTER_IRQ_STACK regs=0 old_rsp=%r11
> -       call    __do_softirq
> -       LEAVE_IRQ_STACK regs=0
> -       leaveq
> -       ret
> -SYM_FUNC_END(do_softirq_own_stack)
> -.popsection
> -
>  #ifdef CONFIG_XEN_PV
>  /*
>   * A note on the "critical region" in our callback handler.
> diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
> index 12df3a4abfdd..62cff52e03c5 100644
> --- a/arch/x86/kernel/irq_64.c
> +++ b/arch/x86/kernel/irq_64.c
> @@ -20,6 +20,7 @@
>  #include <linux/sched/task_stack.h>
>
>  #include <asm/cpu_entry_area.h>
> +#include <asm/irq_stack.h>
>  #include <asm/io_apic.h>
>  #include <asm/apic.h>
>
> @@ -70,3 +71,11 @@ int irq_init_percpu_irqstack(unsigned int cpu)
>                 return 0;
>         return map_irq_stack(cpu);
>  }
> +
> +void do_softirq_own_stack(void)
> +{
> +       if (irqstack_active())
> +               __do_softirq();
> +       else
> +               run_on_irqstack(__do_softirq, NULL);
> +}

See my comment in patch 8.  I see no great reason that this should
open-code the conditional, except maybe that we don't have pt_regs
here so the condition needs to be a bit different.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 09/37] x86/entry: Split idtentry_enter/exit()
  2020-05-15 23:45 ` [patch V6 09/37] x86/entry: Split idtentry_enter/exit() Thomas Gleixner
@ 2020-05-18 23:49   ` Andy Lutomirski
  2020-05-19  8:25     ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-18 23:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Split the implementation of idtentry_enter/exit() out into inline functions
> so that variants of idtentry_enter/exit() can be implemented without
> duplicating code.
>

After reading just this patch,  I don't see how it helps anything.
Maybe it'll make more sense after I read more of the series.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack
  2020-05-18 23:11   ` Andy Lutomirski
  2020-05-18 23:46     ` Andy Lutomirski
@ 2020-05-18 23:51     ` Thomas Gleixner
  1 sibling, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-18 23:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Have you tested by forcing a stack trace from the IRQ stack and making
> sure it unwinds all the way out?

Yes.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack
  2020-05-18 23:46     ` Andy Lutomirski
@ 2020-05-18 23:53       ` Thomas Gleixner
  2020-05-18 23:56         ` Andy Lutomirski
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-18 23:53 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> Actually, I revoke my ack.  Can you make one of two changes:
>
> Option A: Add an assertion to run_on_irqstack to verify that irq_count
> was -1 at the beginning?  I suppose this also means you could just
> explicitly write 0 instead of adding and subtracting.
>
> Option B: Make run_on_irqstack() just call the function on the current
> stack if we're already on the irq stack.
>
> Right now, it's too easy to mess up and not verify the right
> precondition before calling run_on_irqstack().
>
> If you choose A, perhaps add a helper to do the if(irq_needs_irqstack)
> dance so that users can just do:
>
> run_on_irqstack_if_needed(...);
>
> instead of checking everything themselves.

I'll have a look tomorrow morning with brain awake.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack
  2020-05-18 23:53       ` Thomas Gleixner
@ 2020-05-18 23:56         ` Andy Lutomirski
  2020-05-20 12:35           ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-18 23:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Mon, May 18, 2020 at 4:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > Actually, I revoke my ack.  Can you make one of two changes:
> >
> > Option A: Add an assertion to run_on_irqstack to verify that irq_count
> > was -1 at the beginning?  I suppose this also means you could just
> > explicitly write 0 instead of adding and subtracting.
> >
> > Option B: Make run_on_irqstack() just call the function on the current
> > stack if we're already on the irq stack.
> >
> > Right now, it's too easy to mess up and not verify the right
> > precondition before calling run_on_irqstack().
> >
> > If you choose A, perhaps add a helper to do the if(irq_needs_irqstack)
> > dance so that users can just do:
> >
> > run_on_irqstack_if_needed(...);
> >
> > instead of checking everything themselves.
>
> I'll have a look tomorrow morning with brain awake.

Also, reading more of the series, I suspect that asm_call_on_stack is
logically in the wrong section or that the noinstr stuff is otherwise
not quite right.  I think that objtool should not accept
run_on_irqstack() from noinstr code.  See followups on patch 10.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 09/37] x86/entry: Split idtentry_enter/exit()
  2020-05-18 23:49   ` Andy Lutomirski
@ 2020-05-19  8:25     ` Thomas Gleixner
  0 siblings, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-19  8:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>>
>> Split the implementation of idtentry_enter/exit() out into inline functions
>> so that variants of idtentry_enter/exit() can be implemented without
>> duplicating code.
>>
> After reading just this patch,  I don't see how it helps anything.
> Maybe it'll make more sense after I read more of the series.

I hope so :)

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-18 18:53   ` Thomas Gleixner
@ 2020-05-19  8:29     ` Peter Zijlstra
  0 siblings, 0 replies; 159+ messages in thread
From: Peter Zijlstra @ 2020-05-19  8:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui

On Mon, May 18, 2020 at 08:53:49PM +0200, Thomas Gleixner wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> > So on top of you entry-v8-full; I had to chase one of those
> > instrumentation_end() escapes an (extended) basic block chase (again!).
> >  
> > +#ifdef CONFIG_DEBUG_ENTRY
> 
> Why this? We lose the kprobes runtime protection that way.

Oh bugger indeed. I forgot about that :-(

I added the CONFIG_DEBUG_ENTRY dependency to
instrumentation_{begin,end}() because they now emit actual code, and I
figured we shouldn't bother 'production' kernels with all them extra
NOPs.

And then I figured (wrongly!) that since I have that, I might as well
add noinstr to is.

> > +/* Section for code which can't be instrumented at all */
> > +#define noinstr								\
> > +	noinline notrace __attribute((__section__(".noinstr.text")))
> > +
> >  /* Begin/end of an instrumentation safe region */
> > -#define instrumentation_begin() ({						\
> > +#define instrumentation_begin() ({					\
> >  	asm volatile("%c0:\n\t"						\
> >  		     ".pushsection .discard.instr_begin\n\t"		\
> >  		     ".long %c0b - .\n\t"				\
> >  		     ".popsection\n\t" : : "i" (__COUNTER__));
> 
> Nifty.

Yeah, took a bit of fiddling because objtool is a bit weird vs UD2, but
if you order it just right in the WARN thing it works :-)

You want a new delta without the noinstr thing on?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-18 20:24   ` Thomas Gleixner
@ 2020-05-19  8:38     ` Peter Zijlstra
  2020-05-19  9:02       ` Peter Zijlstra
  2020-05-19  9:06       ` Thomas Gleixner
  0 siblings, 2 replies; 159+ messages in thread
From: Peter Zijlstra @ 2020-05-19  8:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui

On Mon, May 18, 2020 at 10:24:53PM +0200, Thomas Gleixner wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> > So on top of you entry-v8-full; I had to chase one of those
> > instrumentation_end() escapes an (extended) basic block chase (again!).
> >
> > --- a/arch/x86/include/asm/bug.h
> > +++ b/arch/x86/include/asm/bug.h
> > @@ -79,8 +79,8 @@ do {								\
> >  do {								\
> >  	instrumentation_begin();				\
> >  	_BUG_FLAGS(ASM_UD2, BUGFLAG_WARNING|(flags));		\
> > -	instrumentation_end();					\
> >  	annotate_reachable();					\
> > +	instrumentation_end();					\
> >  } while (0)
> 
> I just applied this part and rebuilt:
> 
>  vmlinux.o: warning: objtool: rcu_eqs_enter.constprop.77()+0xa9: call to
>  rcu_preempt_deferred_qs() leaves .noinstr.text section
> 
> Did it go away after you disabled DEBUG_ENTRY perhaps?

Hehe, then all complaints would be gone :-)

So tglx/entry-v8-full + below patch:

$ make O=defconfig-build clean
...
$ make CC=gcc-9 O=defconfig-build/ vmlinux -j40 -s
vmlinux.o: warning: objtool: exc_debug()+0x158: call to trace_hwlat_timestamp() leaves .noinstr.text section
vmlinux.o: warning: objtool: exc_nmi()+0x190: call to trace_hwlat_timestamp() leaves .noinstr.text section
vmlinux.o: warning: objtool: do_machine_check()+0x46: call to mce_rdmsrl() leaves .noinstr.text section
$

(it really isn't defconfig, but your config-fail + DEBUG_ENTRY)

---
 arch/x86/include/asm/bug.h |  2 +-
 include/linux/compiler.h   | 11 ++++++++---
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index f128e5c2ed42..fb34ff641e0a 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -79,8 +79,8 @@ do {								\
 do {								\
 	instrumentation_begin();				\
 	_BUG_FLAGS(ASM_UD2, BUGFLAG_WARNING|(flags));		\
-	instrumentation_end();					\
 	annotate_reachable();					\
+	instrumentation_end();					\
 } while (0)
 
 #include <asm-generic/bug.h>
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 7db5902f8f6e..f6f54e8e0797 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -120,25 +120,30 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
 /* Annotate a C jump table to allow objtool to follow the code flow */
 #define __annotate_jump_table __section(.rodata..c_jump_table)
 
+#ifdef CONFIG_DEBUG_ENTRY
 /* Begin/end of an instrumentation safe region */
-#define instrumentation_begin() ({						\
+#define instrumentation_begin() ({					\
 	asm volatile("%c0:\n\t"						\
 		     ".pushsection .discard.instr_begin\n\t"		\
 		     ".long %c0b - .\n\t"				\
 		     ".popsection\n\t" : : "i" (__COUNTER__));		\
 })
 
-#define instrumentation_end() ({							\
-	asm volatile("%c0:\n\t"						\
+#define instrumentation_end() ({					\
+	asm volatile("%c0: nop\n\t"					\
 		     ".pushsection .discard.instr_end\n\t"		\
 		     ".long %c0b - .\n\t"				\
 		     ".popsection\n\t" : : "i" (__COUNTER__));		\
 })
+#endif /* CONFIG_DEBUG_ENTRY */
 
 #else
 #define annotate_reachable()
 #define annotate_unreachable()
 #define __annotate_jump_table
+#endif
+
+#ifndef instrumentation_begin
 #define instrumentation_begin()		do { } while(0)
 #define instrumentation_end()		do { } while(0)
 #endif

^ permalink raw reply related	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-19  8:38     ` Peter Zijlstra
@ 2020-05-19  9:02       ` Peter Zijlstra
  2020-05-23  2:52         ` Lai Jiangshan
  2020-05-19  9:06       ` Thomas Gleixner
  1 sibling, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2020-05-19  9:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui

On Tue, May 19, 2020 at 10:38:26AM +0200, Peter Zijlstra wrote:
> $ make CC=gcc-9 O=defconfig-build/ vmlinux -j40 -s
> vmlinux.o: warning: objtool: exc_debug()+0x158: call to trace_hwlat_timestamp() leaves .noinstr.text section
> vmlinux.o: warning: objtool: exc_nmi()+0x190: call to trace_hwlat_timestamp() leaves .noinstr.text section
> vmlinux.o: warning: objtool: do_machine_check()+0x46: call to mce_rdmsrl() leaves .noinstr.text section
> $
> 
> (it really isn't defconfig, but your config-fail + DEBUG_ENTRY)
> 

With comment on, as requested.

---
diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index f128e5c2ed42..fb34ff641e0a 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -79,8 +79,8 @@ do {								\
 do {								\
 	instrumentation_begin();				\
 	_BUG_FLAGS(ASM_UD2, BUGFLAG_WARNING|(flags));		\
-	instrumentation_end();					\
 	annotate_reachable();					\
+	instrumentation_end();					\
 } while (0)
 
 #include <asm-generic/bug.h>
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 7db5902f8f6e..4b8fabed46ae 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -120,25 +120,61 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
 /* Annotate a C jump table to allow objtool to follow the code flow */
 #define __annotate_jump_table __section(.rodata..c_jump_table)
 
+#ifdef CONFIG_DEBUG_ENTRY
 /* Begin/end of an instrumentation safe region */
-#define instrumentation_begin() ({						\
+#define instrumentation_begin() ({					\
 	asm volatile("%c0:\n\t"						\
 		     ".pushsection .discard.instr_begin\n\t"		\
 		     ".long %c0b - .\n\t"				\
 		     ".popsection\n\t" : : "i" (__COUNTER__));		\
 })
 
-#define instrumentation_end() ({							\
-	asm volatile("%c0:\n\t"						\
+/*
+ * Because instrumentation_{begin,end}() can nest, objtool validation considers
+ * _begin() a +1 and _end() a -1 and computes a sum over the instructions.
+ * When the value is greater than 0, we consider instrumentation allowed.
+ *
+ * There is a problem with code like:
+ *
+ * noinstr void foo()
+ * {
+ *	instrumentation_begin();
+ *	...
+ *	if (cond) {
+ *		instrumentation_begin();
+ *		...
+ *		instrumentation_end();
+ *	}
+ *	bar();
+ *	instrumentation_end();
+ * }
+ *
+ * If instrumentation_end() would be an empty label, like all the other
+ * annotations, the inner _end(), which is at the end of a conditional block,
+ * would land on the instruction after the block.
+ *
+ * If we then consider the sum of the !cond path, we'll see that the call to
+ * bar() is with a 0-value, even though, we meant it to happen with a positive
+ * value.
+ *
+ * To avoid this, have _end() be a NOP instruction, this ensures it will be
+ * part of the condition block and does not escape.
+ */
+#define instrumentation_end() ({					\
+	asm volatile("%c0: nop\n\t"					\
 		     ".pushsection .discard.instr_end\n\t"		\
 		     ".long %c0b - .\n\t"				\
 		     ".popsection\n\t" : : "i" (__COUNTER__));		\
 })
+#endif /* CONFIG_DEBUG_ENTRY */
 
 #else
 #define annotate_reachable()
 #define annotate_unreachable()
 #define __annotate_jump_table
+#endif
+
+#ifndef instrumentation_begin
 #define instrumentation_begin()		do { } while(0)
 #define instrumentation_end()		do { } while(0)
 #endif

^ permalink raw reply related	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-19  8:38     ` Peter Zijlstra
  2020-05-19  9:02       ` Peter Zijlstra
@ 2020-05-19  9:06       ` Thomas Gleixner
  1 sibling, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-19  9:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui

Peter Zijlstra <peterz@infradead.org> writes:
> On Mon, May 18, 2020 at 10:24:53PM +0200, Thomas Gleixner wrote:
> So tglx/entry-v8-full + below patch:
>
> $ make O=defconfig-build clean
> ...
> $ make CC=gcc-9 O=defconfig-build/ vmlinux -j40 -s
> vmlinux.o: warning: objtool: exc_debug()+0x158: call to trace_hwlat_timestamp() leaves .noinstr.text section
> vmlinux.o: warning: objtool: exc_nmi()+0x190: call to trace_hwlat_timestamp() leaves .noinstr.text section
> vmlinux.o: warning: objtool: do_machine_check()+0x46: call to mce_rdmsrl() leaves .noinstr.text section
> $
>
> (it really isn't defconfig, but your config-fail + DEBUG_ENTRY)
>  
> +#ifdef CONFIG_DEBUG_ENTRY
>  /* Begin/end of an instrumentation safe region */
> -#define instrumentation_begin() ({						\
> +#define instrumentation_begin() ({					\
>  	asm volatile("%c0:\n\t"						\
>  		     ".pushsection .discard.instr_begin\n\t"		\
>  		     ".long %c0b - .\n\t"				\
>  		     ".popsection\n\t" : : "i" (__COUNTER__));		\
>  })
>  
> -#define instrumentation_end() ({							\
> -	asm volatile("%c0:\n\t"						\
> +#define instrumentation_end() ({					\
> +	asm volatile("%c0: nop\n\t"					\

Bah. I fatfingered that nop out when I fixed up that noinstr wreckage.
With that added back it does was it claims to do.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-16 17:18 ` [patch V6 00/37] x86/entry: Rework leftovers and merge plan Paul E. McKenney
@ 2020-05-19 12:28   ` Joel Fernandes
  0 siblings, 0 replies; 159+ messages in thread
From: Joel Fernandes @ 2020-05-19 12:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, LKML, x86, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Sat, May 16, 2020 at 10:18:45AM -0700, Paul E. McKenney wrote:
> On Sat, May 16, 2020 at 01:45:47AM +0200, Thomas Gleixner wrote:
> 
> [ . . . ]
> 
> >   - noinstr-rcu-nmi-2020-05-15
> > 
> >     Based on the core/rcu branch in the tip tree. It has merged in
> >     noinstr-lds-2020-05-15 and contains the nmi_enter/exit() changes along
> >     with the noinstr section changes on top.
> > 
> >     This tag is intended to be pulled by Paul into his rcu/next branch so
> >     he can sort the conflicts and base further work on top.
> 
> And this sorting process is now allegedly complete and available on the
> -rcu tree's "dev" branch.  As you might have guessed, the major source
> of conflicts were with Joel's patches, including one conflict that was
> invisible to "git rebase":
> 
> 1b2530e7d0c3 ("Revert b8c17e6664c4 ("rcu: Maintain special bits at bottom of ->dynticks counter")
> 03f31532d0ce ("rcu/tree: Add better tracing for dyntick-idle")
> a309d5ce2335 ("rcu/tree: Clean up dynticks counter usage")
> 5c6e734fbaeb ("rcu/tree: Remove dynticks_nmi_nesting counter")
> 
> This passes modest rcutorture testing.  So far, so good!  ;-)

Also I double checked these patches in the rcu/dev branch. It looks good to
me, thanks for resolving the conflicts!!

 - Joel


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-15 23:45 ` [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY Thomas Gleixner
@ 2020-05-19 17:06   ` Andy Lutomirski
  2020-05-19 18:57     ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 17:06 UTC (permalink / raw)
  To: Thomas Gleixner, Andrew Cooper
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Convert the XEN/PV hypercall to IDTENTRY:
>
>   - Emit the ASM stub with DECLARE_IDTENTRY
>   - Remove the ASM idtentry in 64bit
>   - Remove the open coded ASM entry code in 32bit
>   - Remove the old prototypes
>
> The handler stubs need to stay in ASM code as it needs corner case handling
> and adjustment of the stack pointer.
>
> Provide a new C function which invokes the entry/exit handling and calls
> into the XEN handler on the interrupt stack.
>
> The exit code is slightly different from the regular idtentry_exit() on
> non-preemptible kernels. If the hypercall is preemptible and need_resched()
> is set then XEN provides a preempt hypercall scheduling function. Add it as
> conditional path to __idtentry_exit() so the function can be reused.
>
> __idtentry_exit() is forced inlined so on the regular idtentry_exit() path
> the extra condition is optimized out by the compiler.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 882ada245bd5..34caf3849632 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -27,6 +27,9 @@
>  #include <linux/syscalls.h>
>  #include <linux/uaccess.h>
>
> +#include <xen/xen-ops.h>
> +#include <xen/events.h>
> +
>  #include <asm/desc.h>
>  #include <asm/traps.h>
>  #include <asm/vdso.h>
> @@ -35,6 +38,7 @@
>  #include <asm/nospec-branch.h>
>  #include <asm/io_bitmap.h>
>  #include <asm/syscall.h>
> +#include <asm/irq_stack.h>
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/syscalls.h>
> @@ -539,7 +543,8 @@ void noinstr idtentry_enter(struct pt_regs *regs)
>         }
>  }
>
> -static __always_inline void __idtentry_exit(struct pt_regs *regs)
> +static __always_inline void __idtentry_exit(struct pt_regs *regs,
> +                                           bool preempt_hcall)
>  {
>         lockdep_assert_irqs_disabled();
>
> @@ -573,6 +578,16 @@ static __always_inline void __idtentry_exit(struct pt_regs *regs)
>                                 instrumentation_end();
>                                 return;
>                         }
> +               } else if (IS_ENABLED(CONFIG_XEN_PV)) {
> +                       if (preempt_hcall) {
> +                               /* See CONFIG_PREEMPTION above */
> +                               instrumentation_begin();
> +                               rcu_irq_exit_preempt();
> +                               xen_maybe_preempt_hcall();
> +                               trace_hardirqs_on();
> +                               instrumentation_end();
> +                               return;
> +                       }

Ewwwww!  This shouldn't be taken as a NAK -- it's just an expression of disgust.

>                 }
>                 /*
>                  * If preemption is disabled then this needs to be done
> @@ -612,5 +627,43 @@ static __always_inline void __idtentry_exit(struct pt_regs *regs)
>   */
>  void noinstr idtentry_exit(struct pt_regs *regs)
>  {
> -       __idtentry_exit(regs);
> +       __idtentry_exit(regs, false);
> +}
> +
> +#ifdef CONFIG_XEN_PV
> +static void __xen_pv_evtchn_do_upcall(void)
> +{
> +       irq_enter_rcu();
> +       inc_irq_stat(irq_hv_callback_count);
> +
> +       xen_hvm_evtchn_do_upcall();
> +
> +       irq_exit_rcu();
> +}
> +
> +__visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
> +{
> +       struct pt_regs *old_regs;
> +
> +       idtentry_enter(regs);
> +       old_regs = set_irq_regs(regs);
> +
> +       if (!irq_needs_irq_stack(regs)) {
> +               instrumentation_begin();
> +               __xen_pv_evtchn_do_upcall();
> +               instrumentation_end();
> +       } else {
> +               run_on_irqstack(__xen_pv_evtchn_do_upcall, NULL);
> +       }

Shouldn't this be:

instrumentation_begin();
if (!irq_needs_irq_stack(...))
  __blah();
else
  run_on_irqstack(__blah, NULL);
instrumentation_end();

or even:

instrumentation_begin();
run_on_irqstack_if_needed(__blah, NULL);
instrumentation_end();


****** BUT *******

I think this is all arse-backwards.  This is a giant mess designed to
pretend we support preemption and to emulate normal preemption in a
non-preemptible kernel.  I propose one to two massive cleanups:

A: Just delete all of this code.  Preemptible hypercalls on
non-preempt kernels will still process interrupts but won't get
preempted.  If you want preemption, compile with preemption.

B: Turn this thing around.  Specifically, in the one and only case we
care about, we know pretty much exactly what context we got this entry
in: we're running in a schedulable context doing an explicitly
preemptible hypercall, and we have RIP pointing at a SYSCALL
instruction (presumably, but we shouldn't bet on it) in the hypercall
page.  Ideally we would change the Xen PV ABI so the hypercall would
return something like EAGAIN instead of auto-restarting and we could
ditch this mess entirely.  But the ABI seems to be set in stone or at
least in molasses, so how about just:

idt_entry(exit(regs));
if (inhcall && need_resched())
  schedule();

Off the top of my head, I don't see any reason this wouldn't work, and
it's a heck of a lot cleaner.  Possibly it should really be:

if (inhcall) {
  if (!WARN_ON(regs->ip not in hypercall page))
    cond_resched();
}

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 11/37] x86/entry/64: Simplify idtentry_body
  2020-05-15 23:45 ` [patch V6 11/37] x86/entry/64: Simplify idtentry_body Thomas Gleixner
@ 2020-05-19 17:06   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 17:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> All C functions which do not have an error code have been converted to the
> new IDTENTRY interface which does not expect an error code in the
> arguments. Spare the XORL.

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-15 23:45 ` [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu() Thomas Gleixner
@ 2020-05-19 17:08   ` Andy Lutomirski
  2020-05-19 19:00     ` Thomas Gleixner
  2020-05-27  8:12   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
  1 sibling, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 17:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> The pagefault handler cannot use the regular idtentry_enter() because that
> invokes rcu_irq_enter() if the pagefault was caused in the kernel. Not a
> problem per se, but kernel side page faults can schedule which is not
> possible without invoking rcu_irq_exit().
>
> Adding rcu_irq_exit() and a matching rcu_irq_enter() into the actual
> pagefault handling code would be possible, but not pretty either.
>
> Provide idtentry_entry/exit_cond_rcu() which calls rcu_irq_enter() only
> when RCU is not watching. The conditional RCU enabling is a correctness
> issue: A kernel page fault which hits a RCU idle reason can neither
> schedule nor is it likely to survive. But avoiding RCU warnings or RCU side
> effects is at least increasing the chance for useful debug output.
>
> The function is also useful for implementing lightweight reschedule IPI and
> KVM posted interrupt IPI entry handling later.

Why is this conditional?  That is, couldn't we do this for all
idtentry_enter() calls instead of just for page faults?  Evil things
like NMI shouldn't go through this path at all.

--Andy

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
                   ` (38 preceding siblings ...)
  2020-05-18 16:07 ` Peter Zijlstra
@ 2020-05-19 18:37 ` Steven Rostedt
  2020-05-19 19:09   ` Thomas Gleixner
  39 siblings, 1 reply; 159+ messages in thread
From: Steven Rostedt @ 2020-05-19 18:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Sat, 16 May 2020 01:45:47 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> The V6 leftover series is based on:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-base-v6


$ git fetch git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-base-v6
fatal: couldn't find remote ref entry-base-v6

-- Steve

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-19 17:06   ` Andy Lutomirski
@ 2020-05-19 18:57     ` Thomas Gleixner
  2020-05-19 19:44       ` Andy Lutomirski
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-19 18:57 UTC (permalink / raw)
  To: Andy Lutomirski, Andrew Cooper
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> @@ -573,6 +578,16 @@ static __always_inline void __idtentry_exit(struct pt_regs *regs)
>>                                 instrumentation_end();
>>                                 return;
>>                         }
>> +               } else if (IS_ENABLED(CONFIG_XEN_PV)) {
>> +                       if (preempt_hcall) {
>> +                               /* See CONFIG_PREEMPTION above */
>> +                               instrumentation_begin();
>> +                               rcu_irq_exit_preempt();
>> +                               xen_maybe_preempt_hcall();
>> +                               trace_hardirqs_on();
>> +                               instrumentation_end();
>> +                               return;
>> +                       }
>
> Ewwwww!  This shouldn't be taken as a NAK -- it's just an expression
> of disgust.

I'm really not proud of it, but that was the least horrible thing I
could come up with.

> Shouldn't this be:
>
> instrumentation_begin();
> if (!irq_needs_irq_stack(...))
>   __blah();
> else
>   run_on_irqstack(__blah, NULL);
> instrumentation_end();
>
> or even:
>
> instrumentation_begin();
> run_on_irqstack_if_needed(__blah, NULL);
> instrumentation_end();

Yeah. In that case the instrumentation markers are not required as they
will be inside the run....() function.

> ****** BUT *******
>
> I think this is all arse-backwards.  This is a giant mess designed to
> pretend we support preemption and to emulate normal preemption in a
> non-preemptible kernel.  I propose one to two massive cleanups:
>
> A: Just delete all of this code.  Preemptible hypercalls on
> non-preempt kernels will still process interrupts but won't get
> preempted.  If you want preemption, compile with preemption.

I'm happy to do so, but the XEN folks might have opinions on that :)

> B: Turn this thing around.  Specifically, in the one and only case we
> care about, we know pretty much exactly what context we got this entry
> in: we're running in a schedulable context doing an explicitly
> preemptible hypercall, and we have RIP pointing at a SYSCALL
> instruction (presumably, but we shouldn't bet on it) in the hypercall
> page.  Ideally we would change the Xen PV ABI so the hypercall would
> return something like EAGAIN instead of auto-restarting and we could
> ditch this mess entirely.  But the ABI seems to be set in stone or at
> least in molasses, so how about just:
>
> idt_entry(exit(regs));
> if (inhcall && need_resched())
>   schedule();

Which brings you into the situation that you call schedule() from the
point where we just moved it out. If we would go there we'd need to
ensure that RCU is watching as well. idtentry_exit() might have it
turned off ....

That's why I did it this way to keep the code flow exactly the same for
all these exit variants.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-19 17:08   ` Andy Lutomirski
@ 2020-05-19 19:00     ` Thomas Gleixner
  2020-05-19 20:20       ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-19 19:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> The pagefault handler cannot use the regular idtentry_enter() because that
>> invokes rcu_irq_enter() if the pagefault was caused in the kernel. Not a
>> problem per se, but kernel side page faults can schedule which is not
>> possible without invoking rcu_irq_exit().
>>
>> Adding rcu_irq_exit() and a matching rcu_irq_enter() into the actual
>> pagefault handling code would be possible, but not pretty either.
>>
>> Provide idtentry_entry/exit_cond_rcu() which calls rcu_irq_enter() only
>> when RCU is not watching. The conditional RCU enabling is a correctness
>> issue: A kernel page fault which hits a RCU idle reason can neither
>> schedule nor is it likely to survive. But avoiding RCU warnings or RCU side
>> effects is at least increasing the chance for useful debug output.
>>
>> The function is also useful for implementing lightweight reschedule IPI and
>> KVM posted interrupt IPI entry handling later.
>
> Why is this conditional?  That is, couldn't we do this for all
> idtentry_enter() calls instead of just for page faults?  Evil things
> like NMI shouldn't go through this path at all.

I thought about that, but then ended up with the conclusion that RCU
might be unhappy, but my conclusion might be fundamentally wrong.

Paul?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-19 18:37 ` Steven Rostedt
@ 2020-05-19 19:09   ` Thomas Gleixner
  2020-05-19 19:13     ` Steven Rostedt
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-19 19:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

Steven Rostedt <rostedt@goodmis.org> writes:

> On Sat, 16 May 2020 01:45:47 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> The V6 leftover series is based on:
>> 
>>   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-base-v6
>
>
> $ git fetch git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-base-v6
> fatal: couldn't find remote ref entry-base-v6

try entry-v6-base or better entry-v8-base

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-19 19:09   ` Thomas Gleixner
@ 2020-05-19 19:13     ` Steven Rostedt
  0 siblings, 0 replies; 159+ messages in thread
From: Steven Rostedt @ 2020-05-19 19:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Tue, 19 May 2020 21:09:48 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> Steven Rostedt <rostedt@goodmis.org> writes:
> 
> > On Sat, 16 May 2020 01:45:47 +0200
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> >  
> >> The V6 leftover series is based on:
> >> 
> >>   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-base-v6  
> >
> >
> > $ git fetch git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git entry-base-v6
> > fatal: couldn't find remote ref entry-base-v6  
> 
> try entry-v6-base or better entry-v8-base

Ah, I should have had a V8!

-- Steve

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-19 18:57     ` Thomas Gleixner
@ 2020-05-19 19:44       ` Andy Lutomirski
  2020-05-20  8:06         ` Jürgen Groß
  2020-05-20 14:13         ` Thomas Gleixner
  0 siblings, 2 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 19:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Andrew Cooper, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Tue, May 19, 2020 at 11:58 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> @@ -573,6 +578,16 @@ static __always_inline void __idtentry_exit(struct pt_regs *regs)
> >>                                 instrumentation_end();
> >>                                 return;
> >>                         }
> >> +               } else if (IS_ENABLED(CONFIG_XEN_PV)) {
> >> +                       if (preempt_hcall) {
> >> +                               /* See CONFIG_PREEMPTION above */
> >> +                               instrumentation_begin();
> >> +                               rcu_irq_exit_preempt();
> >> +                               xen_maybe_preempt_hcall();
> >> +                               trace_hardirqs_on();
> >> +                               instrumentation_end();
> >> +                               return;
> >> +                       }
> >
> > Ewwwww!  This shouldn't be taken as a NAK -- it's just an expression
> > of disgust.
>
> I'm really not proud of it, but that was the least horrible thing I
> could come up with.
>
> > Shouldn't this be:
> >
> > instrumentation_begin();
> > if (!irq_needs_irq_stack(...))
> >   __blah();
> > else
> >   run_on_irqstack(__blah, NULL);
> > instrumentation_end();
> >
> > or even:
> >
> > instrumentation_begin();
> > run_on_irqstack_if_needed(__blah, NULL);
> > instrumentation_end();
>
> Yeah. In that case the instrumentation markers are not required as they
> will be inside the run....() function.
>
> > ****** BUT *******
> >
> > I think this is all arse-backwards.  This is a giant mess designed to
> > pretend we support preemption and to emulate normal preemption in a
> > non-preemptible kernel.  I propose one to two massive cleanups:
> >
> > A: Just delete all of this code.  Preemptible hypercalls on
> > non-preempt kernels will still process interrupts but won't get
> > preempted.  If you want preemption, compile with preemption.
>
> I'm happy to do so, but the XEN folks might have opinions on that :)
>
> > B: Turn this thing around.  Specifically, in the one and only case we
> > care about, we know pretty much exactly what context we got this entry
> > in: we're running in a schedulable context doing an explicitly
> > preemptible hypercall, and we have RIP pointing at a SYSCALL
> > instruction (presumably, but we shouldn't bet on it) in the hypercall
> > page.  Ideally we would change the Xen PV ABI so the hypercall would
> > return something like EAGAIN instead of auto-restarting and we could
> > ditch this mess entirely.  But the ABI seems to be set in stone or at
> > least in molasses, so how about just:
> >
> > idt_entry(exit(regs));
> > if (inhcall && need_resched())
> >   schedule();
>
> Which brings you into the situation that you call schedule() from the
> point where we just moved it out. If we would go there we'd need to
> ensure that RCU is watching as well. idtentry_exit() might have it
> turned off ....

I don't think this is possible.  Once you untangle all the wrappers,
the call sites are effectively:

__this_cpu_write(xen_in_preemptible_hcall, true);
CALL_NOSPEC to the hypercall page
__this_cpu_write(xen_in_preemptible_hcall, false);

I think IF=1 when this happens, but I won't swear to it.  RCU had
better be watching.

As I understand it, the one and only situation Xen wants to handle is
that an interrupt gets delivered during the hypercall.  The hypervisor
is too clever for its own good and deals with this by rewinding RIP to
the beginning of whatever instruction did the hypercall and delivers
the interrupt, and we end up in this handler.  So, if this happens,
the idea is to not only handle the interrupt but to schedule if
scheduling would be useful.

So I don't think we need all this RCU magic.  This really ought to be
able to be simplified to:

idtentry_exit();

if (appropriate condition)
  schedule();

Obviously we don't want to schedule if this is a nested entry, but we
should be able to rule that out by checking that regs->flags &
X86_EFLAGS_IF and by handling the percpu variable a little more
intelligently.  So maybe the right approach is:

bool in_preemptible_hcall = __this_cpu_read(xen_in_preemptible_hcall);
__this_cpu_write(xen_in_preemptible_hcall, false);
idtentry_enter(...);

do the acutal work;

idtentry_exit(...);

if (in_preemptible_hcall) {
  assert regs->flags & X86_EFLAGS_IF;
  assert that RCU is watching;
  assert that we're on the thread stack;
  assert whatever else we feel like asserting;
  if (need_resched())
    schedule();
}

__this_cpu_write(xen_in_preemptible_hcall, in_preemptible_hcall);

And now we don't have a special idtentry_exit() case just for Xen, and
all the mess is entirely contained in the Xen PV code.  And we need to
mark all the preemptible hypercalls noinstr.  Does this seem
reasonable?

That being said, right now, with or without your patch, I think we're
toast if the preemptible hypercall code gets traced.  So maybe the
right thing is to just drop all the magic preemption stuff from your
patch and let the Xen maintainers submit something new (maybe like
what I suggest above) if they want magic preemption back.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 13/37] x86/entry: Switch page fault exception to IDTENTRY_RAW
  2020-05-15 23:46 ` [patch V6 13/37] x86/entry: Switch page fault exception to IDTENTRY_RAW Thomas Gleixner
@ 2020-05-19 20:12   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 20:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Convert page fault exceptions to IDTENTRY_RAW:
>   - Implement the C entry point with DEFINE_IDTENTRY_RAW
>   - Add the CR2 read into the exception handler
>   - Add the idtentry_enter/exit_cond_rcu() invocations in
>     in the regular page fault handler and use the regular
>     idtentry_enter/exit() for the async PF part.
>   - Emit the ASM stub with DECLARE_IDTENTRY_RAW
>   - Remove the ASM idtentry in 64bit
>   - Remove the CR2 read from 64bit
>   - Remove the open coded ASM entry code in 32bit
>   - Fixup the XEN/PV code
>   - Remove the old prototypes
>

Acked-by: Andy Lutomirski <luto@kernel.org>

although if you make the irq_enter_cond_rcu() mode unconditional, then
this comment can go away too:

> +       /*
> +        * Entry handling for valid #PF from kernel mode is slightly
> +        * different: RCU is already watching and rcu_irq_enter() must not
> +        * be invoked because a kernel fault on a user space address might
> +        * sleep.
> +        *
> +        * In case the fault hit a RCU idle region the conditional entry
> +        * code reenabled RCU to avoid subsequent wreckage which helps
> +        * debugability.
> +        */
> +       rcu_exit = idtentry_enter_cond_rcu(regs);

--Andy

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 14/37] x86/entry: Remove the transition leftovers
  2020-05-15 23:46 ` [patch V6 14/37] x86/entry: Remove the transition leftovers Thomas Gleixner
@ 2020-05-19 20:13   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 20:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Now that all exceptions are converted over the sane flag is not longer
> needed. Also the vector argument of idtentry_body on 64 bit is pointless
> now.

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 15/37] x86/entry: Change exit path of xen_failsafe_callback
  2020-05-15 23:46 ` [patch V6 15/37] x86/entry: Change exit path of xen_failsafe_callback Thomas Gleixner
@ 2020-05-19 20:14   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 20:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> xen_failsafe_callback is invoked from XEN for two cases:
>
>   1. Fault while reloading DS, ES, FS or GS
>   2. Fault while executing IRET
>
> #1 retries the IRET after XEN has fixed up the segments.
> #2 injects a #GP which kills the task
>
> For #1 there is no reason to go through the full exception return path
> because the tasks TIF state is still the same. So just going straight to
> the IRET path is good enough.

Seems reasonable:

Acked-by: Andy Lutomirski <luto@kernel.org>

Although a look from a Xen person might be nice too.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 16/37] x86/entry/64: Remove error_exit
  2020-05-15 23:46 ` [patch V6 16/37] x86/entry/64: Remove error_exit Thomas Gleixner
@ 2020-05-19 20:14   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 20:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> No more users.

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 17/37] x86/entry/32: Remove common_exception
  2020-05-15 23:46 ` [patch V6 17/37] x86/entry/32: Remove common_exception Thomas Gleixner
@ 2020-05-19 20:14   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 20:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> No more users.

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 19/37] x86/irq: Convey vector as argument and not in ptregs
  2020-05-15 23:46 ` [patch V6 19/37] x86/irq: Convey vector as argument and not in ptregs Thomas Gleixner
@ 2020-05-19 20:19   ` Andy Lutomirski
  2020-05-21 13:22     ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 20:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Device interrupts which go through do_IRQ() or the spurious interrupt
> handler have their separate entry code on 64 bit for no good reason.
>
> Both 32 and 64 bit transport the vector number through ORIG_[RE]AX in
> pt_regs. Further the vector number is forced to fit into an u8 and is
> complemented and offset by 0x80 so it's in the signed character
> range. Otherwise GAS would expand the pushq to a 5 byte instruction for any
> vector > 0x7F.
>
> Treat the vector number like an error code and hand it to the C function as
> argument. This allows to get rid of the extra entry code in a later step.
>
> Simplify the error code push magic by implementing the pushq imm8 via a
> '.byte 0x6a, vector' sequence so GAS is not able to screw it up. As the
> pushq imm8 is sign extending the resulting error code needs to be truncated
> to 8 bits in C code.


Acked-by: Andy Lutomirski <luto@kernel.org>

although you may be giving me more credit than deserved :)

 +       .align 8
> +SYM_CODE_START(irq_entries_start)
> +    vector=FIRST_EXTERNAL_VECTOR
> +    .rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
> +       UNWIND_HINT_IRET_REGS
> +       .byte   0x6a, vector
> +       jmp     common_interrupt
> +       .align  8
> +    vector=vector+1
> +    .endr
> +SYM_CODE_END(irq_entries_start)

Having battled code like this in the past (for early exceptions), I
prefer the variant like:

pos = .;
.rept blah blah blah
  .byte whatever
  jmp whatever
  . = pos + 8;
 vector = vector + 1
.endr

or maybe:

.rept blah blah blah
  .byte whatever
  jmp whatever;
  . = irq_entries_start + 8 * vector;
  vector = vector + 1
.endr

The reason is that these variants will fail to assemble if something
goes wrong and the code expands to more than 8 bytes, whereas using
.align will cause gas to happily emit 16 bytes and result in
hard-to-debug mayhem.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-19 19:00     ` Thomas Gleixner
@ 2020-05-19 20:20       ` Thomas Gleixner
  2020-05-19 20:24         ` Andy Lutomirski
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-19 20:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Thomas Gleixner <tglx@linutronix.de> writes:
> Andy Lutomirski <luto@kernel.org> writes:
>> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>> The pagefault handler cannot use the regular idtentry_enter() because that
>>> invokes rcu_irq_enter() if the pagefault was caused in the kernel. Not a
>>> problem per se, but kernel side page faults can schedule which is not
>>> possible without invoking rcu_irq_exit().
>>>
>>> Adding rcu_irq_exit() and a matching rcu_irq_enter() into the actual
>>> pagefault handling code would be possible, but not pretty either.
>>>
>>> Provide idtentry_entry/exit_cond_rcu() which calls rcu_irq_enter() only
>>> when RCU is not watching. The conditional RCU enabling is a correctness
>>> issue: A kernel page fault which hits a RCU idle reason can neither
>>> schedule nor is it likely to survive. But avoiding RCU warnings or RCU side
>>> effects is at least increasing the chance for useful debug output.
>>>
>>> The function is also useful for implementing lightweight reschedule IPI and
>>> KVM posted interrupt IPI entry handling later.
>>
>> Why is this conditional?  That is, couldn't we do this for all
>> idtentry_enter() calls instead of just for page faults?  Evil things
>> like NMI shouldn't go through this path at all.
>
> I thought about that, but then ended up with the conclusion that RCU
> might be unhappy, but my conclusion might be fundamentally wrong.

It's about this:

rcu_nmi_enter()
{
        if (!rcu_is_watching()) {
            make it watch;
        } else if (!in_nmi()) {
            do_magic_nohz_dyntick_muck();
        }

So if we do all irq/system vector entries conditional then the
do_magic() gets never executed. After that I got lost...

Thanks,

         tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 20/37] x86/irq/64: Provide handle_irq()
  2020-05-15 23:46 ` [patch V6 20/37] x86/irq/64: Provide handle_irq() Thomas Gleixner
@ 2020-05-19 20:21   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 20:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> To consolidate the interrupt entry/exit code vs. the other exceptions
> provide handle_irq() (similar to 32bit) to move the interrupt stack
> switching to C code. That allows to consolidate the entry exit handling by
> reusing the idtentry machinery both in ASM and C.

Reviewed-by: Andy Lutomirski <luto@kernel.org>

>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>
> diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
> index 62cff52e03c5..6087164e581c 100644
> --- a/arch/x86/kernel/irq_64.c
> +++ b/arch/x86/kernel/irq_64.c
> @@ -79,3 +79,11 @@ void do_softirq_own_stack(void)
>         else
>                 run_on_irqstack(__do_softirq, NULL);
>  }
> +
> +void handle_irq(struct irq_desc *desc, struct pt_regs *regs)
> +{
> +       if (!irq_needs_irq_stack(regs))
> +               generic_handle_irq_desc(desc);
> +       else
> +               run_on_irqstack(desc->handle_irq, desc);
> +}
>

Would this be nicer if you open-coded desc->handle_irq(desc) in the if
branch to make it look less weird.

This also goes away if you make the run_on_irqstack_if_needed() change
I suggested.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-19 20:20       ` Thomas Gleixner
@ 2020-05-19 20:24         ` Andy Lutomirski
  2020-05-19 21:20           ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 20:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Thomas Gleixner <tglx@linutronix.de> writes:
> > Andy Lutomirski <luto@kernel.org> writes:
> >> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>> The pagefault handler cannot use the regular idtentry_enter() because that
> >>> invokes rcu_irq_enter() if the pagefault was caused in the kernel. Not a
> >>> problem per se, but kernel side page faults can schedule which is not
> >>> possible without invoking rcu_irq_exit().
> >>>
> >>> Adding rcu_irq_exit() and a matching rcu_irq_enter() into the actual
> >>> pagefault handling code would be possible, but not pretty either.
> >>>
> >>> Provide idtentry_entry/exit_cond_rcu() which calls rcu_irq_enter() only
> >>> when RCU is not watching. The conditional RCU enabling is a correctness
> >>> issue: A kernel page fault which hits a RCU idle reason can neither
> >>> schedule nor is it likely to survive. But avoiding RCU warnings or RCU side
> >>> effects is at least increasing the chance for useful debug output.
> >>>
> >>> The function is also useful for implementing lightweight reschedule IPI and
> >>> KVM posted interrupt IPI entry handling later.
> >>
> >> Why is this conditional?  That is, couldn't we do this for all
> >> idtentry_enter() calls instead of just for page faults?  Evil things
> >> like NMI shouldn't go through this path at all.
> >
> > I thought about that, but then ended up with the conclusion that RCU
> > might be unhappy, but my conclusion might be fundamentally wrong.
>
> It's about this:
>
> rcu_nmi_enter()
> {
>         if (!rcu_is_watching()) {
>             make it watch;
>         } else if (!in_nmi()) {
>             do_magic_nohz_dyntick_muck();
>         }
>
> So if we do all irq/system vector entries conditional then the
> do_magic() gets never executed. After that I got lost...

I'm also baffled by that magic, but I'm also not suggesting doing this
to *all* entries -- just the not-super-magic ones that use
idtentry_enter().

Paul, what is this code actually trying to do?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 21/37] x86/entry: Add IRQENTRY_IRQ macro
  2020-05-15 23:46 ` [patch V6 21/37] x86/entry: Add IRQENTRY_IRQ macro Thomas Gleixner
@ 2020-05-19 20:27   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 20:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Provide a seperate IDTENTRY macro for device interrupts. Similar to
> IDTENTRY_ERRORCODE with the addition of invoking irq_enter/exit_rcu() and
> providing the errorcode as a 'u8' argument to the C function, which
> truncates the sign extended vector number.

Acked-by: Andy Lutomirski <luto@kernel.org>

with a minor minor optimization suggestion:

> +.macro idtentry_irq vector cfunc
> +       .p2align CONFIG_X86_L1_CACHE_SHIFT
> +SYM_CODE_START_LOCAL(asm_\cfunc)
> +       ASM_CLAC
> +       SAVE_ALL switch_stacks=1
> +       ENCODE_FRAME_POINTER
> +       movl    %esp, %eax
> +       movl    PT_ORIG_EAX(%esp), %edx         /* get the vector from stack */

You could save somewhere between 0 and 1 cycles by using movzbl here...

> +       __##func (regs, (u8)error_code);                                \

And eliminating this cast.  Totally worth it, right?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 22/37] x86/entry: Use idtentry for interrupts
  2020-05-15 23:46 ` [patch V6 22/37] x86/entry: Use idtentry for interrupts Thomas Gleixner
@ 2020-05-19 20:28   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 20:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Replace the extra interrupt handling code and reuse the existing idtentry
> machinery. This moves the irq stack switching on 64 bit from ASM to C code;
> 32bit already does the stack switching in C.

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-19 20:24         ` Andy Lutomirski
@ 2020-05-19 21:20           ` Thomas Gleixner
  2020-05-20  0:26             ` Andy Lutomirski
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-19 21:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Thomas Gleixner <tglx@linutronix.de> writes:
>> It's about this:
>>
>> rcu_nmi_enter()
>> {
>>         if (!rcu_is_watching()) {
>>             make it watch;
>>         } else if (!in_nmi()) {
>>             do_magic_nohz_dyntick_muck();
>>         }
>>
>> So if we do all irq/system vector entries conditional then the
>> do_magic() gets never executed. After that I got lost...
>
> I'm also baffled by that magic, but I'm also not suggesting doing this
> to *all* entries -- just the not-super-magic ones that use
> idtentry_enter().
>
> Paul, what is this code actually trying to do?

Citing Paul from IRC:

  "The way things are right now, you can leave out the rcu_irq_enter()
   if this is not a nohz_full CPU.

   Or if this is a nohz_full CPU, and the tick is already
   enabled, in that case you could also leave out the rcu_irq_enter().

   Or even if this is a nohz_full CPU and it does not have the tick
   enabled, if it has been in the kernel less than a few tens of
   milliseconds, still OK to avoid invoking rcu_irq_enter()

   But my guess is that it would be a lot simpler to just always call
   it.

Hope that helps.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns()
  2020-05-15 23:45 ` [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns() Thomas Gleixner
@ 2020-05-19 21:26   ` Steven Rostedt
  2020-05-19 21:45     ` Thomas Gleixner
  2020-05-20 20:14   ` Peter Zijlstra
  1 sibling, 1 reply; 159+ messages in thread
From: Steven Rostedt @ 2020-05-19 21:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Sat, 16 May 2020 01:45:48 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> Timestamping in the hardware latency detector uses sched_clock() underneath
> and depends on CONFIG_GENERIC_SCHED_CLOCK=n because sched clocks from that
> subsystem are not NMI safe.
> 
> ktime_get_mono_fast_ns() is NMI safe and available on all architectures.
> 
> Replace the time getter, get rid of the CONFIG_GENERIC_SCHED_CLOCK=n
> dependency and cleanup the horrible macro maze which encapsulates u64 math
> in u64 macros.

Good riddance. That macro maze was due to supporting the same code upstream
as we had in RHEL RT, where the math and time keeping functions available
between the two kernels were different.

That's been dealt with, but the macros never got cleaned up.

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/trace/trace_hwlat.c |   59 +++++++++++++++++++--------------------------
>  1 file changed, 25 insertions(+), 34 deletions(-)
> 
> --- a/kernel/trace/trace_hwlat.c
> +++ b/kernel/trace/trace_hwlat.c
> @@ -131,29 +131,19 @@ static void trace_hwlat_sample(struct hw
>  		trace_buffer_unlock_commit_nostack(buffer, event);
>  }
>  
> -/* Macros to encapsulate the time capturing infrastructure */
> -#define time_type	u64
> -#define time_get()	trace_clock_local()
> -#define time_to_us(x)	div_u64(x, 1000)
> -#define time_sub(a, b)	((a) - (b))
> -#define init_time(a, b)	(a = b)
> -#define time_u64(a)	a
> -
> +/*
> + * Timestamping uses ktime_get_mono_fast(), the NMI safe access to
> + * CLOCK_MONOTONIC.
> + */
>  void trace_hwlat_callback(bool enter)
>  {
>  	if (smp_processor_id() != nmi_cpu)
>  		return;
>  
> -	/*
> -	 * Currently trace_clock_local() calls sched_clock() and the
> -	 * generic version is not NMI safe.
> -	 */
> -	if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
> -		if (enter)
> -			nmi_ts_start = time_get();
> -		else
> -			nmi_total_ts += time_get() - nmi_ts_start;
> -	}
> +	if (enter)
> +		nmi_ts_start = ktime_get_mono_fast_ns();
> +	else
> +		nmi_total_ts += ktime_get_mono_fast_ns() - nmi_ts_start;
>  
>  	if (enter)
>  		nmi_count++;
> @@ -165,20 +155,22 @@ void trace_hwlat_callback(bool enter)
>   * Used to repeatedly capture the CPU TSC (or similar), looking for potential
>   * hardware-induced latency. Called with interrupts disabled and with
>   * hwlat_data.lock held.
> + *
> + * Use ktime_get_mono_fast() here as well because it does not wait on the
> + * timekeeping seqcount like ktime_get_mono().

When doing a "git grep ktime_get_mono" I only find
ktime_get_mono_fast_ns() (and this comment), so I don't know what to compare
that to. Did you mean another function?

The rest looks fine (although, I see other things I need to clean up in
this code ;-)

-- Steve


>   */
>  static int get_sample(void)
>  {
>  	struct trace_array *tr = hwlat_trace;
>  	struct hwlat_sample s;
> -	time_type start, t1, t2, last_t2;
> +	u64 start, t1, t2, last_t2, thresh;
>  	s64 diff, outer_diff, total, last_total = 0;
>  	u64 sample = 0;
> -	u64 thresh = tracing_thresh;
>  	u64 outer_sample = 0;
>  	int ret = -1;
>  	unsigned int count = 0;
>  

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns()
  2020-05-19 21:26   ` Steven Rostedt
@ 2020-05-19 21:45     ` Thomas Gleixner
  2020-05-19 22:18       ` Steven Rostedt
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-19 21:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

Steven Rostedt <rostedt@goodmis.org> writes:
> On Sat, 16 May 2020 01:45:48 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>> +	if (enter)
>> +		nmi_ts_start = ktime_get_mono_fast_ns();
>> +	else
>> +		nmi_total_ts += ktime_get_mono_fast_ns() - nmi_ts_start;
>>  
>>  	if (enter)
>>  		nmi_count++;
>> @@ -165,20 +155,22 @@ void trace_hwlat_callback(bool enter)
>>   * Used to repeatedly capture the CPU TSC (or similar), looking for potential
>>   * hardware-induced latency. Called with interrupts disabled and with
>>   * hwlat_data.lock held.
>> + *
>> + * Use ktime_get_mono_fast() here as well because it does not wait on the
>> + * timekeeping seqcount like ktime_get_mono().
>
> When doing a "git grep ktime_get_mono" I only find
> ktime_get_mono_fast_ns() (and this comment), so I don't know what to compare
> that to. Did you mean another function?

Yeah. I fatfingered the comment. The code uses ktime_get_mono_fast_ns().

> The rest looks fine (although, I see other things I need to clean up in
> this code ;-)

Quite some ...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns()
  2020-05-19 21:45     ` Thomas Gleixner
@ 2020-05-19 22:18       ` Steven Rostedt
  2020-05-20 19:51         ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Steven Rostedt @ 2020-05-19 22:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Tue, 19 May 2020 23:45:10 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> >> @@ -165,20 +155,22 @@ void trace_hwlat_callback(bool enter)
> >>   * Used to repeatedly capture the CPU TSC (or similar), looking for potential
> >>   * hardware-induced latency. Called with interrupts disabled and with
> >>   * hwlat_data.lock held.
> >> + *
> >> + * Use ktime_get_mono_fast() here as well because it does not wait on the
> >> + * timekeeping seqcount like ktime_get_mono().  
> >
> > When doing a "git grep ktime_get_mono" I only find
> > ktime_get_mono_fast_ns() (and this comment), so I don't know what to compare
> > that to. Did you mean another function?  
> 
> Yeah. I fatfingered the comment. The code uses ktime_get_mono_fast_ns().

Well, I assumed that's what you meant with "ktime_get_mono_fast()" but I
don't know what function you are comparing it to that waits on the seqcount
like "ktime_get_mono()" as there is no such function.

-- Steve

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 02/37] tracing/hwlat: Split ftrace_nmi_enter/exit()
  2020-05-15 23:45 ` [patch V6 02/37] tracing/hwlat: Split ftrace_nmi_enter/exit() Thomas Gleixner
@ 2020-05-19 22:23   ` Steven Rostedt
  0 siblings, 0 replies; 159+ messages in thread
From: Steven Rostedt @ 2020-05-19 22:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Sat, 16 May 2020 01:45:49 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> The hardware latency tracer calls into timekeeping and ends up in
> various instrumentable functions which is problematic vs. the kprobe
> handling especially the text poke machinery. It's invoked from
> nmi_enter/exit(), i.e. non-instrumentable code.
> 
> Split it into two parts:
> 
>  1) NMI counter, only invoked on nmi_enter() and noinstr safe
> 
>  2) NMI timestamping, to be invoked from instrumentable code
> 
> Move it into the rcu is watching regions of nmi_enter/exit() even
> if there is no actual RCU dependency right now but there is also
> no point in having it early.
> 
> The actual split of nmi_enter/exit() is done in a separate step.
> 
> Requested-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 03/37] nmi, tracing: Provide nmi_enter/exit_notrace()
  2020-05-15 23:45 ` [patch V6 03/37] nmi, tracing: Provide nmi_enter/exit_notrace() Thomas Gleixner
  2020-05-17  5:12   ` Andy Lutomirski
@ 2020-05-19 22:24   ` Steven Rostedt
  1 sibling, 0 replies; 159+ messages in thread
From: Steven Rostedt @ 2020-05-19 22:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Sat, 16 May 2020 01:45:50 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> To fully isolate #DB and #BP from instrumentable code it's necessary to
> avoid invoking the hardware latency tracer on nmi_enter/exit().
> 
> Provide nmi_enter/exit() variants which are not invoking the hardware
> latency tracer. That allows to put calls explicitely into the call sites
> outside of the kprobe handling.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 30/37] x86/entry: Convert reschedule interrupt to IDTENTRY_RAW
  2020-05-15 23:46 ` [patch V6 30/37] x86/entry: Convert reschedule interrupt to IDTENTRY_RAW Thomas Gleixner
@ 2020-05-19 23:57   ` Andy Lutomirski
  2020-05-20 15:08     ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-19 23:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> The scheduler IPI does not need the full interrupt entry handling logic
> when the entry is from kernel mode.
>
> Even if tracing is enabled the only requirement is that RCU is watching and
> preempt_count has the hardirq bit on.
>
> The NOHZ tick state does not have to be adjusted. If the tick is not
> running then the CPU is in idle and the idle exit will restore the
> tick. Softinterrupts are not raised here, so handling them on return is not
> required either.
>
> User mode entry must go through the regular entry path as it will invoke
> the scheduler on return so context tracking needs to be in the correct
> state.
>
> Use IDTENTRY_RAW and the RCU conditional variants of idtentry_enter/exit()
> to guarantee that RCU is watching even if the IPI hits a RCU idle section.
>
> Remove the tracepoint static key conditional which is incomplete
> vs. tracing anyway because e.g. ack_APIC_irq() calls out into
> instrumentable code.
>
> Avoid the overhead of irq time accounting and introduce variants of
> __irq_enter/exit() so instrumentation observes the correct preempt count
> state.

Leftover text from an old version?

The code is Reviewed-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-19 21:20           ` Thomas Gleixner
@ 2020-05-20  0:26             ` Andy Lutomirski
  2020-05-20  2:23               ` Paul E. McKenney
  2020-05-20 14:19               ` Thomas Gleixner
  0 siblings, 2 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> Thomas Gleixner <tglx@linutronix.de> writes:
> >> It's about this:
> >>
> >> rcu_nmi_enter()
> >> {
> >>         if (!rcu_is_watching()) {
> >>             make it watch;
> >>         } else if (!in_nmi()) {
> >>             do_magic_nohz_dyntick_muck();
> >>         }
> >>
> >> So if we do all irq/system vector entries conditional then the
> >> do_magic() gets never executed. After that I got lost...
> >
> > I'm also baffled by that magic, but I'm also not suggesting doing this
> > to *all* entries -- just the not-super-magic ones that use
> > idtentry_enter().
> >
> > Paul, what is this code actually trying to do?
>
> Citing Paul from IRC:
>
>   "The way things are right now, you can leave out the rcu_irq_enter()
>    if this is not a nohz_full CPU.
>
>    Or if this is a nohz_full CPU, and the tick is already
>    enabled, in that case you could also leave out the rcu_irq_enter().
>
>    Or even if this is a nohz_full CPU and it does not have the tick
>    enabled, if it has been in the kernel less than a few tens of
>    milliseconds, still OK to avoid invoking rcu_irq_enter()
>
>    But my guess is that it would be a lot simpler to just always call
>    it.
>
> Hope that helps.

Maybe?

Unless I've missed something, the effect here is that #PF hitting in
an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
(because you converted them) as well as other faults and traps will
call rcu_irq_enter().

Once upon a time, we did this horrible thing where, on entry from user
mode, we would turn on interrupts while still in CONTEXT_USER, which
means we could get an IRQ in an extended quiescent state.  This means
that the IRQ code had to end the EQS so that IRQ handlers could use
RCU.  But I killed this a few years ago -- x86 Linux now has a rule
that, if IF=1, we are *not* in an EQS with the sole exception of the
idle code.

In my dream world, we would never ever get IRQs while in an EQS -- we
would do MWAIT with IF=0 and we would exit the EQS before taking the
interrupt.  But I guess we still need to support HLT, which means we
have this mess.

But I still think we can plausibly get rid of the conditional.  If we
get an IRQ or (egads!) a fault in idle context, we'll have
!__rcu_is_watching(), but, AFAICT, we also have preemption off.  So it
should be okay to do rcu_irq_enter().  OTOH, if we get an IRQ or a
fault anywhere else, then we either have a severe bug in the RCU code
itself and the RCU code faulted (in which case we get what we deserve)
or RCU is watching and all is well.  This means that the rule will be
that, if preemption is on, it's fine to schedule inside an
idtentry_begin()/idtentry_end() pair.

The remaining bit is just the urgent thing, and I don't understand
what's going on.  Paul, could we split out the urgent logic all by
itself so that the IRQ handlers could do rcu_poke_urgent()?  Or am I
entirely misunderstanding its purpose?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 24/37] x86/entry: Convert APIC interrupts to IDTENTRY_SYSVEC
  2020-05-15 23:46 ` [patch V6 24/37] x86/entry: Convert APIC interrupts to IDTENTRY_SYSVEC Thomas Gleixner
@ 2020-05-20  0:27   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Convert APIC interrupts to IDTENTRY_SYSVEC
>   - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
>   - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
>   - Remove the ASM idtentries in 64bit
>   - Remove the BUILD_INTERRUPT entries in 32bit
>   - Remove the old prototypes
>
> No functional change.
>

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 25/37] x86/entry: Convert SMP system vectors to IDTENTRY_SYSVEC
  2020-05-15 23:46 ` [patch V6 25/37] x86/entry: Convert SMP system vectors " Thomas Gleixner
@ 2020-05-20  0:28   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Convert SMP system vectors to IDTENTRY_SYSVEC
>   - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
>   - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
>   - Remove the ASM idtentries in 64bit
>   - Remove the BUILD_INTERRUPT entries in 32bit
>   - Remove the old prototypes

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 23/37] x86/entry: Provide IDTENTRY_SYSVEC
  2020-05-15 23:46 ` [patch V6 23/37] x86/entry: Provide IDTENTRY_SYSVEC Thomas Gleixner
@ 2020-05-20  0:29   ` Andy Lutomirski
  2020-05-20 15:07     ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Provide a IDTENTRY variant for system vectors to consolidate the different
> mechanisms to emit the ASM stubs for 32 an 64 bit.
>
> On 64bit this also moves the stack switching from ASM to C code. 32bit will
> excute the system vectors w/o stack switching as before. As some of the
> system vector handlers require access to pt_regs this requires a new stack
> switching macro which can handle an argument.

Is that last sentence obsolete?

Otherwise,

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 27/37] x86/entry: Convert KVM vectors to IDTENTRY_SYSVEC
  2020-05-15 23:46 ` [patch V6 27/37] x86/entry: Convert KVM vectors to IDTENTRY_SYSVEC Thomas Gleixner
@ 2020-05-20  0:30   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Convert KVM specific system vectors to IDTENTRY_SYSVEC*:
>
> The two empty stub handlers which only increment the stats counter do no
> need to run on the interrupt stack. Use IDTENTRY_SYSVEC_DIRECT for them.
>
> The wakeup handler does more work and runs on the interrupt stack.
>
> None of these handlers need to save and restore the irq_regs pointer.

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 26/37] x86/entry: Convert various system vectors
  2020-05-15 23:46 ` [patch V6 26/37] x86/entry: Convert various system vectors Thomas Gleixner
@ 2020-05-20  0:30   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Convert various system vectors to IDTENTRY_SYSVEC
>   - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
>   - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
>   - Remove the ASM idtentries in 64bit
>   - Remove the BUILD_INTERRUPT entries in 32bit
>   - Remove the old prototypes
>
> No functional change.

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 28/37] x86/entry: Convert various hypervisor vectors to IDTENTRY_SYSVEC
  2020-05-15 23:46 ` [patch V6 28/37] x86/entry: Convert various hypervisor " Thomas Gleixner
@ 2020-05-20  0:31   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Convert various hypervisor vectors to IDTENTRY_SYSVEC
>   - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
>   - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
>   - Remove the ASM idtentries in 64bit
>   - Remove the BUILD_INTERRUPT entries in 32bit
>   - Remove the old prototypes
>
> No functional change.

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 29/37] x86/entry: Convert XEN hypercall vector to IDTENTRY_SYSVEC
  2020-05-15 23:46 ` [patch V6 29/37] x86/entry: Convert XEN hypercall vector " Thomas Gleixner
@ 2020-05-20  0:31   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Convert the last oldstyle defined vector to IDTENTRY_SYSVEC
>   - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
>   - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
>   - Remove the ASM idtentries in 64bit
>   - Remove the BUILD_INTERRUPT entries in 32bit
>   - Remove the old prototypes

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 31/37] x86/entry: Remove the apic/BUILD interrupt leftovers
  2020-05-15 23:46 ` [patch V6 31/37] x86/entry: Remove the apic/BUILD interrupt leftovers Thomas Gleixner
@ 2020-05-20  0:32   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Remove all the code which was there to emit the system vector stubs. All
> users are gone.
>
> Move the now unused GET_CR2_INTO macro muck to head_64.S where the last
> user is. Fixup the eye hurting comment there while at it.

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 32/37] x86/entry/64: Remove IRQ stack switching ASM
  2020-05-15 23:46 ` [patch V6 32/37] x86/entry/64: Remove IRQ stack switching ASM Thomas Gleixner
@ 2020-05-20  0:33   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 33/37] x86/entry: Make enter_from_user_mode() static
  2020-05-15 23:46 ` [patch V6 33/37] x86/entry: Make enter_from_user_mode() static Thomas Gleixner
@ 2020-05-20  0:34   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> The ASM users are gone. All callers are local.

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 34/37] x86/entry/32: Remove redundant irq disable code
  2020-05-15 23:46 ` [patch V6 34/37] x86/entry/32: Remove redundant irq disable code Thomas Gleixner
@ 2020-05-20  0:35   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> All exceptions/interrupts return with interrupts disabled now. No point in
> doing this in ASM again.

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 35/37] x86/entry/64: Remove TRACE_IRQS_*_DEBUG
  2020-05-15 23:46 ` [patch V6 35/37] x86/entry/64: Remove TRACE_IRQS_*_DEBUG Thomas Gleixner
@ 2020-05-20  0:46   ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> Since INT3/#BP no longer runs on an IST, this workaround is no longer
> required.
>
> Tested by running lockdep+ftrace as described in the initial commit:
>
>   5963e317b1e9 ("ftrace/x86: Do not change stacks in DEBUG when calling lockdep")

Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 36/37] x86/entry: Move paranoid irq tracing out of ASM code
  2020-05-15 23:46 ` [patch V6 36/37] x86/entry: Move paranoid irq tracing out of ASM code Thomas Gleixner
@ 2020-05-20  0:53   ` Andy Lutomirski
  2020-05-20 15:16     ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20  0:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Fri, May 15, 2020 at 5:11 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>

I think something's missing here.  With this patch applied, don't we
get to exc_debug_kernel() -> handle_debug() without doing
idtentry_enter() or equivalent?  And that can even enable IRQs.

Maybe exc_debug_kernel() should wrap handle_debug() in some
appropriate _enter() / _exit() pair?

--Andy

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20  0:26             ` Andy Lutomirski
@ 2020-05-20  2:23               ` Paul E. McKenney
  2020-05-20 15:36                 ` Andy Lutomirski
  2020-05-20 14:19               ` Thomas Gleixner
  1 sibling, 1 reply; 159+ messages in thread
From: Paul E. McKenney @ 2020-05-20  2:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Andy Lutomirski <luto@kernel.org> writes:
> > > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >> Thomas Gleixner <tglx@linutronix.de> writes:
> > >> It's about this:
> > >>
> > >> rcu_nmi_enter()
> > >> {
> > >>         if (!rcu_is_watching()) {
> > >>             make it watch;
> > >>         } else if (!in_nmi()) {
> > >>             do_magic_nohz_dyntick_muck();
> > >>         }
> > >>
> > >> So if we do all irq/system vector entries conditional then the
> > >> do_magic() gets never executed. After that I got lost...
> > >
> > > I'm also baffled by that magic, but I'm also not suggesting doing this
> > > to *all* entries -- just the not-super-magic ones that use
> > > idtentry_enter().
> > >
> > > Paul, what is this code actually trying to do?
> >
> > Citing Paul from IRC:
> >
> >   "The way things are right now, you can leave out the rcu_irq_enter()
> >    if this is not a nohz_full CPU.
> >
> >    Or if this is a nohz_full CPU, and the tick is already
> >    enabled, in that case you could also leave out the rcu_irq_enter().
> >
> >    Or even if this is a nohz_full CPU and it does not have the tick
> >    enabled, if it has been in the kernel less than a few tens of
> >    milliseconds, still OK to avoid invoking rcu_irq_enter()
> >
> >    But my guess is that it would be a lot simpler to just always call
> >    it.
> >
> > Hope that helps.
> 
> Maybe?
> 
> Unless I've missed something, the effect here is that #PF hitting in
> an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> (because you converted them) as well as other faults and traps will
> call rcu_irq_enter().
> 
> Once upon a time, we did this horrible thing where, on entry from user
> mode, we would turn on interrupts while still in CONTEXT_USER, which
> means we could get an IRQ in an extended quiescent state.  This means
> that the IRQ code had to end the EQS so that IRQ handlers could use
> RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> that, if IF=1, we are *not* in an EQS with the sole exception of the
> idle code.
> 
> In my dream world, we would never ever get IRQs while in an EQS -- we
> would do MWAIT with IF=0 and we would exit the EQS before taking the
> interrupt.  But I guess we still need to support HLT, which means we
> have this mess.
> 
> But I still think we can plausibly get rid of the conditional.

You mean the conditional in rcu_nmi_enter()?  In a NO_HZ_FULL=n system,
this becomes:

	if (!rcu_is_watching()) {
	    make it watch;
	} else if (!in_nmi()) {
	    instrumentation_begin();
	    if (tick_nohz_full_cpu(rdp->cpu) && ... {
	    	do stuff
	    }
	    instrumentation_end();
	}

But tick_nohz_full_cpu() is compile-time-known false, so as long as the
compiler can ditch the instrumentation_begin() and instrumentation_end(),
the entire "else if" clause evaporates.

>                                                                 If we
> get an IRQ or (egads!) a fault in idle context, we'll have
> !__rcu_is_watching(), but, AFAICT, we also have preemption off.

Or we could be early in the kernel-entry code or late in the kernel-exit
code, but as far as I know, preemption is disabled on those code paths.
As are interrupts, right?  And interrupts are disabled on the portions
of the CPU-hotplug code where RCU is not watching, if I recall correctly.

I am guessing that interrupts from userspace are not at issue here, but
completeness and all that.

>                                                                  So it
> should be okay to do rcu_irq_enter().  OTOH, if we get an IRQ or a
> fault anywhere else, then we either have a severe bug in the RCU code
> itself and the RCU code faulted (in which case we get what we deserve)
> or RCU is watching and all is well.  This means that the rule will be
> that, if preemption is on, it's fine to schedule inside an
> idtentry_begin()/idtentry_end() pair.

On this, I must defer to you guys.

> The remaining bit is just the urgent thing, and I don't understand
> what's going on.  Paul, could we split out the urgent logic all by
> itself so that the IRQ handlers could do rcu_poke_urgent()?  Or am I
> entirely misunderstanding its purpose?

A nohz_full CPU does not enable the scheduling-clock interrupt upon
entry to the kernel.  Normally, this is fine because that CPU will very
quickly exit back to nohz_full userspace execution, so that RCU will
see the quiescent state, either by sampling it directly or by deducing
the CPU's passage through that quiescent state by comparing with state
that was captured earlier.  The grace-period kthread notices the lack
of a quiescent state and will eventually set ->rcu_urgent_qs to
trigger this code.

But if the nohz_full CPU stays in the kernel for an extended time,
perhaps due to OOM handling or due to processing of some huge I/O that
hits in-memory buffers/cache, then RCU needs some way of detecting
quiescent states on that CPU.  This requires the scheduling-clock
interrupt to be alive and well.

Are there other ways to get this done?  But of course!  RCU could
for example use smp_call_function_single() or use workqueues to force
execution onto that CPU and enable the tick that way.  This gets a
little involved in order to avoid deadlock, but if the added check
in rcu_nmi_enter() is causing trouble, something can be arranged.
Though that something would cause more latency excursions than
does the current code.

Or did you have something else in mind?

						Thanx, Paul

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-19 19:44       ` Andy Lutomirski
@ 2020-05-20  8:06         ` Jürgen Groß
  2020-05-20 11:31           ` Andrew Cooper
  2020-05-20 14:13         ` Thomas Gleixner
  1 sibling, 1 reply; 159+ messages in thread
From: Jürgen Groß @ 2020-05-20  8:06 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner
  Cc: Andrew Cooper, LKML, X86 ML, Paul E. McKenney, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On 19.05.20 21:44, Andy Lutomirski wrote:
> On Tue, May 19, 2020 at 11:58 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Andy Lutomirski <luto@kernel.org> writes:
>>> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>>> @@ -573,6 +578,16 @@ static __always_inline void __idtentry_exit(struct pt_regs *regs)
>>>>                                  instrumentation_end();
>>>>                                  return;
>>>>                          }
>>>> +               } else if (IS_ENABLED(CONFIG_XEN_PV)) {
>>>> +                       if (preempt_hcall) {
>>>> +                               /* See CONFIG_PREEMPTION above */
>>>> +                               instrumentation_begin();
>>>> +                               rcu_irq_exit_preempt();
>>>> +                               xen_maybe_preempt_hcall();
>>>> +                               trace_hardirqs_on();
>>>> +                               instrumentation_end();
>>>> +                               return;
>>>> +                       }
>>>
>>> Ewwwww!  This shouldn't be taken as a NAK -- it's just an expression
>>> of disgust.
>>
>> I'm really not proud of it, but that was the least horrible thing I
>> could come up with.
>>
>>> Shouldn't this be:
>>>
>>> instrumentation_begin();
>>> if (!irq_needs_irq_stack(...))
>>>    __blah();
>>> else
>>>    run_on_irqstack(__blah, NULL);
>>> instrumentation_end();
>>>
>>> or even:
>>>
>>> instrumentation_begin();
>>> run_on_irqstack_if_needed(__blah, NULL);
>>> instrumentation_end();
>>
>> Yeah. In that case the instrumentation markers are not required as they
>> will be inside the run....() function.
>>
>>> ****** BUT *******
>>>
>>> I think this is all arse-backwards.  This is a giant mess designed to
>>> pretend we support preemption and to emulate normal preemption in a
>>> non-preemptible kernel.  I propose one to two massive cleanups:
>>>
>>> A: Just delete all of this code.  Preemptible hypercalls on
>>> non-preempt kernels will still process interrupts but won't get
>>> preempted.  If you want preemption, compile with preemption.
>>
>> I'm happy to do so, but the XEN folks might have opinions on that :)

Indeed. :-)

>>
>>> B: Turn this thing around.  Specifically, in the one and only case we
>>> care about, we know pretty much exactly what context we got this entry
>>> in: we're running in a schedulable context doing an explicitly
>>> preemptible hypercall, and we have RIP pointing at a SYSCALL
>>> instruction (presumably, but we shouldn't bet on it) in the hypercall
>>> page.  Ideally we would change the Xen PV ABI so the hypercall would
>>> return something like EAGAIN instead of auto-restarting and we could
>>> ditch this mess entirely.  But the ABI seems to be set in stone or at
>>> least in molasses, so how about just:
>>>
>>> idt_entry(exit(regs));
>>> if (inhcall && need_resched())
>>>    schedule();
>>
>> Which brings you into the situation that you call schedule() from the
>> point where we just moved it out. If we would go there we'd need to
>> ensure that RCU is watching as well. idtentry_exit() might have it
>> turned off ....
> 
> I don't think this is possible.  Once you untangle all the wrappers,
> the call sites are effectively:
> 
> __this_cpu_write(xen_in_preemptible_hcall, true);
> CALL_NOSPEC to the hypercall page
> __this_cpu_write(xen_in_preemptible_hcall, false);
> 
> I think IF=1 when this happens, but I won't swear to it.  RCU had
> better be watching.

Preemptible hypercalls are never done with interrupts off. To be more
precise: they are only ever done during ioctl() processing.

I can add an ASSERT() to xen_preemptible_hcall_begin() if you want.

> 
> As I understand it, the one and only situation Xen wants to handle is
> that an interrupt gets delivered during the hypercall.  The hypervisor
> is too clever for its own good and deals with this by rewinding RIP to
> the beginning of whatever instruction did the hypercall and delivers
> the interrupt, and we end up in this handler.  So, if this happens,
> the idea is to not only handle the interrupt but to schedule if
> scheduling would be useful.

Correct. More precise: the hypercalls in question can last very long
(up to several seconds) and so they need to be interruptible. As said
before: the interface how this is done is horrible. :-(

> 
> So I don't think we need all this RCU magic.  This really ought to be
> able to be simplified to:
> 
> idtentry_exit();
> 
> if (appropriate condition)
>    schedule();
> 
> Obviously we don't want to schedule if this is a nested entry, but we
> should be able to rule that out by checking that regs->flags &
> X86_EFLAGS_IF and by handling the percpu variable a little more
> intelligently.  So maybe the right approach is:
> 
> bool in_preemptible_hcall = __this_cpu_read(xen_in_preemptible_hcall);
> __this_cpu_write(xen_in_preemptible_hcall, false);
> idtentry_enter(...);
> 
> do the acutal work;
> 
> idtentry_exit(...);
> 
> if (in_preemptible_hcall) {
>    assert regs->flags & X86_EFLAGS_IF;
>    assert that RCU is watching;
>    assert that we're on the thread stack;
>    assert whatever else we feel like asserting;
>    if (need_resched())
>      schedule();
> }
> 
> __this_cpu_write(xen_in_preemptible_hcall, in_preemptible_hcall);
> 
> And now we don't have a special idtentry_exit() case just for Xen, and
> all the mess is entirely contained in the Xen PV code.  And we need to
> mark all the preemptible hypercalls noinstr.  Does this seem
> reasonable?

 From my point of view this sounds fine.

> 
> That being said, right now, with or without your patch, I think we're
> toast if the preemptible hypercall code gets traced.  So maybe the
> right thing is to just drop all the magic preemption stuff from your
> patch and let the Xen maintainers submit something new (maybe like
> what I suggest above) if they want magic preemption back.
> 

I'd prefer to not break preemptible hypercall in between.

IMO the patch should be modified along your suggestion. I'd be happy to
test it.


Juergen

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-20  8:06         ` Jürgen Groß
@ 2020-05-20 11:31           ` Andrew Cooper
  0 siblings, 0 replies; 159+ messages in thread
From: Andrew Cooper @ 2020-05-20 11:31 UTC (permalink / raw)
  To: Jürgen Groß, Andy Lutomirski, Thomas Gleixner
  Cc: LKML, X86 ML, Paul E. McKenney, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On 20/05/2020 09:06, Jürgen Groß wrote:
> On 19.05.20 21:44, Andy Lutomirski wrote:
>> On Tue, May 19, 2020 at 11:58 AM Thomas Gleixner <tglx@linutronix.de>
>> wrote:
>>>
>>> Andy Lutomirski <luto@kernel.org> writes:
>>>> B: Turn this thing around.  Specifically, in the one and only case we
>>>> care about, we know pretty much exactly what context we got this entry
>>>> in: we're running in a schedulable context doing an explicitly
>>>> preemptible hypercall, and we have RIP pointing at a SYSCALL
>>>> instruction (presumably, but we shouldn't bet on it) in the hypercall
>>>> page.  Ideally we would change the Xen PV ABI so the hypercall would
>>>> return something like EAGAIN instead of auto-restarting and we could
>>>> ditch this mess entirely.  But the ABI seems to be set in stone or at
>>>> least in molasses, so how about just:
>>>>
>>>> idt_entry(exit(regs));
>>>> if (inhcall && need_resched())
>>>>    schedule();
>>>
>>> Which brings you into the situation that you call schedule() from the
>>> point where we just moved it out. If we would go there we'd need to
>>> ensure that RCU is watching as well. idtentry_exit() might have it
>>> turned off ....
>>
>> I don't think this is possible.  Once you untangle all the wrappers,
>> the call sites are effectively:
>>
>> __this_cpu_write(xen_in_preemptible_hcall, true);
>> CALL_NOSPEC to the hypercall page
>> __this_cpu_write(xen_in_preemptible_hcall, false);
>>
>> I think IF=1 when this happens, but I won't swear to it.  RCU had
>> better be watching.
>
> Preemptible hypercalls are never done with interrupts off. To be more
> precise: they are only ever done during ioctl() processing.
>
> I can add an ASSERT() to xen_preemptible_hcall_begin() if you want.
>
>>
>> As I understand it, the one and only situation Xen wants to handle is
>> that an interrupt gets delivered during the hypercall.  The hypervisor
>> is too clever for its own good and deals with this by rewinding RIP to
>> the beginning of whatever instruction did the hypercall and delivers
>> the interrupt, and we end up in this handler.  So, if this happens,
>> the idea is to not only handle the interrupt but to schedule if
>> scheduling would be useful.
>
> Correct. More precise: the hypercalls in question can last very long
> (up to several seconds) and so they need to be interruptible. As said
> before: the interface how this is done is horrible. :-(

Forget seconds.  DOMCTL_domain_kill gets to ~14 minutes for a 2TB domain.

The reason for the existing logic is to be able voluntarily preempt.

It doesn't need to remain the way it is, but some adequate form of
pre-emption does need to stay.

~Andrew

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack
  2020-05-18 23:56         ` Andy Lutomirski
@ 2020-05-20 12:35           ` Thomas Gleixner
  2020-05-20 15:09             ` Andy Lutomirski
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 12:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Mon, May 18, 2020 at 4:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Andy Lutomirski <luto@kernel.org> writes:
>> > Actually, I revoke my ack.  Can you make one of two changes:
>> >
>> > Option A: Add an assertion to run_on_irqstack to verify that irq_count
>> > was -1 at the beginning?  I suppose this also means you could just
>> > explicitly write 0 instead of adding and subtracting.
>> >
>> > Option B: Make run_on_irqstack() just call the function on the current
>> > stack if we're already on the irq stack.
>> >
>> > Right now, it's too easy to mess up and not verify the right
>> > precondition before calling run_on_irqstack().
>> >
>> > If you choose A, perhaps add a helper to do the if(irq_needs_irqstack)
>> > dance so that users can just do:
>> >
>> > run_on_irqstack_if_needed(...);
>> >
>> > instead of checking everything themselves.
>>
>> I'll have a look tomorrow morning with brain awake.
>
> Also, reading more of the series, I suspect that asm_call_on_stack is
> logically in the wrong section or that the noinstr stuff is otherwise
> not quite right.  I think that objtool should not accept
> run_on_irqstack() from noinstr code.  See followups on patch 10.

It's in entry.text which is non-instrumentable as well.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-19 19:44       ` Andy Lutomirski
  2020-05-20  8:06         ` Jürgen Groß
@ 2020-05-20 14:13         ` Thomas Gleixner
  2020-05-20 15:16           ` Andy Lutomirski
  1 sibling, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 14:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Andrew Cooper, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Tue, May 19, 2020 at 11:58 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Which brings you into the situation that you call schedule() from the
>> point where we just moved it out. If we would go there we'd need to
>> ensure that RCU is watching as well. idtentry_exit() might have it
>> turned off ....
>
> I don't think this is possible.  Once you untangle all the wrappers,
> the call sites are effectively:
>
> __this_cpu_write(xen_in_preemptible_hcall, true);
> CALL_NOSPEC to the hypercall page
> __this_cpu_write(xen_in_preemptible_hcall, false);
>
> I think IF=1 when this happens, but I won't swear to it.  RCU had
> better be watching.
>
> As I understand it, the one and only situation Xen wants to handle is
> that an interrupt gets delivered during the hypercall.  The hypervisor
> is too clever for its own good and deals with this by rewinding RIP to
> the beginning of whatever instruction did the hypercall and delivers
> the interrupt, and we end up in this handler.  So, if this happens,
> the idea is to not only handle the interrupt but to schedule if
> scheduling would be useful.
>
> So I don't think we need all this RCU magic.  This really ought to be
> able to be simplified to:
>
> idtentry_exit();
>
> if (appropriate condition)
>   schedule();

This is exactly the kind of tinkering which causes all kinds of trouble.

idtentry_exit()

	if (user_mode(regs)) {
		prepare_exit_to_usermode(regs);
	} else if (regs->flags & X86_EFLAGS_IF) {
		/* Check kernel preemption, if enabled */
		if (IS_ENABLED(CONFIG_PREEMPTION)) {
                    ....
		}
		instrumentation_begin();
		/* Tell the tracer that IRET will enable interrupts */
		trace_hardirqs_on_prepare();
		lockdep_hardirqs_on_prepare(CALLER_ADDR0);
		instrumentation_end();
		rcu_irq_exit();
		lockdep_hardirqs_on(CALLER_ADDR0);
	} else {
		/* IRQ flags state is correct already. Just tell RCU */
		rcu_irq_exit();
	}

So in case IF is set then this already told the tracer and lockdep that
interrupts are enabled. And contrary to the ugly version this exit path
does not use rcu_irq_exit_preempt() which is there to warn about crappy
RCU state when trying to schedule.

So we went great length to sanitize _all_ of this and make it consistent
just to say: screw it for that xen thingy.

The extra checks and extra warnings for scheduling come with the
guarantee to bitrot when idtentry_exit() or any logic invoked from there
is changed. It's going to look like this:

	/*
         * If the below causes problems due to inconsistent state
         * or out of sync sanity checks, please complain to
         * luto@kernel.org directly.
         */
        idtentry_exit();

	if (user_mode(regs) || !(regs->flags & X86_FlAGS_IF))
        	return;

        if (!__this_cpu_read(xen_in_preemptible_hcall))
        	return;

        rcu_sanity_check_for_preemption();

        if (need_resched()) {
        	instrumentation_begin();
		xen_maybe_preempt_hcall();
		trace_hardirqs_on();
		instrumentation_end();
	}  	

Of course you need the extra rcu_sanity_check_for_preemption() function
just for this muck.

That's a true win on all ends? I don't think so.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20  0:26             ` Andy Lutomirski
  2020-05-20  2:23               ` Paul E. McKenney
@ 2020-05-20 14:19               ` Thomas Gleixner
  1 sibling, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 14:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> Unless I've missed something, the effect here is that #PF hitting in
> an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> (because you converted them) as well as other faults and traps will
> call rcu_irq_enter().

The only reason why this is needed for #PF is that a kernel mode #PF may
sleep. And of course you cannot sleep after calling rcu_irq_enter().

All other interrupts/traps/system vectors cannot sleep ever. So it's a
straight forward enter/exit.

> Once upon a time, we did this horrible thing where, on entry from user
> mode, we would turn on interrupts while still in CONTEXT_USER, which
> means we could get an IRQ in an extended quiescent state.  This means
> that the IRQ code had to end the EQS so that IRQ handlers could use
> RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> that, if IF=1, we are *not* in an EQS with the sole exception of the
> idle code.
>
> In my dream world, we would never ever get IRQs while in an EQS -- we
> would do MWAIT with IF=0 and we would exit the EQS before taking the
> interrupt.  But I guess we still need to support HLT, which means we
> have this mess.

You always can dream, but dont complain about the nightmares :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 23/37] x86/entry: Provide IDTENTRY_SYSVEC
  2020-05-20  0:29   ` Andy Lutomirski
@ 2020-05-20 15:07     ` Thomas Gleixner
  0 siblings, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 15:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:

> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>>
>> Provide a IDTENTRY variant for system vectors to consolidate the different
>> mechanisms to emit the ASM stubs for 32 an 64 bit.
>>
>> On 64bit this also moves the stack switching from ASM to C code. 32bit will
>> excute the system vectors w/o stack switching as before. As some of the
>> system vector handlers require access to pt_regs this requires a new stack
>> switching macro which can handle an argument.
>
> Is that last sentence obsolete?

Yes.

> Otherwise,
>
> Acked-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 30/37] x86/entry: Convert reschedule interrupt to IDTENTRY_RAW
  2020-05-19 23:57   ` Andy Lutomirski
@ 2020-05-20 15:08     ` Thomas Gleixner
  0 siblings, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 15:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:

> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>>
>> The scheduler IPI does not need the full interrupt entry handling logic
>> when the entry is from kernel mode.
>>
>> Even if tracing is enabled the only requirement is that RCU is watching and
>> preempt_count has the hardirq bit on.
>>
>> The NOHZ tick state does not have to be adjusted. If the tick is not
>> running then the CPU is in idle and the idle exit will restore the
>> tick. Softinterrupts are not raised here, so handling them on return is not
>> required either.
>>
>> User mode entry must go through the regular entry path as it will invoke
>> the scheduler on return so context tracking needs to be in the correct
>> state.
>>
>> Use IDTENTRY_RAW and the RCU conditional variants of idtentry_enter/exit()
>> to guarantee that RCU is watching even if the IPI hits a RCU idle section.
>>
>> Remove the tracepoint static key conditional which is incomplete
>> vs. tracing anyway because e.g. ack_APIC_irq() calls out into
>> instrumentable code.
>>
>> Avoid the overhead of irq time accounting and introduce variants of
>> __irq_enter/exit() so instrumentation observes the correct preempt count
>> state.
>
> Leftover text from an old version?

Indeed

> The code is Reviewed-by: Andy Lutomirski <luto@kernel.org>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack
  2020-05-20 12:35           ` Thomas Gleixner
@ 2020-05-20 15:09             ` Andy Lutomirski
  2020-05-20 15:27               ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20 15:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 5:35 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > On Mon, May 18, 2020 at 4:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >> Andy Lutomirski <luto@kernel.org> writes:
> >> > Actually, I revoke my ack.  Can you make one of two changes:
> >> >
> >> > Option A: Add an assertion to run_on_irqstack to verify that irq_count
> >> > was -1 at the beginning?  I suppose this also means you could just
> >> > explicitly write 0 instead of adding and subtracting.
> >> >
> >> > Option B: Make run_on_irqstack() just call the function on the current
> >> > stack if we're already on the irq stack.
> >> >
> >> > Right now, it's too easy to mess up and not verify the right
> >> > precondition before calling run_on_irqstack().
> >> >
> >> > If you choose A, perhaps add a helper to do the if(irq_needs_irqstack)
> >> > dance so that users can just do:
> >> >
> >> > run_on_irqstack_if_needed(...);
> >> >
> >> > instead of checking everything themselves.
> >>
> >> I'll have a look tomorrow morning with brain awake.
> >
> > Also, reading more of the series, I suspect that asm_call_on_stack is
> > logically in the wrong section or that the noinstr stuff is otherwise
> > not quite right.  I think that objtool should not accept
> > run_on_irqstack() from noinstr code.  See followups on patch 10.
>
> It's in entry.text which is non-instrumentable as well.

Hmm.  I suppose we can chalk this up to the noinstr checking not being
entirely perfect.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-20 14:13         ` Thomas Gleixner
@ 2020-05-20 15:16           ` Andy Lutomirski
  2020-05-20 17:22             ` Andy Lutomirski
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20 15:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Andrew Cooper, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 7:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > On Tue, May 19, 2020 at 11:58 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> Which brings you into the situation that you call schedule() from the
> >> point where we just moved it out. If we would go there we'd need to
> >> ensure that RCU is watching as well. idtentry_exit() might have it
> >> turned off ....
> >
> > I don't think this is possible.  Once you untangle all the wrappers,
> > the call sites are effectively:
> >
> > __this_cpu_write(xen_in_preemptible_hcall, true);
> > CALL_NOSPEC to the hypercall page
> > __this_cpu_write(xen_in_preemptible_hcall, false);
> >
> > I think IF=1 when this happens, but I won't swear to it.  RCU had
> > better be watching.
> >
> > As I understand it, the one and only situation Xen wants to handle is
> > that an interrupt gets delivered during the hypercall.  The hypervisor
> > is too clever for its own good and deals with this by rewinding RIP to
> > the beginning of whatever instruction did the hypercall and delivers
> > the interrupt, and we end up in this handler.  So, if this happens,
> > the idea is to not only handle the interrupt but to schedule if
> > scheduling would be useful.
> >
> > So I don't think we need all this RCU magic.  This really ought to be
> > able to be simplified to:
> >
> > idtentry_exit();
> >
> > if (appropriate condition)
> >   schedule();
>
> This is exactly the kind of tinkering which causes all kinds of trouble.
>
> idtentry_exit()
>
>         if (user_mode(regs)) {
>                 prepare_exit_to_usermode(regs);
>         } else if (regs->flags & X86_EFLAGS_IF) {
>                 /* Check kernel preemption, if enabled */
>                 if (IS_ENABLED(CONFIG_PREEMPTION)) {
>                     ....
>                 }
>                 instrumentation_begin();
>                 /* Tell the tracer that IRET will enable interrupts */
>                 trace_hardirqs_on_prepare();
>                 lockdep_hardirqs_on_prepare(CALLER_ADDR0);
>                 instrumentation_end();
>                 rcu_irq_exit();
>                 lockdep_hardirqs_on(CALLER_ADDR0);
>         } else {
>                 /* IRQ flags state is correct already. Just tell RCU */
>                 rcu_irq_exit();
>         }
>
> So in case IF is set then this already told the tracer and lockdep that
> interrupts are enabled. And contrary to the ugly version this exit path
> does not use rcu_irq_exit_preempt() which is there to warn about crappy
> RCU state when trying to schedule.
>
> So we went great length to sanitize _all_ of this and make it consistent
> just to say: screw it for that xen thingy.
>
> The extra checks and extra warnings for scheduling come with the
> guarantee to bitrot when idtentry_exit() or any logic invoked from there
> is changed. It's going to look like this:
>
>         /*
>          * If the below causes problems due to inconsistent state
>          * or out of sync sanity checks, please complain to
>          * luto@kernel.org directly.
>          */
>         idtentry_exit();
>
>         if (user_mode(regs) || !(regs->flags & X86_FlAGS_IF))
>                 return;
>
>         if (!__this_cpu_read(xen_in_preemptible_hcall))
>                 return;
>
>         rcu_sanity_check_for_preemption();
>
>         if (need_resched()) {
>                 instrumentation_begin();
>                 xen_maybe_preempt_hcall();
>                 trace_hardirqs_on();
>                 instrumentation_end();
>         }
>
> Of course you need the extra rcu_sanity_check_for_preemption() function
> just for this muck.
>
> That's a true win on all ends? I don't think so.

Hmm, fair enough.  I guess the IRQ tracing messes a bunch of this logic up.

Let's keep your patch as is and consider cleanups later.  One approach
might be to make this work more like extable handling: instead of
trying to schedule from inside the interrupt handler here, patch up
RIP and perhaps some other registers and let the actual Xen code just
do cond_resched().  IOW, try to make this work the way it always
should have:

int ret;
do {
  ret = issue_the_hypercall();
  cond_resched();
} while (ret == EAGAIN);

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 36/37] x86/entry: Move paranoid irq tracing out of ASM code
  2020-05-20  0:53   ` Andy Lutomirski
@ 2020-05-20 15:16     ` Thomas Gleixner
  2020-05-20 17:13       ` Andy Lutomirski
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 15:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:

> On Fri, May 15, 2020 at 5:11 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> I think something's missing here.  With this patch applied, don't we
> get to exc_debug_kernel() -> handle_debug() without doing
> idtentry_enter() or equivalent?  And that can even enable IRQs.
>
> Maybe exc_debug_kernel() should wrap handle_debug() in some
> appropriate _enter() / _exit() pair?

I'm the one who is missing something here, i.e. the connection of this
patch to #DB. exc_debug_kernel() still looks like this:

	nmi_enter_notrace();
	handle_debug(regs, dr6, false);
	nmi_exit_notrace();

Confused.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack
  2020-05-20 15:09             ` Andy Lutomirski
@ 2020-05-20 15:27               ` Thomas Gleixner
  2020-05-20 15:36                 ` Andy Lutomirski
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 15:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Wed, May 20, 2020 at 5:35 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Andy Lutomirski <luto@kernel.org> writes:
>> > On Mon, May 18, 2020 at 4:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> >>
>> >> Andy Lutomirski <luto@kernel.org> writes:
>> >> > Actually, I revoke my ack.  Can you make one of two changes:
>> >> >
>> >> > Option A: Add an assertion to run_on_irqstack to verify that irq_count
>> >> > was -1 at the beginning?  I suppose this also means you could just
>> >> > explicitly write 0 instead of adding and subtracting.
>> >> >
>> >> > Option B: Make run_on_irqstack() just call the function on the current
>> >> > stack if we're already on the irq stack.
>> >> >
>> >> > Right now, it's too easy to mess up and not verify the right
>> >> > precondition before calling run_on_irqstack().
>> >> >
>> >> > If you choose A, perhaps add a helper to do the if(irq_needs_irqstack)
>> >> > dance so that users can just do:
>> >> >
>> >> > run_on_irqstack_if_needed(...);
>> >> >
>> >> > instead of checking everything themselves.
>> >>
>> >> I'll have a look tomorrow morning with brain awake.
>> >
>> > Also, reading more of the series, I suspect that asm_call_on_stack is
>> > logically in the wrong section or that the noinstr stuff is otherwise
>> > not quite right.  I think that objtool should not accept
>> > run_on_irqstack() from noinstr code.  See followups on patch 10.
>>
>> It's in entry.text which is non-instrumentable as well.
>
> Hmm.  I suppose we can chalk this up to the noinstr checking not being
> entirely perfect.

objtool considers both entry.text and noinstr.text. We just can't stick
everything into entry.text for these !%@#45@# reasons.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20  2:23               ` Paul E. McKenney
@ 2020-05-20 15:36                 ` Andy Lutomirski
  2020-05-20 16:51                   ` Andy Lutomirski
  2020-05-20 17:38                   ` Paul E. McKenney
  0 siblings, 2 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20 15:36 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Andy Lutomirski <luto@kernel.org> writes:
> > > > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >> Thomas Gleixner <tglx@linutronix.de> writes:
> > > >> It's about this:
> > > >>
> > > >> rcu_nmi_enter()
> > > >> {
> > > >>         if (!rcu_is_watching()) {
> > > >>             make it watch;
> > > >>         } else if (!in_nmi()) {
> > > >>             do_magic_nohz_dyntick_muck();
> > > >>         }
> > > >>
> > > >> So if we do all irq/system vector entries conditional then the
> > > >> do_magic() gets never executed. After that I got lost...
> > > >
> > > > I'm also baffled by that magic, but I'm also not suggesting doing this
> > > > to *all* entries -- just the not-super-magic ones that use
> > > > idtentry_enter().
> > > >
> > > > Paul, what is this code actually trying to do?
> > >
> > > Citing Paul from IRC:
> > >
> > >   "The way things are right now, you can leave out the rcu_irq_enter()
> > >    if this is not a nohz_full CPU.
> > >
> > >    Or if this is a nohz_full CPU, and the tick is already
> > >    enabled, in that case you could also leave out the rcu_irq_enter().
> > >
> > >    Or even if this is a nohz_full CPU and it does not have the tick
> > >    enabled, if it has been in the kernel less than a few tens of
> > >    milliseconds, still OK to avoid invoking rcu_irq_enter()
> > >
> > >    But my guess is that it would be a lot simpler to just always call
> > >    it.
> > >
> > > Hope that helps.
> >
> > Maybe?
> >
> > Unless I've missed something, the effect here is that #PF hitting in
> > an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> > (because you converted them) as well as other faults and traps will
> > call rcu_irq_enter().
> >
> > Once upon a time, we did this horrible thing where, on entry from user
> > mode, we would turn on interrupts while still in CONTEXT_USER, which
> > means we could get an IRQ in an extended quiescent state.  This means
> > that the IRQ code had to end the EQS so that IRQ handlers could use
> > RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> > that, if IF=1, we are *not* in an EQS with the sole exception of the
> > idle code.
> >
> > In my dream world, we would never ever get IRQs while in an EQS -- we
> > would do MWAIT with IF=0 and we would exit the EQS before taking the
> > interrupt.  But I guess we still need to support HLT, which means we
> > have this mess.
> >
> > But I still think we can plausibly get rid of the conditional.
>
> You mean the conditional in rcu_nmi_enter()?  In a NO_HZ_FULL=n system,
> this becomes:

So, I meant the conditional in tglx's patch that makes page faults special.

>
> >                                                                 If we
> > get an IRQ or (egads!) a fault in idle context, we'll have
> > !__rcu_is_watching(), but, AFAICT, we also have preemption off.
>
> Or we could be early in the kernel-entry code or late in the kernel-exit
> code, but as far as I know, preemption is disabled on those code paths.
> As are interrupts, right?  And interrupts are disabled on the portions
> of the CPU-hotplug code where RCU is not watching, if I recall correctly.

Interrupts are off in the parts of the entry/exit that RCU considers
to be user mode.  We can get various faults, although these should be
either NMI-like or events that genuinely or effectively happened in
user mode.

>
> A nohz_full CPU does not enable the scheduling-clock interrupt upon
> entry to the kernel.  Normally, this is fine because that CPU will very
> quickly exit back to nohz_full userspace execution, so that RCU will
> see the quiescent state, either by sampling it directly or by deducing
> the CPU's passage through that quiescent state by comparing with state
> that was captured earlier.  The grace-period kthread notices the lack
> of a quiescent state and will eventually set ->rcu_urgent_qs to
> trigger this code.
>
> But if the nohz_full CPU stays in the kernel for an extended time,
> perhaps due to OOM handling or due to processing of some huge I/O that
> hits in-memory buffers/cache, then RCU needs some way of detecting
> quiescent states on that CPU.  This requires the scheduling-clock
> interrupt to be alive and well.
>
> Are there other ways to get this done?  But of course!  RCU could
> for example use smp_call_function_single() or use workqueues to force
> execution onto that CPU and enable the tick that way.  This gets a
> little involved in order to avoid deadlock, but if the added check
> in rcu_nmi_enter() is causing trouble, something can be arranged.
> Though that something would cause more latency excursions than
> does the current code.
>
> Or did you have something else in mind?

I'm trying to understand when we actually need to call the function.
Is it just the scheduling interrupt that's supposed to call
rcu_irq_enter()?  But the scheduling interrupt is off, so I'm
confused.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack
  2020-05-20 15:27               ` Thomas Gleixner
@ 2020-05-20 15:36                 ` Andy Lutomirski
  0 siblings, 0 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20 15:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 8:27 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > On Wed, May 20, 2020 at 5:35 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >> Andy Lutomirski <luto@kernel.org> writes:
> >> > On Mon, May 18, 2020 at 4:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> >>
> >> >> Andy Lutomirski <luto@kernel.org> writes:
> >> >> > Actually, I revoke my ack.  Can you make one of two changes:
> >> >> >
> >> >> > Option A: Add an assertion to run_on_irqstack to verify that irq_count
> >> >> > was -1 at the beginning?  I suppose this also means you could just
> >> >> > explicitly write 0 instead of adding and subtracting.
> >> >> >
> >> >> > Option B: Make run_on_irqstack() just call the function on the current
> >> >> > stack if we're already on the irq stack.
> >> >> >
> >> >> > Right now, it's too easy to mess up and not verify the right
> >> >> > precondition before calling run_on_irqstack().
> >> >> >
> >> >> > If you choose A, perhaps add a helper to do the if(irq_needs_irqstack)
> >> >> > dance so that users can just do:
> >> >> >
> >> >> > run_on_irqstack_if_needed(...);
> >> >> >
> >> >> > instead of checking everything themselves.
> >> >>
> >> >> I'll have a look tomorrow morning with brain awake.
> >> >
> >> > Also, reading more of the series, I suspect that asm_call_on_stack is
> >> > logically in the wrong section or that the noinstr stuff is otherwise
> >> > not quite right.  I think that objtool should not accept
> >> > run_on_irqstack() from noinstr code.  See followups on patch 10.
> >>
> >> It's in entry.text which is non-instrumentable as well.
> >
> > Hmm.  I suppose we can chalk this up to the noinstr checking not being
> > entirely perfect.
>
> objtool considers both entry.text and noinstr.text. We just can't stick
> everything into entry.text for these !%@#45@# reasons.
>

Meh, this is all fine I think.  I think it would be slightly nicer if
objtool were to warn if noinstr code called run_on_stack(), but I'm
not sure it matters much.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20 15:36                 ` Andy Lutomirski
@ 2020-05-20 16:51                   ` Andy Lutomirski
  2020-05-20 18:05                     ` Paul E. McKenney
                                       ` (2 more replies)
  2020-05-20 17:38                   ` Paul E. McKenney
  1 sibling, 3 replies; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20 16:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paul E. McKenney, Thomas Gleixner, LKML, X86 ML,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 8:36 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:

First, the patch as you submitted it is Acked-by: Andy Lutomirski
<luto@kernel.org>.  I think there are cleanups that should happen, but
I think the patch is correct.

About cleanups, concretely:  I think that everything that calls
__idtenter_entry() is called in one of a small number of relatively
sane states:

1. User mode.  This is easy.

2. Kernel, RCU is watching, everything is sane.  We don't actually
need to do any RCU entry/exit pairs -- we should be okay with just a
hypothetical RCU tickle (and IRQ tracing, etc).  This variant can
sleep after the entry part finishes if regs->flags & IF and no one
turned off preemption.

3. Kernel, RCU is not watching, system was idle.  This can only be an
actual interrupt.

So maybe the code can change to:

    if (user_mode(regs)) {
        enter_from_user_mode();
    } else {
        if (!__rcu_is_watching()) {
            /*
             * If RCU is not watching then the same careful
             * sequence vs. lockdep and tracing is required.
             *
             * This only happens for IRQs that hit the idle loop, and
             * even that only happens if we aren't using the sane
             * MWAIT-while-IF=0 mode.
             */
            lockdep_hardirqs_off(CALLER_ADDR0);
            rcu_irq_enter();
            instrumentation_begin();
            trace_hardirqs_off_prepare();
            instrumentation_end();
            return true;
        } else {
            /*
             * If RCU is watching then the combo function
             * can be used.
             */
            instrumentation_begin();
            trace_hardirqs_off();
            rcu_tickle();
            instrumentation_end();
        }
    }
    return false;

This is exactly what you have except that the cond_rcu part is gone
and I added rcu_tickle().

Paul, the major change here is that if an IRQ hits normal kernel code
(i.e. code where RCU is watching and we're not in an EQS), the IRQ
won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
rcu_tickle() on entry and nothing on exit.  Does that cover all the
bases?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 36/37] x86/entry: Move paranoid irq tracing out of ASM code
  2020-05-20 15:16     ` Thomas Gleixner
@ 2020-05-20 17:13       ` Andy Lutomirski
  2020-05-20 18:33         ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20 17:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 8:17 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
>
> > On Fri, May 15, 2020 at 5:11 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > I think something's missing here.  With this patch applied, don't we
> > get to exc_debug_kernel() -> handle_debug() without doing
> > idtentry_enter() or equivalent?  And that can even enable IRQs.
> >
> > Maybe exc_debug_kernel() should wrap handle_debug() in some
> > appropriate _enter() / _exit() pair?
>
> I'm the one who is missing something here, i.e. the connection of this
> patch to #DB. exc_debug_kernel() still looks like this:
>
>         nmi_enter_notrace();
>         handle_debug(regs, dr6, false);
>         nmi_exit_notrace();
>
> Confused.
>

Hmm.  I guess the code is correct-ish or at least as correct as it
ever was.  But $SUBJECT says "Move paranoid irq tracing out of ASM
code" but you didn't move it into all the users.  So now the NMI code
does trace_hardirqs_on_prepare() but the #DB code doesn't.  Perhaps
the changelog should mention this.

exc_kernel_debug() is an atrocity.  Every now and then I get started
on cleanup it up and so far I always get mired in the giant amount of
indirection.

So Acked-by: Andy Lutomirski <luto@kernel.org> if you write a credible
changelog.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-20 15:16           ` Andy Lutomirski
@ 2020-05-20 17:22             ` Andy Lutomirski
  2020-05-20 19:16               ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20 17:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Andrew Cooper, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 8:16 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, May 20, 2020 at 7:13 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Andy Lutomirski <luto@kernel.org> writes:
> > > On Tue, May 19, 2020 at 11:58 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >> Which brings you into the situation that you call schedule() from the
> > >> point where we just moved it out. If we would go there we'd need to
> > >> ensure that RCU is watching as well. idtentry_exit() might have it
> > >> turned off ....
> > >
> > > I don't think this is possible.  Once you untangle all the wrappers,
> > > the call sites are effectively:
> > >
> > > __this_cpu_write(xen_in_preemptible_hcall, true);
> > > CALL_NOSPEC to the hypercall page
> > > __this_cpu_write(xen_in_preemptible_hcall, false);
> > >
> > > I think IF=1 when this happens, but I won't swear to it.  RCU had
> > > better be watching.
> > >
> > > As I understand it, the one and only situation Xen wants to handle is
> > > that an interrupt gets delivered during the hypercall.  The hypervisor
> > > is too clever for its own good and deals with this by rewinding RIP to
> > > the beginning of whatever instruction did the hypercall and delivers
> > > the interrupt, and we end up in this handler.  So, if this happens,
> > > the idea is to not only handle the interrupt but to schedule if
> > > scheduling would be useful.
> > >
> > > So I don't think we need all this RCU magic.  This really ought to be
> > > able to be simplified to:
> > >
> > > idtentry_exit();
> > >
> > > if (appropriate condition)
> > >   schedule();
> >
> > This is exactly the kind of tinkering which causes all kinds of trouble.
> >
> > idtentry_exit()
> >
> >         if (user_mode(regs)) {
> >                 prepare_exit_to_usermode(regs);
> >         } else if (regs->flags & X86_EFLAGS_IF) {
> >                 /* Check kernel preemption, if enabled */
> >                 if (IS_ENABLED(CONFIG_PREEMPTION)) {
> >                     ....
> >                 }
> >                 instrumentation_begin();
> >                 /* Tell the tracer that IRET will enable interrupts */
> >                 trace_hardirqs_on_prepare();
> >                 lockdep_hardirqs_on_prepare(CALLER_ADDR0);
> >                 instrumentation_end();
> >                 rcu_irq_exit();
> >                 lockdep_hardirqs_on(CALLER_ADDR0);
> >         } else {
> >                 /* IRQ flags state is correct already. Just tell RCU */
> >                 rcu_irq_exit();
> >         }
> >
> > So in case IF is set then this already told the tracer and lockdep that
> > interrupts are enabled. And contrary to the ugly version this exit path
> > does not use rcu_irq_exit_preempt() which is there to warn about crappy
> > RCU state when trying to schedule.
> >
> > So we went great length to sanitize _all_ of this and make it consistent
> > just to say: screw it for that xen thingy.
> >
> > The extra checks and extra warnings for scheduling come with the
> > guarantee to bitrot when idtentry_exit() or any logic invoked from there
> > is changed. It's going to look like this:
> >
> >         /*
> >          * If the below causes problems due to inconsistent state
> >          * or out of sync sanity checks, please complain to
> >          * luto@kernel.org directly.
> >          */
> >         idtentry_exit();
> >
> >         if (user_mode(regs) || !(regs->flags & X86_FlAGS_IF))
> >                 return;
> >
> >         if (!__this_cpu_read(xen_in_preemptible_hcall))
> >                 return;
> >
> >         rcu_sanity_check_for_preemption();
> >
> >         if (need_resched()) {
> >                 instrumentation_begin();
> >                 xen_maybe_preempt_hcall();
> >                 trace_hardirqs_on();
> >                 instrumentation_end();
> >         }
> >
> > Of course you need the extra rcu_sanity_check_for_preemption() function
> > just for this muck.
> >
> > That's a true win on all ends? I don't think so.
>
> Hmm, fair enough.  I guess the IRQ tracing messes a bunch of this logic up.
>
> Let's keep your patch as is and consider cleanups later.  One approach
> might be to make this work more like extable handling: instead of
> trying to schedule from inside the interrupt handler here, patch up
> RIP and perhaps some other registers and let the actual Xen code just
> do cond_resched().  IOW, try to make this work the way it always
> should have:
>
> int ret;
> do {
>   ret = issue_the_hypercall();
>   cond_resched();
> } while (ret == EAGAIN);

Andrew Cooper pointed out that there is too much magic in Xen for this
to work.  So never mind.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20 15:36                 ` Andy Lutomirski
  2020-05-20 16:51                   ` Andy Lutomirski
@ 2020-05-20 17:38                   ` Paul E. McKenney
  2020-05-20 17:47                     ` Andy Lutomirski
  1 sibling, 1 reply; 159+ messages in thread
From: Paul E. McKenney @ 2020-05-20 17:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 08:36:06AM -0700, Andy Lutomirski wrote:
> On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > Andy Lutomirski <luto@kernel.org> writes:
> > > > > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > >> Thomas Gleixner <tglx@linutronix.de> writes:
> > > > >> It's about this:
> > > > >>
> > > > >> rcu_nmi_enter()
> > > > >> {
> > > > >>         if (!rcu_is_watching()) {
> > > > >>             make it watch;
> > > > >>         } else if (!in_nmi()) {
> > > > >>             do_magic_nohz_dyntick_muck();
> > > > >>         }
> > > > >>
> > > > >> So if we do all irq/system vector entries conditional then the
> > > > >> do_magic() gets never executed. After that I got lost...
> > > > >
> > > > > I'm also baffled by that magic, but I'm also not suggesting doing this
> > > > > to *all* entries -- just the not-super-magic ones that use
> > > > > idtentry_enter().
> > > > >
> > > > > Paul, what is this code actually trying to do?
> > > >
> > > > Citing Paul from IRC:
> > > >
> > > >   "The way things are right now, you can leave out the rcu_irq_enter()
> > > >    if this is not a nohz_full CPU.
> > > >
> > > >    Or if this is a nohz_full CPU, and the tick is already
> > > >    enabled, in that case you could also leave out the rcu_irq_enter().
> > > >
> > > >    Or even if this is a nohz_full CPU and it does not have the tick
> > > >    enabled, if it has been in the kernel less than a few tens of
> > > >    milliseconds, still OK to avoid invoking rcu_irq_enter()
> > > >
> > > >    But my guess is that it would be a lot simpler to just always call
> > > >    it.
> > > >
> > > > Hope that helps.
> > >
> > > Maybe?
> > >
> > > Unless I've missed something, the effect here is that #PF hitting in
> > > an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> > > (because you converted them) as well as other faults and traps will
> > > call rcu_irq_enter().
> > >
> > > Once upon a time, we did this horrible thing where, on entry from user
> > > mode, we would turn on interrupts while still in CONTEXT_USER, which
> > > means we could get an IRQ in an extended quiescent state.  This means
> > > that the IRQ code had to end the EQS so that IRQ handlers could use
> > > RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> > > that, if IF=1, we are *not* in an EQS with the sole exception of the
> > > idle code.
> > >
> > > In my dream world, we would never ever get IRQs while in an EQS -- we
> > > would do MWAIT with IF=0 and we would exit the EQS before taking the
> > > interrupt.  But I guess we still need to support HLT, which means we
> > > have this mess.
> > >
> > > But I still think we can plausibly get rid of the conditional.
> >
> > You mean the conditional in rcu_nmi_enter()?  In a NO_HZ_FULL=n system,
> > this becomes:
> 
> So, I meant the conditional in tglx's patch that makes page faults special.

OK.

> > >                                                                 If we
> > > get an IRQ or (egads!) a fault in idle context, we'll have
> > > !__rcu_is_watching(), but, AFAICT, we also have preemption off.
> >
> > Or we could be early in the kernel-entry code or late in the kernel-exit
> > code, but as far as I know, preemption is disabled on those code paths.
> > As are interrupts, right?  And interrupts are disabled on the portions
> > of the CPU-hotplug code where RCU is not watching, if I recall correctly.
> 
> Interrupts are off in the parts of the entry/exit that RCU considers
> to be user mode.  We can get various faults, although these should be
> either NMI-like or events that genuinely or effectively happened in
> user mode.

Fair enough!

> > A nohz_full CPU does not enable the scheduling-clock interrupt upon
> > entry to the kernel.  Normally, this is fine because that CPU will very
> > quickly exit back to nohz_full userspace execution, so that RCU will
> > see the quiescent state, either by sampling it directly or by deducing
> > the CPU's passage through that quiescent state by comparing with state
> > that was captured earlier.  The grace-period kthread notices the lack
> > of a quiescent state and will eventually set ->rcu_urgent_qs to
> > trigger this code.
> >
> > But if the nohz_full CPU stays in the kernel for an extended time,
> > perhaps due to OOM handling or due to processing of some huge I/O that
> > hits in-memory buffers/cache, then RCU needs some way of detecting
> > quiescent states on that CPU.  This requires the scheduling-clock
> > interrupt to be alive and well.
> >
> > Are there other ways to get this done?  But of course!  RCU could
> > for example use smp_call_function_single() or use workqueues to force
> > execution onto that CPU and enable the tick that way.  This gets a
> > little involved in order to avoid deadlock, but if the added check
> > in rcu_nmi_enter() is causing trouble, something can be arranged.
> > Though that something would cause more latency excursions than
> > does the current code.
> >
> > Or did you have something else in mind?
> 
> I'm trying to understand when we actually need to call the function.
> Is it just the scheduling interrupt that's supposed to call
> rcu_irq_enter()?  But the scheduling interrupt is off, so I'm
> confused.

The scheduling-clock interrupt is indeed off, but if execution remains
in the kernel for an extended time period, this becomes a problem.
RCU quiescent states don't happen, or if they do, they are not reported
to RCU.  Grace periods never end, and the system eventually OOMs.

And it is not all that hard to make a CPU stay in the kernel for minutes
at a time on a large system.

So what happens is that if RCU notices that a given CPU has not responded
in a reasonable time period, it sets that CPU's ->rcu_urgent_qs.  This
flag plays various roles in various configurations, but on nohz_full CPUs
it causes that CPU's next rcu_nmi_enter() invocation to turn that CPU's
tick on.  It also sets that CPU's ->rcu_forced_tick flag, which prevents
redundant turning on of the tick and also causes the quiescent-state
detection code to turn off the tick for this CPU.

As you say, the scheduling-clock tick cannot turn itself on, but
there might be other interrupts, exceptions, and so on that could.
And if nothing like that happens (as might well be the case on a
well-isolated CPU), RCU will eventually force one.  But it waits a few
hundred milliseconds in order to take advantage of whatever naturally
occurring interrupt might appear in the meantime.

Does that help?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20 17:38                   ` Paul E. McKenney
@ 2020-05-20 17:47                     ` Andy Lutomirski
  2020-05-20 18:11                       ` Paul E. McKenney
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20 17:47 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 10:38 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, May 20, 2020 at 08:36:06AM -0700, Andy Lutomirski wrote:
> > On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > > > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > > Andy Lutomirski <luto@kernel.org> writes:
> > > > > > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > > >> Thomas Gleixner <tglx@linutronix.de> writes:
> > > > > >> It's about this:
> > > > > >>
> > > > > >> rcu_nmi_enter()
> > > > > >> {
> > > > > >>         if (!rcu_is_watching()) {
> > > > > >>             make it watch;
> > > > > >>         } else if (!in_nmi()) {
> > > > > >>             do_magic_nohz_dyntick_muck();
> > > > > >>         }
> > > > > >>
> > > > > >> So if we do all irq/system vector entries conditional then the
> > > > > >> do_magic() gets never executed. After that I got lost...
> > > > > >
> > > > > > I'm also baffled by that magic, but I'm also not suggesting doing this
> > > > > > to *all* entries -- just the not-super-magic ones that use
> > > > > > idtentry_enter().
> > > > > >
> > > > > > Paul, what is this code actually trying to do?
> > > > >
> > > > > Citing Paul from IRC:
> > > > >
> > > > >   "The way things are right now, you can leave out the rcu_irq_enter()
> > > > >    if this is not a nohz_full CPU.
> > > > >
> > > > >    Or if this is a nohz_full CPU, and the tick is already
> > > > >    enabled, in that case you could also leave out the rcu_irq_enter().
> > > > >
> > > > >    Or even if this is a nohz_full CPU and it does not have the tick
> > > > >    enabled, if it has been in the kernel less than a few tens of
> > > > >    milliseconds, still OK to avoid invoking rcu_irq_enter()
> > > > >
> > > > >    But my guess is that it would be a lot simpler to just always call
> > > > >    it.
> > > > >
> > > > > Hope that helps.
> > > >
> > > > Maybe?
> > > >
> > > > Unless I've missed something, the effect here is that #PF hitting in
> > > > an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> > > > (because you converted them) as well as other faults and traps will
> > > > call rcu_irq_enter().
> > > >
> > > > Once upon a time, we did this horrible thing where, on entry from user
> > > > mode, we would turn on interrupts while still in CONTEXT_USER, which
> > > > means we could get an IRQ in an extended quiescent state.  This means
> > > > that the IRQ code had to end the EQS so that IRQ handlers could use
> > > > RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> > > > that, if IF=1, we are *not* in an EQS with the sole exception of the
> > > > idle code.
> > > >
> > > > In my dream world, we would never ever get IRQs while in an EQS -- we
> > > > would do MWAIT with IF=0 and we would exit the EQS before taking the
> > > > interrupt.  But I guess we still need to support HLT, which means we
> > > > have this mess.
> > > >
> > > > But I still think we can plausibly get rid of the conditional.
> > >
> > > You mean the conditional in rcu_nmi_enter()?  In a NO_HZ_FULL=n system,
> > > this becomes:
> >
> > So, I meant the conditional in tglx's patch that makes page faults special.
>
> OK.
>
> > > >                                                                 If we
> > > > get an IRQ or (egads!) a fault in idle context, we'll have
> > > > !__rcu_is_watching(), but, AFAICT, we also have preemption off.
> > >
> > > Or we could be early in the kernel-entry code or late in the kernel-exit
> > > code, but as far as I know, preemption is disabled on those code paths.
> > > As are interrupts, right?  And interrupts are disabled on the portions
> > > of the CPU-hotplug code where RCU is not watching, if I recall correctly.
> >
> > Interrupts are off in the parts of the entry/exit that RCU considers
> > to be user mode.  We can get various faults, although these should be
> > either NMI-like or events that genuinely or effectively happened in
> > user mode.
>
> Fair enough!
>
> > > A nohz_full CPU does not enable the scheduling-clock interrupt upon
> > > entry to the kernel.  Normally, this is fine because that CPU will very
> > > quickly exit back to nohz_full userspace execution, so that RCU will
> > > see the quiescent state, either by sampling it directly or by deducing
> > > the CPU's passage through that quiescent state by comparing with state
> > > that was captured earlier.  The grace-period kthread notices the lack
> > > of a quiescent state and will eventually set ->rcu_urgent_qs to
> > > trigger this code.
> > >
> > > But if the nohz_full CPU stays in the kernel for an extended time,
> > > perhaps due to OOM handling or due to processing of some huge I/O that
> > > hits in-memory buffers/cache, then RCU needs some way of detecting
> > > quiescent states on that CPU.  This requires the scheduling-clock
> > > interrupt to be alive and well.
> > >
> > > Are there other ways to get this done?  But of course!  RCU could
> > > for example use smp_call_function_single() or use workqueues to force
> > > execution onto that CPU and enable the tick that way.  This gets a
> > > little involved in order to avoid deadlock, but if the added check
> > > in rcu_nmi_enter() is causing trouble, something can be arranged.
> > > Though that something would cause more latency excursions than
> > > does the current code.
> > >
> > > Or did you have something else in mind?
> >
> > I'm trying to understand when we actually need to call the function.
> > Is it just the scheduling interrupt that's supposed to call
> > rcu_irq_enter()?  But the scheduling interrupt is off, so I'm
> > confused.
>
> The scheduling-clock interrupt is indeed off, but if execution remains
> in the kernel for an extended time period, this becomes a problem.
> RCU quiescent states don't happen, or if they do, they are not reported
> to RCU.  Grace periods never end, and the system eventually OOMs.
>
> And it is not all that hard to make a CPU stay in the kernel for minutes
> at a time on a large system.
>
> So what happens is that if RCU notices that a given CPU has not responded
> in a reasonable time period, it sets that CPU's ->rcu_urgent_qs.  This
> flag plays various roles in various configurations, but on nohz_full CPUs
> it causes that CPU's next rcu_nmi_enter() invocation to turn that CPU's
> tick on.  It also sets that CPU's ->rcu_forced_tick flag, which prevents
> redundant turning on of the tick and also causes the quiescent-state
> detection code to turn off the tick for this CPU.
>
> As you say, the scheduling-clock tick cannot turn itself on, but
> there might be other interrupts, exceptions, and so on that could.
> And if nothing like that happens (as might well be the case on a
> well-isolated CPU), RCU will eventually force one.  But it waits a few
> hundred milliseconds in order to take advantage of whatever naturally
> occurring interrupt might appear in the meantime.
>
> Does that help?

Yes, I think.  Could this go in a comment in the new function?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20 16:51                   ` Andy Lutomirski
@ 2020-05-20 18:05                     ` Paul E. McKenney
  2020-05-20 19:49                       ` Thomas Gleixner
  2020-05-20 18:32                     ` Thomas Gleixner
  2020-05-20 19:24                     ` Thomas Gleixner
  2 siblings, 1 reply; 159+ messages in thread
From: Paul E. McKenney @ 2020-05-20 18:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 09:51:17AM -0700, Andy Lutomirski wrote:
> On Wed, May 20, 2020 at 8:36 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > > > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> First, the patch as you submitted it is Acked-by: Andy Lutomirski
> <luto@kernel.org>.  I think there are cleanups that should happen, but
> I think the patch is correct.
> 
> About cleanups, concretely:  I think that everything that calls
> __idtenter_entry() is called in one of a small number of relatively
> sane states:
> 
> 1. User mode.  This is easy.
> 
> 2. Kernel, RCU is watching, everything is sane.  We don't actually
> need to do any RCU entry/exit pairs -- we should be okay with just a
> hypothetical RCU tickle (and IRQ tracing, etc).  This variant can
> sleep after the entry part finishes if regs->flags & IF and no one
> turned off preemption.
> 
> 3. Kernel, RCU is not watching, system was idle.  This can only be an
> actual interrupt.
> 
> So maybe the code can change to:
> 
>     if (user_mode(regs)) {
>         enter_from_user_mode();
>     } else {
>         if (!__rcu_is_watching()) {
>             /*
>              * If RCU is not watching then the same careful
>              * sequence vs. lockdep and tracing is required.
>              *
>              * This only happens for IRQs that hit the idle loop, and
>              * even that only happens if we aren't using the sane
>              * MWAIT-while-IF=0 mode.
>              */
>             lockdep_hardirqs_off(CALLER_ADDR0);
>             rcu_irq_enter();
>             instrumentation_begin();
>             trace_hardirqs_off_prepare();
>             instrumentation_end();
>             return true;
>         } else {
>             /*
>              * If RCU is watching then the combo function
>              * can be used.
>              */
>             instrumentation_begin();
>             trace_hardirqs_off();
>             rcu_tickle();
>             instrumentation_end();
>         }
>     }
>     return false;
> 
> This is exactly what you have except that the cond_rcu part is gone
> and I added rcu_tickle().
> 
> Paul, the major change here is that if an IRQ hits normal kernel code
> (i.e. code where RCU is watching and we're not in an EQS), the IRQ
> won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
> rcu_tickle() on entry and nothing on exit.  Does that cover all the
> bases?

From an RCU viewpoint, yes, give or take my concerns about someone
putting rcu_tickle() on entry and rcu_irq_exit() on exit.  Perhaps
I can bring some lockdep trickery to bear.

But I must defer to Thomas and Peter on the non-RCU/non-NO_HZ_FULL
portions of this.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20 17:47                     ` Andy Lutomirski
@ 2020-05-20 18:11                       ` Paul E. McKenney
  0 siblings, 0 replies; 159+ messages in thread
From: Paul E. McKenney @ 2020-05-20 18:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 10:47:29AM -0700, Andy Lutomirski wrote:
> On Wed, May 20, 2020 at 10:38 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, May 20, 2020 at 08:36:06AM -0700, Andy Lutomirski wrote:
> > > On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > > > > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > > > Andy Lutomirski <luto@kernel.org> writes:
> > > > > > > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > > > >> Thomas Gleixner <tglx@linutronix.de> writes:
> > > > > > >> It's about this:
> > > > > > >>
> > > > > > >> rcu_nmi_enter()
> > > > > > >> {
> > > > > > >>         if (!rcu_is_watching()) {
> > > > > > >>             make it watch;
> > > > > > >>         } else if (!in_nmi()) {
> > > > > > >>             do_magic_nohz_dyntick_muck();
> > > > > > >>         }
> > > > > > >>
> > > > > > >> So if we do all irq/system vector entries conditional then the
> > > > > > >> do_magic() gets never executed. After that I got lost...
> > > > > > >
> > > > > > > I'm also baffled by that magic, but I'm also not suggesting doing this
> > > > > > > to *all* entries -- just the not-super-magic ones that use
> > > > > > > idtentry_enter().
> > > > > > >
> > > > > > > Paul, what is this code actually trying to do?
> > > > > >
> > > > > > Citing Paul from IRC:
> > > > > >
> > > > > >   "The way things are right now, you can leave out the rcu_irq_enter()
> > > > > >    if this is not a nohz_full CPU.
> > > > > >
> > > > > >    Or if this is a nohz_full CPU, and the tick is already
> > > > > >    enabled, in that case you could also leave out the rcu_irq_enter().
> > > > > >
> > > > > >    Or even if this is a nohz_full CPU and it does not have the tick
> > > > > >    enabled, if it has been in the kernel less than a few tens of
> > > > > >    milliseconds, still OK to avoid invoking rcu_irq_enter()
> > > > > >
> > > > > >    But my guess is that it would be a lot simpler to just always call
> > > > > >    it.
> > > > > >
> > > > > > Hope that helps.
> > > > >
> > > > > Maybe?
> > > > >
> > > > > Unless I've missed something, the effect here is that #PF hitting in
> > > > > an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> > > > > (because you converted them) as well as other faults and traps will
> > > > > call rcu_irq_enter().
> > > > >
> > > > > Once upon a time, we did this horrible thing where, on entry from user
> > > > > mode, we would turn on interrupts while still in CONTEXT_USER, which
> > > > > means we could get an IRQ in an extended quiescent state.  This means
> > > > > that the IRQ code had to end the EQS so that IRQ handlers could use
> > > > > RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> > > > > that, if IF=1, we are *not* in an EQS with the sole exception of the
> > > > > idle code.
> > > > >
> > > > > In my dream world, we would never ever get IRQs while in an EQS -- we
> > > > > would do MWAIT with IF=0 and we would exit the EQS before taking the
> > > > > interrupt.  But I guess we still need to support HLT, which means we
> > > > > have this mess.
> > > > >
> > > > > But I still think we can plausibly get rid of the conditional.
> > > >
> > > > You mean the conditional in rcu_nmi_enter()?  In a NO_HZ_FULL=n system,
> > > > this becomes:
> > >
> > > So, I meant the conditional in tglx's patch that makes page faults special.
> >
> > OK.
> >
> > > > >                                                                 If we
> > > > > get an IRQ or (egads!) a fault in idle context, we'll have
> > > > > !__rcu_is_watching(), but, AFAICT, we also have preemption off.
> > > >
> > > > Or we could be early in the kernel-entry code or late in the kernel-exit
> > > > code, but as far as I know, preemption is disabled on those code paths.
> > > > As are interrupts, right?  And interrupts are disabled on the portions
> > > > of the CPU-hotplug code where RCU is not watching, if I recall correctly.
> > >
> > > Interrupts are off in the parts of the entry/exit that RCU considers
> > > to be user mode.  We can get various faults, although these should be
> > > either NMI-like or events that genuinely or effectively happened in
> > > user mode.
> >
> > Fair enough!
> >
> > > > A nohz_full CPU does not enable the scheduling-clock interrupt upon
> > > > entry to the kernel.  Normally, this is fine because that CPU will very
> > > > quickly exit back to nohz_full userspace execution, so that RCU will
> > > > see the quiescent state, either by sampling it directly or by deducing
> > > > the CPU's passage through that quiescent state by comparing with state
> > > > that was captured earlier.  The grace-period kthread notices the lack
> > > > of a quiescent state and will eventually set ->rcu_urgent_qs to
> > > > trigger this code.
> > > >
> > > > But if the nohz_full CPU stays in the kernel for an extended time,
> > > > perhaps due to OOM handling or due to processing of some huge I/O that
> > > > hits in-memory buffers/cache, then RCU needs some way of detecting
> > > > quiescent states on that CPU.  This requires the scheduling-clock
> > > > interrupt to be alive and well.
> > > >
> > > > Are there other ways to get this done?  But of course!  RCU could
> > > > for example use smp_call_function_single() or use workqueues to force
> > > > execution onto that CPU and enable the tick that way.  This gets a
> > > > little involved in order to avoid deadlock, but if the added check
> > > > in rcu_nmi_enter() is causing trouble, something can be arranged.
> > > > Though that something would cause more latency excursions than
> > > > does the current code.
> > > >
> > > > Or did you have something else in mind?
> > >
> > > I'm trying to understand when we actually need to call the function.
> > > Is it just the scheduling interrupt that's supposed to call
> > > rcu_irq_enter()?  But the scheduling interrupt is off, so I'm
> > > confused.
> >
> > The scheduling-clock interrupt is indeed off, but if execution remains
> > in the kernel for an extended time period, this becomes a problem.
> > RCU quiescent states don't happen, or if they do, they are not reported
> > to RCU.  Grace periods never end, and the system eventually OOMs.
> >
> > And it is not all that hard to make a CPU stay in the kernel for minutes
> > at a time on a large system.
> >
> > So what happens is that if RCU notices that a given CPU has not responded
> > in a reasonable time period, it sets that CPU's ->rcu_urgent_qs.  This
> > flag plays various roles in various configurations, but on nohz_full CPUs
> > it causes that CPU's next rcu_nmi_enter() invocation to turn that CPU's
> > tick on.  It also sets that CPU's ->rcu_forced_tick flag, which prevents
> > redundant turning on of the tick and also causes the quiescent-state
> > detection code to turn off the tick for this CPU.
> >
> > As you say, the scheduling-clock tick cannot turn itself on, but
> > there might be other interrupts, exceptions, and so on that could.
> > And if nothing like that happens (as might well be the case on a
> > well-isolated CPU), RCU will eventually force one.  But it waits a few
> > hundred milliseconds in order to take advantage of whatever naturally
> > occurring interrupt might appear in the meantime.
> >
> > Does that help?
> 
> Yes, I think.  Could this go in a comment in the new function?

Even if we don't go with the new function, evidence indicates that this
commentary should go somewhere.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20 16:51                   ` Andy Lutomirski
  2020-05-20 18:05                     ` Paul E. McKenney
@ 2020-05-20 18:32                     ` Thomas Gleixner
  2020-05-20 19:24                     ` Thomas Gleixner
  2 siblings, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 18:32 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski
  Cc: Paul E. McKenney, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> So maybe the code can change to:
>
>     if (user_mode(regs)) {
>         enter_from_user_mode();
>     } else {
>         if (!__rcu_is_watching()) {
>             /*
>              * If RCU is not watching then the same careful
>              * sequence vs. lockdep and tracing is required.
>              *
>              * This only happens for IRQs that hit the idle loop, and
>              * even that only happens if we aren't using the sane
>              * MWAIT-while-IF=0 mode.
>              */
>             lockdep_hardirqs_off(CALLER_ADDR0);
>             rcu_irq_enter();
>             instrumentation_begin();
>             trace_hardirqs_off_prepare();
>             instrumentation_end();
>             return true;
>         } else {
>             /*
>              * If RCU is watching then the combo function
>              * can be used.
>              */
>             instrumentation_begin();
>             trace_hardirqs_off();
>             rcu_tickle();
>             instrumentation_end();
>         }
>     }
>     return false;
>
> This is exactly what you have except that the cond_rcu part is gone
> and I added rcu_tickle().
>
> Paul, the major change here is that if an IRQ hits normal kernel code
> (i.e. code where RCU is watching and we're not in an EQS), the IRQ
> won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
> rcu_tickle() on entry and nothing on exit.  Does that cover all the
> bases?

Fine with me, but the final vote needs to come from Paul and Joel.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 36/37] x86/entry: Move paranoid irq tracing out of ASM code
  2020-05-20 17:13       ` Andy Lutomirski
@ 2020-05-20 18:33         ` Thomas Gleixner
  0 siblings, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 18:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:

> On Wed, May 20, 2020 at 8:17 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Andy Lutomirski <luto@kernel.org> writes:
>>
>> > On Fri, May 15, 2020 at 5:11 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> >
>> > I think something's missing here.  With this patch applied, don't we
>> > get to exc_debug_kernel() -> handle_debug() without doing
>> > idtentry_enter() or equivalent?  And that can even enable IRQs.
>> >
>> > Maybe exc_debug_kernel() should wrap handle_debug() in some
>> > appropriate _enter() / _exit() pair?
>>
>> I'm the one who is missing something here, i.e. the connection of this
>> patch to #DB. exc_debug_kernel() still looks like this:
>>
>>         nmi_enter_notrace();
>>         handle_debug(regs, dr6, false);
>>         nmi_exit_notrace();
>>
>> Confused.
>>
>
> Hmm.  I guess the code is correct-ish or at least as correct as it
> ever was.  But $SUBJECT says "Move paranoid irq tracing out of ASM
> code" but you didn't move it into all the users.  So now the NMI code
> does trace_hardirqs_on_prepare() but the #DB code doesn't.  Perhaps
> the changelog should mention this.

Duh. I simply missed to add it.

> exc_kernel_debug() is an atrocity.  Every now and then I get started
> on cleanup it up and so far I always get mired in the giant amount of
> indirection.
>
> So Acked-by: Andy Lutomirski <luto@kernel.org> if you write a credible
> changelog.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-20 17:22             ` Andy Lutomirski
@ 2020-05-20 19:16               ` Thomas Gleixner
  2020-05-20 23:21                 ` Andy Lutomirski
  2020-05-21  2:23                 ` Boris Ostrovsky
  0 siblings, 2 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 19:16 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski
  Cc: Andrew Cooper, LKML, X86 ML, Paul E. McKenney, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> Andrew Cooper pointed out that there is too much magic in Xen for this
> to work.  So never mind.

:)

But you made me stare more at that stuff and I came up with a way
simpler solution. See below.

Thanks,

        tglx

8<--------------

 arch/x86/entry/common.c         |   75 ++++++++++++++++++++++++++++++++++++++--
 arch/x86/entry/entry_32.S       |   17 ++++-----
 arch/x86/entry/entry_64.S       |   22 +++--------
 arch/x86/include/asm/idtentry.h |   13 ++++++
 arch/x86/xen/setup.c            |    4 +-
 arch/x86/xen/smp_pv.c           |    3 +
 arch/x86/xen/xen-asm_32.S       |   12 +++---
 arch/x86/xen/xen-asm_64.S       |    2 -
 arch/x86/xen/xen-ops.h          |    1 
 drivers/xen/Makefile            |    2 -
 drivers/xen/preempt.c           |   42 ----------------------
 11 files changed, 115 insertions(+), 78 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -27,6 +27,9 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 
+#include <xen/xen-ops.h>
+#include <xen/events.h>
+
 #include <asm/desc.h>
 #include <asm/traps.h>
 #include <asm/vdso.h>
@@ -35,6 +38,7 @@
 #include <asm/nospec-branch.h>
 #include <asm/io_bitmap.h>
 #include <asm/syscall.h>
+#include <asm/irq_stack.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
@@ -539,7 +543,8 @@ void noinstr idtentry_enter(struct pt_re
 	}
 }
 
-static __always_inline void __idtentry_exit(struct pt_regs *regs)
+static __always_inline void __idtentry_exit(struct pt_regs *regs,
+					    bool may_sched)
 {
 	lockdep_assert_irqs_disabled();
 
@@ -548,7 +553,7 @@ static __always_inline void __idtentry_e
 		prepare_exit_to_usermode(regs);
 	} else if (regs->flags & X86_EFLAGS_IF) {
 		/* Check kernel preemption, if enabled */
-		if (IS_ENABLED(CONFIG_PREEMPTION)) {
+		if (IS_ENABLED(CONFIG_PREEMPTION) || may_resched) {
 			/*
 			 * This needs to be done very carefully.
 			 * idtentry_enter() invoked rcu_irq_enter(). This
@@ -612,5 +617,69 @@ static __always_inline void __idtentry_e
  */
 void noinstr idtentry_exit(struct pt_regs *regs)
 {
-	__idtentry_exit(regs);
+	__idtentry_exit(regs, false);
+}
+
+#ifdef CONFIG_XEN_PV
+
+#ifndef CONFIG_PREEMPTION
+/*
+ * Some hypercalls issued by the toolstack can take many 10s of
+ * seconds. Allow tasks running hypercalls via the privcmd driver to
+ * be voluntarily preempted even if full kernel preemption is
+ * disabled.
+ *
+ * Such preemptible hypercalls are bracketed by
+ * xen_preemptible_hcall_begin() and xen_preemptible_hcall_end()
+ * calls.
+ */
+DEFINE_PER_CPU(bool, xen_in_preemptible_hcall);
+EXPORT_SYMBOL_GPL(xen_in_preemptible_hcall);
+
+/*
+ * In case of scheduling the flag must be cleared and restored after
+ * returning from schedule as the task might move to a different CPU.
+ */
+static __always_inline bool get_and_clear_inhcall(void)
+{
+	boot inhcall = __this_cpu_read(xen_in_preemptible_hcall);
+
+	__this_cpu_write(xen_in_preemptible_hcall, false);
+}
+
+static __always_inline void restore_inhcall(bool inhcall)
+{
+	__this_cpu_write(xen_in_preemptible_hcall, inhcall);
+}
+#else
+static __always_inline bool get_and_clear_inhcall(void) { return false; }
+static __always_inline void restore_inhcall(bool inhcall) { }
+#endif
+
+static void __xen_pv_evtchn_do_upcall(void)
+{
+	irq_enter_rcu();
+	inc_irq_stat(irq_hv_callback_count);
+
+	xen_hvm_evtchn_do_upcall();
+
+	irq_exit_rcu();
+}
+
+__visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
+{
+	struct pt_regs *old_regs;
+	bool inhcall;
+
+	idtentry_enter(regs);
+	old_regs = set_irq_regs(regs);
+
+	run_on_irqstack(__xen_pv_evtchn_do_upcall, NULL, regs);
+
+	set_irq_regs(old_regs);
+
+	inhcall = get_and_clear_inhcall();
+	__idtentry_exit(regs, inhcall);
+	restore_inhcall(inhcall);
 }
+#endif /* CONFIG_XEN_PV */
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1298,7 +1298,10 @@ SYM_CODE_END(native_iret)
 #endif
 
 #ifdef CONFIG_XEN_PV
-SYM_FUNC_START(xen_hypervisor_callback)
+/*
+ * See comment in entry_64.S for further explanation
+ */
+SYM_FUNC_START(exc_xen_hypervisor_callback)
 	/*
 	 * Check to see if we got the event in the critical
 	 * region in xen_iret_direct, after we've reenabled
@@ -1315,14 +1318,11 @@ SYM_FUNC_START(xen_hypervisor_callback)
 	pushl	$-1				/* orig_ax = -1 => not a system call */
 	SAVE_ALL
 	ENCODE_FRAME_POINTER
-	TRACE_IRQS_OFF
+
 	mov	%esp, %eax
-	call	xen_evtchn_do_upcall
-#ifndef CONFIG_PREEMPTION
-	call	xen_maybe_preempt_hcall
-#endif
-	jmp	ret_from_intr
-SYM_FUNC_END(xen_hypervisor_callback)
+	call	xen_pv_evtchn_do_upcall
+	jmp	handle_exception_return
+SYM_FUNC_END(exc_xen_hypervisor_callback)
 
 /*
  * Hypervisor uses this for application faults while it executes.
@@ -1464,6 +1464,7 @@ SYM_CODE_START_LOCAL_NOALIGN(handle_exce
 	movl	%esp, %eax			# pt_regs pointer
 	CALL_NOSPEC edi
 
+handle_exception_return:
 #ifdef CONFIG_VM86
 	movl	PT_EFLAGS(%esp), %eax		# mix EFLAGS and CS
 	movb	PT_CS(%esp), %al
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1067,10 +1067,6 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work
 
 idtentry	X86_TRAP_PF		page_fault		do_page_fault			has_error_code=1
 
-#ifdef CONFIG_XEN_PV
-idtentry	512 /* dummy */		hypervisor_callback	xen_do_hypervisor_callback	has_error_code=0
-#endif
-
 /*
  * Reload gs selector with exception handling
  * edi:  new selector
@@ -1158,9 +1154,10 @@ SYM_FUNC_END(asm_call_on_stack)
  * So, on entry to the handler we detect whether we interrupted an
  * existing activation in its critical region -- if so, we pop the current
  * activation and restart the handler using the previous one.
+ *
+ * C calling convention: exc_xen_hypervisor_callback(struct *pt_regs)
  */
-/* do_hypervisor_callback(struct *pt_regs) */
-SYM_CODE_START_LOCAL(xen_do_hypervisor_callback)
+SYM_CODE_START_LOCAL(exc_xen_hypervisor_callback)
 
 /*
  * Since we don't modify %rdi, evtchn_do_upall(struct *pt_regs) will
@@ -1170,15 +1167,10 @@ SYM_CODE_START_LOCAL(xen_do_hypervisor_c
 	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
 	UNWIND_HINT_REGS
 
-	ENTER_IRQ_STACK old_rsp=%r10
-	call	xen_evtchn_do_upcall
-	LEAVE_IRQ_STACK
-
-#ifndef CONFIG_PREEMPTION
-	call	xen_maybe_preempt_hcall
-#endif
-	jmp	error_exit
-SYM_CODE_END(xen_do_hypervisor_callback)
+	call	xen_pv_evtchn_do_upcall
+
+	jmp	error_return
+SYM_CODE_END(exc_xen_hypervisor_callback)
 
 /*
  * Hypervisor uses this for application faults while it executes.
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -332,6 +332,13 @@ static __always_inline void __##func(str
  * This avoids duplicate defines and ensures that everything is consistent.
  */
 
+/*
+ * Dummy trap number so the low level ASM macro vector number checks do not
+ * match which results in emitting plain IDTENTRY stubs without bells and
+ * whistels.
+ */
+#define X86_TRAP_OTHER		0xFFFF
+
 /* Simple exception entry points. No hardware error code */
 DECLARE_IDTENTRY(X86_TRAP_DE,		exc_divide_error);
 DECLARE_IDTENTRY(X86_TRAP_OF,		exc_overflow);
@@ -371,4 +378,10 @@ DECLARE_IDTENTRY_XEN(X86_TRAP_DB,	debug)
 /* #DF */
 DECLARE_IDTENTRY_DF(X86_TRAP_DF,	exc_double_fault);
 
+#ifdef CONFIG_XEN_PV
+DECLARE_IDTENTRY(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
+#endif
+
+#undef X86_TRAP_OTHER
+
 #endif
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -20,6 +20,7 @@
 #include <asm/setup.h>
 #include <asm/acpi.h>
 #include <asm/numa.h>
+#include <asm/idtentry.h>
 #include <asm/xen/hypervisor.h>
 #include <asm/xen/hypercall.h>
 
@@ -993,7 +994,8 @@ static void __init xen_pvmmu_arch_setup(
 	HYPERVISOR_vm_assist(VMASST_CMD_enable,
 			     VMASST_TYPE_pae_extended_cr3);
 
-	if (register_callback(CALLBACKTYPE_event, xen_hypervisor_callback) ||
+	if (register_callback(CALLBACKTYPE_event,
+			      xen_asm_exc_xen_hypervisor_callback) ||
 	    register_callback(CALLBACKTYPE_failsafe, xen_failsafe_callback))
 		BUG();
 
--- a/arch/x86/xen/smp_pv.c
+++ b/arch/x86/xen/smp_pv.c
@@ -27,6 +27,7 @@
 #include <asm/paravirt.h>
 #include <asm/desc.h>
 #include <asm/pgtable.h>
+#include <asm/idtentry.h>
 #include <asm/cpu.h>
 
 #include <xen/interface/xen.h>
@@ -347,7 +348,7 @@ cpu_initialize_context(unsigned int cpu,
 	ctxt->gs_base_kernel = per_cpu_offset(cpu);
 #endif
 	ctxt->event_callback_eip    =
-		(unsigned long)xen_hypervisor_callback;
+		(unsigned long)xen_asm_exc_xen_hypervisor_callback;
 	ctxt->failsafe_callback_eip =
 		(unsigned long)xen_failsafe_callback;
 	per_cpu(xen_cr3, cpu) = __pa(swapper_pg_dir);
--- a/arch/x86/xen/xen-asm_32.S
+++ b/arch/x86/xen/xen-asm_32.S
@@ -93,7 +93,7 @@ SYM_CODE_START(xen_iret)
 
 	/*
 	 * If there's something pending, mask events again so we can
-	 * jump back into xen_hypervisor_callback. Otherwise do not
+	 * jump back into exc_xen_hypervisor_callback. Otherwise do not
 	 * touch XEN_vcpu_info_mask.
 	 */
 	jne 1f
@@ -113,7 +113,7 @@ SYM_CODE_START(xen_iret)
 	 * Events are masked, so jumping out of the critical region is
 	 * OK.
 	 */
-	je xen_hypervisor_callback
+	je asm_exc_xen_hypervisor_callback
 
 1:	iret
 xen_iret_end_crit:
@@ -127,7 +127,7 @@ SYM_CODE_END(xen_iret)
 	.globl xen_iret_start_crit, xen_iret_end_crit
 
 /*
- * This is called by xen_hypervisor_callback in entry_32.S when it sees
+ * This is called by exc_xen_hypervisor_callback in entry_32.S when it sees
  * that the EIP at the time of interrupt was between
  * xen_iret_start_crit and xen_iret_end_crit.
  *
@@ -144,7 +144,7 @@ SYM_CODE_END(xen_iret)
  *	 eflags		}
  *	 cs		}  nested exception info
  *	 eip		}
- *	 return address	: (into xen_hypervisor_callback)
+ *	 return address	: (into asm_exc_xen_hypervisor_callback)
  *
  * In order to deliver the nested exception properly, we need to discard the
  * nested exception frame such that when we handle the exception, we do it
@@ -152,7 +152,8 @@ SYM_CODE_END(xen_iret)
  *
  * The only caveat is that if the outer eax hasn't been restored yet (i.e.
  * it's still on stack), we need to restore its value here.
- */
+*/
+.pushsection .noinstr.text, "ax"
 SYM_CODE_START(xen_iret_crit_fixup)
 	/*
 	 * Paranoia: Make sure we're really coming from kernel space.
@@ -181,3 +182,4 @@ SYM_CODE_START(xen_iret_crit_fixup)
 2:
 	ret
 SYM_CODE_END(xen_iret_crit_fixup)
+.popsection
--- a/arch/x86/xen/xen-asm_64.S
+++ b/arch/x86/xen/xen-asm_64.S
@@ -54,7 +54,7 @@ xen_pv_trap asm_exc_simd_coprocessor_err
 #ifdef CONFIG_IA32_EMULATION
 xen_pv_trap entry_INT80_compat
 #endif
-xen_pv_trap hypervisor_callback
+xen_pv_trap asm_exc_xen_hypervisor_callback
 
 	__INIT
 SYM_CODE_START(xen_early_idt_handler_array)
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -8,7 +8,6 @@
 #include <xen/xen-ops.h>
 
 /* These are code, but not functions.  Defined in entry.S */
-extern const char xen_hypervisor_callback[];
 extern const char xen_failsafe_callback[];
 
 void xen_sysenter_target(void);
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_HOTPLUG_CPU)		+= cpu_hotplug.o
-obj-y	+= grant-table.o features.o balloon.o manage.o preempt.o time.o
+obj-y	+= grant-table.o features.o balloon.o manage.o time.o
 obj-y	+= mem-reservation.o
 obj-y	+= events/
 obj-y	+= xenbus/
--- a/drivers/xen/preempt.c
+++ /dev/null
@@ -1,42 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/*
- * Preemptible hypercalls
- *
- * Copyright (C) 2014 Citrix Systems R&D ltd.
- */
-
-#include <linux/sched.h>
-#include <xen/xen-ops.h>
-
-#ifndef CONFIG_PREEMPTION
-
-/*
- * Some hypercalls issued by the toolstack can take many 10s of
- * seconds. Allow tasks running hypercalls via the privcmd driver to
- * be voluntarily preempted even if full kernel preemption is
- * disabled.
- *
- * Such preemptible hypercalls are bracketed by
- * xen_preemptible_hcall_begin() and xen_preemptible_hcall_end()
- * calls.
- */
-
-DEFINE_PER_CPU(bool, xen_in_preemptible_hcall);
-EXPORT_SYMBOL_GPL(xen_in_preemptible_hcall);
-
-asmlinkage __visible void xen_maybe_preempt_hcall(void)
-{
-	if (unlikely(__this_cpu_read(xen_in_preemptible_hcall)
-		     && need_resched())) {
-		/*
-		 * Clear flag as we may be rescheduled on a different
-		 * cpu.
-		 */
-		__this_cpu_write(xen_in_preemptible_hcall, false);
-		local_irq_enable();
-		cond_resched();
-		local_irq_disable();
-		__this_cpu_write(xen_in_preemptible_hcall, true);
-	}
-}
-#endif /* CONFIG_PREEMPTION */

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20 16:51                   ` Andy Lutomirski
  2020-05-20 18:05                     ` Paul E. McKenney
  2020-05-20 18:32                     ` Thomas Gleixner
@ 2020-05-20 19:24                     ` Thomas Gleixner
  2020-05-20 19:42                       ` Paul E. McKenney
  2 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 19:24 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski
  Cc: Paul E. McKenney, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Wed, May 20, 2020 at 8:36 AM Andy Lutomirski <luto@kernel.org> wrote:
>     if (user_mode(regs)) {
>         enter_from_user_mode();
>     } else {
>         if (!__rcu_is_watching()) {
>             /*
>              * If RCU is not watching then the same careful
>              * sequence vs. lockdep and tracing is required.
>              *
>              * This only happens for IRQs that hit the idle loop, and
>              * even that only happens if we aren't using the sane
>              * MWAIT-while-IF=0 mode.
>              */
>             lockdep_hardirqs_off(CALLER_ADDR0);
>             rcu_irq_enter();
>             instrumentation_begin();
>             trace_hardirqs_off_prepare();
>             instrumentation_end();
>             return true;
>         } else {
>             /*
>              * If RCU is watching then the combo function
>              * can be used.
>              */
>             instrumentation_begin();
>             trace_hardirqs_off();
>             rcu_tickle();
>             instrumentation_end();
>         }
>     }
>     return false;
>
> This is exactly what you have except that the cond_rcu part is gone
> and I added rcu_tickle().
>
> Paul, the major change here is that if an IRQ hits normal kernel code
> (i.e. code where RCU is watching and we're not in an EQS), the IRQ
> won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
> rcu_tickle() on entry and nothing on exit.  Does that cover all the
> bases?

Just chatted with Paul on IRC and he thinks this should work, but he's
not sure whether it's actually sane :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20 19:24                     ` Thomas Gleixner
@ 2020-05-20 19:42                       ` Paul E. McKenney
  0 siblings, 0 replies; 159+ messages in thread
From: Paul E. McKenney @ 2020-05-20 19:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 09:24:46PM +0200, Thomas Gleixner wrote:
> Andy Lutomirski <luto@kernel.org> writes:
> > On Wed, May 20, 2020 at 8:36 AM Andy Lutomirski <luto@kernel.org> wrote:
> >     if (user_mode(regs)) {
> >         enter_from_user_mode();
> >     } else {
> >         if (!__rcu_is_watching()) {
> >             /*
> >              * If RCU is not watching then the same careful
> >              * sequence vs. lockdep and tracing is required.
> >              *
> >              * This only happens for IRQs that hit the idle loop, and
> >              * even that only happens if we aren't using the sane
> >              * MWAIT-while-IF=0 mode.
> >              */
> >             lockdep_hardirqs_off(CALLER_ADDR0);
> >             rcu_irq_enter();
> >             instrumentation_begin();
> >             trace_hardirqs_off_prepare();
> >             instrumentation_end();
> >             return true;
> >         } else {
> >             /*
> >              * If RCU is watching then the combo function
> >              * can be used.
> >              */
> >             instrumentation_begin();
> >             trace_hardirqs_off();
> >             rcu_tickle();
> >             instrumentation_end();
> >         }
> >     }
> >     return false;
> >
> > This is exactly what you have except that the cond_rcu part is gone
> > and I added rcu_tickle().
> >
> > Paul, the major change here is that if an IRQ hits normal kernel code
> > (i.e. code where RCU is watching and we're not in an EQS), the IRQ
> > won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
> > rcu_tickle() on entry and nothing on exit.  Does that cover all the
> > bases?
> 
> Just chatted with Paul on IRC and he thinks this should work, but he's
> not sure whether it's actually sane :)

I will have more to say after coding it up.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20 18:05                     ` Paul E. McKenney
@ 2020-05-20 19:49                       ` Thomas Gleixner
  2020-05-20 22:15                         ` Paul E. McKenney
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 19:49 UTC (permalink / raw)
  To: paulmck, Andy Lutomirski
  Cc: LKML, X86 ML, Alexandre Chartre, Frederic Weisbecker,
	Paolo Bonzini, Sean Christopherson, Masami Hiramatsu,
	Petr Mladek, Steven Rostedt, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Wed, May 20, 2020 at 09:51:17AM -0700, Andy Lutomirski wrote:
>> Paul, the major change here is that if an IRQ hits normal kernel code
>> (i.e. code where RCU is watching and we're not in an EQS), the IRQ
>> won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
>> rcu_tickle() on entry and nothing on exit.  Does that cover all the
>> bases?
>
> From an RCU viewpoint, yes, give or take my concerns about someone
> putting rcu_tickle() on entry and rcu_irq_exit() on exit.  Perhaps
> I can bring some lockdep trickery to bear.

An surplus rcu_irq_exit() should already trigger alarms today.

> But I must defer to Thomas and Peter on the non-RCU/non-NO_HZ_FULL
> portions of this.

I don't see a problem. Let me write that into actual testable patch
form.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns()
  2020-05-19 22:18       ` Steven Rostedt
@ 2020-05-20 19:51         ` Thomas Gleixner
  0 siblings, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 19:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Joel Fernandes, Boris Ostrovsky,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

Steven Rostedt <rostedt@goodmis.org> writes:

> On Tue, 19 May 2020 23:45:10 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> >> @@ -165,20 +155,22 @@ void trace_hwlat_callback(bool enter)
>> >>   * Used to repeatedly capture the CPU TSC (or similar), looking for potential
>> >>   * hardware-induced latency. Called with interrupts disabled and with
>> >>   * hwlat_data.lock held.
>> >> + *
>> >> + * Use ktime_get_mono_fast() here as well because it does not wait on the
>> >> + * timekeeping seqcount like ktime_get_mono().  
>> >
>> > When doing a "git grep ktime_get_mono" I only find
>> > ktime_get_mono_fast_ns() (and this comment), so I don't know what to compare
>> > that to. Did you mean another function?  
>> 
>> Yeah. I fatfingered the comment. The code uses ktime_get_mono_fast_ns().
>
> Well, I assumed that's what you meant with "ktime_get_mono_fast()" but I
> don't know what function you are comparing it to that waits on the seqcount
> like "ktime_get_mono()" as there is no such function.

Gah. ktime_get_mono_fast_ns() and ktime_get() of course.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 04/37] x86: Make hardware latency tracing explicit
  2020-05-18  8:08       ` Peter Zijlstra
@ 2020-05-20 20:09         ` Thomas Gleixner
  2020-05-20 20:14           ` Andy Lutomirski
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 20:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui

Peter Zijlstra <peterz@infradead.org> writes:
> On Mon, May 18, 2020 at 10:05:56AM +0200, Thomas Gleixner wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> > On Sat, May 16, 2020 at 01:45:51AM +0200, Thomas Gleixner wrote:
>> >> --- a/arch/x86/kernel/nmi.c
>> >> +++ b/arch/x86/kernel/nmi.c
>> >> @@ -334,6 +334,7 @@ static noinstr void default_do_nmi(struc
>> >>  	__this_cpu_write(last_nmi_rip, regs->ip);
>> >>  
>> >>  	instrumentation_begin();
>> >> +	ftrace_nmi_handler_enter();
>> >>  
>> >>  	handled = nmi_handle(NMI_LOCAL, regs);
>> >>  	__this_cpu_add(nmi_stats.normal, handled);
>> >> @@ -420,6 +421,7 @@ static noinstr void default_do_nmi(struc
>> >>  		unknown_nmi_error(reason, regs);
>> >>  
>> >>  out:
>> >> +	ftrace_nmi_handler_exit();
>> >>  	instrumentation_end();
>> >>  }
>> >
>> > Yeah, so I'm confused about this and the previous patch too. Why not
>> > do just this? Remove that ftrace_nmi_handler.* crud from
>> > nmi_{enter,exit}() and stick it here? Why do we needs the
>> > nmi_{enter,exit}_notrace() thing?
>> 
>> Because you then have to fixup _all_ architectures which use
>> nmi_enter/exit().
>
> We probably have to anyway. But I can do that later I suppose.

Second thoughts. For #DB and #INT3 we can just keep nmi_enter(), needs
just annotation in nmi_enter() around that trace muck.

For #NMI and #MCE I rather avoid the early trace call and do it once we
have reached "stable" state, i.e. avoid it in the whole nested NMI mess.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 04/37] x86: Make hardware latency tracing explicit
  2020-05-20 20:09         ` Thomas Gleixner
@ 2020-05-20 20:14           ` Andy Lutomirski
  2020-05-20 22:20             ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20 20:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, LKML, x86, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui



> On May 20, 2020, at 1:10 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Peter Zijlstra <peterz@infradead.org> writes:
>>> On Mon, May 18, 2020 at 10:05:56AM +0200, Thomas Gleixner wrote:
>>> Peter Zijlstra <peterz@infradead.org> writes:
>>>> On Sat, May 16, 2020 at 01:45:51AM +0200, Thomas Gleixner wrote:
>>>>> --- a/arch/x86/kernel/nmi.c
>>>>> +++ b/arch/x86/kernel/nmi.c
>>>>> @@ -334,6 +334,7 @@ static noinstr void default_do_nmi(struc
>>>>>    __this_cpu_write(last_nmi_rip, regs->ip);
>>>>> 
>>>>>    instrumentation_begin();
>>>>> +    ftrace_nmi_handler_enter();
>>>>> 
>>>>>    handled = nmi_handle(NMI_LOCAL, regs);
>>>>>    __this_cpu_add(nmi_stats.normal, handled);
>>>>> @@ -420,6 +421,7 @@ static noinstr void default_do_nmi(struc
>>>>>        unknown_nmi_error(reason, regs);
>>>>> 
>>>>> out:
>>>>> +    ftrace_nmi_handler_exit();
>>>>>    instrumentation_end();
>>>>> }
>>>> 
>>>> Yeah, so I'm confused about this and the previous patch too. Why not
>>>> do just this? Remove that ftrace_nmi_handler.* crud from
>>>> nmi_{enter,exit}() and stick it here? Why do we needs the
>>>> nmi_{enter,exit}_notrace() thing?
>>> 
>>> Because you then have to fixup _all_ architectures which use
>>> nmi_enter/exit().
>> 
>> We probably have to anyway. But I can do that later I suppose.
> 
> Second thoughts. For #DB and #INT3 we can just keep nmi_enter(), needs
> just annotation in nmi_enter() around that trace muck.
> 
> For #NMI and #MCE I rather avoid the early trace call and do it once we
> have reached "stable" state, i.e. avoid it in the whole nested NMI mess.
> 
> 

What’s the issue?  The actual meat is mostly in the asm for NMI, and for MCE it’s just the sync-all-the-cores thing. The actual simultaneous NMI-and-MCE case is utterly busted regardless, and I’ve been thinking about how to fix it. It won’t be pretty, but nmi_enter() will have nothing to do with it.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns()
  2020-05-15 23:45 ` [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns() Thomas Gleixner
  2020-05-19 21:26   ` Steven Rostedt
@ 2020-05-20 20:14   ` Peter Zijlstra
  2020-05-20 22:20     ` Thomas Gleixner
  1 sibling, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2020-05-20 20:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui

On Sat, May 16, 2020 at 01:45:48AM +0200, Thomas Gleixner wrote:
> Timestamping in the hardware latency detector uses sched_clock() underneath
> and depends on CONFIG_GENERIC_SCHED_CLOCK=n because sched clocks from that
> subsystem are not NMI safe.

AFAICT that's not actually true, see commit:

  1809bfa44e10 ("timers, sched/clock: Avoid deadlock during read from NMI")

that said, no objection to switching to ktime, people that run HPET
clocks deserve all the pain and suffering available with those setups.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20 19:49                       ` Thomas Gleixner
@ 2020-05-20 22:15                         ` Paul E. McKenney
  2020-05-20 23:25                           ` Paul E. McKenney
  0 siblings, 1 reply; 159+ messages in thread
From: Paul E. McKenney @ 2020-05-20 22:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 09:49:18PM +0200, Thomas Gleixner wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> > On Wed, May 20, 2020 at 09:51:17AM -0700, Andy Lutomirski wrote:
> >> Paul, the major change here is that if an IRQ hits normal kernel code
> >> (i.e. code where RCU is watching and we're not in an EQS), the IRQ
> >> won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
> >> rcu_tickle() on entry and nothing on exit.  Does that cover all the
> >> bases?
> >
> > From an RCU viewpoint, yes, give or take my concerns about someone
> > putting rcu_tickle() on entry and rcu_irq_exit() on exit.  Perhaps
> > I can bring some lockdep trickery to bear.
> 
> An surplus rcu_irq_exit() should already trigger alarms today.

Fair point!

> > But I must defer to Thomas and Peter on the non-RCU/non-NO_HZ_FULL
> > portions of this.
> 
> I don't see a problem. Let me write that into actual testable patch
> form.

Here is the RCU part, with my current best guess for the commit log.

Please note that this is on top of my -rcu stack, so some adjustment
will likely be needed to pull it underneath Joel's series that removes
the special-purpose bits at the bottom of the ->dynticks counter.

But a starting point, anyway.

							Thanx, Paul

------------------------------------------------------------------------

commit ca05838a9a1809fafee63f488a7be8b30e1c2a6a
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed May 20 15:03:07 2020 -0700

    rcu: Abstract out tickle_nohz_for_rcu() from rcu_nmi_enter()
    
    This commit splits out the nohz_full scheduler-tick enabling from the
    rest of the rcu_nmi_enter() logic.  This allows short exception handlers
    that interrupt kernel code regions that RCU is already watching to just
    invoke tickle_nohz_for_rcu() at exception entry instead of having to
    invoke rcu_nmi_enter() on entry and also rcu_nmi_exit() on all exit paths.
    
    Suggested-by: Andy Lutomirski <luto@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 621556e..d4be42a 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -14,6 +14,10 @@ extern bool synchronize_hardirq(unsigned int irq);
 
 #if defined(CONFIG_TINY_RCU)
 
+static inline void tickle_nohz_for_rcu(void)
+{
+}
+
 static inline void rcu_nmi_enter(void)
 {
 }
@@ -23,6 +27,7 @@ static inline void rcu_nmi_exit(void)
 }
 
 #else
+extern void tickle_nohz_for_rcu(void);
 extern void rcu_nmi_enter(void);
 extern void rcu_nmi_exit(void);
 #endif
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7812574..0a3cad4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -806,6 +806,67 @@ void noinstr rcu_user_exit(void)
 #endif /* CONFIG_NO_HZ_FULL */
 
 /**
+ * tickle_nohz_for_rcu - Enable scheduler tick on CPU if RCU needs it.
+ *
+ * The scheduler tick is not normally enabled when CPUs enter the kernel
+ * from nohz_full userspace execution.  After all, nohz_full userspace
+ * execution is an RCU quiescent state and the time executing in the kernel
+ * is quite short.  Except of course when it isn't.  And it is not hard to
+ * cause a large system to spend tens of seconds or even minutes looping
+ * in the kernel, which can cause a number of problems, include RCU CPU
+ * stall warnings.
+ *
+ * Therefore, if a nohz_full CPU fails to report a quiescent state
+ * in a timely manner, the RCU grace-period kthread sets that CPU's
+ * ->rcu_urgent_qs flag with the expectation that the next interrupt or
+ * exception will invoke this function, which will turn on the scheduler
+ * tick, which will enable RCU to detect that CPU's quiescent states,
+ * for example, due to cond_resched() calls in CONFIG_PREEMPT=n kernels.
+ * The tick will be disabled once a quiescent state is reported for
+ * this CPU.
+ *
+ * Of course, in carefully tuned systems, there might never be an
+ * interrupt or exception.  In that case, the RCU grace-period kthread
+ * will eventually cause one to happen.  However, in less carefully
+ * controlled environments, this function allows RCU to get what it
+ * needs without creating otherwise useless interruptions.
+ */
+noinstr void tickle_nohz_for_rcu(void)
+{
+	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
+
+	if (in_nmi())
+		return; // Enabling tick is unsafe in NMI handlers.
+	RCU_LOCKDEP_WARN(rcu_dynticks_curr_cpu_in_eqs(),
+			 "Illegal tickle_nohz_for_rcu from extended quiescent state");
+	instrumentation_begin();
+	if (!tick_nohz_full_cpu(rdp->cpu) ||
+	    !READ_ONCE(rdp->rcu_urgent_qs) ||
+	    READ_ONCE(rdp->rcu_forced_tick)) {
+		// RCU doesn't need nohz_full help from this CPU, or it is
+		// already getting that help.
+		instrumentation_end();
+		return;
+	}
+
+	// We get here only when not in an extended quiescent state and
+	// from interrupts (as opposed to NMIs).  Therefore, (1) RCU is
+	// already watching and (2) The fact that we are in an interrupt
+	// handler and that the rcu_node lock is an irq-disabled lock
+	// prevents self-deadlock.  So we can safely recheck under the lock.
+	// Note that the nohz_full state currently cannot change.
+	raw_spin_lock_rcu_node(rdp->mynode);
+	if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
+		// A nohz_full CPU is in the kernel and RCU needs a
+		// quiescent state.  Turn on the tick!
+		WRITE_ONCE(rdp->rcu_forced_tick, true);
+		tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
+	}
+	raw_spin_unlock_rcu_node(rdp->mynode);
+	instrumentation_end();
+}
+
+/**
  * rcu_nmi_enter - inform RCU of entry to NMI context
  * @irq: Is this call from rcu_irq_enter?
  *
@@ -835,7 +896,9 @@ noinstr void rcu_nmi_enter(void)
 	 * is if the interrupt arrived in kernel mode; in this case we would
 	 * be the outermost interrupt but still increment by 2 which is Ok.
 	 */
-	if (rcu_dynticks_curr_cpu_in_eqs()) {
+	if (!rcu_dynticks_curr_cpu_in_eqs()) {
+		tickle_nohz_for_rcu();
+	} else {
 
 		if (!in_nmi())
 			rcu_dynticks_task_exit();
@@ -851,28 +914,6 @@ noinstr void rcu_nmi_enter(void)
 		}
 
 		incby = 1;
-	} else if (!in_nmi()) {
-		instrumentation_begin();
-		if (tick_nohz_full_cpu(rdp->cpu) &&
-		    READ_ONCE(rdp->rcu_urgent_qs) &&
-		    !READ_ONCE(rdp->rcu_forced_tick)) {
-			// We get here only if we had already exited the
-			// extended quiescent state and this was an
-			// interrupt (not an NMI).  Therefore, (1) RCU is
-			// already watching and (2) The fact that we are in
-			// an interrupt handler and that the rcu_node lock
-			// is an irq-disabled lock prevents self-deadlock.
-			// So we can safely recheck under the lock.
-			raw_spin_lock_rcu_node(rdp->mynode);
-			if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
-				// A nohz_full CPU is in the kernel and RCU
-				// needs a quiescent state.  Turn on the tick!
-				WRITE_ONCE(rdp->rcu_forced_tick, true);
-				tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
-			}
-			raw_spin_unlock_rcu_node(rdp->mynode);
-		}
-		instrumentation_end();
 	}
 	instrumentation_begin();
 	trace_rcu_dyntick(incby == 1 ? TPS("End") : TPS("StillNonIdle"),

^ permalink raw reply related	[flat|nested] 159+ messages in thread

* Re: [patch V6 04/37] x86: Make hardware latency tracing explicit
  2020-05-20 20:14           ` Andy Lutomirski
@ 2020-05-20 22:20             ` Thomas Gleixner
  0 siblings, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 22:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, LKML, x86, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui

Andy Lutomirski <luto@amacapital.net> writes:
>> On May 20, 2020, at 1:10 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>>> We probably have to anyway. But I can do that later I suppose.
>> 
>> Second thoughts. For #DB and #INT3 we can just keep nmi_enter(), needs
>> just annotation in nmi_enter() around that trace muck.
>> 
>> For #NMI and #MCE I rather avoid the early trace call and do it once we
>> have reached "stable" state, i.e. avoid it in the whole nested NMI mess.
>
> What’s the issue?  The actual meat is mostly in the asm for NMI, and
> for MCE it’s just the sync-all-the-cores thing. The actual
> simultaneous NMI-and-MCE case is utterly busted regardless, and I’ve
> been thinking about how to fix it. It won’t be pretty, but nmi_enter()
> will have nothing to do with it.

The issue is that I want to avoid anything which is not essential just
for pure paranoia reasons.

I can drop that and just move the trace muck after RCU is safe and
annotate it properly.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns()
  2020-05-20 20:14   ` Peter Zijlstra
@ 2020-05-20 22:20     ` Thomas Gleixner
  0 siblings, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-20 22:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, x86, Paul E. McKenney, Andy Lutomirski, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui

Peter Zijlstra <peterz@infradead.org> writes:

> On Sat, May 16, 2020 at 01:45:48AM +0200, Thomas Gleixner wrote:
>> Timestamping in the hardware latency detector uses sched_clock() underneath
>> and depends on CONFIG_GENERIC_SCHED_CLOCK=n because sched clocks from that
>> subsystem are not NMI safe.
>
> AFAICT that's not actually true, see commit:
>
>   1809bfa44e10 ("timers, sched/clock: Avoid deadlock during read from NMI")
>
> that said, no objection to switching to ktime, people that run HPET
> clocks deserve all the pain and suffering available with those setups.

Correct ...

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-20 19:16               ` Thomas Gleixner
@ 2020-05-20 23:21                 ` Andy Lutomirski
  2020-05-21 10:45                   ` Thomas Gleixner
  2020-05-21  2:23                 ` Boris Ostrovsky
  1 sibling, 1 reply; 159+ messages in thread
From: Andy Lutomirski @ 2020-05-20 23:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Andrew Cooper, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 12:17 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > Andrew Cooper pointed out that there is too much magic in Xen for this
> > to work.  So never mind.
>
> :)
>
> But you made me stare more at that stuff and I came up with a way
> simpler solution. See below.

I like it, but I bet it can be even simpler if you do the
tickle_whatever_paulmck_call_it() change:

> +__visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
> +{
> +       struct pt_regs *old_regs;
> +       bool inhcall;
> +
> +       idtentry_enter(regs);
> +       old_regs = set_irq_regs(regs);
> +
> +       run_on_irqstack(__xen_pv_evtchn_do_upcall, NULL, regs);
> +
> +       set_irq_regs(old_regs);
> +
> +       inhcall = get_and_clear_inhcall();
> +       __idtentry_exit(regs, inhcall);
> +       restore_inhcall(inhcall);

How about:

       inhcall = get_and_clear_inhcall();
       if (inhcall) {
        if (!WARN_ON_ONCE((regs->flags & X86_EFLAGS_IF) || preempt_count()) {
          local_irq_enable();
          cond_resched();
          local_irq_disable();
        }
     }
     restore_inhcall(inhcall);
     idtentry_exit(regs);

This could probably be tidied up by having a xen_maybe_preempt() that
does the inhcall and resched mess.

The point is that, with the tickle_nohz_ stuff, there is nothing
actually preventing IRQ handlers from sleeping as long as they aren't
on the IRQ stack and as long as the interrupted context was safe to
sleep in.

--Andy

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20 22:15                         ` Paul E. McKenney
@ 2020-05-20 23:25                           ` Paul E. McKenney
  2020-05-21  8:31                             ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Paul E. McKenney @ 2020-05-20 23:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Wed, May 20, 2020 at 03:15:31PM -0700, Paul E. McKenney wrote:
> On Wed, May 20, 2020 at 09:49:18PM +0200, Thomas Gleixner wrote:
> > "Paul E. McKenney" <paulmck@kernel.org> writes:
> > > On Wed, May 20, 2020 at 09:51:17AM -0700, Andy Lutomirski wrote:
> > >> Paul, the major change here is that if an IRQ hits normal kernel code
> > >> (i.e. code where RCU is watching and we're not in an EQS), the IRQ
> > >> won't call rcu_irq_enter() and rcu_irq_exit().  Instead it will call
> > >> rcu_tickle() on entry and nothing on exit.  Does that cover all the
> > >> bases?
> > >
> > > From an RCU viewpoint, yes, give or take my concerns about someone
> > > putting rcu_tickle() on entry and rcu_irq_exit() on exit.  Perhaps
> > > I can bring some lockdep trickery to bear.
> > 
> > An surplus rcu_irq_exit() should already trigger alarms today.
> 
> Fair point!
> 
> > > But I must defer to Thomas and Peter on the non-RCU/non-NO_HZ_FULL
> > > portions of this.
> > 
> > I don't see a problem. Let me write that into actual testable patch
> > form.
> 
> Here is the RCU part, with my current best guess for the commit log.
> 
> Please note that this is on top of my -rcu stack, so some adjustment
> will likely be needed to pull it underneath Joel's series that removes
> the special-purpose bits at the bottom of the ->dynticks counter.
> 
> But a starting point, anyway.

Same patch, but with updated commit log based on IRC discussion
with Andy.

							Thanx, Paul

------------------------------------------------------------------------

commit 1771ea9fac5748d1424d9214c51b2f79cc1176b6
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed May 20 15:03:07 2020 -0700

    rcu: Abstract out tickle_nohz_for_rcu() from rcu_nmi_enter()
    
    There will likely be exception handlers that can sleep, which rules
    out the usual approach of invoking rcu_nmi_enter() on entry and also
    rcu_nmi_exit() on all exit paths.  However, the alternative approach of
    just not calling anything can prevent RCU from coaxing quiescent states
    from nohz_full CPUs that are looping in the kernel:  RCU must instead
    IPI them explicitly.  It would be better to enable the scheduler tick
    on such CPUs to interact with RCU in a lighter-weight manner, and this
    enabling is one of the things that rcu_nmi_enter() currently does.
    
    What is needed is something that helps RCU coax quiescent states while
    not preventing subsequent sleeps.  This commit therefore splits out the
    nohz_full scheduler-tick enabling from the rest of the rcu_nmi_enter()
    logic into a new function named tickle_nohz_for_rcu().
    
    Suggested-by: Andy Lutomirski <luto@kernel.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 621556e..d4be42a 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -14,6 +14,10 @@ extern bool synchronize_hardirq(unsigned int irq);
 
 #if defined(CONFIG_TINY_RCU)
 
+static inline void tickle_nohz_for_rcu(void)
+{
+}
+
 static inline void rcu_nmi_enter(void)
 {
 }
@@ -23,6 +27,7 @@ static inline void rcu_nmi_exit(void)
 }
 
 #else
+extern void tickle_nohz_for_rcu(void);
 extern void rcu_nmi_enter(void);
 extern void rcu_nmi_exit(void);
 #endif
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7812574..0a3cad4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -806,6 +806,67 @@ void noinstr rcu_user_exit(void)
 #endif /* CONFIG_NO_HZ_FULL */
 
 /**
+ * tickle_nohz_for_rcu - Enable scheduler tick on CPU if RCU needs it.
+ *
+ * The scheduler tick is not normally enabled when CPUs enter the kernel
+ * from nohz_full userspace execution.  After all, nohz_full userspace
+ * execution is an RCU quiescent state and the time executing in the kernel
+ * is quite short.  Except of course when it isn't.  And it is not hard to
+ * cause a large system to spend tens of seconds or even minutes looping
+ * in the kernel, which can cause a number of problems, include RCU CPU
+ * stall warnings.
+ *
+ * Therefore, if a nohz_full CPU fails to report a quiescent state
+ * in a timely manner, the RCU grace-period kthread sets that CPU's
+ * ->rcu_urgent_qs flag with the expectation that the next interrupt or
+ * exception will invoke this function, which will turn on the scheduler
+ * tick, which will enable RCU to detect that CPU's quiescent states,
+ * for example, due to cond_resched() calls in CONFIG_PREEMPT=n kernels.
+ * The tick will be disabled once a quiescent state is reported for
+ * this CPU.
+ *
+ * Of course, in carefully tuned systems, there might never be an
+ * interrupt or exception.  In that case, the RCU grace-period kthread
+ * will eventually cause one to happen.  However, in less carefully
+ * controlled environments, this function allows RCU to get what it
+ * needs without creating otherwise useless interruptions.
+ */
+noinstr void tickle_nohz_for_rcu(void)
+{
+	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
+
+	if (in_nmi())
+		return; // Enabling tick is unsafe in NMI handlers.
+	RCU_LOCKDEP_WARN(rcu_dynticks_curr_cpu_in_eqs(),
+			 "Illegal tickle_nohz_for_rcu from extended quiescent state");
+	instrumentation_begin();
+	if (!tick_nohz_full_cpu(rdp->cpu) ||
+	    !READ_ONCE(rdp->rcu_urgent_qs) ||
+	    READ_ONCE(rdp->rcu_forced_tick)) {
+		// RCU doesn't need nohz_full help from this CPU, or it is
+		// already getting that help.
+		instrumentation_end();
+		return;
+	}
+
+	// We get here only when not in an extended quiescent state and
+	// from interrupts (as opposed to NMIs).  Therefore, (1) RCU is
+	// already watching and (2) The fact that we are in an interrupt
+	// handler and that the rcu_node lock is an irq-disabled lock
+	// prevents self-deadlock.  So we can safely recheck under the lock.
+	// Note that the nohz_full state currently cannot change.
+	raw_spin_lock_rcu_node(rdp->mynode);
+	if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
+		// A nohz_full CPU is in the kernel and RCU needs a
+		// quiescent state.  Turn on the tick!
+		WRITE_ONCE(rdp->rcu_forced_tick, true);
+		tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
+	}
+	raw_spin_unlock_rcu_node(rdp->mynode);
+	instrumentation_end();
+}
+
+/**
  * rcu_nmi_enter - inform RCU of entry to NMI context
  * @irq: Is this call from rcu_irq_enter?
  *
@@ -835,7 +896,9 @@ noinstr void rcu_nmi_enter(void)
 	 * is if the interrupt arrived in kernel mode; in this case we would
 	 * be the outermost interrupt but still increment by 2 which is Ok.
 	 */
-	if (rcu_dynticks_curr_cpu_in_eqs()) {
+	if (!rcu_dynticks_curr_cpu_in_eqs()) {
+		tickle_nohz_for_rcu();
+	} else {
 
 		if (!in_nmi())
 			rcu_dynticks_task_exit();
@@ -851,28 +914,6 @@ noinstr void rcu_nmi_enter(void)
 		}
 
 		incby = 1;
-	} else if (!in_nmi()) {
-		instrumentation_begin();
-		if (tick_nohz_full_cpu(rdp->cpu) &&
-		    READ_ONCE(rdp->rcu_urgent_qs) &&
-		    !READ_ONCE(rdp->rcu_forced_tick)) {
-			// We get here only if we had already exited the
-			// extended quiescent state and this was an
-			// interrupt (not an NMI).  Therefore, (1) RCU is
-			// already watching and (2) The fact that we are in
-			// an interrupt handler and that the rcu_node lock
-			// is an irq-disabled lock prevents self-deadlock.
-			// So we can safely recheck under the lock.
-			raw_spin_lock_rcu_node(rdp->mynode);
-			if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
-				// A nohz_full CPU is in the kernel and RCU
-				// needs a quiescent state.  Turn on the tick!
-				WRITE_ONCE(rdp->rcu_forced_tick, true);
-				tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
-			}
-			raw_spin_unlock_rcu_node(rdp->mynode);
-		}
-		instrumentation_end();
 	}
 	instrumentation_begin();
 	trace_rcu_dyntick(incby == 1 ? TPS("End") : TPS("StillNonIdle"),

^ permalink raw reply related	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-20 19:16               ` Thomas Gleixner
  2020-05-20 23:21                 ` Andy Lutomirski
@ 2020-05-21  2:23                 ` Boris Ostrovsky
  2020-05-21  7:08                   ` Thomas Gleixner
  1 sibling, 1 reply; 159+ messages in thread
From: Boris Ostrovsky @ 2020-05-21  2:23 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirski
  Cc: Andrew Cooper, LKML, X86 ML, Paul E. McKenney, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On 5/20/20 3:16 PM, Thomas Gleixner wrote:


> +__visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
> +{
> +	struct pt_regs *old_regs;
> +	bool inhcall;
> +
> +	idtentry_enter(regs);
> +	old_regs = set_irq_regs(regs);
> +
> +	run_on_irqstack(__xen_pv_evtchn_do_upcall, NULL, regs);


We need to handle nested case (i.e. !irq_needs_irq_stack(), like in your
original version). Moving get_and_clear_inhcall() up should prevent
scheduling when this happens.


-boris


> +
> +	set_irq_regs(old_regs);
> +
> +	inhcall = get_and_clear_inhcall();
> +	__idtentry_exit(regs, inhcall);
> +	restore_inhcall(inhcall);
>  }


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-21  2:23                 ` Boris Ostrovsky
@ 2020-05-21  7:08                   ` Thomas Gleixner
  0 siblings, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-21  7:08 UTC (permalink / raw)
  To: Boris Ostrovsky, Andy Lutomirski
  Cc: Andrew Cooper, LKML, X86 ML, Paul E. McKenney, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

Boris Ostrovsky <boris.ostrovsky@oracle.com> writes:

> On 5/20/20 3:16 PM, Thomas Gleixner wrote:
>
>
>> +__visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>> +{
>> +	struct pt_regs *old_regs;
>> +	bool inhcall;
>> +
>> +	idtentry_enter(regs);
>> +	old_regs = set_irq_regs(regs);
>> +
>> +	run_on_irqstack(__xen_pv_evtchn_do_upcall, NULL, regs);
>
>
> We need to handle nested case (i.e. !irq_needs_irq_stack(), like in your
> original version). Moving get_and_clear_inhcall() up should prevent
> scheduling when this happens.

I locally changed run_on_irqstack() to do the magic checks and select the
right one.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-20 23:25                           ` Paul E. McKenney
@ 2020-05-21  8:31                             ` Thomas Gleixner
  2020-05-21 13:39                               ` Paul E. McKenney
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-21  8:31 UTC (permalink / raw)
  To: paulmck
  Cc: Andy Lutomirski, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Wed, May 20, 2020 at 03:15:31PM -0700, Paul E. McKenney wrote:
> Same patch, but with updated commit log based on IRC discussion
> with Andy.

Fun. I came up with the same thing before going to bed. Just that I
named the function differently: rcu_irq_enter_check_tick()

>  #if defined(CONFIG_TINY_RCU)
>  
> +static inline void tickle_nohz_for_rcu(void)
> +{
> +}
> +
>  static inline void rcu_nmi_enter(void)
>  {
>  }
> @@ -23,6 +27,7 @@ static inline void rcu_nmi_exit(void)
>  }
>  
>  #else
> +extern void tickle_nohz_for_rcu(void);

And I made this a NOP for for !NOHZ_FULL systems and avoided the call if
context tracking is not enabled at boot.

void __rcu_irq_enter_check_tick(void);

static inline void rcu_irq_enter_check_tick(void)
{
	if (context_tracking_enabled())
        	__rcu_irq_enter_check_tick();
}

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY
  2020-05-20 23:21                 ` Andy Lutomirski
@ 2020-05-21 10:45                   ` Thomas Gleixner
  0 siblings, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-21 10:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Andrew Cooper, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Wed, May 20, 2020 at 12:17 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Andy Lutomirski <luto@kernel.org> writes:
>> > Andrew Cooper pointed out that there is too much magic in Xen for this
>> > to work.  So never mind.
>>
>> :)
>>
>> But you made me stare more at that stuff and I came up with a way
>> simpler solution. See below.
>
> I like it, but I bet it can be even simpler if you do the
> tickle_whatever_paulmck_call_it() change:
>
>> +__visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>> +{
>> +       struct pt_regs *old_regs;
>> +       bool inhcall;
>> +
>> +       idtentry_enter(regs);
>> +       old_regs = set_irq_regs(regs);
>> +
>> +       run_on_irqstack(__xen_pv_evtchn_do_upcall, NULL, regs);
>> +
>> +       set_irq_regs(old_regs);
>> +
>> +       inhcall = get_and_clear_inhcall();
>> +       __idtentry_exit(regs, inhcall);
>> +       restore_inhcall(inhcall);
>
> How about:
>
>        inhcall = get_and_clear_inhcall();
>        if (inhcall) {
>         if (!WARN_ON_ONCE((regs->flags & X86_EFLAGS_IF) || preempt_count()) {
>           local_irq_enable();
>           cond_resched();
>           local_irq_disable();

This really want's to use preempt_schedule_irq() as the above is racy
vs. need_resched().

>         }
>      }
>      restore_inhcall(inhcall);
>      idtentry_exit(regs);
>
> This could probably be tidied up by having a xen_maybe_preempt() that
> does the inhcall and resched mess.
>
> The point is that, with the tickle_nohz_ stuff, there is nothing
> actually preventing IRQ handlers from sleeping as long as they aren't
> on the IRQ stack and as long as the interrupted context was safe to
> sleep in.

You still lose the debug checks. I'm working on it ...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 19/37] x86/irq: Convey vector as argument and not in ptregs
  2020-05-19 20:19   ` Andy Lutomirski
@ 2020-05-21 13:22     ` Thomas Gleixner
  2020-05-22 18:48       ` Boris Ostrovsky
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-21 13:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui,
	Peter Zijlstra (Intel)

Andy Lutomirski <luto@kernel.org> writes:
> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>  +       .align 8
>> +SYM_CODE_START(irq_entries_start)
>> +    vector=FIRST_EXTERNAL_VECTOR
>> +    .rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
>> +       UNWIND_HINT_IRET_REGS
>> +       .byte   0x6a, vector
>> +       jmp     common_interrupt
>> +       .align  8
>> +    vector=vector+1
>> +    .endr
>> +SYM_CODE_END(irq_entries_start)
>
> Having battled code like this in the past (for early exceptions), I
> prefer the variant like:
>
> pos = .;
> .rept blah blah blah
>   .byte whatever
>   jmp whatever
>   . = pos + 8;
>  vector = vector + 1
> .endr
>
> or maybe:
>
> .rept blah blah blah
>   .byte whatever
>   jmp whatever;
>   . = irq_entries_start + 8 * vector;
>   vector = vector + 1
> .endr
>
> The reason is that these variants will fail to assemble if something
> goes wrong and the code expands to more than 8 bytes, whereas using
> .align will cause gas to happily emit 16 bytes and result in
> hard-to-debug mayhem.

Yes. They just make objtool very unhappy:

arch/x86/entry/entry_64.o: warning: objtool: .entry.text+0xfd0: special:
can't find orig instruction

Peter suggested to use:

      .pos = .
      .byte..
      jmp
      .nops (pos + 8) - .

That works ...

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-21  8:31                             ` Thomas Gleixner
@ 2020-05-21 13:39                               ` Paul E. McKenney
  2020-05-21 18:41                                 ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Paul E. McKenney @ 2020-05-21 13:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Thu, May 21, 2020 at 10:31:11AM +0200, Thomas Gleixner wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> > On Wed, May 20, 2020 at 03:15:31PM -0700, Paul E. McKenney wrote:
> > Same patch, but with updated commit log based on IRC discussion
> > with Andy.
> 
> Fun. I came up with the same thing before going to bed. Just that I
> named the function differently: rcu_irq_enter_check_tick()

I am good with that name.

> >  #if defined(CONFIG_TINY_RCU)
> >  
> > +static inline void tickle_nohz_for_rcu(void)
> > +{
> > +}
> > +
> >  static inline void rcu_nmi_enter(void)
> >  {
> >  }
> > @@ -23,6 +27,7 @@ static inline void rcu_nmi_exit(void)
> >  }
> >  
> >  #else
> > +extern void tickle_nohz_for_rcu(void);
> 
> And I made this a NOP for for !NOHZ_FULL systems and avoided the call if
> context tracking is not enabled at boot.
> 
> void __rcu_irq_enter_check_tick(void);
> 
> static inline void rcu_irq_enter_check_tick(void)
> {
> 	if (context_tracking_enabled())
>         	__rcu_irq_enter_check_tick();
> }

That certainly is a better approach!

So let's go with your version.  But could you please post it, just in
case reviewing an alternative version causes me to spot something?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-21 13:39                               ` Paul E. McKenney
@ 2020-05-21 18:41                                 ` Thomas Gleixner
  2020-05-21 19:04                                   ` Paul E. McKenney
  0 siblings, 1 reply; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-21 18:41 UTC (permalink / raw)
  To: paulmck
  Cc: Andy Lutomirski, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Thu, May 21, 2020 at 10:31:11AM +0200, Thomas Gleixner wrote:
>> And I made this a NOP for for !NOHZ_FULL systems and avoided the call if
>> context tracking is not enabled at boot.
>> 
>> void __rcu_irq_enter_check_tick(void);
>> 
>> static inline void rcu_irq_enter_check_tick(void)
>> {
>> 	if (context_tracking_enabled())
>>         	__rcu_irq_enter_check_tick();
>> }
>
> That certainly is a better approach!
>
> So let's go with your version.  But could you please post it, just in
> case reviewing an alternative version causes me to spot something?

I'm just testing the complete rework of this series and will post it if
it survives smoke testing.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-21 18:41                                 ` Thomas Gleixner
@ 2020-05-21 19:04                                   ` Paul E. McKenney
  0 siblings, 0 replies; 159+ messages in thread
From: Paul E. McKenney @ 2020-05-21 19:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Boris Ostrovsky, Juergen Gross, Brian Gerst, Mathieu Desnoyers,
	Josh Poimboeuf, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Thu, May 21, 2020 at 08:41:11PM +0200, Thomas Gleixner wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> > On Thu, May 21, 2020 at 10:31:11AM +0200, Thomas Gleixner wrote:
> >> And I made this a NOP for for !NOHZ_FULL systems and avoided the call if
> >> context tracking is not enabled at boot.
> >> 
> >> void __rcu_irq_enter_check_tick(void);
> >> 
> >> static inline void rcu_irq_enter_check_tick(void)
> >> {
> >> 	if (context_tracking_enabled())
> >>         	__rcu_irq_enter_check_tick();
> >> }
> >
> > That certainly is a better approach!
> >
> > So let's go with your version.  But could you please post it, just in
> > case reviewing an alternative version causes me to spot something?
> 
> I'm just testing the complete rework of this series and will post it if
> it survives smoke testing.

Fair enough!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 19/37] x86/irq: Convey vector as argument and not in ptregs
  2020-05-21 13:22     ` Thomas Gleixner
@ 2020-05-22 18:48       ` Boris Ostrovsky
  2020-05-22 19:26         ` Josh Poimboeuf
  0 siblings, 1 reply; 159+ messages in thread
From: Boris Ostrovsky @ 2020-05-22 18:48 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirski
  Cc: LKML, X86 ML, Paul E. McKenney, Alexandre Chartre,
	Frederic Weisbecker, Paolo Bonzini, Sean Christopherson,
	Masami Hiramatsu, Petr Mladek, Steven Rostedt, Joel Fernandes,
	Juergen Gross, Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf,
	Will Deacon, Tom Lendacky, Wei Liu, Michael Kelley,
	Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On 5/21/20 9:22 AM, Thomas Gleixner wrote:
> Andy Lutomirski <luto@kernel.org> writes:
>> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>>  +       .align 8
>>> +SYM_CODE_START(irq_entries_start)
>>> +    vector=FIRST_EXTERNAL_VECTOR
>>> +    .rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
>>> +       UNWIND_HINT_IRET_REGS
>>> +       .byte   0x6a, vector
>>> +       jmp     common_interrupt
>>> +       .align  8
>>> +    vector=vector+1
>>> +    .endr
>>> +SYM_CODE_END(irq_entries_start)
>> Having battled code like this in the past (for early exceptions), I
>> prefer the variant like:
>>
>> pos = .;
>> .rept blah blah blah
>>   .byte whatever
>>   jmp whatever
>>   . = pos + 8;
>>  vector = vector + 1
>> .endr
>>
>> or maybe:
>>
>> .rept blah blah blah
>>   .byte whatever
>>   jmp whatever;
>>   . = irq_entries_start + 8 * vector;
>>   vector = vector + 1
>> .endr
>>
>> The reason is that these variants will fail to assemble if something
>> goes wrong and the code expands to more than 8 bytes, whereas using
>> .align will cause gas to happily emit 16 bytes and result in
>> hard-to-debug mayhem.
> Yes. They just make objtool very unhappy:
>
> arch/x86/entry/entry_64.o: warning: objtool: .entry.text+0xfd0: special:
> can't find orig instruction
>
> Peter suggested to use:
>
>       .pos = .
>       .byte..
>       jmp
>       .nops (pos + 8) - .


Unfortunately this (.nops directive) only works for newer assemblers
(2.31, per
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob_plain;f=gas/NEWS;h=9a3f352108e439754688e19e63a6235b38801182;hb=5eb617a71463fa6810cd14f57adfe7a1efc93a96)


I have 2.27 and things don't go well.


-boris


>
> That works ...



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 19/37] x86/irq: Convey vector as argument and not in ptregs
  2020-05-22 18:48       ` Boris Ostrovsky
@ 2020-05-22 19:26         ` Josh Poimboeuf
  2020-05-22 19:54           ` Thomas Gleixner
  0 siblings, 1 reply; 159+ messages in thread
From: Josh Poimboeuf @ 2020-05-22 19:26 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Thomas Gleixner, Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Juergen Gross, Brian Gerst,
	Mathieu Desnoyers, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

On Fri, May 22, 2020 at 02:48:53PM -0400, Boris Ostrovsky wrote:
> > Yes. They just make objtool very unhappy:
> >
> > arch/x86/entry/entry_64.o: warning: objtool: .entry.text+0xfd0: special:
> > can't find orig instruction
> >
> > Peter suggested to use:
> >
> >       .pos = .
> >       .byte..
> >       jmp
> >       .nops (pos + 8) - .
> 
> 
> Unfortunately this (.nops directive) only works for newer assemblers
> (2.31, per
> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob_plain;f=gas/NEWS;h=9a3f352108e439754688e19e63a6235b38801182;hb=5eb617a71463fa6810cd14f57adfe7a1efc93a96)
> 
> 
> I have 2.27 and things don't go well.

A single nop should be fine, since gas will complain if it tries to move
the IP backwards.  (Also I'd vote for normal indentation instead of the
"assembler magic at 4 spaces" thing.)

.align 8
SYM_CODE_START(irq_entries_start)
	vector = FIRST_EXTERNAL_VECTOR
	.rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
		pos = .
		UNWIND_HINT_IRET_REGS
		.byte	0x6a, vector
		jmp	asm_common_interrupt
		nop
		.  = pos + 8
		vector = vector + 1
	.endr
SYM_CODE_END(irq_entries_start)

-- 
Josh


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 19/37] x86/irq: Convey vector as argument and not in ptregs
  2020-05-22 19:26         ` Josh Poimboeuf
@ 2020-05-22 19:54           ` Thomas Gleixner
  0 siblings, 0 replies; 159+ messages in thread
From: Thomas Gleixner @ 2020-05-22 19:54 UTC (permalink / raw)
  To: Josh Poimboeuf, Boris Ostrovsky
  Cc: Andy Lutomirski, LKML, X86 ML, Paul E. McKenney,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Juergen Gross, Brian Gerst,
	Mathieu Desnoyers, Will Deacon, Tom Lendacky, Wei Liu,
	Michael Kelley, Jason Chen CJ, Zhao Yakui, Peter Zijlstra (Intel)

Josh Poimboeuf <jpoimboe@redhat.com> writes:

> On Fri, May 22, 2020 at 02:48:53PM -0400, Boris Ostrovsky wrote:
>> > Yes. They just make objtool very unhappy:
>> >
>> > arch/x86/entry/entry_64.o: warning: objtool: .entry.text+0xfd0: special:
>> > can't find orig instruction
>> >
>> > Peter suggested to use:
>> >
>> >       .pos = .
>> >       .byte..
>> >       jmp
>> >       .nops (pos + 8) - .
>> 
>> 
>> Unfortunately this (.nops directive) only works for newer assemblers
>> (2.31, per
>> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob_plain;f=gas/NEWS;h=9a3f352108e439754688e19e63a6235b38801182;hb=5eb617a71463fa6810cd14f57adfe7a1efc93a96)
>> 
>> 
>> I have 2.27 and things don't go well.
>
> A single nop should be fine, since gas will complain if it tries to move
> the IP backwards.

Yes. That's what I posted in the V9 thread :)

> (Also I'd vote for normal indentation instead of the "assembler magic
> at 4 spaces" thing.)

let me fix that

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-19  9:02       ` Peter Zijlstra
@ 2020-05-23  2:52         ` Lai Jiangshan
  2020-05-23 13:08           ` Peter Zijlstra
  0 siblings, 1 reply; 159+ messages in thread
From: Lai Jiangshan @ 2020-05-23  2:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, x86, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui

On Tue, May 19, 2020 at 5:04 PM Peter Zijlstra <peterz@infradead.org> wrote:

> +#ifdef CONFIG_DEBUG_ENTRY
>  /* Begin/end of an instrumentation safe region */
> -#define instrumentation_begin() ({                                             \
> +#define instrumentation_begin() ({                                     \
>         asm volatile("%c0:\n\t"                                         \
>                      ".pushsection .discard.instr_begin\n\t"            \
>                      ".long %c0b - .\n\t"                               \
>                      ".popsection\n\t" : : "i" (__COUNTER__));          \
>  })
>
> -#define instrumentation_end() ({                                                       \
> -       asm volatile("%c0:\n\t"                                         \
> +/*
> + * Because instrumentation_{begin,end}() can nest, objtool validation considers
> + * _begin() a +1 and _end() a -1 and computes a sum over the instructions.
> + * When the value is greater than 0, we consider instrumentation allowed.
> + *
> + * There is a problem with code like:
> + *
> + * noinstr void foo()
> + * {
> + *     instrumentation_begin();
> + *     ...
> + *     if (cond) {
> + *             instrumentation_begin();
> + *             ...
> + *             instrumentation_end();
> + *     }
> + *     bar();
> + *     instrumentation_end();
> + * }
> + *
> + * If instrumentation_end() would be an empty label, like all the other
> + * annotations, the inner _end(), which is at the end of a conditional block,
> + * would land on the instruction after the block.
> + *
> + * If we then consider the sum of the !cond path, we'll see that the call to
> + * bar() is with a 0-value, even though, we meant it to happen with a positive
> + * value.
> + *
> + * To avoid this, have _end() be a NOP instruction, this ensures it will be
> + * part of the condition block and does not escape.
> + */
> +#define instrumentation_end() ({                                       \
> +       asm volatile("%c0: nop\n\t"                                     \
>                      ".pushsection .discard.instr_end\n\t"              \
>                      ".long %c0b - .\n\t"                               \
>                      ".popsection\n\t" : : "i" (__COUNTER__));          \
>  })

Hello,

I, who don't know how does the objtool handle it, am just curious.
_begin() and _end() are symmetrical, which means if _end() (without nop)
can escape, so can _begin() in a reverse way. For example:

noinstr void foo()
{
    instrumentation_begin();
    do {
            instrumentation_begin();
            ...
            instrumentation_end();
    } while (cond);
    bar();
    instrumentation_end();
}

Here, the first _begin() can be "dragged" into the do-while block.
Expectedly, objtool validation should not complain here.

But objtool validation's not complaining means it can handle it
magically correctly (by distinguishing how many _begin()s should
be taken around the jmp target when jmp in a specific path), or
handle it by not checking if all paths have the same count onto
a jmp target (a little nervous to me), or other possible ways.

Sorry for my curiosity.
Thanks
Lai.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-23  2:52         ` Lai Jiangshan
@ 2020-05-23 13:08           ` Peter Zijlstra
  2020-06-15 16:17             ` Peter Zijlstra
  0 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2020-05-23 13:08 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Thomas Gleixner, LKML, x86, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui

On Sat, May 23, 2020 at 10:52:24AM +0800, Lai Jiangshan wrote:

> Hello,
> 
> I, who don't know how does the objtool handle it, am just curious.
> _begin() and _end() are symmetrical, which means if _end() (without nop)
> can escape, so can _begin() in a reverse way. For example:
> 
> noinstr void foo()
> {
>     instrumentation_begin();
>     do {
>             instrumentation_begin();
>             ...
>             instrumentation_end();
>     } while (cond);
>     bar();
>     instrumentation_end();
> }
> 
> Here, the first _begin() can be "dragged" into the do-while block.
> Expectedly, objtool validation should not complain here.
> 
> But objtool validation's not complaining means it can handle it
> magically correctly (by distinguishing how many _begin()s should
> be taken around the jmp target when jmp in a specific path), or
> handle it by not checking if all paths have the same count onto
> a jmp target (a little nervous to me), or other possible ways.

No, I tihnk you're right. It could be we never hit this particular
problem. Even the one described, where end leaks out, is quite rare. For
instance, the last one I debgged (that led to this patch) only showed
itself with gcc-9, but not with gcc-8 for example.

Anyway, if we ever find the above, I'll add the NOP to begin too.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* [tip: x86/entry] x86/entry: Provide idtentry_entry/exit_cond_rcu()
  2020-05-15 23:45 ` [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu() Thomas Gleixner
  2020-05-19 17:08   ` Andy Lutomirski
@ 2020-05-27  8:12   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 159+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2020-05-27  8:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Ingo Molnar, Paul E. McKenney, Andy Lutomirski,
	x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     8fbf48a5cdb83a1ae4285920713facef72639641
Gitweb:        https://git.kernel.org/tip/8fbf48a5cdb83a1ae4285920713facef72639641
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 21 May 2020 22:05:17 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 26 May 2020 19:06:27 +02:00

x86/entry: Provide idtentry_entry/exit_cond_rcu()

After a lengthy discussion [1] it turned out that RCU does not need a full
rcu_irq_enter/exit() when RCU is already watching. All it needs if
NOHZ_FULL is active is to check whether the tick needs to be restarted.

This allows to avoid a separate variant for the pagefault handler which
cannot invoke rcu_irq_enter() on a kernel pagefault which might sleep.

The cond_rcu argument is only temporary and will be removed once the
existing users of idtentry_enter/exit() have been cleaned up. After that
the code can be significantly simplified.

[ mingo: Simplified the control flow ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: "Paul E. McKenney" <paulmck@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: [1] https://lkml.kernel.org/r/20200515235125.628629605@linutronix.de
Link: https://lore.kernel.org/r/20200521202117.181397835@linutronix.de
---
 arch/x86/entry/common.c         | 79 +++++++++++++++++++++++++-------
 arch/x86/include/asm/idtentry.h | 14 +++++-
 2 files changed, 76 insertions(+), 17 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 9ebe334..a7f5846 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -512,8 +512,10 @@ SYSCALL_DEFINE0(ni_syscall)
 }
 
 /**
- * idtentry_enter - Handle state tracking on idtentry
+ * idtentry_enter_cond_rcu - Handle state tracking on idtentry with conditional
+ *			     RCU handling
  * @regs:	Pointer to pt_regs of interrupted context
+ * @cond_rcu:	Invoke rcu_irq_enter() only if RCU is not watching
  *
  * Invokes:
  *  - lockdep irqflag state tracking as low level ASM entry disabled
@@ -521,40 +523,84 @@ SYSCALL_DEFINE0(ni_syscall)
  *
  *  - Context tracking if the exception hit user mode.
  *
- *  - RCU notification if the exception hit kernel mode.
- *
  *  - The hardirq tracer to keep the state consistent as low level ASM
  *    entry disabled interrupts.
+ *
+ * For kernel mode entries RCU handling is done conditional. If RCU is
+ * watching then the only RCU requirement is to check whether the tick has
+ * to be restarted. If RCU is not watching then rcu_irq_enter() has to be
+ * invoked on entry and rcu_irq_exit() on exit.
+ *
+ * Avoiding the rcu_irq_enter/exit() calls is an optimization but also
+ * solves the problem of kernel mode pagefaults which can schedule, which
+ * is not possible after invoking rcu_irq_enter() without undoing it.
+ *
+ * For user mode entries enter_from_user_mode() must be invoked to
+ * establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
+ * would not be possible.
+ *
+ * Returns: True if RCU has been adjusted on a kernel entry
+ *	    False otherwise
+ *
+ * The return value must be fed into the rcu_exit argument of
+ * idtentry_exit_cond_rcu().
  */
-void noinstr idtentry_enter(struct pt_regs *regs)
+bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs, bool cond_rcu)
 {
 	if (user_mode(regs)) {
 		enter_from_user_mode();
-	} else {
+		return false;
+	}
+
+	if (!cond_rcu || !__rcu_is_watching()) {
+		/*
+		 * If RCU is not watching then the same careful
+		 * sequence vs. lockdep and tracing is required
+		 * as in enter_from_user_mode().
+		 *
+		 * This only happens for IRQs that hit the idle
+		 * loop, i.e. if idle is not using MWAIT.
+		 */
 		lockdep_hardirqs_off(CALLER_ADDR0);
 		rcu_irq_enter();
 		instrumentation_begin();
 		trace_hardirqs_off_prepare();
 		instrumentation_end();
+
+		return true;
 	}
+
+	/*
+	 * If RCU is watching then RCU only wants to check
+	 * whether it needs to restart the tick in NOHZ
+	 * mode.
+	 */
+	instrumentation_begin();
+	rcu_irq_enter_check_tick();
+	/* Use the combo lockdep/tracing function */
+	trace_hardirqs_off();
+	instrumentation_end();
+
+	return false;
 }
 
 /**
- * idtentry_exit - Common code to handle return from exceptions
+ * idtentry_exit_cond_rcu - Handle return from exception with conditional RCU
+ *			    handling
  * @regs:	Pointer to pt_regs (exception entry regs)
+ * @rcu_exit:	Invoke rcu_irq_exit() if true
  *
  * Depending on the return target (kernel/user) this runs the necessary
- * preemption and work checks if possible and required and returns to
+ * preemption and work checks if possible and reguired and returns to
  * the caller with interrupts disabled and no further work pending.
  *
  * This is the last action before returning to the low level ASM code which
  * just needs to return to the appropriate context.
  *
- * Invoked by all exception/interrupt IDTENTRY handlers which are not
- * returning through the paranoid exit path (all except NMI, #DF and the IST
- * variants of #MC and #DB) and are therefore on the thread stack.
+ * Counterpart to idtentry_enter_cond_rcu(). The return value of the entry
+ * function must be fed into the @rcu_exit argument.
  */
-void noinstr idtentry_exit(struct pt_regs *regs)
+void noinstr idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit)
 {
 	lockdep_assert_irqs_disabled();
 
@@ -580,7 +626,8 @@ void noinstr idtentry_exit(struct pt_regs *regs)
 				if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
 					WARN_ON_ONCE(!on_thread_stack());
 				instrumentation_begin();
-				rcu_irq_exit_preempt();
+				if (rcu_exit)
+					rcu_irq_exit_preempt();
 				if (need_resched())
 					preempt_schedule_irq();
 				/* Covers both tracing and lockdep */
@@ -602,10 +649,12 @@ void noinstr idtentry_exit(struct pt_regs *regs)
 		trace_hardirqs_on_prepare();
 		lockdep_hardirqs_on_prepare(CALLER_ADDR0);
 		instrumentation_end();
-		rcu_irq_exit();
+		if (rcu_exit)
+			rcu_irq_exit();
 		lockdep_hardirqs_on(CALLER_ADDR0);
 	} else {
-		/* IRQ flags state is correct already. Just tell RCU */
-		rcu_irq_exit();
+		/* IRQ flags state is correct already. Just tell RCU. */
+		if (rcu_exit)
+			rcu_irq_exit();
 	}
 }
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index ce97478..a116b80 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -7,8 +7,18 @@
 
 #ifndef __ASSEMBLY__
 
-void idtentry_enter(struct pt_regs *regs);
-void idtentry_exit(struct pt_regs *regs);
+bool idtentry_enter_cond_rcu(struct pt_regs *regs, bool cond_rcu);
+void idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit);
+
+static __always_inline void idtentry_enter(struct pt_regs *regs)
+{
+	idtentry_enter_cond_rcu(regs, false);
+}
+
+static __always_inline void idtentry_exit(struct pt_regs *regs)
+{
+	idtentry_exit_cond_rcu(regs, true);
+}
 
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points

^ permalink raw reply related	[flat|nested] 159+ messages in thread

* Re: [patch V6 00/37] x86/entry: Rework leftovers and merge plan
  2020-05-23 13:08           ` Peter Zijlstra
@ 2020-06-15 16:17             ` Peter Zijlstra
  0 siblings, 0 replies; 159+ messages in thread
From: Peter Zijlstra @ 2020-06-15 16:17 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Thomas Gleixner, LKML, x86, Paul E. McKenney, Andy Lutomirski,
	Alexandre Chartre, Frederic Weisbecker, Paolo Bonzini,
	Sean Christopherson, Masami Hiramatsu, Petr Mladek,
	Steven Rostedt, Joel Fernandes, Boris Ostrovsky, Juergen Gross,
	Brian Gerst, Mathieu Desnoyers, Josh Poimboeuf, Will Deacon,
	Tom Lendacky, Wei Liu, Michael Kelley, Jason Chen CJ, Zhao Yakui

On Sat, May 23, 2020 at 03:08:36PM +0200, Peter Zijlstra wrote:
> On Sat, May 23, 2020 at 10:52:24AM +0800, Lai Jiangshan wrote:
> 
> > Hello,
> > 
> > I, who don't know how does the objtool handle it, am just curious.
> > _begin() and _end() are symmetrical, which means if _end() (without nop)
> > can escape, so can _begin() in a reverse way. For example:
> > 
> > noinstr void foo()
> > {
> >     instrumentation_begin();
> >     do {
> >             instrumentation_begin();
> >             ...
> >             instrumentation_end();
> >     } while (cond);
> >     bar();
> >     instrumentation_end();
> > }
> > 
> > Here, the first _begin() can be "dragged" into the do-while block.
> > Expectedly, objtool validation should not complain here.
> > 
> > But objtool validation's not complaining means it can handle it
> > magically correctly (by distinguishing how many _begin()s should
> > be taken around the jmp target when jmp in a specific path), or
> > handle it by not checking if all paths have the same count onto
> > a jmp target (a little nervous to me), or other possible ways.
> 
> No, I tihnk you're right. It could be we never hit this particular
> problem. Even the one described, where end leaks out, is quite rare. For
> instance, the last one I debgged (that led to this patch) only showed
> itself with gcc-9, but not with gcc-8 for example.
> 
> Anyway, if we ever find the above, I'll add the NOP to begin too.

FYI, I just found one, I'll be making instrumentation_begin() a NOP
too.

^ permalink raw reply	[flat|nested] 159+ messages in thread

end of thread, other threads:[~2020-06-15 16:17 UTC | newest]

Thread overview: 159+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-15 23:45 [patch V6 00/37] x86/entry: Rework leftovers and merge plan Thomas Gleixner
2020-05-15 23:45 ` [patch V6 01/37] tracing/hwlat: Use ktime_get_mono_fast_ns() Thomas Gleixner
2020-05-19 21:26   ` Steven Rostedt
2020-05-19 21:45     ` Thomas Gleixner
2020-05-19 22:18       ` Steven Rostedt
2020-05-20 19:51         ` Thomas Gleixner
2020-05-20 20:14   ` Peter Zijlstra
2020-05-20 22:20     ` Thomas Gleixner
2020-05-15 23:45 ` [patch V6 02/37] tracing/hwlat: Split ftrace_nmi_enter/exit() Thomas Gleixner
2020-05-19 22:23   ` Steven Rostedt
2020-05-15 23:45 ` [patch V6 03/37] nmi, tracing: Provide nmi_enter/exit_notrace() Thomas Gleixner
2020-05-17  5:12   ` Andy Lutomirski
2020-05-19 22:24   ` Steven Rostedt
2020-05-15 23:45 ` [patch V6 04/37] x86: Make hardware latency tracing explicit Thomas Gleixner
2020-05-17  5:36   ` Andy Lutomirski
2020-05-17  8:48     ` Thomas Gleixner
2020-05-18  5:50       ` Andy Lutomirski
2020-05-18  8:03         ` Thomas Gleixner
2020-05-18 20:42           ` Andy Lutomirski
2020-05-18  8:01   ` Peter Zijlstra
2020-05-18  8:05     ` Thomas Gleixner
2020-05-18  8:08       ` Peter Zijlstra
2020-05-20 20:09         ` Thomas Gleixner
2020-05-20 20:14           ` Andy Lutomirski
2020-05-20 22:20             ` Thomas Gleixner
2020-05-15 23:45 ` [patch V6 05/37] genirq: Provide irq_enter/exit_rcu() Thomas Gleixner
2020-05-18 23:06   ` Andy Lutomirski
2020-05-15 23:45 ` [patch V6 06/37] genirq: Provde __irq_enter/exit_raw() Thomas Gleixner
2020-05-18 23:07   ` Andy Lutomirski
2020-05-15 23:45 ` [patch V6 07/37] x86/entry: Provide helpers for execute on irqstack Thomas Gleixner
2020-05-18 23:11   ` Andy Lutomirski
2020-05-18 23:46     ` Andy Lutomirski
2020-05-18 23:53       ` Thomas Gleixner
2020-05-18 23:56         ` Andy Lutomirski
2020-05-20 12:35           ` Thomas Gleixner
2020-05-20 15:09             ` Andy Lutomirski
2020-05-20 15:27               ` Thomas Gleixner
2020-05-20 15:36                 ` Andy Lutomirski
2020-05-18 23:51     ` Thomas Gleixner
2020-05-15 23:45 ` [patch V6 08/37] x86/entry/64: Move do_softirq_own_stack() to C Thomas Gleixner
2020-05-18 23:48   ` Andy Lutomirski
2020-05-15 23:45 ` [patch V6 09/37] x86/entry: Split idtentry_enter/exit() Thomas Gleixner
2020-05-18 23:49   ` Andy Lutomirski
2020-05-19  8:25     ` Thomas Gleixner
2020-05-15 23:45 ` [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY Thomas Gleixner
2020-05-19 17:06   ` Andy Lutomirski
2020-05-19 18:57     ` Thomas Gleixner
2020-05-19 19:44       ` Andy Lutomirski
2020-05-20  8:06         ` Jürgen Groß
2020-05-20 11:31           ` Andrew Cooper
2020-05-20 14:13         ` Thomas Gleixner
2020-05-20 15:16           ` Andy Lutomirski
2020-05-20 17:22             ` Andy Lutomirski
2020-05-20 19:16               ` Thomas Gleixner
2020-05-20 23:21                 ` Andy Lutomirski
2020-05-21 10:45                   ` Thomas Gleixner
2020-05-21  2:23                 ` Boris Ostrovsky
2020-05-21  7:08                   ` Thomas Gleixner
2020-05-15 23:45 ` [patch V6 11/37] x86/entry/64: Simplify idtentry_body Thomas Gleixner
2020-05-19 17:06   ` Andy Lutomirski
2020-05-15 23:45 ` [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu() Thomas Gleixner
2020-05-19 17:08   ` Andy Lutomirski
2020-05-19 19:00     ` Thomas Gleixner
2020-05-19 20:20       ` Thomas Gleixner
2020-05-19 20:24         ` Andy Lutomirski
2020-05-19 21:20           ` Thomas Gleixner
2020-05-20  0:26             ` Andy Lutomirski
2020-05-20  2:23               ` Paul E. McKenney
2020-05-20 15:36                 ` Andy Lutomirski
2020-05-20 16:51                   ` Andy Lutomirski
2020-05-20 18:05                     ` Paul E. McKenney
2020-05-20 19:49                       ` Thomas Gleixner
2020-05-20 22:15                         ` Paul E. McKenney
2020-05-20 23:25                           ` Paul E. McKenney
2020-05-21  8:31                             ` Thomas Gleixner
2020-05-21 13:39                               ` Paul E. McKenney
2020-05-21 18:41                                 ` Thomas Gleixner
2020-05-21 19:04                                   ` Paul E. McKenney
2020-05-20 18:32                     ` Thomas Gleixner
2020-05-20 19:24                     ` Thomas Gleixner
2020-05-20 19:42                       ` Paul E. McKenney
2020-05-20 17:38                   ` Paul E. McKenney
2020-05-20 17:47                     ` Andy Lutomirski
2020-05-20 18:11                       ` Paul E. McKenney
2020-05-20 14:19               ` Thomas Gleixner
2020-05-27  8:12   ` [tip: x86/entry] " tip-bot2 for Thomas Gleixner
2020-05-15 23:46 ` [patch V6 13/37] x86/entry: Switch page fault exception to IDTENTRY_RAW Thomas Gleixner
2020-05-19 20:12   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 14/37] x86/entry: Remove the transition leftovers Thomas Gleixner
2020-05-19 20:13   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 15/37] x86/entry: Change exit path of xen_failsafe_callback Thomas Gleixner
2020-05-19 20:14   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 16/37] x86/entry/64: Remove error_exit Thomas Gleixner
2020-05-19 20:14   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 17/37] x86/entry/32: Remove common_exception Thomas Gleixner
2020-05-19 20:14   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 18/37] x86/irq: Use generic irq_regs implementation Thomas Gleixner
2020-05-15 23:46 ` [patch V6 19/37] x86/irq: Convey vector as argument and not in ptregs Thomas Gleixner
2020-05-19 20:19   ` Andy Lutomirski
2020-05-21 13:22     ` Thomas Gleixner
2020-05-22 18:48       ` Boris Ostrovsky
2020-05-22 19:26         ` Josh Poimboeuf
2020-05-22 19:54           ` Thomas Gleixner
2020-05-15 23:46 ` [patch V6 20/37] x86/irq/64: Provide handle_irq() Thomas Gleixner
2020-05-19 20:21   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 21/37] x86/entry: Add IRQENTRY_IRQ macro Thomas Gleixner
2020-05-19 20:27   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 22/37] x86/entry: Use idtentry for interrupts Thomas Gleixner
2020-05-19 20:28   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 23/37] x86/entry: Provide IDTENTRY_SYSVEC Thomas Gleixner
2020-05-20  0:29   ` Andy Lutomirski
2020-05-20 15:07     ` Thomas Gleixner
2020-05-15 23:46 ` [patch V6 24/37] x86/entry: Convert APIC interrupts to IDTENTRY_SYSVEC Thomas Gleixner
2020-05-20  0:27   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 25/37] x86/entry: Convert SMP system vectors " Thomas Gleixner
2020-05-20  0:28   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 26/37] x86/entry: Convert various system vectors Thomas Gleixner
2020-05-20  0:30   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 27/37] x86/entry: Convert KVM vectors to IDTENTRY_SYSVEC Thomas Gleixner
2020-05-20  0:30   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 28/37] x86/entry: Convert various hypervisor " Thomas Gleixner
2020-05-20  0:31   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 29/37] x86/entry: Convert XEN hypercall vector " Thomas Gleixner
2020-05-20  0:31   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 30/37] x86/entry: Convert reschedule interrupt to IDTENTRY_RAW Thomas Gleixner
2020-05-19 23:57   ` Andy Lutomirski
2020-05-20 15:08     ` Thomas Gleixner
2020-05-15 23:46 ` [patch V6 31/37] x86/entry: Remove the apic/BUILD interrupt leftovers Thomas Gleixner
2020-05-20  0:32   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 32/37] x86/entry/64: Remove IRQ stack switching ASM Thomas Gleixner
2020-05-20  0:33   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 33/37] x86/entry: Make enter_from_user_mode() static Thomas Gleixner
2020-05-20  0:34   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 34/37] x86/entry/32: Remove redundant irq disable code Thomas Gleixner
2020-05-20  0:35   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 35/37] x86/entry/64: Remove TRACE_IRQS_*_DEBUG Thomas Gleixner
2020-05-20  0:46   ` Andy Lutomirski
2020-05-15 23:46 ` [patch V6 36/37] x86/entry: Move paranoid irq tracing out of ASM code Thomas Gleixner
2020-05-20  0:53   ` Andy Lutomirski
2020-05-20 15:16     ` Thomas Gleixner
2020-05-20 17:13       ` Andy Lutomirski
2020-05-20 18:33         ` Thomas Gleixner
2020-05-15 23:46 ` [patch V6 37/37] x86/entry: Remove the TRACE_IRQS cruft Thomas Gleixner
2020-05-18 23:07   ` Andy Lutomirski
2020-05-16 17:18 ` [patch V6 00/37] x86/entry: Rework leftovers and merge plan Paul E. McKenney
2020-05-19 12:28   ` Joel Fernandes
2020-05-18 16:07 ` Peter Zijlstra
2020-05-18 18:53   ` Thomas Gleixner
2020-05-19  8:29     ` Peter Zijlstra
2020-05-18 20:24   ` Thomas Gleixner
2020-05-19  8:38     ` Peter Zijlstra
2020-05-19  9:02       ` Peter Zijlstra
2020-05-23  2:52         ` Lai Jiangshan
2020-05-23 13:08           ` Peter Zijlstra
2020-06-15 16:17             ` Peter Zijlstra
2020-05-19  9:06       ` Thomas Gleixner
2020-05-19 18:37 ` Steven Rostedt
2020-05-19 19:09   ` Thomas Gleixner
2020-05-19 19:13     ` Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.