kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/34] x86: enable FRED for x86-64
@ 2023-03-07  2:39 Xin Li
  2023-03-07  2:39 ` [PATCH v5 01/34] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR Xin Li
                   ` (34 more replies)
  0 siblings, 35 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

This patch set enables FRED for x86-64.

The Intel flexible return and event delivery (FRED) architecture defines simple
new transitions that change privilege level (ring transitions). The FRED
architecture was designed with the following goals:
1) Improve overall performance and response time by replacing event delivery
through the interrupt descriptor table (IDT event delivery) and event return by
the IRET instruction with lower latency transitions.
2) Improve software robustness by ensuring that event delivery establishes the
full supervisor context and that event return establishes the full user context.

The new transitions defined by the FRED architecture are FRED event delivery and,
for returning from events, two FRED return instructions. FRED event delivery can
effect a transition from ring 3 to ring 0, but it is used also to deliver events
incident to ring 0. One FRED instruction (ERETU) effects a return from ring 0 to
ring 3, while the other (ERETS) returns while remaining in ring 0.

Search for the latest FRED spec in most search engines with this search pattern:

  site:intel.com FRED (flexible return and event delivery) specification

As of now there is no publicly avaiable CPU supporting FRED, thus the Intel
Simics® Simulator is used as software development and testing vehicles. And
it can be downloaded from:
  https://www.intel.com/content/www/us/en/developer/articles/tool/simics-simulator.html

To enable FRED, the Simics package 8112 QSP-CPU needs to be installed with CPU
model configured as:
	$cpu_comp_class = "x86-experimental-fred"

Longer term, we should refactor common code shared by FRED and IDT into common
shared files, and contain IDT code using a new config CONFIG_X86_IDT.

Changes since v4:
* Rebased against v6.3-rc1.
* Do NOT use the term "injection", which in the KVM context means to
  reinject an event into the guest (Sean Christopherson).
* Add the explanation of why to execute "int $2" to invoke the NMI handler
  in NMI caused VM exits (Sean Christopherson).
* Use cs/ss instead of csx/ssx when initializing the pt_regs structure
  for calling external_interrupt(), otherwise it breaks i386 build.

Changes since v3:
* Call external_interrupt() to handle IRQ in IRQ caused VM exits.
* Execute "int $2" to handle NMI in NMI caused VM exits.
* Rename csl/ssl of the pt_regs structure to csx/ssx (x for extended)
  (Andrew Cooper).

Changes since v2:
* Improve comments for changes in arch/x86/include/asm/idtentry.h.

Changes since v1:
* call irqentry_nmi_{enter,exit}() in both IDT and FRED debug fault kernel
  handler (Peter Zijlstra).
* Initialize a FRED exception handler to fred_bad_event() instead of NULL
  if no FRED handler defined for an exception vector (Peter Zijlstra).
* Push calling irqentry_{enter,exit}() and instrumentation_{begin,end}()
  down into individual FRED exception handlers, instead of in the dispatch
  framework (Peter Zijlstra).


H. Peter Anvin (Intel) (24):
  x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR
  x86/traps: add a system interrupt table for system interrupt dispatch
  x86/traps: add external_interrupt() to dispatch external interrupts
  x86/cpufeature: add the cpu feature bit for FRED
  x86/opcode: add ERETU, ERETS instructions to x86-opcode-map
  x86/objtool: teach objtool about ERETU and ERETS
  x86/cpu: add X86_CR4_FRED macro
  x86/fred: add Kconfig option for FRED (CONFIG_X86_FRED)
  x86/fred: if CONFIG_X86_FRED is disabled, disable FRED support
  x86/cpu: add MSR numbers for FRED configuration
  x86/fred: header file with FRED definitions
  x86/fred: make unions for the cs and ss fields in struct pt_regs
  x86/fred: reserve space for the FRED stack frame
  x86/fred: add a page fault entry stub for FRED
  x86/fred: add a debug fault entry stub for FRED
  x86/fred: add a NMI entry stub for FRED
  x86/fred: FRED entry/exit and dispatch code
  x86/fred: FRED initialization code
  x86/fred: update MSR_IA32_FRED_RSP0 during task switch
  x86/fred: let ret_from_fork() jmp to fred_exit_user when FRED is
    enabled
  x86/fred: disallow the swapgs instruction when FRED is enabled
  x86/fred: no ESPFIX needed when FRED is enabled
  x86/fred: allow single-step trap and NMI when starting a new thread
  x86/fred: allow FRED systems to use interrupt vectors 0x10-0x1f

Xin Li (10):
  x86/traps: add install_system_interrupt_handler()
  x86/traps: export external_interrupt() for VMX IRQ reinjection
  x86/fred: header file for event types
  x86/fred: add a machine check entry stub for FRED
  x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user
  x86/ia32: do not modify the DPL bits for a null selector
  x86/fred: allow dynamic stack frame size
  x86/fred: disable FRED by default in its early stage
  KVM: x86/vmx: call external_interrupt() to handle IRQ in IRQ caused VM
    exits
  KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits
    when FRED is enabled

 .../admin-guide/kernel-parameters.txt         |   4 +
 arch/x86/Kconfig                              |   9 +
 arch/x86/entry/Makefile                       |   5 +-
 arch/x86/entry/entry_32.S                     |   2 +-
 arch/x86/entry/entry_64.S                     |   5 +
 arch/x86/entry/entry_64_fred.S                |  59 +++++
 arch/x86/entry/entry_fred.c                   | 234 ++++++++++++++++++
 arch/x86/entry/vsyscall/vsyscall_64.c         |   2 +-
 arch/x86/include/asm/cpufeatures.h            |   1 +
 arch/x86/include/asm/disabled-features.h      |   8 +-
 arch/x86/include/asm/entry-common.h           |   3 +
 arch/x86/include/asm/event-type.h             |  17 ++
 arch/x86/include/asm/extable_fixup_types.h    |   4 +-
 arch/x86/include/asm/fred.h                   | 131 ++++++++++
 arch/x86/include/asm/idtentry.h               |  76 +++++-
 arch/x86/include/asm/irq.h                    |   5 +
 arch/x86/include/asm/irq_vectors.h            |  15 +-
 arch/x86/include/asm/msr-index.h              |  13 +-
 arch/x86/include/asm/processor.h              |  12 +-
 arch/x86/include/asm/ptrace.h                 |  36 ++-
 arch/x86/include/asm/switch_to.h              |  10 +-
 arch/x86/include/asm/thread_info.h            |  35 +--
 arch/x86/include/asm/traps.h                  |  13 +
 arch/x86/include/asm/vmx.h                    |  17 +-
 arch/x86/include/uapi/asm/processor-flags.h   |   2 +
 arch/x86/kernel/Makefile                      |   1 +
 arch/x86/kernel/apic/apic.c                   |  11 +-
 arch/x86/kernel/apic/vector.c                 |   8 +-
 arch/x86/kernel/cpu/acrn.c                    |   7 +-
 arch/x86/kernel/cpu/common.c                  |  88 ++++---
 arch/x86/kernel/cpu/mce/core.c                |  11 +
 arch/x86/kernel/cpu/mshyperv.c                |  22 +-
 arch/x86/kernel/espfix_64.c                   |   8 +
 arch/x86/kernel/fred.c                        |  73 ++++++
 arch/x86/kernel/head_32.S                     |   3 +-
 arch/x86/kernel/idt.c                         |   6 +-
 arch/x86/kernel/irq.c                         |   6 +-
 arch/x86/kernel/irqinit.c                     |   7 +-
 arch/x86/kernel/kvm.c                         |   4 +-
 arch/x86/kernel/nmi.c                         |  28 +++
 arch/x86/kernel/process.c                     |   5 +
 arch/x86/kernel/process_64.c                  |  21 +-
 arch/x86/kernel/signal_32.c                   |  21 +-
 arch/x86/kernel/traps.c                       | 175 +++++++++++--
 arch/x86/kvm/vmx/vmx.c                        |  33 ++-
 arch/x86/lib/x86-opcode-map.txt               |   2 +-
 arch/x86/mm/extable.c                         |  28 +++
 arch/x86/mm/fault.c                           |  20 +-
 drivers/xen/events/events_base.c              |   5 +-
 kernel/fork.c                                 |   6 +
 tools/arch/x86/include/asm/cpufeatures.h      |   1 +
 .../arch/x86/include/asm/disabled-features.h  |   8 +-
 tools/arch/x86/include/asm/msr-index.h        |  13 +-
 tools/arch/x86/lib/x86-opcode-map.txt         |   2 +-
 tools/objtool/arch/x86/decode.c               |  19 +-
 55 files changed, 1185 insertions(+), 175 deletions(-)
 create mode 100644 arch/x86/entry/entry_64_fred.S
 create mode 100644 arch/x86/entry/entry_fred.c
 create mode 100644 arch/x86/include/asm/event-type.h
 create mode 100644 arch/x86/include/asm/fred.h
 create mode 100644 arch/x86/kernel/fred.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v5 01/34] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 02/34] x86/traps: add a system interrupt table for system interrupt dispatch Xin Li
                   ` (33 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

IRQ_MOVE_CLEANUP_VECTOR is the only one of the system IRQ vectors that
is *below* FIRST_SYSTEM_VECTOR. It is a slow path, so just push it
into common_interrupt() just before the spurious interrupt handling.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kernel/irq.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 766ffe3ba313..7e125fff45ab 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -248,6 +248,10 @@ DEFINE_IDTENTRY_IRQ(common_interrupt)
 	desc = __this_cpu_read(vector_irq[vector]);
 	if (likely(!IS_ERR_OR_NULL(desc))) {
 		handle_irq(desc, regs);
+#ifdef CONFIG_SMP
+	} else if (vector == IRQ_MOVE_CLEANUP_VECTOR) {
+		sysvec_irq_move_cleanup(regs);
+#endif
 	} else {
 		ack_APIC_irq();
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 02/34] x86/traps: add a system interrupt table for system interrupt dispatch
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
  2023-03-07  2:39 ` [PATCH v5 01/34] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 03/34] x86/traps: add install_system_interrupt_handler() Xin Li
                   ` (32 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Upon receiving an external interrupt, KVM VMX reinjects it through
calling the interrupt handler in its IDT descriptor on the current
kernel stack, which essentially uses the IDT as an interrupt dispatch
table.

However the IDT is one of the lowest level critical data structures
between a x86 CPU and the Linux kernel, we should avoid using it
*directly* whenever possible, espeically in a software defined manner.

On x86, external interrupts are divided into the following groups
  1) system interrupts
  2) external device interrupts
With the IDT, system interrupts are dispatched through the IDT
directly, while external device interrupts are all routed to the
external interrupt dispatch function common_interrupt(), which
dispatches external device interrupts through a per-CPU external
interrupt dispatch table vector_irq.

To eliminate dispatching external interrupts through the IDT, add
a system interrupt handler table for dispatching a system interrupt
to its corresponding handler directly. Thus a software based dispatch
function will be:

  void external_interrupt(struct pt_regs *regs, u8 vector)
  {
    if (is_system_interrupt(vector))
      system_interrupt_handlers[vector_to_sysvec(vector)](regs);
    else /* external device interrupt */
      common_interrupt(regs, vector);
  }

What's more, with the Intel FRED (Flexible Return and Event Delivery)
architecture, IDT, the hardware based event dispatch table, is gone,
and the Linux kernel needs to dispatch events to their handlers with
vector to handler mappings, the dispatch function external_interrupt()
is also needed.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Co-developed-by: Xin Li <xin3.li@intel.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/idtentry.h | 64 +++++++++++++++++++++++++++------
 arch/x86/include/asm/traps.h    |  7 ++++
 arch/x86/kernel/traps.c         | 40 +++++++++++++++++++++
 3 files changed, 100 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b241af4ce9b4..2876ddae02bc 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -167,17 +167,22 @@ __visible noinstr void func(struct pt_regs *regs, unsigned long error_code)
 
 /**
  * DECLARE_IDTENTRY_IRQ - Declare functions for device interrupt IDT entry
- *			  points (common/spurious)
+ *			  points (common/spurious) and their corresponding
+ *			  software based dispatch handlers in the non-noinstr
+ *			  text section
  * @vector:	Vector number (ignored for C)
  * @func:	Function name of the entry point
  *
- * Maps to DECLARE_IDTENTRY_ERRORCODE()
+ * Maps to DECLARE_IDTENTRY_ERRORCODE(), plus a dispatch function prototype
  */
 #define DECLARE_IDTENTRY_IRQ(vector, func)				\
-	DECLARE_IDTENTRY_ERRORCODE(vector, func)
+	DECLARE_IDTENTRY_ERRORCODE(vector, func);			\
+	void dispatch_##func(struct pt_regs *regs, unsigned long error_code)
 
 /**
  * DEFINE_IDTENTRY_IRQ - Emit code for device interrupt IDT entry points
+ *			 and their corresponding software based dispatch
+ *			 handlers in the non-noinstr text section
  * @func:	Function name of the entry point
  *
  * The vector number is pushed by the low level entry stub and handed
@@ -187,6 +192,9 @@ __visible noinstr void func(struct pt_regs *regs, unsigned long error_code)
  * irq_enter/exit_rcu() are invoked before the function body and the
  * KVM L1D flush request is set. Stack switching to the interrupt stack
  * has to be done in the function body if necessary.
+ *
+ * dispatch_func() is a software based dispatch handler in the non-noinstr
+ * text section.
  */
 #define DEFINE_IDTENTRY_IRQ(func)					\
 static void __##func(struct pt_regs *regs, u32 vector);			\
@@ -204,31 +212,48 @@ __visible noinstr void func(struct pt_regs *regs,			\
 	irqentry_exit(regs, state);					\
 }									\
 									\
+void dispatch_##func(struct pt_regs *regs, unsigned long error_code)	\
+{									\
+	u32 vector = (u32)(u8)error_code;				\
+									\
+	kvm_set_cpu_l1tf_flush_l1d();					\
+	run_irq_on_irqstack_cond(__##func, regs, vector);		\
+}									\
+									\
 static noinline void __##func(struct pt_regs *regs, u32 vector)
 
 /**
  * DECLARE_IDTENTRY_SYSVEC - Declare functions for system vector entry points
+ *			     and their corresponding software based dispatch
+ *			     handlers in the non-noinstr text section
  * @vector:	Vector number (ignored for C)
  * @func:	Function name of the entry point
  *
- * Declares three functions:
+ * Declares four functions:
  * - The ASM entry point: asm_##func
  * - The XEN PV trap entry point: xen_##func (maybe unused)
  * - The C handler called from the ASM entry point
+ * - The C handler used in the system interrupt handler table
  *
- * Maps to DECLARE_IDTENTRY().
+ * Maps to DECLARE_IDTENTRY(), plus a dispatch table function prototype
  */
 #define DECLARE_IDTENTRY_SYSVEC(vector, func)				\
-	DECLARE_IDTENTRY(vector, func)
+	DECLARE_IDTENTRY(vector, func);					\
+	void dispatch_table_##func(struct pt_regs *regs)
 
 /**
  * DEFINE_IDTENTRY_SYSVEC - Emit code for system vector IDT entry points
+ *			    and their corresponding software based dispatch
+ *			    handlers in the non-noinstr text section
  * @func:	Function name of the entry point
  *
  * irqentry_enter/exit() and irq_enter/exit_rcu() are invoked before the
  * function body. KVM L1D flush request is set.
  *
- * Runs the function on the interrupt stack if the entry hit kernel mode
+ * Runs the function on the interrupt stack if the entry hit kernel mode.
+ *
+ * dispatch_table_func() is used in the system interrupt handler table for
+ * system interrupts dispatching.
  */
 #define DEFINE_IDTENTRY_SYSVEC(func)					\
 static void __##func(struct pt_regs *regs);				\
@@ -244,11 +269,19 @@ __visible noinstr void func(struct pt_regs *regs)			\
 	irqentry_exit(regs, state);					\
 }									\
 									\
+void dispatch_table_##func(struct pt_regs *regs)			\
+{									\
+	kvm_set_cpu_l1tf_flush_l1d();					\
+	run_sysvec_on_irqstack_cond(__##func, regs);			\
+}									\
+									\
 static noinline void __##func(struct pt_regs *regs)
 
 /**
  * DEFINE_IDTENTRY_SYSVEC_SIMPLE - Emit code for simple system vector IDT
- *				   entry points
+ *				   entry points and their corresponding
+ *				   software based dispatch handlers in
+ *				   the non-noinstr text section
  * @func:	Function name of the entry point
  *
  * Runs the function on the interrupted stack. No switch to IRQ stack and
@@ -256,6 +289,9 @@ static noinline void __##func(struct pt_regs *regs)
  *
  * Only use for 'empty' vectors like reschedule IPI and KVM posted
  * interrupt vectors.
+ *
+ * dispatch_table_func() is used in the system interrupt handler table for
+ * system interrupts dispatching.
  */
 #define DEFINE_IDTENTRY_SYSVEC_SIMPLE(func)				\
 static __always_inline void __##func(struct pt_regs *regs);		\
@@ -273,6 +309,14 @@ __visible noinstr void func(struct pt_regs *regs)			\
 	irqentry_exit(regs, state);					\
 }									\
 									\
+void dispatch_table_##func(struct pt_regs *regs)			\
+{									\
+	__irq_enter_raw();						\
+	kvm_set_cpu_l1tf_flush_l1d();					\
+	__##func (regs);						\
+	__irq_exit_raw();						\
+}									\
+									\
 static __always_inline void __##func(struct pt_regs *regs)
 
 /**
@@ -634,9 +678,7 @@ DECLARE_IDTENTRY(X86_TRAP_VE,		exc_virtualization_exception);
 
 /* Device interrupts common/spurious */
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
-#ifdef CONFIG_X86_LOCAL_APIC
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	spurious_interrupt);
-#endif
 
 /* System vector entry points */
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -647,7 +689,7 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR,	sysvec_x86_platform_ipi);
 #endif
 
 #ifdef CONFIG_SMP
-DECLARE_IDTENTRY(RESCHEDULE_VECTOR,			sysvec_reschedule_ipi);
+DECLARE_IDTENTRY_SYSVEC(RESCHEDULE_VECTOR,		sysvec_reschedule_ipi);
 DECLARE_IDTENTRY_SYSVEC(IRQ_MOVE_CLEANUP_VECTOR,	sysvec_irq_move_cleanup);
 DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR,			sysvec_reboot);
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR,	sysvec_call_function_single);
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 47ecfff2c83d..28c8ba5fd81c 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -47,4 +47,11 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
 				      struct stack_info *info);
 #endif
 
+/*
+ * How system interrupt handlers are called.
+ */
+#define DECLARE_SYSTEM_INTERRUPT_HANDLER(f)	\
+	void f (struct pt_regs *regs)
+typedef DECLARE_SYSTEM_INTERRUPT_HANDLER((*system_interrupt_handler));
+
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d317dc3d06a3..e4bdebdf05dd 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1451,6 +1451,46 @@ DEFINE_IDTENTRY_SW(iret_error)
 }
 #endif
 
+#define SYSV(x,y) [(x) - FIRST_SYSTEM_VECTOR] = y
+
+static system_interrupt_handler system_interrupt_handlers[NR_SYSTEM_VECTORS] = {
+#ifdef CONFIG_SMP
+	SYSV(RESCHEDULE_VECTOR,			dispatch_table_sysvec_reschedule_ipi),
+	SYSV(CALL_FUNCTION_VECTOR,		dispatch_table_sysvec_call_function),
+	SYSV(CALL_FUNCTION_SINGLE_VECTOR,	dispatch_table_sysvec_call_function_single),
+	SYSV(REBOOT_VECTOR,			dispatch_table_sysvec_reboot),
+#endif
+
+#ifdef CONFIG_X86_THERMAL_VECTOR
+	SYSV(THERMAL_APIC_VECTOR,		dispatch_table_sysvec_thermal),
+#endif
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+	SYSV(THRESHOLD_APIC_VECTOR,		dispatch_table_sysvec_threshold),
+#endif
+
+#ifdef CONFIG_X86_MCE_AMD
+	SYSV(DEFERRED_ERROR_VECTOR,		dispatch_table_sysvec_deferred_error),
+#endif
+
+#ifdef CONFIG_X86_LOCAL_APIC
+	SYSV(LOCAL_TIMER_VECTOR,		dispatch_table_sysvec_apic_timer_interrupt),
+	SYSV(X86_PLATFORM_IPI_VECTOR,		dispatch_table_sysvec_x86_platform_ipi),
+# ifdef CONFIG_HAVE_KVM
+	SYSV(POSTED_INTR_VECTOR,		dispatch_table_sysvec_kvm_posted_intr_ipi),
+	SYSV(POSTED_INTR_WAKEUP_VECTOR,		dispatch_table_sysvec_kvm_posted_intr_wakeup_ipi),
+	SYSV(POSTED_INTR_NESTED_VECTOR,		dispatch_table_sysvec_kvm_posted_intr_nested_ipi),
+# endif
+# ifdef CONFIG_IRQ_WORK
+	SYSV(IRQ_WORK_VECTOR,			dispatch_table_sysvec_irq_work),
+# endif
+	SYSV(SPURIOUS_APIC_VECTOR,		dispatch_table_sysvec_spurious_apic_interrupt),
+	SYSV(ERROR_APIC_VECTOR,			dispatch_table_sysvec_error_interrupt),
+#endif
+};
+
+#undef SYSV
+
 void __init trap_init(void)
 {
 	/* Init cpu_entry_area before IST entries are set up */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 03/34] x86/traps: add install_system_interrupt_handler()
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
  2023-03-07  2:39 ` [PATCH v5 01/34] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR Xin Li
  2023-03-07  2:39 ` [PATCH v5 02/34] x86/traps: add a system interrupt table for system interrupt dispatch Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 04/34] x86/traps: add external_interrupt() to dispatch external interrupts Xin Li
                   ` (31 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

Some kernel components install system interrupt handlers into the IDT,
and we need to do the same for system_interrupt_handlers. A new function
install_system_interrupt_handler() is added to install a system interrupt
handler into both the IDT and system_interrupt_handlers.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/traps.h     |  2 ++
 arch/x86/kernel/cpu/acrn.c       |  7 +++++--
 arch/x86/kernel/cpu/mshyperv.c   | 22 ++++++++++++++--------
 arch/x86/kernel/kvm.c            |  4 +++-
 arch/x86/kernel/traps.c          |  8 ++++++++
 drivers/xen/events/events_base.c |  5 ++++-
 6 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 28c8ba5fd81c..46f5e4e2a346 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -41,6 +41,8 @@ void math_emulate(struct math_emu_info *);
 
 bool fault_in_kernel_space(unsigned long address);
 
+void install_system_interrupt_handler(unsigned int n, const void *asm_addr, const void *addr);
+
 #ifdef CONFIG_VMAP_STACK
 void __noreturn handle_stack_overflow(struct pt_regs *regs,
 				      unsigned long fault_address,
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index 485441b7f030..9351bf183a9e 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -18,6 +18,7 @@
 #include <asm/hypervisor.h>
 #include <asm/idtentry.h>
 #include <asm/irq_regs.h>
+#include <asm/traps.h>
 
 static u32 __init acrn_detect(void)
 {
@@ -26,8 +27,10 @@ static u32 __init acrn_detect(void)
 
 static void __init acrn_init_platform(void)
 {
-	/* Setup the IDT for ACRN hypervisor callback */
-	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_acrn_hv_callback);
+	/* Install system interrupt handler for ACRN hypervisor callback */
+	install_system_interrupt_handler(HYPERVISOR_CALLBACK_VECTOR,
+					 asm_sysvec_acrn_hv_callback,
+					 sysvec_acrn_hv_callback);
 
 	x86_platform.calibrate_tsc = acrn_get_tsc_khz;
 	x86_platform.calibrate_cpu = acrn_get_tsc_khz;
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f36dc2f796c5..63282f4bfdcd 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -29,6 +29,7 @@
 #include <asm/i8259.h>
 #include <asm/apic.h>
 #include <asm/timer.h>
+#include <asm/traps.h>
 #include <asm/reboot.h>
 #include <asm/nmi.h>
 #include <clocksource/hyperv_timer.h>
@@ -487,19 +488,24 @@ static void __init ms_hyperv_init_platform(void)
 	 */
 	x86_platform.apic_post_init = hyperv_init;
 	hyperv_setup_mmu_ops();
-	/* Setup the IDT for hypervisor callback */
-	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_hyperv_callback);
 
-	/* Setup the IDT for reenlightenment notifications */
+	/* Install system interrupt handler for hypervisor callback */
+	install_system_interrupt_handler(HYPERVISOR_CALLBACK_VECTOR,
+					 asm_sysvec_hyperv_callback,
+					 sysvec_hyperv_callback);
+
+	/* Install system interrupt handler for reenlightenment notifications */
 	if (ms_hyperv.features & HV_ACCESS_REENLIGHTENMENT) {
-		alloc_intr_gate(HYPERV_REENLIGHTENMENT_VECTOR,
-				asm_sysvec_hyperv_reenlightenment);
+		install_system_interrupt_handler(HYPERV_REENLIGHTENMENT_VECTOR,
+						 asm_sysvec_hyperv_reenlightenment,
+						 sysvec_hyperv_reenlightenment);
 	}
 
-	/* Setup the IDT for stimer0 */
+	/* Install system interrupt handler for stimer0 */
 	if (ms_hyperv.misc_features & HV_STIMER_DIRECT_MODE_AVAILABLE) {
-		alloc_intr_gate(HYPERV_STIMER0_VECTOR,
-				asm_sysvec_hyperv_stimer0);
+		install_system_interrupt_handler(HYPERV_STIMER0_VECTOR,
+						 asm_sysvec_hyperv_stimer0,
+						 sysvec_hyperv_stimer0);
 	}
 
 # ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1cceac5984da..5c684df6de7a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -829,7 +829,9 @@ static void __init kvm_guest_init(void)
 
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_INT) && kvmapf) {
 		static_branch_enable(&kvm_async_pf_enabled);
-		alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_kvm_asyncpf_interrupt);
+		install_system_interrupt_handler(HYPERVISOR_CALLBACK_VECTOR,
+						 asm_sysvec_kvm_asyncpf_interrupt,
+						 sysvec_kvm_asyncpf_interrupt);
 	}
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index e4bdebdf05dd..c0f7666140da 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1491,6 +1491,14 @@ static system_interrupt_handler system_interrupt_handlers[NR_SYSTEM_VECTORS] = {
 
 #undef SYSV
 
+void __init install_system_interrupt_handler(unsigned int n, const void *asm_addr, const void *addr)
+{
+	BUG_ON(n < FIRST_SYSTEM_VECTOR);
+
+	system_interrupt_handlers[n - FIRST_SYSTEM_VECTOR] = (system_interrupt_handler)addr;
+	alloc_intr_gate(n, asm_addr);
+}
+
 void __init trap_init(void)
 {
 	/* Init cpu_entry_area before IST entries are set up */
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index c7715f8bd452..cf1a5ca3bf62 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -45,6 +45,7 @@
 #include <asm/irq.h>
 #include <asm/io_apic.h>
 #include <asm/i8259.h>
+#include <asm/traps.h>
 #include <asm/xen/cpuid.h>
 #include <asm/xen/pci.h>
 #endif
@@ -2249,7 +2250,9 @@ static __init void xen_alloc_callback_vector(void)
 		return;
 
 	pr_info("Xen HVM callback vector for event delivery is enabled\n");
-	alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_xen_hvm_callback);
+	install_system_interrupt_handler(HYPERVISOR_CALLBACK_VECTOR,
+					 asm_sysvec_xen_hvm_callback,
+					 sysvec_xen_hvm_callback);
 }
 #else
 void xen_setup_callback_vector(void) {}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 04/34] x86/traps: add external_interrupt() to dispatch external interrupts
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (2 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 03/34] x86/traps: add install_system_interrupt_handler() Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-20 15:36   ` Peter Zijlstra
  2023-03-07  2:39 ` [PATCH v5 05/34] x86/traps: export external_interrupt() for VMX IRQ reinjection Xin Li
                   ` (30 subsequent siblings)
  34 siblings, 1 reply; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add external_interrupt() to dispatch external interrupts to their
handlers. If an external interrupt is a system interrupt, dipatch
it through system_interrupt_handler_table, otherwise call into
dispatch_common_interrupt().

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Co-developed-by: Xin Li <xin3.li@intel.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kernel/traps.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c0f7666140da..31ad645be2fb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1499,6 +1499,47 @@ void __init install_system_interrupt_handler(unsigned int n, const void *asm_add
 	alloc_intr_gate(n, asm_addr);
 }
 
+#ifndef CONFIG_X86_LOCAL_APIC
+/*
+ * Used when local APIC is not compiled into the kernel, but
+ * external_interrupt() needs dispatch_spurious_interrupt().
+ */
+DEFINE_IDTENTRY_IRQ(spurious_interrupt)
+{
+	pr_info("Spurious interrupt (vector 0x%x) on CPU#%d, should never happen.\n",
+		vector, smp_processor_id());
+}
+#endif
+
+/*
+ * External interrupt dispatch function.
+ *
+ * Until/unless dispatch_common_interrupt() can be taught to deal with the
+ * special system vectors, split the dispatch.
+ *
+ * Note: dispatch_common_interrupt() already deals with IRQ_MOVE_CLEANUP_VECTOR.
+ */
+int external_interrupt(struct pt_regs *regs, unsigned int vector)
+{
+	unsigned int sysvec = vector - FIRST_SYSTEM_VECTOR;
+
+	if (vector < FIRST_EXTERNAL_VECTOR) {
+		pr_err("invalid external interrupt vector %d\n", vector);
+		return -EINVAL;
+	}
+
+	if (sysvec < NR_SYSTEM_VECTORS) {
+		if (system_interrupt_handlers[sysvec])
+			system_interrupt_handlers[sysvec](regs);
+		else
+			dispatch_spurious_interrupt(regs, vector);
+	} else {
+		dispatch_common_interrupt(regs, vector);
+	}
+
+	return 0;
+}
+
 void __init trap_init(void)
 {
 	/* Init cpu_entry_area before IST entries are set up */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 05/34] x86/traps: export external_interrupt() for VMX IRQ reinjection
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (3 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 04/34] x86/traps: add external_interrupt() to dispatch external interrupts Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-22 17:52   ` Sean Christopherson
  2023-03-07  2:39 ` [PATCH v5 06/34] x86/cpufeature: add the cpu feature bit for FRED Xin Li
                   ` (29 subsequent siblings)
  34 siblings, 1 reply; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

To eliminate dispatching IRQ through the IDT, export external_interrupt()
for VMX IRQ reinjection.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/traps.h |  2 ++
 arch/x86/kernel/traps.c      | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 46f5e4e2a346..da4c21ed68b4 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -56,4 +56,6 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
 	void f (struct pt_regs *regs)
 typedef DECLARE_SYSTEM_INTERRUPT_HANDLER((*system_interrupt_handler));
 
+int external_interrupt(struct pt_regs *regs, unsigned int vector);
+
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 31ad645be2fb..cebba1f49e19 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1540,6 +1540,20 @@ int external_interrupt(struct pt_regs *regs, unsigned int vector)
 	return 0;
 }
 
+#if IS_ENABLED(CONFIG_KVM_INTEL)
+/*
+ * KVM VMX reinjects IRQ on its current stack, it's a sync call
+ * thus the values in the pt_regs structure are not used in
+ * executing IRQ handlers, except cs.RPL and flags.IF, which
+ * are both always 0 in the VMX IRQ reinjection context.
+ *
+ * However, the pt_regs structure is sometimes used in stack
+ * dump, e.g., show_regs(). So let the caller, i.e., KVM VMX
+ * decide how to initialize the input pt_regs structure.
+ */
+EXPORT_SYMBOL_GPL(external_interrupt);
+#endif
+
 void __init trap_init(void)
 {
 	/* Init cpu_entry_area before IST entries are set up */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 06/34] x86/cpufeature: add the cpu feature bit for FRED
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (4 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 05/34] x86/traps: export external_interrupt() for VMX IRQ reinjection Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 07/34] x86/opcode: add ERETU, ERETS instructions to x86-opcode-map Xin Li
                   ` (28 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add the CPU feature bit for FRED (Flexible Return and Event Delivery).

The Intel flexible return and event delivery (FRED) architecture defines simple
new transitions that change privilege level (ring transitions).  The FRED
architecture was designed with the following goals:
1) Improve overall performance and response time by replacing event delivery
through the interrupt descriptor table (IDT event delivery) and event return by
the IRET instruction with lower latency transitions.
2) Improve software robustness by ensuring that event delivery establishes the
full supervisor context and that event return establishes the full user context.

The new transitions defined by the FRED architecture are FRED event delivery and,
for returning from events, two FRED return instructions. FRED event delivery can
effect a transition from ring 3 to ring 0, but it is used also to deliver events
incident to ring 0. One FRED instruction (ERETU) effects a return from ring 0 to
ring 3, while the other (ERETS) returns while remaining in ring 0.

Search for the latest FRED spec in most search engines with this search pattern:

  site:intel.com FRED (flexible return and event delivery) specification

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/cpufeatures.h       | 1 +
 tools/arch/x86/include/asm/cpufeatures.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 73c9672c123b..1fa444478d33 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -318,6 +318,7 @@
 #define X86_FEATURE_FZRM		(12*32+10) /* "" Fast zero-length REP MOVSB */
 #define X86_FEATURE_FSRS		(12*32+11) /* "" Fast short REP STOSB */
 #define X86_FEATURE_FSRC		(12*32+12) /* "" Fast short REP {CMPSB,SCASB} */
+#define X86_FEATURE_FRED		(12*32+17) /* Flexible Return and Event Delivery */
 #define X86_FEATURE_LKGS		(12*32+18) /* "" Load "kernel" (userspace) GS */
 #define X86_FEATURE_AMX_FP16		(12*32+21) /* "" AMX fp16 Support */
 #define X86_FEATURE_AVX_IFMA            (12*32+23) /* "" Support for VPMADD52[H,L]UQ */
diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
index b70111a75688..b2218a7a0927 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -312,6 +312,7 @@
 #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
 #define X86_FEATURE_AVX512_BF16		(12*32+ 5) /* AVX512 BFLOAT16 instructions */
 #define X86_FEATURE_CMPCCXADD           (12*32+ 7) /* "" CMPccXADD instructions */
+#define X86_FEATURE_FRED		(12*32+17) /* Flexible Return and Event Delivery */
 #define X86_FEATURE_LKGS		(12*32+18) /* "" Load "kernel" (userspace) GS */
 #define X86_FEATURE_AMX_FP16		(12*32+21) /* "" AMX fp16 Support */
 #define X86_FEATURE_AVX_IFMA            (12*32+23) /* "" Support for VPMADD52[H,L]UQ */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 07/34] x86/opcode: add ERETU, ERETS instructions to x86-opcode-map
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (5 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 06/34] x86/cpufeature: add the cpu feature bit for FRED Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 08/34] x86/objtool: teach objtool about ERETU and ERETS Xin Li
                   ` (27 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add the instruction opcodes used by FRED: ERETU, ERETS.

Opcode number is per public FRED draft spec v3.0.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/lib/x86-opcode-map.txt       | 2 +-
 tools/arch/x86/lib/x86-opcode-map.txt | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index 5168ee0360b2..7a269e269dc0 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -1052,7 +1052,7 @@ EndTable
 
 GrpTable: Grp7
 0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B)
-1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B)
+1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B) | ERETU (F3),(010),(11B) | ERETS (F2),(010),(11B)
 2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
 3: LIDT Ms
 4: SMSW Mw/Rv
diff --git a/tools/arch/x86/lib/x86-opcode-map.txt b/tools/arch/x86/lib/x86-opcode-map.txt
index 5168ee0360b2..7a269e269dc0 100644
--- a/tools/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/arch/x86/lib/x86-opcode-map.txt
@@ -1052,7 +1052,7 @@ EndTable
 
 GrpTable: Grp7
 0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B)
-1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B)
+1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B) | ERETU (F3),(010),(11B) | ERETS (F2),(010),(11B)
 2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
 3: LIDT Ms
 4: SMSW Mw/Rv
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 08/34] x86/objtool: teach objtool about ERETU and ERETS
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (6 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 07/34] x86/opcode: add ERETU, ERETS instructions to x86-opcode-map Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 09/34] x86/cpu: add X86_CR4_FRED macro Xin Li
                   ` (26 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Update the objtool decoder to know about the ERETU and ERETS
instructions (type INSN_CONTEXT_SWITCH.)

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 tools/objtool/arch/x86/decode.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index 9ef024fd648c..8e9c802f78ec 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -509,11 +509,20 @@ int arch_decode_instruction(struct objtool_file *file, const struct section *sec
 
 		if (op2 == 0x01) {
 
-			if (modrm == 0xca)
-				insn->type = INSN_CLAC;
-			else if (modrm == 0xcb)
-				insn->type = INSN_STAC;
-
+			switch (insn_last_prefix_id(&ins)) {
+			case INAT_PFX_REPE:
+			case INAT_PFX_REPNE:
+				if (modrm == 0xca)
+					/* eretu/erets */
+					insn->type = INSN_CONTEXT_SWITCH;
+				break;
+			default:
+				if (modrm == 0xca)
+					insn->type = INSN_CLAC;
+				else if (modrm == 0xcb)
+					insn->type = INSN_STAC;
+				break;
+			}
 		} else if (op2 >= 0x80 && op2 <= 0x8f) {
 
 			insn->type = INSN_JUMP_CONDITIONAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 09/34] x86/cpu: add X86_CR4_FRED macro
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (7 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 08/34] x86/objtool: teach objtool about ERETU and ERETS Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 10/34] x86/fred: add Kconfig option for FRED (CONFIG_X86_FRED) Xin Li
                   ` (25 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add X86_CR4_FRED macro for the FRED bit in %cr4. This bit should be a
pinned bit, not to be changed after initialization.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/cpu/common.c                | 11 ++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index c47cc7f2feeb..a90933f1ff41 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -132,6 +132,8 @@
 #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
 #define X86_CR4_CET_BIT		23 /* enable Control-flow Enforcement Technology */
 #define X86_CR4_CET		_BITUL(X86_CR4_CET_BIT)
+#define X86_CR4_FRED_BIT	32 /* enable FRED kernel entry */
+#define X86_CR4_FRED		_BITULL(X86_CR4_FRED_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8cd4126d8253..e8cf6f4cfb52 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -412,10 +412,15 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
 	cr4_clear_bits(X86_CR4_UMIP);
 }
 
-/* These bits should not change their value after CPU init is finished. */
+/*
+ * These bits should not change their value after CPU init is finished.
+ * The explicit cast to unsigned long suppresses a warning on i386 for
+ * x86-64 only feature bits >= 32.
+ */
 static const unsigned long cr4_pinned_mask =
-	X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
-	X86_CR4_FSGSBASE | X86_CR4_CET;
+	(unsigned long)
+	(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
+	 X86_CR4_FSGSBASE | X86_CR4_CET | X86_CR4_FRED);
 static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
 static unsigned long cr4_pinned_bits __ro_after_init;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 10/34] x86/fred: add Kconfig option for FRED (CONFIG_X86_FRED)
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (8 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 09/34] x86/cpu: add X86_CR4_FRED macro Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 11/34] x86/fred: if CONFIG_X86_FRED is disabled, disable FRED support Xin Li
                   ` (24 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add the configuration option CONFIG_X86_FRED to enable FRED.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/Kconfig | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a825bf031f49..da62178bb246 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -500,6 +500,15 @@ config X86_CPU_RESCTRL
 
 	  Say N if unsure.
 
+config X86_FRED
+	bool "Flexible Return and Event Delivery"
+	depends on X86_64
+	help
+	  When enabled, try to use Flexible Return and Event Delivery
+	  instead of the legacy SYSCALL/SYSENTER/IDT architecture for
+	  ring transitions and exception/interrupt handling if the
+	  system supports.
+
 if X86_32
 config X86_BIGSMP
 	bool "Support for big SMP systems with more than 8 CPUs"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 11/34] x86/fred: if CONFIG_X86_FRED is disabled, disable FRED support
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (9 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 10/34] x86/fred: add Kconfig option for FRED (CONFIG_X86_FRED) Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 12/34] x86/cpu: add MSR numbers for FRED configuration Xin Li
                   ` (23 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add CONFIG_X86_FRED to <asm/disabled-features.h> to make
cpu_feature_enabled() work correctly with FRED.

Originally-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/disabled-features.h       | 8 +++++++-
 tools/arch/x86/include/asm/disabled-features.h | 8 +++++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5dfa4fb76f4b..56838de9cb23 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -99,6 +99,12 @@
 # define DISABLE_TDX_GUEST	(1 << (X86_FEATURE_TDX_GUEST & 31))
 #endif
 
+#ifdef CONFIG_X86_FRED
+# define DISABLE_FRED 0
+#else
+# define DISABLE_FRED (1 << (X86_FEATURE_FRED & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -115,7 +121,7 @@
 #define DISABLED_MASK10	0
 #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
 			 DISABLE_CALL_DEPTH_TRACKING)
-#define DISABLED_MASK12	0
+#define DISABLED_MASK12	(DISABLE_FRED)
 #define DISABLED_MASK13	0
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
diff --git a/tools/arch/x86/include/asm/disabled-features.h b/tools/arch/x86/include/asm/disabled-features.h
index c44b56f7ffba..2d3ec539dcc7 100644
--- a/tools/arch/x86/include/asm/disabled-features.h
+++ b/tools/arch/x86/include/asm/disabled-features.h
@@ -99,6 +99,12 @@
 # define DISABLE_TDX_GUEST	(1 << (X86_FEATURE_TDX_GUEST & 31))
 #endif
 
+#ifdef CONFIG_X86_FRED
+# define DISABLE_FRED 0
+#else
+# define DISABLE_FRED (1 << (X86_FEATURE_FRED & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -115,7 +121,7 @@
 #define DISABLED_MASK10	0
 #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
 			 DISABLE_CALL_DEPTH_TRACKING)
-#define DISABLED_MASK12	0
+#define DISABLED_MASK12	(DISABLE_FRED)
 #define DISABLED_MASK13	0
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 12/34] x86/cpu: add MSR numbers for FRED configuration
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (10 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 11/34] x86/fred: if CONFIG_X86_FRED is disabled, disable FRED support Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 13/34] x86/fred: header file for event types Xin Li
                   ` (22 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add MSR numbers for the FRED configuration registers.

Originally-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/msr-index.h       | 13 ++++++++++++-
 tools/arch/x86/include/asm/msr-index.h | 13 ++++++++++++-
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index ad35355ee43e..87db728f8bbc 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -36,8 +36,19 @@
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
-/* Intel MSRs. Some also available on other CPUs */
+/* FRED MSRs */
+#define MSR_IA32_FRED_RSP0	0x1cc /* Level 0 stack pointer */
+#define MSR_IA32_FRED_RSP1	0x1cd /* Level 1 stack pointer */
+#define MSR_IA32_FRED_RSP2	0x1ce /* Level 2 stack pointer */
+#define MSR_IA32_FRED_RSP3	0x1cf /* Level 3 stack pointer */
+#define MSR_IA32_FRED_STKLVLS	0x1d0 /* Exception stack levels */
+#define MSR_IA32_FRED_SSP0	MSR_IA32_PL0_SSP /* Level 0 shadow stack pointer */
+#define MSR_IA32_FRED_SSP1	0x1d1 /* Level 1 shadow stack pointer */
+#define MSR_IA32_FRED_SSP2	0x1d2 /* Level 2 shadow stack pointer */
+#define MSR_IA32_FRED_SSP3	0x1d3 /* Level 3 shadow stack pointer */
+#define MSR_IA32_FRED_CONFIG	0x1d4 /* Entrypoint and interrupt stack level */
 
+/* Intel MSRs. Some also available on other CPUs */
 #define MSR_TEST_CTRL				0x00000033
 #define MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT	29
 #define MSR_TEST_CTRL_SPLIT_LOCK_DETECT		BIT(MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT)
diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h
index 37ff47552bcb..0ade66db3627 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -34,8 +34,19 @@
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
 
-/* Intel MSRs. Some also available on other CPUs */
+/* FRED MSRs */
+#define MSR_IA32_FRED_RSP0	0x1cc /* Level 0 stack pointer */
+#define MSR_IA32_FRED_RSP1	0x1cd /* Level 1 stack pointer */
+#define MSR_IA32_FRED_RSP2	0x1ce /* Level 2 stack pointer */
+#define MSR_IA32_FRED_RSP3	0x1cf /* Level 3 stack pointer */
+#define MSR_IA32_FRED_STKLVLS	0x1d0 /* Exception stack levels */
+#define MSR_IA32_FRED_SSP0	MSR_IA32_PL0_SSP /* Level 0 shadow stack pointer */
+#define MSR_IA32_FRED_SSP1	0x1d1 /* Level 1 shadow stack pointer */
+#define MSR_IA32_FRED_SSP2	0x1d2 /* Level 2 shadow stack pointer */
+#define MSR_IA32_FRED_SSP3	0x1d3 /* Level 3 shadow stack pointer */
+#define MSR_IA32_FRED_CONFIG	0x1d4 /* Entrypoint and interrupt stack level */
 
+/* Intel MSRs. Some also available on other CPUs */
 #define MSR_TEST_CTRL				0x00000033
 #define MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT	29
 #define MSR_TEST_CTRL_SPLIT_LOCK_DETECT		BIT(MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 13/34] x86/fred: header file for event types
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (11 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 12/34] x86/cpu: add MSR numbers for FRED configuration Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 14/34] x86/fred: header file with FRED definitions Xin Li
                   ` (21 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

FRED inherits the Intel VT-x enhancement of classified events with
a two-level event dispatch logic. The first-level dispatch is on
the event type, not the event vector as used in the IDT architecture.
This also means that vectors in different event types are orthogonal,
e.g., vectors 0x10-0x1f become available as hardware interrupts.

Add a header file for event types, and also use it in <asm/vmx.h>.

Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/event-type.h | 17 +++++++++++++++++
 arch/x86/include/asm/vmx.h        | 17 +++++++++--------
 2 files changed, 26 insertions(+), 8 deletions(-)
 create mode 100644 arch/x86/include/asm/event-type.h

diff --git a/arch/x86/include/asm/event-type.h b/arch/x86/include/asm/event-type.h
new file mode 100644
index 000000000000..fedaa0e492c5
--- /dev/null
+++ b/arch/x86/include/asm/event-type.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_EVENT_TYPE_H
+#define _ASM_X86_EVENT_TYPE_H
+
+/*
+ * Event type codes: these are the same that are used by VTx.
+ */
+#define EVENT_TYPE_HWINT	0	/* Maskable external interrupt */
+#define EVENT_TYPE_RESERVED	1
+#define EVENT_TYPE_NMI		2	/* Non-maskable interrupt */
+#define EVENT_TYPE_HWFAULT	3	/* Hardware exceptions (e.g., page fault) */
+#define EVENT_TYPE_SWINT	4	/* Software interrupt (INT n) */
+#define EVENT_TYPE_PRIVSW	5	/* INT1 (ICEBP) */
+#define EVENT_TYPE_SWFAULT	6	/* Software exception (INT3 or INTO) */
+#define EVENT_TYPE_OTHER	7	/* FRED: SYSCALL/SYSENTER */
+
+#endif
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 498dc600bd5c..8d9b8b0d8e56 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -15,6 +15,7 @@
 #include <linux/bitops.h>
 #include <linux/types.h>
 #include <uapi/asm/vmx.h>
+#include <asm/event-type.h>
 #include <asm/vmxfeatures.h>
 
 #define VMCS_CONTROL_BIT(x)	BIT(VMX_FEATURE_##x & 0x1f)
@@ -372,14 +373,14 @@ enum vmcs_field {
 #define VECTORING_INFO_DELIVER_CODE_MASK    	INTR_INFO_DELIVER_CODE_MASK
 #define VECTORING_INFO_VALID_MASK       	INTR_INFO_VALID_MASK
 
-#define INTR_TYPE_EXT_INTR              (0 << 8) /* external interrupt */
-#define INTR_TYPE_RESERVED              (1 << 8) /* reserved */
-#define INTR_TYPE_NMI_INTR		(2 << 8) /* NMI */
-#define INTR_TYPE_HARD_EXCEPTION	(3 << 8) /* processor exception */
-#define INTR_TYPE_SOFT_INTR             (4 << 8) /* software interrupt */
-#define INTR_TYPE_PRIV_SW_EXCEPTION	(5 << 8) /* ICE breakpoint - undocumented */
-#define INTR_TYPE_SOFT_EXCEPTION	(6 << 8) /* software exception */
-#define INTR_TYPE_OTHER_EVENT           (7 << 8) /* other event */
+#define INTR_TYPE_EXT_INTR		(EVENT_TYPE_HWINT << 8)		/* external interrupt */
+#define INTR_TYPE_RESERVED		(EVENT_TYPE_RESERVED << 8)	/* reserved */
+#define INTR_TYPE_NMI_INTR		(EVENT_TYPE_NMI << 8)		/* NMI */
+#define INTR_TYPE_HARD_EXCEPTION	(EVENT_TYPE_HWFAULT << 8)	/* processor exception */
+#define INTR_TYPE_SOFT_INTR		(EVENT_TYPE_SWINT << 8)		/* software interrupt */
+#define INTR_TYPE_PRIV_SW_EXCEPTION	(EVENT_TYPE_PRIVSW << 8)	/* ICE breakpoint - undocumented */
+#define INTR_TYPE_SOFT_EXCEPTION	(EVENT_TYPE_SWFAULT << 8)	/* software exception */
+#define INTR_TYPE_OTHER_EVENT		(EVENT_TYPE_OTHER << 8)		/* other event */
 
 /* GUEST_INTERRUPTIBILITY_INFO flags. */
 #define GUEST_INTR_STATE_STI		0x00000001
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 14/34] x86/fred: header file with FRED definitions
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (12 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 13/34] x86/fred: header file for event types Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 15/34] x86/fred: make unions for the cs and ss fields in struct pt_regs Xin Li
                   ` (20 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add a header file for FRED prototypes and definitions.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/fred.h | 101 ++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)
 create mode 100644 arch/x86/include/asm/fred.h

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
new file mode 100644
index 000000000000..2f337162da73
--- /dev/null
+++ b/arch/x86/include/asm/fred.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * arch/x86/include/asm/fred.h
+ *
+ * Macros for Flexible Return and Event Delivery (FRED)
+ */
+
+#ifndef ASM_X86_FRED_H
+#define ASM_X86_FRED_H
+
+#ifdef CONFIG_X86_FRED
+
+#include <linux/const.h>
+#include <asm/asm.h>
+
+/*
+ * FRED return instructions
+ *
+ * Replace with "ERETS"/"ERETU" once binutils support FRED return instructions.
+ * The binutils version supporting FRED instructions is still TBD, and will
+ * update once we have it.
+ */
+#define ERETS			_ASM_BYTES(0xf2,0x0f,0x01,0xca)
+#define ERETU			_ASM_BYTES(0xf3,0x0f,0x01,0xca)
+
+/*
+ * Event stack level macro for the FRED_STKLVLS MSR.
+ * Usage example: FRED_STKLVL(X86_TRAP_DF, 3)
+ * Multiple values can be ORd together.
+ */
+#define FRED_STKLVL(v,l)	(_AT(unsigned long, l) << (2*(v)))
+
+/* FRED_CONFIG MSR */
+#define FRED_CONFIG_CSL_MASK		0x3
+#define FRED_CONFIG_SHADOW_STACK_SPACE	_BITUL(3)
+#define FRED_CONFIG_REDZONE(b)		__ALIGN_KERNEL_MASK((b), _UL(0x3f))
+#define FRED_CONFIG_INT_STKLVL(l)	(_AT(unsigned long, l) << 9)
+#define FRED_CONFIG_ENTRYPOINT(p)	_AT(unsigned long, (p))
+
+/* FRED event type and vector bit width and counts */
+#define FRED_EVENT_TYPE_BITS		3 /* only 3 bits used in FRED 3.0 */
+#define FRED_EVENT_TYPE_COUNT		_BITUL(FRED_EVENT_TYPE_BITS)
+#define FRED_EVENT_VECTOR_BITS		8
+#define FRED_EVENT_VECTOR_COUNT		_BITUL(FRED_EVENT_VECTOR_BITS)
+
+/* FRED EVENT_TYPE_OTHER vector numbers */
+#define FRED_SYSCALL			1
+#define FRED_SYSENTER			2
+
+/* Flags above the CS selector (regs->csx) */
+#define FRED_CSL_ENABLE_NMI		_BITUL(28)
+#define FRED_CSL_ALLOW_SINGLE_STEP	_BITUL(25)
+#define FRED_CSL_INTERRUPT_SHADOW	_BITUL(24)
+
+#ifndef __ASSEMBLY__
+
+#include <linux/kernel.h>
+#include <asm/ptrace.h>
+
+/* FRED stack frame information */
+struct fred_info {
+	unsigned long edata;	/* Event data: CR2, DR6, ... */
+	unsigned long resv;
+};
+
+/* Full format of the FRED stack frame */
+struct fred_frame {
+	struct pt_regs   regs;
+	struct fred_info info;
+};
+
+/* Getting the FRED frame information from a pt_regs pointer */
+static __always_inline struct fred_info *fred_info(struct pt_regs *regs)
+{
+	return &container_of(regs, struct fred_frame, regs)->info;
+}
+
+static __always_inline unsigned long fred_event_data(struct pt_regs *regs)
+{
+	return fred_info(regs)->edata;
+}
+
+/*
+ * How FRED event handlers are called.
+ *
+ * FRED event delivery establishes the full supervisor context
+ * by pushing everything related to the event being delivered
+ * to the FRED stack frame, e.g., the faulting linear address
+ * of a #PF is pushed as event data of the FRED #PF stack frame.
+ * Thus a struct pt_regs has everything needed and it's the only
+ * input parameter required for a FRED event handler.
+ */
+#define DECLARE_FRED_HANDLER(f) void f (struct pt_regs *regs)
+#define DEFINE_FRED_HANDLER(f) noinstr DECLARE_FRED_HANDLER(f)
+typedef DECLARE_FRED_HANDLER((*fred_handler));
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* CONFIG_X86_FRED */
+
+#endif /* ASM_X86_FRED_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 15/34] x86/fred: make unions for the cs and ss fields in struct pt_regs
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (13 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 14/34] x86/fred: header file with FRED definitions Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 16/34] x86/fred: reserve space for the FRED stack frame Xin Li
                   ` (19 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Make the cs and ss fields in struct pt_regs unions between the actual
selector and the unsigned long stack slot. FRED uses this space to
store additional flags.

The printk changes are simply due to the cs and ss fields changed to
unsigned short from unsigned long.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v3:
* Rename csl/ssl of the pt_regs structure to csx/ssx (x for extended)
  (Andrew Cooper).
---
 arch/x86/entry/vsyscall/vsyscall_64.c |  2 +-
 arch/x86/include/asm/ptrace.h         | 36 ++++++++++++++++++++++++---
 arch/x86/kernel/process_64.c          |  2 +-
 3 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index d234ca797e4a..2429ad0df068 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -76,7 +76,7 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
 	if (!show_unhandled_signals)
 		return;
 
-	printk_ratelimited("%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
+	printk_ratelimited("%s%s[%d] %s ip:%lx cs:%x sp:%lx ax:%lx si:%lx di:%lx\n",
 			   level, current->comm, task_pid_nr(current),
 			   message, regs->ip, regs->cs,
 			   regs->sp, regs->ax, regs->si, regs->di);
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index f4db78b09c8f..a61d860dc33c 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -82,13 +82,41 @@ struct pt_regs {
  * On hw interrupt, it's IRQ number:
  */
 	unsigned long orig_ax;
-/* Return frame for iretq */
+
+	/* Return frame for iretq/eretu/erets */
 	unsigned long ip;
-	unsigned long cs;
+	union {
+		unsigned long  csx;	/* cs extended: CS + any fields above it */
+		struct __attribute__((__packed__)) {
+			unsigned short cs;	/* CS selector proper */
+			unsigned int current_stack_level: 2;
+			unsigned int __csx_resv1	: 6;
+			unsigned int interrupt_shadowed	: 1;
+			unsigned int software_initiated	: 1;
+			unsigned int __csx_resv2	: 2;
+			unsigned int nmi		: 1;
+			unsigned int __csx_resv3	: 3;
+			unsigned int __csx_resv4	: 32;
+		};
+	};
 	unsigned long flags;
 	unsigned long sp;
-	unsigned long ss;
-/* top of stack page */
+	union {
+		unsigned long  ssx;	/* ss extended: SS + any fields above it */
+		struct __attribute__((__packed__)) {
+			unsigned short ss;	/* SS selector proper */
+			unsigned int __ssx_resv1	: 16;
+			unsigned int vector		: 8;
+			unsigned int __ssx_resv2	: 8;
+			unsigned int type		: 4;
+			unsigned int __ssx_resv3	: 4;
+			unsigned int enclv		: 1;
+			unsigned int long_mode		: 1;
+			unsigned int nested		: 1;
+			unsigned int __ssx_resv4	: 1;
+			unsigned int instr_len		: 4;
+		};
+	};
 };
 
 #endif /* !__i386__ */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4e34b3b68ebd..57de166dc61c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -116,7 +116,7 @@ void __show_regs(struct pt_regs *regs, enum show_regs_mode mode,
 
 	printk("%sFS:  %016lx(%04x) GS:%016lx(%04x) knlGS:%016lx\n",
 	       log_lvl, fs, fsindex, gs, gsindex, shadowgs);
-	printk("%sCS:  %04lx DS: %04x ES: %04x CR0: %016lx\n",
+	printk("%sCS:  %04x DS: %04x ES: %04x CR0: %016lx\n",
 		log_lvl, regs->cs, ds, es, cr0);
 	printk("%sCR2: %016lx CR3: %016lx CR4: %016lx\n",
 		log_lvl, cr2, cr3, cr4);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 16/34] x86/fred: reserve space for the FRED stack frame
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (14 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 15/34] x86/fred: make unions for the cs and ss fields in struct pt_regs Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 17/34] x86/fred: add a page fault entry stub for FRED Xin Li
                   ` (18 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

When using FRED, reserve space at the top of the stack frame, just
like i386 does. A future version of FRED might have dynamic frame
sizes, though, in which case it might be necessary to make
TOP_OF_KERNEL_STACK_PADDING a variable instead of a constant.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/thread_info.h | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index f1cccba52eb9..998483078d5f 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -31,7 +31,9 @@
  * In vm86 mode, the hardware frame is much longer still, so add 16
  * bytes to make room for the real-mode segments.
  *
- * x86_64 has a fixed-length stack frame.
+ * x86-64 has a fixed-length stack frame, but it depends on whether
+ * or not FRED is enabled. Future versions of FRED might make this
+ * dynamic, but for now it is always 2 words longer.
  */
 #ifdef CONFIG_X86_32
 # ifdef CONFIG_VM86
@@ -39,8 +41,12 @@
 # else
 #  define TOP_OF_KERNEL_STACK_PADDING 8
 # endif
-#else
-# define TOP_OF_KERNEL_STACK_PADDING 0
+#else /* x86-64 */
+# ifdef CONFIG_X86_FRED
+#  define TOP_OF_KERNEL_STACK_PADDING (2*8)
+# else
+#  define TOP_OF_KERNEL_STACK_PADDING 0
+# endif
 #endif
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 17/34] x86/fred: add a page fault entry stub for FRED
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (15 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 16/34] x86/fred: reserve space for the FRED stack frame Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 18/34] x86/fred: add a debug " Xin Li
                   ` (17 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add a page fault entry stub for FRED.

On a FRED system, the faulting address (CR2) is passed on the stack,
to avoid the problem of transient state. Thus we get the page fault
address from the stack instead of CR2.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/fred.h |  2 ++
 arch/x86/mm/fault.c         | 20 ++++++++++++++++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 2f337162da73..57affbf80ced 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -94,6 +94,8 @@ static __always_inline unsigned long fred_event_data(struct pt_regs *regs)
 #define DEFINE_FRED_HANDLER(f) noinstr DECLARE_FRED_HANDLER(f)
 typedef DECLARE_FRED_HANDLER((*fred_handler));
 
+DECLARE_FRED_HANDLER(fred_exc_page_fault);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* CONFIG_X86_FRED */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a498ae1fbe66..0f946121de14 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -33,6 +33,7 @@
 #include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
 #include <asm/vdso.h>			/* fixup_vdso_exception()	*/
 #include <asm/irq_stack.h>
+#include <asm/fred.h>			/* fred_event_data()	*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -1507,9 +1508,10 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
 	}
 }
 
-DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
+static __always_inline void page_fault_common(struct pt_regs *regs,
+					      unsigned int error_code,
+					      unsigned long address)
 {
-	unsigned long address = read_cr2();
 	irqentry_state_t state;
 
 	prefetchw(&current->mm->mmap_lock);
@@ -1556,3 +1558,17 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 
 	irqentry_exit(regs, state);
 }
+
+DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
+{
+	page_fault_common(regs, error_code, read_cr2());
+}
+
+#ifdef CONFIG_X86_FRED
+
+DEFINE_FRED_HANDLER(fred_exc_page_fault)
+{
+	page_fault_common(regs, regs->orig_ax, fred_event_data(regs));
+}
+
+#endif /* CONFIG_X86_FRED */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 18/34] x86/fred: add a debug fault entry stub for FRED
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (16 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 17/34] x86/fred: add a page fault entry stub for FRED Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 19/34] x86/fred: add a NMI " Xin Li
                   ` (16 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Add a debug fault entry stub for FRED.

On a FRED system, the debug trap status information (DR6) is passed
on the stack, to avoid the problem of transient state. Furthermore,
FRED transitions avoid a lot of ugly corner cases the handling of which
can, and should be, skipped.

The FRED debug trap status information saved on the stack differs from DR6
in both stickiness and polarity; it is exactly what debug_read_clear_dr6()
returns, and exc_debug_user()/exc_debug_kernel() expect.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v1:
* call irqentry_nmi_{enter,exit}() in both IDT and FRED debug fault kernel
  handler (Peter Zijlstra).
---
 arch/x86/include/asm/fred.h |  1 +
 arch/x86/kernel/traps.c     | 56 +++++++++++++++++++++++++++----------
 2 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 57affbf80ced..633dd9e6a68e 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -94,6 +94,7 @@ static __always_inline unsigned long fred_event_data(struct pt_regs *regs)
 #define DEFINE_FRED_HANDLER(f) noinstr DECLARE_FRED_HANDLER(f)
 typedef DECLARE_FRED_HANDLER((*fred_handler));
 
+DECLARE_FRED_HANDLER(fred_exc_debug);
 DECLARE_FRED_HANDLER(fred_exc_page_fault);
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cebba1f49e19..4b0f63344526 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -47,6 +47,7 @@
 #include <asm/debugreg.h>
 #include <asm/realmode.h>
 #include <asm/text-patching.h>
+#include <asm/fred.h>
 #include <asm/ftrace.h>
 #include <asm/traps.h>
 #include <asm/desc.h>
@@ -1020,21 +1021,9 @@ static bool notify_debug(struct pt_regs *regs, unsigned long *dr6)
 	return false;
 }
 
-static __always_inline void exc_debug_kernel(struct pt_regs *regs,
-					     unsigned long dr6)
+static __always_inline void debug_kernel_common(struct pt_regs *regs,
+						unsigned long dr6)
 {
-	/*
-	 * Disable breakpoints during exception handling; recursive exceptions
-	 * are exceedingly 'fun'.
-	 *
-	 * Since this function is NOKPROBE, and that also applies to
-	 * HW_BREAKPOINT_X, we can't hit a breakpoint before this (XXX except a
-	 * HW_BREAKPOINT_W on our stack)
-	 *
-	 * Entry text is excluded for HW_BP_X and cpu_entry_area, which
-	 * includes the entry stack is excluded for everything.
-	 */
-	unsigned long dr7 = local_db_save();
 	irqentry_state_t irq_state = irqentry_nmi_enter(regs);
 	instrumentation_begin();
 
@@ -1062,7 +1051,8 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
 	 * Catch SYSENTER with TF set and clear DR_STEP. If this hit a
 	 * watchpoint at the same time then that will still be handled.
 	 */
-	if ((dr6 & DR_STEP) && is_sysenter_singlestep(regs))
+	if (!cpu_feature_enabled(X86_FEATURE_FRED) &&
+	    (dr6 & DR_STEP) && is_sysenter_singlestep(regs))
 		dr6 &= ~DR_STEP;
 
 	/*
@@ -1090,7 +1080,25 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
 out:
 	instrumentation_end();
 	irqentry_nmi_exit(regs, irq_state);
+}
 
+static __always_inline void exc_debug_kernel(struct pt_regs *regs,
+					     unsigned long dr6)
+{
+	/*
+	 * Disable breakpoints during exception handling; recursive exceptions
+	 * are exceedingly 'fun'.
+	 *
+	 * Since this function is NOKPROBE, and that also applies to
+	 * HW_BREAKPOINT_X, we can't hit a breakpoint before this (XXX except a
+	 * HW_BREAKPOINT_W on our stack)
+	 *
+	 * Entry text is excluded for HW_BP_X and cpu_entry_area, which
+	 * includes the entry stack is excluded for everything.
+	 */
+	unsigned long dr7 = local_db_save();
+
+	debug_kernel_common(regs, dr6);
 	local_db_restore(dr7);
 }
 
@@ -1179,6 +1187,24 @@ DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
 {
 	exc_debug_user(regs, debug_read_clear_dr6());
 }
+
+# ifdef CONFIG_X86_FRED
+DEFINE_FRED_HANDLER(fred_exc_debug)
+{
+	/*
+	 * The FRED debug information saved onto stack differs from
+	 * DR6 in both stickiness and polarity; it is exactly what
+	 * debug_read_clear_dr6() returns.
+	 */
+	unsigned long dr6 = fred_event_data(regs);
+
+	if (user_mode(regs))
+		exc_debug_user(regs, dr6);
+	else
+		debug_kernel_common(regs, dr6);
+}
+# endif /* CONFIG_X86_FRED */
+
 #else
 /* 32 bit does not have separate entry points. */
 DEFINE_IDTENTRY_RAW(exc_debug)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 19/34] x86/fred: add a NMI entry stub for FRED
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (17 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 18/34] x86/fred: add a debug " Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 20/34] x86/fred: add a machine check " Xin Li
                   ` (15 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

On a FRED system, NMIs nest both with themselves and faults, transient
information is saved into the stack frame, and NMI unblocking only
happens when the stack frame indicates that so should happen.

Thus, the NMI entry stub for FRED is really quite small...

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/fred.h |  1 +
 arch/x86/kernel/nmi.c       | 28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 633dd9e6a68e..f928a03082af 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -94,6 +94,7 @@ static __always_inline unsigned long fred_event_data(struct pt_regs *regs)
 #define DEFINE_FRED_HANDLER(f) noinstr DECLARE_FRED_HANDLER(f)
 typedef DECLARE_FRED_HANDLER((*fred_handler));
 
+DECLARE_FRED_HANDLER(fred_exc_nmi);
 DECLARE_FRED_HANDLER(fred_exc_debug);
 DECLARE_FRED_HANDLER(fred_exc_page_fault);
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 776f4b1e395b..1deedfd6de69 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -34,6 +34,7 @@
 #include <asm/cache.h>
 #include <asm/nospec-branch.h>
 #include <asm/sev.h>
+#include <asm/fred.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/nmi.h>
@@ -643,6 +644,33 @@ void nmi_backtrace_stall_check(const struct cpumask *btp)
 
 #endif
 
+#ifdef CONFIG_X86_FRED
+DEFINE_FRED_HANDLER(fred_exc_nmi)
+{
+	/*
+	 * With FRED, CR2 and DR6 are pushed atomically on faults,
+	 * so we don't have to worry about saving and restoring them.
+	 * Breakpoint faults nest, so assume it is OK to leave DR7
+	 * enabled.
+	 */
+	irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+
+	/*
+	 * VM exits induced by NMIs keep NMI blocked, and we do
+	 * "int $2" to reinject the NMI w/ NMI kept being blocked.
+	 * However "int $2" doesn't set the nmi bit in the FRED
+	 * stack frame, so we explicitly set it to make sure a
+	 * later ERETS will unblock NMI immediately.
+	 */
+	regs->nmi = 1;
+
+	inc_irq_stat(__nmi_count);
+	default_do_nmi(regs);
+
+	irqentry_nmi_exit(regs, irq_state);
+}
+#endif
+
 void stop_nmi(void)
 {
 	ignore_nmis++;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 20/34] x86/fred: add a machine check entry stub for FRED
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (18 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 19/34] x86/fred: add a NMI " Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-20 16:00   ` Peter Zijlstra
  2023-03-07  2:39 ` [PATCH v5 21/34] x86/fred: FRED entry/exit and dispatch code Xin Li
                   ` (14 subsequent siblings)
  34 siblings, 1 reply; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

Add a machine check entry stub for FRED.

Unlike IDT, no need to save/restore dr7 in FRED machine check handler.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/fred.h    |  1 +
 arch/x86/kernel/cpu/mce/core.c | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index f928a03082af..54746e8c0a17 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -97,6 +97,7 @@ typedef DECLARE_FRED_HANDLER((*fred_handler));
 DECLARE_FRED_HANDLER(fred_exc_nmi);
 DECLARE_FRED_HANDLER(fred_exc_debug);
 DECLARE_FRED_HANDLER(fred_exc_page_fault);
+DECLARE_FRED_HANDLER(fred_exc_machine_check);
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7832a69d170e..26fa7fa44f30 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -52,6 +52,7 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
+#include <asm/fred.h>
 
 #include "internal.h"
 
@@ -2111,6 +2112,16 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
 	exc_machine_check_user(regs);
 	local_db_restore(dr7);
 }
+
+#ifdef CONFIG_X86_FRED
+DEFINE_FRED_HANDLER(fred_exc_machine_check)
+{
+	if (user_mode(regs))
+		exc_machine_check_user(regs);
+	else
+		exc_machine_check_kernel(regs);
+}
+#endif
 #else
 /* 32bit unified entry point */
 DEFINE_IDTENTRY_RAW(exc_machine_check)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 21/34] x86/fred: FRED entry/exit and dispatch code
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (19 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 20/34] x86/fred: add a machine check " Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 22/34] x86/fred: FRED initialization code Xin Li
                   ` (13 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

The code to actually handle kernel and event entry/exit using
FRED. It is split up into two files thus:

- entry_64_fred.S contains the actual entrypoints and exit code, and
  saves and restores registers.
- entry_fred.c contains the two-level event dispatch code for FRED.
  The first-level dispatch is on the event type, and the second-level
  is on the event vector.

Originally-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Co-developed-by: Xin Li <xin3.li@intel.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v1:
* Initialize a FRED exception handler to fred_bad_event() instead of NULL
  if no FRED handler defined for an exception vector (Peter Zijlstra).
* Push calling irqentry_{enter,exit}() and instrumentation_{begin,end}()
  down into individual FRED exception handlers, instead of in the dispatch
  framework (Peter Zijlstra).
---
 arch/x86/entry/Makefile         |   5 +-
 arch/x86/entry/entry_64_fred.S  |  55 ++++++++
 arch/x86/entry/entry_fred.c     | 232 ++++++++++++++++++++++++++++++++
 arch/x86/include/asm/idtentry.h |   8 ++
 4 files changed, 299 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/entry/entry_64_fred.S
 create mode 100644 arch/x86/entry/entry_fred.c

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index ca2fe186994b..c93e7f5c2a06 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -18,6 +18,9 @@ obj-y				+= vdso/
 obj-y				+= vsyscall/
 
 obj-$(CONFIG_PREEMPTION)	+= thunk_$(BITS).o
+CFLAGS_entry_fred.o		+= -fno-stack-protector
+CFLAGS_REMOVE_entry_fred.o	+= -pg $(CC_FLAGS_FTRACE)
+obj-$(CONFIG_X86_FRED)		+= entry_64_fred.o entry_fred.o
+
 obj-$(CONFIG_IA32_EMULATION)	+= entry_64_compat.o syscall_32.o
 obj-$(CONFIG_X86_X32_ABI)	+= syscall_x32.o
-
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
new file mode 100644
index 000000000000..1fb765fd3871
--- /dev/null
+++ b/arch/x86/entry/entry_64_fred.S
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ *  arch/x86/entry/entry_64_fred.S
+ *
+ * The actual FRED entry points.
+ */
+#include <linux/linkage.h>
+#include <asm/errno.h>
+#include <asm/asm-offsets.h>
+#include <asm/fred.h>
+
+#include "calling.h"
+
+	.code64
+	.section ".noinstr.text", "ax"
+
+.macro FRED_ENTER
+	UNWIND_HINT_EMPTY
+	PUSH_AND_CLEAR_REGS
+	movq	%rsp, %rdi	/* %rdi -> pt_regs */
+.endm
+
+.macro FRED_EXIT
+	UNWIND_HINT_REGS
+	POP_REGS
+	addq $8,%rsp		/* Drop error code */
+.endm
+
+/*
+ * The new RIP value that FRED event delivery establishes is
+ * IA32_FRED_CONFIG & ~FFFH for events that occur in ring 3.
+ * Thus the FRED ring 3 entry point must be 4K page aligned.
+ */
+	.align 4096
+
+SYM_CODE_START_NOALIGN(fred_entrypoint_user)
+	FRED_ENTER
+	call	fred_entry_from_user
+SYM_INNER_LABEL(fred_exit_user, SYM_L_GLOBAL)
+	FRED_EXIT
+	ERETU
+SYM_CODE_END(fred_entrypoint_user)
+
+/*
+ * The new RIP value that FRED event delivery establishes is
+ * (IA32_FRED_CONFIG & ~FFFH) + 256 for events that occur in
+ * ring 0, i.e., fred_entrypoint_user + 256.
+ */
+	.org fred_entrypoint_user+256
+SYM_CODE_START_NOALIGN(fred_entrypoint_kernel)
+	FRED_ENTER
+	call	fred_entry_from_kernel
+	FRED_EXIT
+	ERETS
+SYM_CODE_END(fred_entrypoint_kernel)
diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
new file mode 100644
index 000000000000..8d3e144670d6
--- /dev/null
+++ b/arch/x86/entry/entry_fred.c
@@ -0,0 +1,232 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * arch/x86/entry/entry_fred.c
+ *
+ * This contains the dispatch functions called from the entry point
+ * assembly.
+ */
+
+#include <linux/kernel.h>
+#include <linux/kdebug.h>		/* oops_begin/end, ...		*/
+#include <linux/nospec.h>
+#include <asm/event-type.h>
+#include <asm/fred.h>
+#include <asm/idtentry.h>
+#include <asm/syscall.h>
+#include <asm/trapnr.h>
+#include <asm/traps.h>
+#include <asm/kdebug.h>
+
+/*
+ * Badness...
+ */
+static DEFINE_FRED_HANDLER(fred_bad_event)
+{
+	irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+
+	instrumentation_begin();
+
+	/* Panic on events from a high stack level */
+	if (regs->current_stack_level > 0) {
+		pr_emerg("PANIC: invalid or fatal FRED event; event type %u "
+			 "vector %u error 0x%lx aux 0x%lx at %04x:%016lx\n",
+			 regs->type, regs->vector, regs->orig_ax,
+			 fred_event_data(regs), regs->cs, regs->ip);
+		die("invalid or fatal FRED event", regs, regs->orig_ax);
+		panic("invalid or fatal FRED event");
+	} else {
+		unsigned long flags = oops_begin();
+		int sig = SIGKILL;
+
+		pr_alert("BUG: invalid or fatal FRED event; event type %u "
+			 "vector %u error 0x%lx aux 0x%lx at %04x:%016lx\n",
+			 regs->type, regs->vector, regs->orig_ax,
+			 fred_event_data(regs), regs->cs, regs->ip);
+
+		if (__die("Invalid or fatal FRED event", regs, regs->orig_ax))
+			sig = 0;
+
+		oops_end(flags, regs, sig);
+	}
+
+	instrumentation_end();
+	irqentry_nmi_exit(regs, irq_state);
+}
+
+noinstr void fred_exc_double_fault(struct pt_regs *regs)
+{
+	exc_double_fault(regs, regs->orig_ax);
+}
+
+/*
+ * Exception entry
+ */
+static DEFINE_FRED_HANDLER(fred_exception)
+{
+	/*
+	 * Exceptions that cannot happen on FRED h/w are set to fred_bad_event().
+	 */
+	static const fred_handler exception_handlers[NUM_EXCEPTION_VECTORS] = {
+		[X86_TRAP_DE] = exc_divide_error,
+		[X86_TRAP_DB] = fred_exc_debug,
+		[X86_TRAP_NMI] = fred_bad_event, /* A separate event type, not handled here */
+		[X86_TRAP_BP] = exc_int3,
+		[X86_TRAP_OF] = exc_overflow,
+		[X86_TRAP_BR] = exc_bounds,
+		[X86_TRAP_UD] = exc_invalid_op,
+		[X86_TRAP_NM] = exc_device_not_available,
+		[X86_TRAP_DF] = fred_exc_double_fault,
+		[X86_TRAP_OLD_MF] = fred_bad_event, /* 387 only! */
+		[X86_TRAP_TS] = fred_exc_invalid_tss,
+		[X86_TRAP_NP] = fred_exc_segment_not_present,
+		[X86_TRAP_SS] = fred_exc_stack_segment,
+		[X86_TRAP_GP] = fred_exc_general_protection,
+		[X86_TRAP_PF] = fred_exc_page_fault,
+		[X86_TRAP_SPURIOUS] = fred_bad_event, /* Interrupts are their own event type */
+		[X86_TRAP_MF] = exc_coprocessor_error,
+		[X86_TRAP_AC] = fred_exc_alignment_check,
+		[X86_TRAP_MC] = fred_exc_machine_check,
+		[X86_TRAP_XF] = exc_simd_coprocessor_error,
+		[X86_TRAP_VE...NUM_EXCEPTION_VECTORS-1] = fred_bad_event
+	};
+	u8 vector = array_index_nospec((u8)regs->vector, NUM_EXCEPTION_VECTORS);
+
+	exception_handlers[vector](regs);
+}
+
+static __always_inline void fred_emulate_trap(struct pt_regs *regs)
+{
+	regs->type = EVENT_TYPE_SWFAULT;
+	regs->orig_ax = 0;
+	fred_exception(regs);
+}
+
+static __always_inline void fred_emulate_fault(struct pt_regs *regs)
+{
+	regs->ip -= regs->instr_len;
+	fred_emulate_trap(regs);
+}
+
+/*
+ * Emulate SYSENTER if applicable. This is not the preferred system
+ * call in 32-bit mode under FRED, rather int $0x80 is preferred and
+ * exported in the vdso. SYSCALL proper has a hard-coded early out in
+ * fred_entry_from_user().
+ */
+static DEFINE_FRED_HANDLER(fred_syscall_slow)
+{
+	if (IS_ENABLED(CONFIG_IA32_EMULATION) &&
+	    likely(regs->vector == FRED_SYSENTER)) {
+		/* Convert frame to a syscall frame */
+		regs->orig_ax = regs->ax;
+		regs->ax = -ENOSYS;
+		do_fast_syscall_32(regs);
+	} else {
+		regs->vector = X86_TRAP_UD;
+		fred_emulate_fault(regs);
+	}
+}
+
+/*
+ * Some software exceptions can also be triggered as int instructions,
+ * for historical reasons. Implement those here. The performance-critical
+ * int $0x80 (32-bit system call) has a hard-coded early out.
+ */
+static DEFINE_FRED_HANDLER(fred_sw_interrupt_user)
+{
+	if (IS_ENABLED(CONFIG_IA32_EMULATION) &&
+	    likely(regs->vector == IA32_SYSCALL_VECTOR)) {
+		/* Convert frame to a syscall frame */
+		regs->orig_ax = regs->ax;
+		regs->ax = -ENOSYS;
+		return do_int80_syscall_32(regs);
+	}
+
+	switch (regs->vector) {
+	case X86_TRAP_BP:
+	case X86_TRAP_OF:
+		fred_emulate_trap(regs);
+		break;
+	default:
+		regs->vector = X86_TRAP_GP;
+		fred_emulate_fault(regs);
+		break;
+	}
+}
+
+static DEFINE_FRED_HANDLER(fred_hw_interrupt)
+{
+	irqentry_state_t state = irqentry_enter(regs);
+
+	instrumentation_begin();
+	external_interrupt(regs, regs->vector);
+	instrumentation_end();
+	irqentry_exit(regs, state);
+}
+
+__visible noinstr void fred_entry_from_user(struct pt_regs *regs)
+{
+	static const fred_handler user_handlers[FRED_EVENT_TYPE_COUNT] =
+	{
+		[EVENT_TYPE_HWINT]	= fred_hw_interrupt,
+		[EVENT_TYPE_RESERVED]	= fred_bad_event,
+		[EVENT_TYPE_NMI]	= fred_exc_nmi,
+		[EVENT_TYPE_SWINT]	= fred_sw_interrupt_user,
+		[EVENT_TYPE_HWFAULT]	= fred_exception,
+		[EVENT_TYPE_SWFAULT]	= fred_exception,
+		[EVENT_TYPE_PRIVSW]	= fred_exception,
+		[EVENT_TYPE_OTHER]	= fred_syscall_slow
+	};
+
+	/*
+	 * FRED employs a two-level event dispatch mechanism, with
+	 * the first-level on the type of an event and the second-level
+	 * on its vector. Thus a dispatch typically induces 2 calls.
+	 * We optimize it by using early outs for the most frequent
+	 * events, and syscalls are the first. We may also need early
+	 * outs for page faults.
+	 */
+	if (likely(regs->type == EVENT_TYPE_OTHER &&
+		   regs->vector == FRED_SYSCALL)) {
+		/* Convert frame to a syscall frame */
+		regs->orig_ax = regs->ax;
+		regs->ax = -ENOSYS;
+		do_syscall_64(regs, regs->orig_ax);
+	} else {
+		/* Not a system call */
+		u8 type = array_index_nospec((u8)regs->type, FRED_EVENT_TYPE_COUNT);
+
+		user_handlers[type](regs);
+	}
+}
+
+static DEFINE_FRED_HANDLER(fred_sw_interrupt_kernel)
+{
+	switch (regs->vector) {
+	case X86_TRAP_NMI:
+		fred_exc_nmi(regs);
+		break;
+	default:
+		fred_bad_event(regs);
+		break;
+	}
+}
+
+__visible noinstr void fred_entry_from_kernel(struct pt_regs *regs)
+{
+	static const fred_handler kernel_handlers[FRED_EVENT_TYPE_COUNT] =
+	{
+		[EVENT_TYPE_HWINT]	= fred_hw_interrupt,
+		[EVENT_TYPE_RESERVED]	= fred_bad_event,
+		[EVENT_TYPE_NMI]	= fred_exc_nmi,
+		[EVENT_TYPE_SWINT]	= fred_sw_interrupt_kernel,
+		[EVENT_TYPE_HWFAULT]	= fred_exception,
+		[EVENT_TYPE_SWFAULT]	= fred_exception,
+		[EVENT_TYPE_PRIVSW]	= fred_exception,
+		[EVENT_TYPE_OTHER]	= fred_bad_event
+	};
+	u8 type = array_index_nospec((u8)regs->type, FRED_EVENT_TYPE_COUNT);
+
+	/* The pt_regs frame on entry here is an exception frame */
+	kernel_handlers[type](regs);
+}
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 2876ddae02bc..bd43866f9c3e 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -82,6 +82,7 @@ static __always_inline void __##func(struct pt_regs *regs)
 #define DECLARE_IDTENTRY_ERRORCODE(vector, func)			\
 	asmlinkage void asm_##func(void);				\
 	asmlinkage void xen_asm_##func(void);				\
+	__visible void fred_##func(struct pt_regs *regs);		\
 	__visible void func(struct pt_regs *regs, unsigned long error_code)
 
 /**
@@ -106,6 +107,11 @@ __visible noinstr void func(struct pt_regs *regs,			\
 	irqentry_exit(regs, state);					\
 }									\
 									\
+__visible noinstr void fred_##func(struct pt_regs *regs)		\
+{									\
+	func (regs, regs->orig_ax);					\
+}									\
+									\
 static __always_inline void __##func(struct pt_regs *regs,		\
 				     unsigned long error_code)
 
@@ -622,6 +628,8 @@ DECLARE_IDTENTRY_RAW(X86_TRAP_MC,	exc_machine_check);
 #ifdef CONFIG_XEN_PV
 DECLARE_IDTENTRY_RAW(X86_TRAP_MC,	xenpv_exc_machine_check);
 #endif
+#else
+#define fred_exc_machine_check		fred_bad_event
 #endif
 
 /* NMI */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (20 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 21/34] x86/fred: FRED entry/exit and dispatch code Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-17 13:35   ` Lai Jiangshan
  2023-03-07  2:39 ` [PATCH v5 23/34] x86/fred: update MSR_IA32_FRED_RSP0 during task switch Xin Li
                   ` (12 subsequent siblings)
  34 siblings, 1 reply; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

The code to initialize FRED when it's available and _not_ disabled.

cpu_init_fred_exceptions() is the core function to initialize FRED,
which
  1. Sets up FRED entrypoints for events happening in ring 0 and 3.
  2. Sets up a default stack for event handling.
  3. Sets up dedicated event stacks for DB/NMI/MC/DF, equivalent to
     the IDT IST stacks.
  4. Forces 32-bit system calls to use "int $0x80" only.
  5. Enables FRED and invalidtes IDT.

When the FRED is used, cpu_init_exception_handling() initializes FRED
through calling cpu_init_fred_exceptions(), otherwise it sets up TSS
IST and loads IDT.

As FRED uses the ring 3 FRED entrypoint for SYSCALL and SYSENTER,
it skips setting up SYSCALL/SYSENTER related MSRs, e.g., MSR_LSTAR.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Co-developed-by: Xin Li <xin3.li@intel.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/fred.h  | 14 +++++++
 arch/x86/include/asm/traps.h |  2 +
 arch/x86/kernel/Makefile     |  1 +
 arch/x86/kernel/cpu/common.c | 74 +++++++++++++++++++++++-------------
 arch/x86/kernel/fred.c       | 73 +++++++++++++++++++++++++++++++++++
 arch/x86/kernel/irqinit.c    |  7 +++-
 arch/x86/kernel/traps.c      | 16 +++++++-
 7 files changed, 157 insertions(+), 30 deletions(-)
 create mode 100644 arch/x86/kernel/fred.c

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 54746e8c0a17..cd974edc8e8a 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -99,8 +99,22 @@ DECLARE_FRED_HANDLER(fred_exc_debug);
 DECLARE_FRED_HANDLER(fred_exc_page_fault);
 DECLARE_FRED_HANDLER(fred_exc_machine_check);
 
+/*
+ * The actual assembly entry and exit points
+ */
+extern __visible void fred_entrypoint_user(void);
+
+/*
+ * Initialization
+ */
+void cpu_init_fred_exceptions(void);
+void fred_setup_apic(void);
+
 #endif /* __ASSEMBLY__ */
 
+#else
+#define cpu_init_fred_exceptions() BUG()
+#define fred_setup_apic() BUG()
 #endif /* CONFIG_X86_FRED */
 
 #endif /* ASM_X86_FRED_H */
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index da4c21ed68b4..69fafef1136e 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -56,6 +56,8 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
 	void f (struct pt_regs *regs)
 typedef DECLARE_SYSTEM_INTERRUPT_HANDLER((*system_interrupt_handler));
 
+system_interrupt_handler get_system_interrupt_handler(unsigned int i);
+
 int external_interrupt(struct pt_regs *regs, unsigned int vector);
 
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index dd61752f4c96..08d9c0a0bfbe 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -47,6 +47,7 @@ obj-y			+= platform-quirks.o
 obj-y			+= process_$(BITS).o signal.o signal_$(BITS).o
 obj-y			+= traps.o idt.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
 obj-y			+= time.o ioport.o dumpstack.o nmi.o
+obj-$(CONFIG_X86_FRED)	+= fred.o
 obj-$(CONFIG_MODIFY_LDT_SYSCALL)	+= ldt.o
 obj-y			+= setup.o x86_init.o i8259.o irqinit.o
 obj-$(CONFIG_JUMP_LABEL)	+= jump_label.o
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index e8cf6f4cfb52..eea41cb8722e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -58,6 +58,7 @@
 #include <asm/microcode_intel.h>
 #include <asm/intel-family.h>
 #include <asm/cpu_device_id.h>
+#include <asm/fred.h>
 #include <asm/uv/uv.h>
 #include <asm/sigframe.h>
 #include <asm/traps.h>
@@ -2054,28 +2055,6 @@ static void wrmsrl_cstar(unsigned long val)
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
-	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
-
-#ifdef CONFIG_IA32_EMULATION
-	wrmsrl_cstar((unsigned long)entry_SYSCALL_compat);
-	/*
-	 * This only works on Intel CPUs.
-	 * On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
-	 * This does not cause SYSENTER to jump to the wrong location, because
-	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
-	 */
-	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)(cpu_entry_stack(smp_processor_id()) + 1));
-	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
-#else
-	wrmsrl_cstar((unsigned long)ignore_sysret);
-	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
-	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
-#endif
-
 	/*
 	 * Flags to clear on syscall; clear as much as possible
 	 * to minimize user space-kernel interference.
@@ -2086,6 +2065,41 @@ void syscall_init(void)
 	       X86_EFLAGS_IF|X86_EFLAGS_DF|X86_EFLAGS_OF|
 	       X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_RF|
 	       X86_EFLAGS_AC|X86_EFLAGS_ID);
+
+	/*
+	 * The default user and kernel segments
+	 */
+	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
+
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		/* Both sysexit and sysret cause #UD when FRED is enabled */
+		wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
+		wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+		wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
+	} else {
+		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+
+#ifdef CONFIG_IA32_EMULATION
+		wrmsrl_cstar((unsigned long)entry_SYSCALL_compat);
+		/*
+		 * This only works on Intel CPUs.
+		 * On AMD CPUs these MSRs are 32-bit, CPU truncates
+		 * MSR_IA32_SYSENTER_EIP.
+		 * This does not cause SYSENTER to jump to the wrong
+		 * location, because AMD doesn't allow SYSENTER in
+		 * long mode (either 32- or 64-bit).
+		 */
+		wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
+		wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
+			    (unsigned long)(cpu_entry_stack(smp_processor_id()) + 1));
+		wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
+#else
+		wrmsrl_cstar((unsigned long)ignore_sysret);
+		wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
+		wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+		wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
+#endif
+	}
 }
 
 #else	/* CONFIG_X86_64 */
@@ -2218,18 +2232,24 @@ void cpu_init_exception_handling(void)
 	/* paranoid_entry() gets the CPU number from the GDT */
 	setup_getcpu(cpu);
 
-	/* IST vectors need TSS to be set up. */
-	tss_setup_ist(tss);
+	/* Set up the TSS */
 	tss_setup_io_bitmap(tss);
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
-
 	load_TR_desc();
 
 	/* GHCB needs to be setup to handle #VC. */
 	setup_ghcb();
 
-	/* Finally load the IDT */
-	load_current_idt();
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		/* Set up FRED exception handling */
+		cpu_init_fred_exceptions();
+	} else {
+		/* IST vectors need TSS to be set up. */
+		tss_setup_ist(tss);
+
+		/* Finally load the IDT */
+		load_current_idt();
+	}
 }
 
 /*
diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
new file mode 100644
index 000000000000..827b58fd98d4
--- /dev/null
+++ b/arch/x86/kernel/fred.c
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/kernel.h>
+#include <asm/desc.h>
+#include <asm/fred.h>
+#include <asm/tlbflush.h>	/* For cr4_set_bits() */
+#include <asm/traps.h>
+
+/*
+ * Initialize FRED on this CPU. This cannot be __init as it is called
+ * during CPU hotplug.
+ */
+void cpu_init_fred_exceptions(void)
+{
+	wrmsrl(MSR_IA32_FRED_CONFIG,
+	       FRED_CONFIG_ENTRYPOINT(fred_entrypoint_user) |
+	       FRED_CONFIG_REDZONE(8) | /* Reserve for CALL emulation */
+	       FRED_CONFIG_INT_STKLVL(0));
+
+	wrmsrl(MSR_IA32_FRED_STKLVLS,
+	       FRED_STKLVL(X86_TRAP_DB,  1) |
+	       FRED_STKLVL(X86_TRAP_NMI, 2) |
+	       FRED_STKLVL(X86_TRAP_MC,  2) |
+	       FRED_STKLVL(X86_TRAP_DF,  3));
+
+	/* The FRED equivalents to IST stacks... */
+	wrmsrl(MSR_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB));
+	wrmsrl(MSR_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI));
+	wrmsrl(MSR_IA32_FRED_RSP3, __this_cpu_ist_top_va(DF));
+
+	/* Not used with FRED */
+	wrmsrl(MSR_LSTAR, 0ULL);
+	wrmsrl(MSR_CSTAR, 0ULL);
+	wrmsrl_safe(MSR_IA32_SYSENTER_CS,  (u64)GDT_ENTRY_INVALID_SEG);
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
+
+	/* Enable FRED */
+	cr4_set_bits(X86_CR4_FRED);
+	idt_invalidate();	/* Any further IDT use is a bug */
+
+	/* Use int $0x80 for 32-bit system calls in FRED mode */
+	setup_clear_cpu_cap(X86_FEATURE_SYSENTER32);
+	setup_clear_cpu_cap(X86_FEATURE_SYSCALL32);
+}
+
+/*
+ * Initialize system vectors from a FRED perspective, so
+ * lapic_assign_system_vectors() can do its job.
+ */
+void __init fred_setup_apic(void)
+{
+	int i;
+
+	for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
+		set_bit(i, system_vectors);
+
+	/*
+	 * Don't set the non assigned system vectors in the
+	 * system_vectors bitmap. Otherwise they show up in
+	 * /proc/interrupts.
+	 */
+#ifdef CONFIG_SMP
+	set_bit(IRQ_MOVE_CLEANUP_VECTOR, system_vectors);
+#endif
+
+	for (i = 0; i < NR_SYSTEM_VECTORS; i++) {
+		if (get_system_interrupt_handler(i) != NULL) {
+			set_bit(i + FIRST_SYSTEM_VECTOR, system_vectors);
+		}
+	}
+
+	/* The rest are fair game... */
+}
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index c683666876f1..2a510f72dd11 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -28,6 +28,7 @@
 #include <asm/setup.h>
 #include <asm/i8259.h>
 #include <asm/traps.h>
+#include <asm/fred.h>
 #include <asm/prom.h>
 
 /*
@@ -96,7 +97,11 @@ void __init native_init_IRQ(void)
 	/* Execute any quirks before the call gates are initialised: */
 	x86_init.irqs.pre_vector_init();
 
-	idt_setup_apic_and_irq_gates();
+	if (cpu_feature_enabled(X86_FEATURE_FRED))
+		fred_setup_apic();
+	else
+		idt_setup_apic_and_irq_gates();
+
 	lapic_assign_system_vectors();
 
 	if (!acpi_ioapic && !of_ioapic && nr_legacy_irqs()) {
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 4b0f63344526..c7253b4901f0 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1517,12 +1517,21 @@ static system_interrupt_handler system_interrupt_handlers[NR_SYSTEM_VECTORS] = {
 
 #undef SYSV
 
+system_interrupt_handler get_system_interrupt_handler(unsigned int i)
+{
+	if (i >= NR_SYSTEM_VECTORS)
+		return NULL;
+
+	return system_interrupt_handlers[i];
+}
+
 void __init install_system_interrupt_handler(unsigned int n, const void *asm_addr, const void *addr)
 {
 	BUG_ON(n < FIRST_SYSTEM_VECTOR);
 
 	system_interrupt_handlers[n - FIRST_SYSTEM_VECTOR] = (system_interrupt_handler)addr;
-	alloc_intr_gate(n, asm_addr);
+	if (!cpu_feature_enabled(X86_FEATURE_FRED))
+		alloc_intr_gate(n, asm_addr);
 }
 
 #ifndef CONFIG_X86_LOCAL_APIC
@@ -1590,7 +1599,10 @@ void __init trap_init(void)
 
 	/* Initialize TSS before setting up traps so ISTs work */
 	cpu_init_exception_handling();
+
 	/* Setup traps as cpu_init() might #GP */
-	idt_setup_traps();
+	if (!cpu_feature_enabled(X86_FEATURE_FRED))
+		idt_setup_traps();
+
 	cpu_init();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 23/34] x86/fred: update MSR_IA32_FRED_RSP0 during task switch
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (21 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 22/34] x86/fred: FRED initialization code Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-20 16:52   ` Peter Zijlstra
  2023-03-07  2:39 ` [PATCH v5 24/34] x86/fred: let ret_from_fork() jmp to fred_exit_user when FRED is enabled Xin Li
                   ` (11 subsequent siblings)
  34 siblings, 1 reply; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

MSR_IA32_FRED_RSP0 is used during ring 3 event delivery, and needs to
be updated to point to the top of next task stack during task switch.

Update MSR_IA32_FRED_RSP0 with WRMSR instruction for now, and will use
WRMSRNS/WRMSRLIST for performance once it gets upstreamed.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/switch_to.h | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 5c91305d09d2..00fd85abc1d2 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -68,9 +68,16 @@ static inline void update_task_stack(struct task_struct *task)
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0);
 #else
-	/* Xen PV enters the kernel on the thread stack. */
-	if (cpu_feature_enabled(X86_FEATURE_XENPV))
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		/*
+		 * Will use WRMSRNS/WRMSRLIST for performance once it's upstreamed.
+		 */
+		wrmsrl(MSR_IA32_FRED_RSP0,
+		       task_top_of_stack(task) + TOP_OF_KERNEL_STACK_PADDING);
+	} else if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
+		/* Xen PV enters the kernel on the thread stack. */
 		load_sp0(task_top_of_stack(task));
+	}
 #endif
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 24/34] x86/fred: let ret_from_fork() jmp to fred_exit_user when FRED is enabled
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (22 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 23/34] x86/fred: update MSR_IA32_FRED_RSP0 during task switch Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 25/34] x86/fred: disallow the swapgs instruction " Xin Li
                   ` (10 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Let ret_from_fork() jmp to fred_exit_user when FRED is enabled,
otherwise the existing IDT code is chosen.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/entry/entry_64.S | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index eccc3431e515..5b595a9b2ffb 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -299,7 +299,12 @@ SYM_CODE_START_NOALIGN(ret_from_fork)
 	UNWIND_HINT_REGS
 	movq	%rsp, %rdi
 	call	syscall_exit_to_user_mode	/* returns with IRQs disabled */
+#ifdef CONFIG_X86_FRED
+	ALTERNATIVE "jmp swapgs_restore_regs_and_return_to_usermode", \
+		    "jmp fred_exit_user", X86_FEATURE_FRED
+#else
 	jmp	swapgs_restore_regs_and_return_to_usermode
+#endif
 
 1:
 	/* kernel thread */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 25/34] x86/fred: disallow the swapgs instruction when FRED is enabled
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (23 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 24/34] x86/fred: let ret_from_fork() jmp to fred_exit_user when FRED is enabled Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-20 16:54   ` Peter Zijlstra
  2023-03-07  2:39 ` [PATCH v5 26/34] x86/fred: no ESPFIX needed " Xin Li
                   ` (9 subsequent siblings)
  34 siblings, 1 reply; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

The FRED architecture establishes the full supervisor/user through:
1) FRED event delivery swaps the value of the GS base address and
   that of the IA32_KERNEL_GS_BASE MSR.
2) ERETU swaps the value of the GS base address and that of the
   IA32_KERNEL_GS_BASE MSR.
Thus, the swapgs instruction is disallowed when FRED is enabled,
otherwise it cauess #UD.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kernel/process_64.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 57de166dc61c..ff6594dbea4a 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -165,7 +165,8 @@ static noinstr unsigned long __rdgsbase_inactive(void)
 
 	lockdep_assert_irqs_disabled();
 
-	if (!cpu_feature_enabled(X86_FEATURE_XENPV)) {
+	if (!cpu_feature_enabled(X86_FEATURE_FRED) &&
+	    !cpu_feature_enabled(X86_FEATURE_XENPV)) {
 		native_swapgs();
 		gsbase = rdgsbase();
 		native_swapgs();
@@ -190,7 +191,8 @@ static noinstr void __wrgsbase_inactive(unsigned long gsbase)
 {
 	lockdep_assert_irqs_disabled();
 
-	if (!cpu_feature_enabled(X86_FEATURE_XENPV)) {
+	if (!cpu_feature_enabled(X86_FEATURE_FRED) &&
+	    !cpu_feature_enabled(X86_FEATURE_XENPV)) {
 		native_swapgs();
 		wrgsbase(gsbase);
 		native_swapgs();
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 26/34] x86/fred: no ESPFIX needed when FRED is enabled
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (24 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 25/34] x86/fred: disallow the swapgs instruction " Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 27/34] x86/fred: allow single-step trap and NMI when starting a new thread Xin Li
                   ` (8 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Because FRED always restores the full value of %rsp, ESPFIX is
no longer needed when it's enabled.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kernel/espfix_64.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 16f9814c9be0..48d133a54f45 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -106,6 +106,10 @@ void __init init_espfix_bsp(void)
 	pgd_t *pgd;
 	p4d_t *p4d;
 
+	/* FRED systems don't need ESPFIX */
+	if (cpu_feature_enabled(X86_FEATURE_FRED))
+		return;
+
 	/* Install the espfix pud into the kernel page directory */
 	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
@@ -129,6 +133,10 @@ void init_espfix_ap(int cpu)
 	void *stack_page;
 	pteval_t ptemask;
 
+	/* FRED systems don't need ESPFIX */
+	if (cpu_feature_enabled(X86_FEATURE_FRED))
+		return;
+
 	/* We only have to do this once... */
 	if (likely(per_cpu(espfix_stack, cpu)))
 		return;		/* Already initialized */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 27/34] x86/fred: allow single-step trap and NMI when starting a new thread
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (25 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 26/34] x86/fred: no ESPFIX needed " Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 28/34] x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user Xin Li
                   ` (7 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Allow single-step trap and NMI when starting a new thread, thus once
the new thread returns to ring3, single-step trap and NMI are both
enabled immediately.

High-order 48 bits above the lowest 16 bit CS are discarded by the
legacy IRET instruction, thus can be set unconditionally, even when
FRED is not enabled.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/fred.h  | 11 +++++++++++
 arch/x86/kernel/process_64.c | 13 +++++++------
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index cd974edc8e8a..12449448e9bf 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -52,6 +52,14 @@
 #define FRED_CSL_ALLOW_SINGLE_STEP	_BITUL(25)
 #define FRED_CSL_INTERRUPT_SHADOW	_BITUL(24)
 
+/*
+ * High-order 48 bits above the lowest 16 bit CS are discarded by the
+ * legacy IRET instruction, thus can be set unconditionally, even when
+ * FRED is not enabled.
+ */
+#define CSL_PROCESS_START \
+	(FRED_CSL_ENABLE_NMI | FRED_CSL_ALLOW_SINGLE_STEP)
+
 #ifndef __ASSEMBLY__
 
 #include <linux/kernel.h>
@@ -115,6 +123,9 @@ void fred_setup_apic(void);
 #else
 #define cpu_init_fred_exceptions() BUG()
 #define fred_setup_apic() BUG()
+
+#define CSL_PROCESS_START 0
+
 #endif /* CONFIG_X86_FRED */
 
 #endif /* ASM_X86_FRED_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ff6594dbea4a..b23850352e7d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -55,6 +55,7 @@
 #include <asm/resctrl.h>
 #include <asm/unistd.h>
 #include <asm/fsgsbase.h>
+#include <asm/fred.h>
 #ifdef CONFIG_IA32_EMULATION
 /* Not included via unistd.h */
 #include <asm/unistd_32_ia32.h>
@@ -506,7 +507,7 @@ void x86_gsbase_write_task(struct task_struct *task, unsigned long gsbase)
 static void
 start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 		    unsigned long new_sp,
-		    unsigned int _cs, unsigned int _ss, unsigned int _ds)
+		    u16 _cs, u16 _ss, u16 _ds)
 {
 	WARN_ON_ONCE(regs != current_pt_regs());
 
@@ -521,11 +522,11 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 	loadsegment(ds, _ds);
 	load_gs_index(0);
 
-	regs->ip		= new_ip;
-	regs->sp		= new_sp;
-	regs->cs		= _cs;
-	regs->ss		= _ss;
-	regs->flags		= X86_EFLAGS_IF;
+	regs->ip	= new_ip;
+	regs->sp	= new_sp;
+	regs->csx	= _cs | CSL_PROCESS_START;
+	regs->ssx	= _ss;
+	regs->flags	= X86_EFLAGS_IF | X86_EFLAGS_FIXED;
 }
 
 void
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 28/34] x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (26 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 27/34] x86/fred: allow single-step trap and NMI when starting a new thread Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-17  9:39   ` Lai Jiangshan
  2023-03-07  2:39 ` [PATCH v5 29/34] x86/ia32: do not modify the DPL bits for a null selector Xin Li
                   ` (6 subsequent siblings)
  34 siblings, 1 reply; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

If the stack frame contains an invalid user context (e.g. due to invalid SS,
a non-canonical RIP, etc.) the ERETU instruction will trap (#SS or #GP).

From a Linux point of view, this really should be considered a user space
failure, so use the standard fault fixup mechanism to intercept the fault,
fix up the exception frame, and redirect execution to fred_entrypoint_user.
The end result is that it appears just as if the hardware had taken the
exception immediately after completing the transition to user space.

Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/entry/entry_64_fred.S             |  8 +++++--
 arch/x86/include/asm/extable_fixup_types.h |  4 +++-
 arch/x86/mm/extable.c                      | 28 ++++++++++++++++++++++
 3 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 1fb765fd3871..027ef8f1e600 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -5,8 +5,10 @@
  * The actual FRED entry points.
  */
 #include <linux/linkage.h>
-#include <asm/errno.h>
+#include <asm/asm.h>
 #include <asm/asm-offsets.h>
+#include <asm/errno.h>
+#include <asm/export.h>
 #include <asm/fred.h>
 
 #include "calling.h"
@@ -38,7 +40,9 @@ SYM_CODE_START_NOALIGN(fred_entrypoint_user)
 	call	fred_entry_from_user
 SYM_INNER_LABEL(fred_exit_user, SYM_L_GLOBAL)
 	FRED_EXIT
-	ERETU
+1:	ERETU
+
+	_ASM_EXTABLE_TYPE(1b, fred_entrypoint_user, EX_TYPE_ERETU)
 SYM_CODE_END(fred_entrypoint_user)
 
 /*
diff --git a/arch/x86/include/asm/extable_fixup_types.h b/arch/x86/include/asm/extable_fixup_types.h
index 991e31cfde94..1585c798a02f 100644
--- a/arch/x86/include/asm/extable_fixup_types.h
+++ b/arch/x86/include/asm/extable_fixup_types.h
@@ -64,6 +64,8 @@
 #define	EX_TYPE_UCOPY_LEN4		(EX_TYPE_UCOPY_LEN | EX_DATA_IMM(4))
 #define	EX_TYPE_UCOPY_LEN8		(EX_TYPE_UCOPY_LEN | EX_DATA_IMM(8))
 
-#define EX_TYPE_ZEROPAD			20 /* longword load with zeropad on fault */
+#define	EX_TYPE_ZEROPAD			20 /* longword load with zeropad on fault */
+
+#define	EX_TYPE_ERETU			21
 
 #endif
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 60814e110a54..88a2c419ce8b 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -6,6 +6,7 @@
 #include <xen/xen.h>
 
 #include <asm/fpu/api.h>
+#include <asm/fred.h>
 #include <asm/sev.h>
 #include <asm/traps.h>
 #include <asm/kdebug.h>
@@ -195,6 +196,29 @@ static bool ex_handler_ucopy_len(const struct exception_table_entry *fixup,
 	return ex_handler_uaccess(fixup, regs, trapnr);
 }
 
+#ifdef CONFIG_X86_FRED
+static bool ex_handler_eretu(const struct exception_table_entry *fixup,
+			     struct pt_regs *regs, unsigned long error_code)
+{
+	struct pt_regs *uregs = (struct pt_regs *)(regs->sp - offsetof(struct pt_regs, ip));
+	unsigned short ss = uregs->ss;
+	unsigned short cs = uregs->cs;
+
+	fred_info(uregs)->edata = fred_event_data(regs);
+	uregs->ssx = regs->ssx;
+	uregs->ss = ss;
+	uregs->csx = regs->csx;
+	uregs->current_stack_level = 0;
+	uregs->cs = cs;
+
+	/* Copy error code to uregs and adjust stack pointer accordingly */
+	uregs->orig_ax = error_code;
+	regs->sp -= 8;
+
+	return ex_handler_default(fixup, regs);
+}
+#endif
+
 int ex_get_fixup_type(unsigned long ip)
 {
 	const struct exception_table_entry *e = search_exception_tables(ip);
@@ -272,6 +296,10 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
 		return ex_handler_ucopy_len(e, regs, trapnr, reg, imm);
 	case EX_TYPE_ZEROPAD:
 		return ex_handler_zeropad(e, regs, fault_addr);
+#ifdef CONFIG_X86_FRED
+	case EX_TYPE_ERETU:
+		return ex_handler_eretu(e, regs, error_code);
+#endif
 	}
 	BUG();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 29/34] x86/ia32: do not modify the DPL bits for a null selector
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (27 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 28/34] x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 30/34] x86/fred: allow FRED systems to use interrupt vectors 0x10-0x1f Xin Li
                   ` (5 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

When a null selector is to be loaded into a segment register,
reload_segments() sets its DPL bits to 3. Later when the IRET
instruction loads it, it zeros the segment register. The two
operations offset each other to actually effect a nop.

Unlike IRET, ERETU does not make any of DS, ES, FS, or GS null
if it is found to have DPL < 3. It is expected that a FRED-enabled
operating system will return to ring 3 (in compatibility mode)
only when those segments all have DPL = 3.

Thus when FRED is enabled, we end up with having 3 in a segment
register even when it is initially set to 0.

Fix it by not modifying the DPL bits for a null selector.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kernel/signal_32.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index 9027fc088f97..7796cf84fca2 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -36,22 +36,27 @@
 #ifdef CONFIG_IA32_EMULATION
 #include <asm/ia32_unistd.h>
 
+static inline u16 usrseg(u16 sel)
+{
+	return sel <= 3 ? sel : sel | 3;
+}
+
 static inline void reload_segments(struct sigcontext_32 *sc)
 {
 	unsigned int cur;
 
 	savesegment(gs, cur);
-	if ((sc->gs | 0x03) != cur)
-		load_gs_index(sc->gs | 0x03);
+	if (usrseg(sc->gs) != cur)
+		load_gs_index(usrseg(sc->gs));
 	savesegment(fs, cur);
-	if ((sc->fs | 0x03) != cur)
-		loadsegment(fs, sc->fs | 0x03);
+	if (usrseg(sc->fs) != cur)
+		loadsegment(fs, usrseg(sc->fs));
 	savesegment(ds, cur);
-	if ((sc->ds | 0x03) != cur)
-		loadsegment(ds, sc->ds | 0x03);
+	if (usrseg(sc->ds) != cur)
+		loadsegment(ds, usrseg(sc->ds));
 	savesegment(es, cur);
-	if ((sc->es | 0x03) != cur)
-		loadsegment(es, sc->es | 0x03);
+	if (usrseg(sc->es) != cur)
+		loadsegment(es, usrseg(sc->es));
 }
 
 #define sigset32_t			compat_sigset_t
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 30/34] x86/fred: allow FRED systems to use interrupt vectors 0x10-0x1f
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (28 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 29/34] x86/ia32: do not modify the DPL bits for a null selector Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 31/34] x86/fred: allow dynamic stack frame size Xin Li
                   ` (4 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

FRED inherits the Intel VT-x enhancement of classified events with
a two-level event dispatch logic. The first-level dispatch is on
the event type, and the second-level is on the event vector. This
also means that vectors in different event types are orthogonal,
thus, vectors 0x10-0x1f become available as hardware interrupts.

Enable interrupt vectors 0x10-0x1f on FRED systems (interrupt 0x80 is
already enabled.) Most of these changes are about removing the
assumption that the lowest-priority vector is hard-wired to 0x20.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/idtentry.h    |  4 ++--
 arch/x86/include/asm/irq.h         |  5 +++++
 arch/x86/include/asm/irq_vectors.h | 15 +++++++++++----
 arch/x86/kernel/apic/apic.c        | 11 ++++++++---
 arch/x86/kernel/apic/vector.c      |  8 +++++++-
 arch/x86/kernel/fred.c             |  4 ++--
 arch/x86/kernel/idt.c              |  6 +++---
 arch/x86/kernel/irq.c              |  2 +-
 arch/x86/kernel/traps.c            |  2 ++
 9 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index bd43866f9c3e..57c891148b59 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -546,8 +546,8 @@ __visible noinstr void func(struct pt_regs *regs,			\
  */
 	.align IDT_ALIGN
 SYM_CODE_START(irq_entries_start)
-    vector=FIRST_EXTERNAL_VECTOR
-    .rept NR_EXTERNAL_VECTORS
+    vector=FIRST_EXTERNAL_VECTOR_IDT
+    .rept FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR_IDT
 	UNWIND_HINT_IRET_REGS
 0 :
 	ENDBR
diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 768aa234cbb4..e4be6f8409ad 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -11,6 +11,11 @@
 #include <asm/apicdef.h>
 #include <asm/irq_vectors.h>
 
+/*
+ * The first available IRQ vector
+ */
+extern unsigned int __ro_after_init first_external_vector;
+
 /*
  * The irq entry code is in the noinstr section and the start/end of
  * __irqentry_text is emitted via labels. Make the build fail if
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 43dcb9284208..cb3670a7c18f 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -31,15 +31,23 @@
 
 /*
  * IDT vectors usable for external interrupt sources start at 0x20.
- * (0x80 is the syscall vector, 0x30-0x3f are for ISA)
+ * (0x80 is the syscall vector, 0x30-0x3f are for ISA).
+ *
+ * With FRED we can also use 0x10-0x1f even though those overlap
+ * exception vectors as FRED distinguishes exceptions and interrupts.
+ * Therefore, FIRST_EXTERNAL_VECTOR is no longer a constant.
  */
-#define FIRST_EXTERNAL_VECTOR		0x20
+#define FIRST_EXTERNAL_VECTOR_IDT	0x20
+#define FIRST_EXTERNAL_VECTOR_FRED	0x10
+#define FIRST_EXTERNAL_VECTOR		first_external_vector
 
 /*
  * Reserve the lowest usable vector (and hence lowest priority)  0x20 for
  * triggering cleanup after irq migration. 0x21-0x2f will still be used
  * for device interrupts.
  */
+#define IRQ_MOVE_CLEANUP_VECTOR_IDT	FIRST_EXTERNAL_VECTOR_IDT
+#define IRQ_MOVE_CLEANUP_VECTOR_FRED	FIRST_EXTERNAL_VECTOR_FRED
 #define IRQ_MOVE_CLEANUP_VECTOR		FIRST_EXTERNAL_VECTOR
 
 #define IA32_SYSCALL_VECTOR		0x80
@@ -48,7 +56,7 @@
  * Vectors 0x30-0x3f are used for ISA interrupts.
  *   round up to the next 16-vector boundary
  */
-#define ISA_IRQ_VECTOR(irq)		(((FIRST_EXTERNAL_VECTOR + 16) & ~15) + irq)
+#define ISA_IRQ_VECTOR(irq)		(((FIRST_EXTERNAL_VECTOR_IDT + 16) & ~15) + irq)
 
 /*
  * Special IRQ vectors used by the SMP architecture, 0xf0-0xff
@@ -114,7 +122,6 @@
 #define FIRST_SYSTEM_VECTOR		NR_VECTORS
 #endif
 
-#define NR_EXTERNAL_VECTORS		(FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
 #define NR_SYSTEM_VECTORS		(NR_VECTORS - FIRST_SYSTEM_VECTOR)
 
 /*
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 20d9a604da7c..eef67f64aa81 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1621,12 +1621,17 @@ static void setup_local_APIC(void)
 	/*
 	 * Set Task Priority to 'accept all except vectors 0-31'.  An APIC
 	 * vector in the 16-31 range could be delivered if TPR == 0, but we
-	 * would think it's an exception and terrible things will happen.  We
-	 * never change this later on.
+	 * would think it's an exception and terrible things will happen,
+	 * unless we are using FRED in which case interrupts and
+	 * exceptions are distinguished by type code.
+	 *
+	 * We never change this later on.
 	 */
+	BUG_ON(!first_external_vector);
+
 	value = apic_read(APIC_TASKPRI);
 	value &= ~APIC_TPRI_MASK;
-	value |= 0x10;
+	value |= (first_external_vector - 0x10) & APIC_TPRI_MASK;
 	apic_write(APIC_TASKPRI, value);
 
 	/* Clear eventually stale ISR/IRR bits */
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index c1efebd27e6c..f4325445fd78 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -46,6 +46,7 @@ static struct irq_matrix *vector_matrix;
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct hlist_head, cleanup_list);
 #endif
+unsigned int first_external_vector = FIRST_EXTERNAL_VECTOR_IDT;
 
 void lock_vector_lock(void)
 {
@@ -796,7 +797,12 @@ int __init arch_early_irq_init(void)
 	 * Allocate the vector matrix allocator data structure and limit the
 	 * search area.
 	 */
-	vector_matrix = irq_alloc_matrix(NR_VECTORS, FIRST_EXTERNAL_VECTOR,
+	if (cpu_feature_enabled(X86_FEATURE_FRED))
+		first_external_vector = FIRST_EXTERNAL_VECTOR_FRED;
+	else
+		first_external_vector = FIRST_EXTERNAL_VECTOR_IDT;
+
+	vector_matrix = irq_alloc_matrix(NR_VECTORS, first_external_vector,
 					 FIRST_SYSTEM_VECTOR);
 	BUG_ON(!vector_matrix);
 
diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
index 827b58fd98d4..04f057219c6e 100644
--- a/arch/x86/kernel/fred.c
+++ b/arch/x86/kernel/fred.c
@@ -51,7 +51,7 @@ void __init fred_setup_apic(void)
 {
 	int i;
 
-	for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
+	for (i = 0; i < FIRST_EXTERNAL_VECTOR_FRED; i++)
 		set_bit(i, system_vectors);
 
 	/*
@@ -60,7 +60,7 @@ void __init fred_setup_apic(void)
 	 * /proc/interrupts.
 	 */
 #ifdef CONFIG_SMP
-	set_bit(IRQ_MOVE_CLEANUP_VECTOR, system_vectors);
+	set_bit(IRQ_MOVE_CLEANUP_VECTOR_FRED, system_vectors);
 #endif
 
 	for (i = 0; i < NR_SYSTEM_VECTORS; i++) {
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..d3fd86f85de9 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -131,7 +131,7 @@ static const __initconst struct idt_data apic_idts[] = {
 	INTG(RESCHEDULE_VECTOR,			asm_sysvec_reschedule_ipi),
 	INTG(CALL_FUNCTION_VECTOR,		asm_sysvec_call_function),
 	INTG(CALL_FUNCTION_SINGLE_VECTOR,	asm_sysvec_call_function_single),
-	INTG(IRQ_MOVE_CLEANUP_VECTOR,		asm_sysvec_irq_move_cleanup),
+	INTG(IRQ_MOVE_CLEANUP_VECTOR_IDT,	asm_sysvec_irq_move_cleanup),
 	INTG(REBOOT_VECTOR,			asm_sysvec_reboot),
 #endif
 
@@ -274,13 +274,13 @@ static void __init idt_map_in_cea(void)
  */
 void __init idt_setup_apic_and_irq_gates(void)
 {
-	int i = FIRST_EXTERNAL_VECTOR;
+	int i = FIRST_EXTERNAL_VECTOR_IDT;
 	void *entry;
 
 	idt_setup_from_table(idt_table, apic_idts, ARRAY_SIZE(apic_idts), true);
 
 	for_each_clear_bit_from(i, system_vectors, FIRST_SYSTEM_VECTOR) {
-		entry = irq_entries_start + IDT_ALIGN * (i - FIRST_EXTERNAL_VECTOR);
+		entry = irq_entries_start + IDT_ALIGN * (i - FIRST_EXTERNAL_VECTOR_IDT);
 		set_intr_gate(i, entry);
 	}
 
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 7e125fff45ab..b7511e02959c 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -359,7 +359,7 @@ void fixup_irqs(void)
 	 * vector_lock because the cpu is already marked !online, so
 	 * nothing else will touch it.
 	 */
-	for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
+	for (vector = first_external_vector; vector < NR_VECTORS; vector++) {
 		if (IS_ERR_OR_NULL(__this_cpu_read(vector_irq[vector])))
 			continue;
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c7253b4901f0..c46eba091728 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1544,6 +1544,8 @@ DEFINE_IDTENTRY_IRQ(spurious_interrupt)
 	pr_info("Spurious interrupt (vector 0x%x) on CPU#%d, should never happen.\n",
 		vector, smp_processor_id());
 }
+
+unsigned int first_external_vector = FIRST_EXTERNAL_VECTOR_IDT;
 #endif
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 31/34] x86/fred: allow dynamic stack frame size
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (29 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 30/34] x86/fred: allow FRED systems to use interrupt vectors 0x10-0x1f Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 32/34] x86/fred: disable FRED by default in its early stage Xin Li
                   ` (3 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

A FRED stack frame could contain different amount of information for
different event types, or perhaps even for different instances of the
same event type. Thus we need to eliminate the need of any advance
information of the stack frame size to allow dynamic stack frame size.

Implement it through:
  1) add a new field user_pt_regs to thread_info, and initialize it
     with a pointer to a virtual pt_regs structure at the top of a
     thread stack.
  2) save a pointer to the user-space pt_regs structure created by
     fred_entrypoint_user() to user_pt_regs in fred_entry_from_user().
  3) initialize the init_thread_info's user_pt_regs with a pointer to
     a virtual pt_regs structure at the top of init stack.

This approach also works for IDT, thus we unify the code.

Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/entry/entry_32.S           |  2 +-
 arch/x86/entry/entry_fred.c         |  2 ++
 arch/x86/include/asm/entry-common.h |  3 +++
 arch/x86/include/asm/processor.h    | 12 +++------
 arch/x86/include/asm/switch_to.h    |  3 +--
 arch/x86/include/asm/thread_info.h  | 41 ++++-------------------------
 arch/x86/kernel/head_32.S           |  3 +--
 arch/x86/kernel/process.c           |  5 ++++
 kernel/fork.c                       |  6 +++++
 9 files changed, 27 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 91397f58ac30..5adc4cf33d92 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1244,7 +1244,7 @@ SYM_CODE_START(rewind_stack_and_make_dead)
 	xorl	%ebp, %ebp
 
 	movl	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %esi
-	leal	-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
+	leal	-PTREGS_SIZE(%esi), %esp
 
 	call	make_task_dead
 1:	jmp 1b
diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
index 8d3e144670d6..a72167c83923 100644
--- a/arch/x86/entry/entry_fred.c
+++ b/arch/x86/entry/entry_fred.c
@@ -178,6 +178,8 @@ __visible noinstr void fred_entry_from_user(struct pt_regs *regs)
 		[EVENT_TYPE_OTHER]	= fred_syscall_slow
 	};
 
+	current->thread_info.user_pt_regs = regs;
+
 	/*
 	 * FRED employs a two-level event dispatch mechanism, with
 	 * the first-level on the type of an event and the second-level
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 117903881fe4..5b7d0f47f188 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -12,6 +12,9 @@
 /* Check that the stack and regs on entry from user mode are sane. */
 static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs)
 {
+	if (!cpu_feature_enabled(X86_FEATURE_FRED))
+		current->thread_info.user_pt_regs = regs;
+
 	if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
 		/*
 		 * Make sure that the entry code gave us a sensible EFLAGS
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 8d73004e4cac..4a50d2a2c14b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -626,17 +626,11 @@ static inline void spin_lock_prefetch(const void *x)
 	prefetchw(x);
 }
 
-#define TOP_OF_INIT_STACK ((unsigned long)&init_stack + sizeof(init_stack) - \
-			   TOP_OF_KERNEL_STACK_PADDING)
+#define TOP_OF_INIT_STACK ((unsigned long)&init_stack + sizeof(init_stack))
 
-#define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1))
+#define task_top_of_stack(task) ((unsigned long)task_stack_page(task) + THREAD_SIZE)
 
-#define task_pt_regs(task) \
-({									\
-	unsigned long __ptr = (unsigned long)task_stack_page(task);	\
-	__ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;		\
-	((struct pt_regs *)__ptr) - 1;					\
-})
+#define task_pt_regs(task) ((task)->thread_info.user_pt_regs)
 
 #ifdef CONFIG_X86_32
 #define INIT_THREAD  {							  \
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 00fd85abc1d2..0a31da150808 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -72,8 +72,7 @@ static inline void update_task_stack(struct task_struct *task)
 		/*
 		 * Will use WRMSRNS/WRMSRLIST for performance once it's upstreamed.
 		 */
-		wrmsrl(MSR_IA32_FRED_RSP0,
-		       task_top_of_stack(task) + TOP_OF_KERNEL_STACK_PADDING);
+		wrmsrl(MSR_IA32_FRED_RSP0, task_top_of_stack(task));
 	} else if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
 		/* Xen PV enters the kernel on the thread stack. */
 		load_sp0(task_top_of_stack(task));
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 998483078d5f..ced0a01e0a3e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -13,42 +13,6 @@
 #include <asm/percpu.h>
 #include <asm/types.h>
 
-/*
- * TOP_OF_KERNEL_STACK_PADDING is a number of unused bytes that we
- * reserve at the top of the kernel stack.  We do it because of a nasty
- * 32-bit corner case.  On x86_32, the hardware stack frame is
- * variable-length.  Except for vm86 mode, struct pt_regs assumes a
- * maximum-length frame.  If we enter from CPL 0, the top 8 bytes of
- * pt_regs don't actually exist.  Ordinarily this doesn't matter, but it
- * does in at least one case:
- *
- * If we take an NMI early enough in SYSENTER, then we can end up with
- * pt_regs that extends above sp0.  On the way out, in the espfix code,
- * we can read the saved SS value, but that value will be above sp0.
- * Without this offset, that can result in a page fault.  (We are
- * careful that, in this case, the value we read doesn't matter.)
- *
- * In vm86 mode, the hardware frame is much longer still, so add 16
- * bytes to make room for the real-mode segments.
- *
- * x86-64 has a fixed-length stack frame, but it depends on whether
- * or not FRED is enabled. Future versions of FRED might make this
- * dynamic, but for now it is always 2 words longer.
- */
-#ifdef CONFIG_X86_32
-# ifdef CONFIG_VM86
-#  define TOP_OF_KERNEL_STACK_PADDING 16
-# else
-#  define TOP_OF_KERNEL_STACK_PADDING 8
-# endif
-#else /* x86-64 */
-# ifdef CONFIG_X86_FRED
-#  define TOP_OF_KERNEL_STACK_PADDING (2*8)
-# else
-#  define TOP_OF_KERNEL_STACK_PADDING 0
-# endif
-#endif
-
 /*
  * low level task data that entry.S needs immediate access to
  * - this struct should fit entirely inside of one cache line
@@ -56,6 +20,7 @@
  */
 #ifndef __ASSEMBLY__
 struct task_struct;
+struct pt_regs;
 #include <asm/cpufeature.h>
 #include <linux/atomic.h>
 
@@ -66,11 +31,14 @@ struct thread_info {
 #ifdef CONFIG_SMP
 	u32			cpu;		/* current CPU */
 #endif
+	struct pt_regs		*user_pt_regs;
 };
 
+#define INIT_TASK_PT_REGS ((struct pt_regs *)TOP_OF_INIT_STACK - 1)
 #define INIT_THREAD_INFO(tsk)			\
 {						\
 	.flags		= 0,			\
+	.user_pt_regs   = INIT_TASK_PT_REGS,	\
 }
 
 #else /* !__ASSEMBLY__ */
@@ -240,6 +208,7 @@ static inline int arch_within_stack_frames(const void * const stack,
 
 extern void arch_task_cache_init(void);
 extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
+extern void arch_init_user_pt_regs(struct task_struct *tsk);
 extern void arch_release_task_struct(struct task_struct *tsk);
 extern void arch_setup_new_exec(void);
 #define arch_setup_new_exec arch_setup_new_exec
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 67c8ed99144b..0201ddcd7576 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -517,8 +517,7 @@ SYM_DATA_END(initial_page_table)
  * reliably detect the end of the stack.
  */
 SYM_DATA(initial_stack,
-		.long init_thread_union + THREAD_SIZE -
-		SIZEOF_PTREGS - TOP_OF_KERNEL_STACK_PADDING)
+		.long init_thread_union + THREAD_SIZE - SIZEOF_PTREGS)
 
 __INITRODATA
 int_msg:
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b650cde3f64d..e1c6350290ae 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -98,6 +98,11 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 	return 0;
 }
 
+void arch_init_user_pt_regs(struct task_struct *tsk)
+{
+	tsk->thread_info.user_pt_regs = (struct pt_regs *)task_top_of_stack(tsk)- 1;
+}
+
 #ifdef CONFIG_X86_64
 void arch_release_task_struct(struct task_struct *tsk)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index f68954d05e89..85c4216bdcd8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -958,6 +958,10 @@ int __weak arch_dup_task_struct(struct task_struct *dst,
 	return 0;
 }
 
+void __weak arch_init_user_pt_regs(struct task_struct *tsk)
+{
+}
+
 void set_task_stack_end_magic(struct task_struct *tsk)
 {
 	unsigned long *stackend;
@@ -985,6 +989,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (err)
 		goto free_tsk;
 
+	arch_init_user_pt_regs(tsk);
+
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	refcount_set(&tsk->stack_refcount, 1);
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 32/34] x86/fred: disable FRED by default in its early stage
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (30 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 31/34] x86/fred: allow dynamic stack frame size Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07  2:39 ` [PATCH v5 33/34] KVM: x86/vmx: call external_interrupt() to handle IRQ in IRQ caused VM exits Xin Li
                   ` (2 subsequent siblings)
  34 siblings, 0 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

Disable FRED by default in its early stage.

To enable FRED, a new kernel command line option "fred" needs to be added.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 4 ++++
 arch/x86/kernel/cpu/common.c                    | 3 +++
 2 files changed, 7 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6221a1d057dd..c55ea60e1a0c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1498,6 +1498,10 @@
 			Warning: use of this parameter will taint the kernel
 			and may cause unknown problems.
 
+	fred
+			Forcefully enable flexible return and event delivery,
+			which is otherwise disabled by default.
+
 	ftrace=[tracer]
 			[FTRACE] will set and start the specified tracer
 			as early as possible in order to facilitate early
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index eea41cb8722e..4db5e619fc97 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1467,6 +1467,9 @@ static void __init cpu_parse_early_param(void)
 	char *argptr = arg, *opt;
 	int arglen, taint = 0;
 
+	if (!cmdline_find_option_bool(boot_command_line, "fred"))
+		setup_clear_cpu_cap(X86_FEATURE_FRED);
+
 #ifdef CONFIG_X86_32
 	if (cmdline_find_option_bool(boot_command_line, "no387"))
 #ifdef CONFIG_MATH_EMULATION
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 33/34] KVM: x86/vmx: call external_interrupt() to handle IRQ in IRQ caused VM exits
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (31 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 32/34] x86/fred: disable FRED by default in its early stage Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-22 17:57   ` Sean Christopherson
  2023-03-07  2:39 ` [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled Xin Li
  2023-03-11  9:58 ` [PATCH v5 00/34] x86: enable FRED for x86-64 Kang, Shan
  34 siblings, 1 reply; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

When FRED is enabled, IDT is gone, thus call external_interrupt() to handle
IRQ in IRQ caused VM exits.

Create an event return stack frame with the host context immediately after
a VM exit for calling external_interrupt(). All other fields of the pt_regs
structure are cleared to 0. Refer to the discussion about the register values
in the pt_regs structure at:

  https://lore.kernel.org/kvm/ef2c54f7-14b9-dcbb-c3c4-1533455e7a18@redhat.com/

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v4:
*) Do NOT use the term "injection", which in the KVM context means to
   reinject an event into the guest (Sean Christopherson).
*) Use cs/ss instead of csx/ssx when initializing the pt_regs structure
   for calling external_interrupt(), otherwise it breaks i386 build.
---
 arch/x86/kvm/vmx/vmx.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bcac3efcde41..3ebeaab34b2e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -47,6 +47,7 @@
 #include <asm/mshyperv.h>
 #include <asm/mwait.h>
 #include <asm/spec-ctrl.h>
+#include <asm/traps.h>
 #include <asm/virtext.h>
 #include <asm/vmx.h>
 
@@ -6923,7 +6924,26 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
 		return;
 
 	kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
-	vmx_do_interrupt_irqoff(gate_offset(desc));
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		struct vcpu_vmx *vmx = to_vmx(vcpu);
+		struct pt_regs regs = {};
+
+		/*
+		 * Create an event return stack frame with the
+		 * host context immediately after a VM exit.
+		 *
+		 * All other fields of the pt_regs structure are
+		 * cleared to 0.
+		 */
+		regs.ss		= __KERNEL_DS;
+		regs.sp		= vmx->loaded_vmcs->host_state.rsp;
+		regs.flags	= X86_EFLAGS_FIXED;
+		regs.cs		= __KERNEL_CS;
+		regs.ip		= (unsigned long)vmx_vmexit;
+
+		external_interrupt(&regs, vector);
+	} else
+		vmx_do_interrupt_irqoff(gate_offset(desc));
 	kvm_after_interrupt(vcpu);
 
 	vcpu->arch.at_instruction_boundary = true;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (32 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 33/34] KVM: x86/vmx: call external_interrupt() to handle IRQ in IRQ caused VM exits Xin Li
@ 2023-03-07  2:39 ` Xin Li
  2023-03-07 22:00   ` Li, Xin3
  2023-03-22 17:49   ` Sean Christopherson
  2023-03-11  9:58 ` [PATCH v5 00/34] x86: enable FRED for x86-64 Kang, Shan
  34 siblings, 2 replies; 80+ messages in thread
From: Xin Li @ 2023-03-07  2:39 UTC (permalink / raw)
  To: linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	seanjc, pbonzini, ravi.v.shankar

Execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled.

Like IRET for IDT, ERETS/ERETU are required to end the NMI handler for FRED
to unblock NMI ASAP (w/ bit 28 of CS set). And there are 2 approaches to
invoke the FRED NMI handler:
1) execute "int $2", let the h/w do the job.
2) create a FRED NMI stack frame on the current kernel stack with ASM,
   and then jump to fred_entrypoint_kernel in arch/x86/entry/entry_64_fred.S.

1) is preferred as we want less ASM.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v4:
*) Do NOT use the term "injection", which in the KVM context means to
   reinject an event into the guest (Sean Christopherson).
*) Add the explanation of why to execute "int $2" to invoke the NMI handler
   in NMI caused VM exits (Sean Christopherson).
---
 arch/x86/kvm/vmx/vmx.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3ebeaab34b2e..4f12ead2266b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7229,7 +7229,16 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	if ((u16)vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
 	    is_nmi(vmx_get_intr_info(vcpu))) {
 		kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
-		vmx_do_nmi_irqoff();
+		/*
+		 * Like IRET for IDT, ERETS/ERETU are required to end the NMI
+		 * handler for FRED to unblock NMI ASAP (w/ bit 28 of CS set).
+		 *
+		 * Invoke the FRED NMI handler through executing "int $2".
+		 */
+		if (cpu_feature_enabled(X86_FEATURE_FRED))
+			asm volatile("int $2");
+		else
+			vmx_do_nmi_irqoff();
 		kvm_after_interrupt(vcpu);
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled
  2023-03-07  2:39 ` [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled Xin Li
@ 2023-03-07 22:00   ` Li, Xin3
  2023-03-22 17:49   ` Sean Christopherson
  1 sibling, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-07 22:00 UTC (permalink / raw)
  To: Li, Xin3, linux-kernel, x86, kvm
  Cc: tglx, mingo, bp, dave.hansen, hpa, peterz, andrew.cooper3,
	Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> Execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled.
> 
> Like IRET for IDT, ERETS/ERETU are required to end the NMI handler for FRED
> to unblock NMI ASAP (w/ bit 28 of CS set). And there are 2 approaches to
> invoke the FRED NMI handler:
> 1) execute "int $2", let the h/w do the job.
> 2) create a FRED NMI stack frame on the current kernel stack with ASM,
>    and then jump to fred_entrypoint_kernel in arch/x86/entry/entry_64_fred.S.
> 
> 1) is preferred as we want less ASM.
> 
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
> 
> Changes since v4:
> *) Do NOT use the term "injection", which in the KVM context means to
>    reinject an event into the guest (Sean Christopherson).
> *) Add the explanation of why to execute "int $2" to invoke the NMI handler
>    in NMI caused VM exits (Sean Christopherson).

Sean,

Do you have any further issue with the last 2 VMX patches?

If not, would you ack them?

Thanks!
  Xin


> ---
>  arch/x86/kvm/vmx/vmx.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 3ebeaab34b2e..4f12ead2266b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7229,7 +7229,16 @@ static noinstr void vmx_vcpu_enter_exit(struct
> kvm_vcpu *vcpu,
>  	if ((u16)vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
>  	    is_nmi(vmx_get_intr_info(vcpu))) {
>  		kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
> -		vmx_do_nmi_irqoff();
> +		/*
> +		 * Like IRET for IDT, ERETS/ERETU are required to end the NMI
> +		 * handler for FRED to unblock NMI ASAP (w/ bit 28 of CS set).
> +		 *
> +		 * Invoke the FRED NMI handler through executing "int $2".
> +		 */
> +		if (cpu_feature_enabled(X86_FEATURE_FRED))
> +			asm volatile("int $2");
> +		else
> +			vmx_do_nmi_irqoff();
>  		kvm_after_interrupt(vcpu);
>  	}
> 
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 00/34] x86: enable FRED for x86-64
  2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
                   ` (33 preceding siblings ...)
  2023-03-07  2:39 ` [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled Xin Li
@ 2023-03-11  9:58 ` Kang, Shan
  2023-03-11 21:29   ` Li, Xin3
  2023-03-20  7:40   ` Kang, Shan
  34 siblings, 2 replies; 80+ messages in thread
From: Kang, Shan @ 2023-03-11  9:58 UTC (permalink / raw)
  To: Li, Xin3, kvm, linux-kernel, x86
  Cc: Christopherson,,
	Sean, bp, dave.hansen, peterz, hpa, mingo, tglx, andrew.cooper3,
	pbonzini, Shankar, Ravi V

We tested the v5 FRED patch set on the Intel Simics® Simulator and a machine
with a 7th Intel(R) Core(TM) CPU.

Following are the Kselftest results on X86-64.
+--------------------------------------------+-------+-------+-------+-------+
|                  Config                    |  Pass |  Fail |  Skip |  Hang |
+--------------------------------------------+-------+-------+-------+-------+
|       the 7th Intel(R) Core(TM) CPU        |  3078 |  458  |  734  |   5   |
|                 6.3.0-rc1+                 |       |       |       |       |
+--------------------------------------------+-------+-------+-------+-------+
|       the 7th Intel(R) Core(TM) CPU        |  3078 |  458  |  734  |   5   |
|        6.3.0-rc1+ w/ FRED patch set        |       |       |       |       |
+--------------------------------------------+-------+-------+-------+-------+
|   Intel Simics® Simulator w/o FRED model   |  1888 |  271  |  2105 |   11  |
|                 6.3.0-rc1+                 |       |       |       |       |
+--------------------------------------------+-------+-------+-------+-------+
|   Intel Simics® Simulator w/o FRED model   |  1888 |  271  |  2105 |   11  |
|        6.3.0-rc1+ w/ FRED patch set        |       |       |       |       |
+--------------------------------------------+-------+-------+-------+-------+
|   Intel Simics® Simulator w/ FRED model    |  1889 |  270  |  2105 |   11  |
|                 6.3.0-rc1+                 |       |       |       |       |
+--------------------------------------------+-------+-------+-------+-------+
|   Intel Simics® Simulator w/ FRED model    |  1889 |  270  |  2105 |   11  |
| 6.3.0-rc1+ w/ FRED patch set FRED disabled |       |       |       |       |
+--------------------------------------------+-------+-------+-------+-------+
|   Intel Simics® Simulator w/ FRED model    |  1888 |  270  |  2105 |   12  |
|        6.3.0-rc1+ w/ FRED patch set        |       |       |       |       |
+--------------------------------------------+-------+-------+-------+-------+

The following issues are seen in this round of test.
+----------------+----------------+----------------+----------------+
|                | x86:test_      | bpf:test_progs | x86:sysret     |
|                | vsyscall_32    |                |    _rip_64     |
+----------------+----------------+----------------+----------------+
|    the 7th     |                |                |                |
|    Intel(R)    |      FAIL      |      FAIL      |      PASS      |
|  Core(TM) CPU  |                |                |                |
|   6.3.0-rc1+   |                |                |                |
+----------------+----------------+----------------+----------------+
|    the 7th     |                |                |                |
|    Intel(R)    |                |                |                |
|  Core(TM) CPU  |      FAIL      |      FAIL      |      PASS      |
| 6.3.0-rc1+ w/  |                |                |                |
| FRED patch set |                |                |                |
+----------------+----------------+----------------+----------------+
| Intel Simics®  |                |                |                |
| Simulator w/o  |      FAIL      |      FAIL      |      PASS      |
|   FRED model   |                |                |                |
|   6.3.0-rc1+   |                |                |                |
+----------------+----------------+----------------+----------------+
| Intel Simics®  |                |                |                |
| Simulator w/o  |                |                |                |
|   FRED model   |      FAIL      |      FAIL      |      PASS      |
| 6.3.0-rc1+ w/  |                |                |                |
| FRED patch set |                |                |                |
+----------------+----------------+----------------+----------------+
| Intel Simics®  |                |                |                |
|  Simulator w/  |      PASS      |      FAIL      |      PASS      |
|   FRED model   |                |                |                |
|   6.3.0-rc1+   |                |                |                |
+----------------+----------------+----------------+----------------+
| Intel Simics®  |                |                |                |
|  Simulator w/  |                |                |                |
|   FRED model   |      PASS      |      FAIL      |      PASS      |
| 6.3.0-rc1+ w/  |                |                |                |
| FRED patch set |                |                |                |
| FRED disabled  |                |                |                |
+----------------+----------------+----------------+----------------+
| Intel Simics®  |                |                |                |
|  Simulator w/  |                |                |                |
|   FRED model   |      PASS      |      HANG      |      FAIL      |
| 6.3.0-rc1+ w/  |                |                |                |
| FRED patch set |                |                |                |
+----------------+----------------+----------------+----------------+

The test "x86:sysret_rip_64" is NOT a valid test on FRED, and there is a fix
from Ammar Faizi after we discussed it in the LKML.

The test "bpf:test_progs" is still in investigation.

The "x86:test_vsyscall_32" is a regression since the v3 FRED patch set.

Thanks
   --Shan

On Mon, 2023-03-06 at 18:39 -0800, Xin Li wrote:
> This patch set enables FRED for x86-64.
> 
> The Intel flexible return and event delivery (FRED) architecture defines
> simple
> new transitions that change privilege level (ring transitions). The FRED
> architecture was designed with the following goals:
> 1) Improve overall performance and response time by replacing event delivery
> through the interrupt descriptor table (IDT event delivery) and event return
> by
> the IRET instruction with lower latency transitions.
> 2) Improve software robustness by ensuring that event delivery establishes the
> full supervisor context and that event return establishes the full user
> context.
> 
> The new transitions defined by the FRED architecture are FRED event delivery
> and,
> for returning from events, two FRED return instructions. FRED event delivery
> can
> effect a transition from ring 3 to ring 0, but it is used also to deliver
> events
> incident to ring 0. One FRED instruction (ERETU) effects a return from ring 0
> to
> ring 3, while the other (ERETS) returns while remaining in ring 0.
> 
> Search for the latest FRED spec in most search engines with this search
> pattern:
> 
>   site:intel.com FRED (flexible return and event delivery) specification
> 
> As of now there is no publicly avaiable CPU supporting FRED, thus the Intel
> Simics® Simulator is used as software development and testing vehicles. And
> it can be downloaded from:
>   
> https://www.intel.com/content/www/us/en/developer/articles/tool/simics-simulator.html
> 
> To enable FRED, the Simics package 8112 QSP-CPU needs to be installed with CPU
> model configured as:
> 	$cpu_comp_class = "x86-experimental-fred"
> 
> Longer term, we should refactor common code shared by FRED and IDT into common
> shared files, and contain IDT code using a new config CONFIG_X86_IDT.
> 
> Changes since v4:
> * Rebased against v6.3-rc1.
> * Do NOT use the term "injection", which in the KVM context means to
>   reinject an event into the guest (Sean Christopherson).
> * Add the explanation of why to execute "int $2" to invoke the NMI handler
>   in NMI caused VM exits (Sean Christopherson).
> * Use cs/ss instead of csx/ssx when initializing the pt_regs structure
>   for calling external_interrupt(), otherwise it breaks i386 build.
> 
> Changes since v3:
> * Call external_interrupt() to handle IRQ in IRQ caused VM exits.
> * Execute "int $2" to handle NMI in NMI caused VM exits.
> * Rename csl/ssl of the pt_regs structure to csx/ssx (x for extended)
>   (Andrew Cooper).
> 
> Changes since v2:
> * Improve comments for changes in arch/x86/include/asm/idtentry.h.
> 
> Changes since v1:
> * call irqentry_nmi_{enter,exit}() in both IDT and FRED debug fault kernel
>   handler (Peter Zijlstra).
> * Initialize a FRED exception handler to fred_bad_event() instead of NULL
>   if no FRED handler defined for an exception vector (Peter Zijlstra).
> * Push calling irqentry_{enter,exit}() and instrumentation_{begin,end}()
>   down into individual FRED exception handlers, instead of in the dispatch
>   framework (Peter Zijlstra).
> 
> 
> H. Peter Anvin (Intel) (24):
>   x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR
>   x86/traps: add a system interrupt table for system interrupt dispatch
>   x86/traps: add external_interrupt() to dispatch external interrupts
>   x86/cpufeature: add the cpu feature bit for FRED
>   x86/opcode: add ERETU, ERETS instructions to x86-opcode-map
>   x86/objtool: teach objtool about ERETU and ERETS
>   x86/cpu: add X86_CR4_FRED macro
>   x86/fred: add Kconfig option for FRED (CONFIG_X86_FRED)
>   x86/fred: if CONFIG_X86_FRED is disabled, disable FRED support
>   x86/cpu: add MSR numbers for FRED configuration
>   x86/fred: header file with FRED definitions
>   x86/fred: make unions for the cs and ss fields in struct pt_regs
>   x86/fred: reserve space for the FRED stack frame
>   x86/fred: add a page fault entry stub for FRED
>   x86/fred: add a debug fault entry stub for FRED
>   x86/fred: add a NMI entry stub for FRED
>   x86/fred: FRED entry/exit and dispatch code
>   x86/fred: FRED initialization code
>   x86/fred: update MSR_IA32_FRED_RSP0 during task switch
>   x86/fred: let ret_from_fork() jmp to fred_exit_user when FRED is
>     enabled
>   x86/fred: disallow the swapgs instruction when FRED is enabled
>   x86/fred: no ESPFIX needed when FRED is enabled
>   x86/fred: allow single-step trap and NMI when starting a new thread
>   x86/fred: allow FRED systems to use interrupt vectors 0x10-0x1f
> 
> Xin Li (10):
>   x86/traps: add install_system_interrupt_handler()
>   x86/traps: export external_interrupt() for VMX IRQ reinjection
>   x86/fred: header file for event types
>   x86/fred: add a machine check entry stub for FRED
>   x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user
>   x86/ia32: do not modify the DPL bits for a null selector
>   x86/fred: allow dynamic stack frame size
>   x86/fred: disable FRED by default in its early stage
>   KVM: x86/vmx: call external_interrupt() to handle IRQ in IRQ caused VM
>     exits
>   KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits
>     when FRED is enabled
> 
>  .../admin-guide/kernel-parameters.txt         |   4 +
>  arch/x86/Kconfig                              |   9 +
>  arch/x86/entry/Makefile                       |   5 +-
>  arch/x86/entry/entry_32.S                     |   2 +-
>  arch/x86/entry/entry_64.S                     |   5 +
>  arch/x86/entry/entry_64_fred.S                |  59 +++++
>  arch/x86/entry/entry_fred.c                   | 234 ++++++++++++++++++
>  arch/x86/entry/vsyscall/vsyscall_64.c         |   2 +-
>  arch/x86/include/asm/cpufeatures.h            |   1 +
>  arch/x86/include/asm/disabled-features.h      |   8 +-
>  arch/x86/include/asm/entry-common.h           |   3 +
>  arch/x86/include/asm/event-type.h             |  17 ++
>  arch/x86/include/asm/extable_fixup_types.h    |   4 +-
>  arch/x86/include/asm/fred.h                   | 131 ++++++++++
>  arch/x86/include/asm/idtentry.h               |  76 +++++-
>  arch/x86/include/asm/irq.h                    |   5 +
>  arch/x86/include/asm/irq_vectors.h            |  15 +-
>  arch/x86/include/asm/msr-index.h              |  13 +-
>  arch/x86/include/asm/processor.h              |  12 +-
>  arch/x86/include/asm/ptrace.h                 |  36 ++-
>  arch/x86/include/asm/switch_to.h              |  10 +-
>  arch/x86/include/asm/thread_info.h            |  35 +--
>  arch/x86/include/asm/traps.h                  |  13 +
>  arch/x86/include/asm/vmx.h                    |  17 +-
>  arch/x86/include/uapi/asm/processor-flags.h   |   2 +
>  arch/x86/kernel/Makefile                      |   1 +
>  arch/x86/kernel/apic/apic.c                   |  11 +-
>  arch/x86/kernel/apic/vector.c                 |   8 +-
>  arch/x86/kernel/cpu/acrn.c                    |   7 +-
>  arch/x86/kernel/cpu/common.c                  |  88 ++++---
>  arch/x86/kernel/cpu/mce/core.c                |  11 +
>  arch/x86/kernel/cpu/mshyperv.c                |  22 +-
>  arch/x86/kernel/espfix_64.c                   |   8 +
>  arch/x86/kernel/fred.c                        |  73 ++++++
>  arch/x86/kernel/head_32.S                     |   3 +-
>  arch/x86/kernel/idt.c                         |   6 +-
>  arch/x86/kernel/irq.c                         |   6 +-
>  arch/x86/kernel/irqinit.c                     |   7 +-
>  arch/x86/kernel/kvm.c                         |   4 +-
>  arch/x86/kernel/nmi.c                         |  28 +++
>  arch/x86/kernel/process.c                     |   5 +
>  arch/x86/kernel/process_64.c                  |  21 +-
>  arch/x86/kernel/signal_32.c                   |  21 +-
>  arch/x86/kernel/traps.c                       | 175 +++++++++++--
>  arch/x86/kvm/vmx/vmx.c                        |  33 ++-
>  arch/x86/lib/x86-opcode-map.txt               |   2 +-
>  arch/x86/mm/extable.c                         |  28 +++
>  arch/x86/mm/fault.c                           |  20 +-
>  drivers/xen/events/events_base.c              |   5 +-
>  kernel/fork.c                                 |   6 +
>  tools/arch/x86/include/asm/cpufeatures.h      |   1 +
>  .../arch/x86/include/asm/disabled-features.h  |   8 +-
>  tools/arch/x86/include/asm/msr-index.h        |  13 +-
>  tools/arch/x86/lib/x86-opcode-map.txt         |   2 +-
>  tools/objtool/arch/x86/decode.c               |  19 +-
>  55 files changed, 1185 insertions(+), 175 deletions(-)
>  create mode 100644 arch/x86/entry/entry_64_fred.S
>  create mode 100644 arch/x86/entry/entry_fred.c
>  create mode 100644 arch/x86/include/asm/event-type.h
>  create mode 100644 arch/x86/include/asm/fred.h
>  create mode 100644 arch/x86/kernel/fred.c
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 00/34] x86: enable FRED for x86-64
  2023-03-11  9:58 ` [PATCH v5 00/34] x86: enable FRED for x86-64 Kang, Shan
@ 2023-03-11 21:29   ` Li, Xin3
  2023-03-20  7:40   ` Kang, Shan
  1 sibling, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-11 21:29 UTC (permalink / raw)
  To: Kang, Shan, kvm, linux-kernel, x86
  Cc: Christopherson,,
	Sean, bp, dave.hansen, peterz, hpa, mingo, tglx, andrew.cooper3,
	pbonzini, Shankar, Ravi V

> The following issues are seen in this round of test.
> +----------------+----------------+----------------+----------------+
> |                | x86:test_      | bpf:test_progs | x86:sysret     |
> |                | vsyscall_32    |                |    _rip_64     |
> +----------------+----------------+----------------+----------------+
> |    the 7th     |                |                |                |
> |    Intel(R)    |      FAIL      |      FAIL      |      PASS      |
> |  Core(TM) CPU  |                |                |                |
> |   6.3.0-rc1+   |                |                |                |
> +----------------+----------------+----------------+----------------+
> |    the 7th     |                |                |                |
> |    Intel(R)    |                |                |                |
> |  Core(TM) CPU  |      FAIL      |      FAIL      |      PASS      |
> | 6.3.0-rc1+ w/  |                |                |                |
> | FRED patch set |                |                |                |
> +----------------+----------------+----------------+----------------+
> | Intel Simics®  |                |                |                |
> | Simulator w/o  |      FAIL      |      FAIL      |      PASS      |
> |   FRED model   |                |                |                |
> |   6.3.0-rc1+   |                |                |                |
> +----------------+----------------+----------------+----------------+
> | Intel Simics®  |                |                |                |
> | Simulator w/o  |                |                |                |
> |   FRED model   |      FAIL      |      FAIL      |      PASS      |
> | 6.3.0-rc1+ w/  |                |                |                |
> | FRED patch set |                |                |                |
> +----------------+----------------+----------------+----------------+
> | Intel Simics®  |                |                |                |
> |  Simulator w/  |      PASS      |      FAIL      |      PASS      |
> |   FRED model   |                |                |                |
> |   6.3.0-rc1+   |                |                |                |
> +----------------+----------------+----------------+----------------+
> | Intel Simics®  |                |                |                |
> |  Simulator w/  |                |                |                |
> |   FRED model   |      PASS      |      FAIL      |      PASS      |
> | 6.3.0-rc1+ w/  |                |                |                |
> | FRED patch set |                |                |                |
> | FRED disabled  |                |                |                |
> +----------------+----------------+----------------+----------------+
> | Intel Simics®  |                |                |                |
> |  Simulator w/  |                |                |                |
> |   FRED model   |      PASS      |      HANG      |      FAIL      |
> | 6.3.0-rc1+ w/  |                |                |                |
> | FRED patch set |                |                |                |
> +----------------+----------------+----------------+----------------+
> 
> The "x86:test_vsyscall_32" is a regression since the v3 FRED patch set.

The "x86:test_vsyscall_32" test passes on the Simics FRED model, no matter
whether FRED is enabled or not, because the Simics FRED model has the RDPID
instruction support.

While on a Simics non-FRED model or the test bare metal machine, which
don't have the RDPID instruction support, the test reads CPU ID from the
GDT_ENTRY_CPUNODE entry, thus it fails due to:
https://lore.kernel.org/lkml/20230311084824.2340-1-xin3.li@intel.com/

Thanks!
  Xin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 28/34] x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user
  2023-03-07  2:39 ` [PATCH v5 28/34] x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user Xin Li
@ 2023-03-17  9:39   ` Lai Jiangshan
  2023-03-17  9:55     ` andrew.cooper3
  2023-03-18  7:55     ` Li, Xin3
  0 siblings, 2 replies; 80+ messages in thread
From: Lai Jiangshan @ 2023-03-17  9:39 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	peterz, andrew.cooper3, seanjc, pbonzini, ravi.v.shankar

> +#ifdef CONFIG_X86_FRED
> +static bool ex_handler_eretu(const struct exception_table_entry *fixup,
> +                            struct pt_regs *regs, unsigned long error_code)
> +{
> +       struct pt_regs *uregs = (struct pt_regs *)(regs->sp - offsetof(struct pt_regs, ip));
> +       unsigned short ss = uregs->ss;
> +       unsigned short cs = uregs->cs;
> +
> +       fred_info(uregs)->edata = fred_event_data(regs);
> +       uregs->ssx = regs->ssx;
> +       uregs->ss = ss;
> +       uregs->csx = regs->csx;
> +       uregs->current_stack_level = 0;
> +       uregs->cs = cs;

Hello

If the ERETU instruction had tried to return from NMI to ring3 and just faulted,
is NMI still blocked?

We know that IRET unconditionally enables NMI, but I can't find any clue in the
FRED's manual.

In the pseudocode of ERETU in the manual, it seems that NMI is only enabled when
ERETU succeeds with bit28 in csx set.  If so, this code will fail to reenable
NMI if bit28 is not explicitly re-set in csx.

Thanks,
Lai

> +
> +       /* Copy error code to uregs and adjust stack pointer accordingly */
> +       uregs->orig_ax = error_code;
> +       regs->sp -= 8;
> +
> +       return ex_handler_default(fixup, regs);
> +}

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 28/34] x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user
  2023-03-17  9:39   ` Lai Jiangshan
@ 2023-03-17  9:55     ` andrew.cooper3
  2023-03-17 13:02       ` Lai Jiangshan
  2023-03-17 21:00       ` H. Peter Anvin
  2023-03-18  7:55     ` Li, Xin3
  1 sibling, 2 replies; 80+ messages in thread
From: andrew.cooper3 @ 2023-03-17  9:55 UTC (permalink / raw)
  To: Lai Jiangshan, Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	peterz, seanjc, pbonzini, ravi.v.shankar

On 17/03/2023 9:39 am, Lai Jiangshan wrote:
>> +#ifdef CONFIG_X86_FRED
>> +static bool ex_handler_eretu(const struct exception_table_entry *fixup,
>> +                            struct pt_regs *regs, unsigned long error_code)
>> +{
>> +       struct pt_regs *uregs = (struct pt_regs *)(regs->sp - offsetof(struct pt_regs, ip));
>> +       unsigned short ss = uregs->ss;
>> +       unsigned short cs = uregs->cs;
>> +
>> +       fred_info(uregs)->edata = fred_event_data(regs);
>> +       uregs->ssx = regs->ssx;
>> +       uregs->ss = ss;
>> +       uregs->csx = regs->csx;
>> +       uregs->current_stack_level = 0;
>> +       uregs->cs = cs;
> Hello
>
> If the ERETU instruction had tried to return from NMI to ring3 and just faulted,
> is NMI still blocked?
>
> We know that IRET unconditionally enables NMI, but I can't find any clue in the
> FRED's manual.
>
> In the pseudocode of ERETU in the manual, it seems that NMI is only enabled when
> ERETU succeeds with bit28 in csx set.  If so, this code will fail to reenable
> NMI if bit28 is not explicitly re-set in csx.

IRET clearing NMI blocking is the source of an immense amount of grief,
and ultimately the reason why Linux and others can't use supervisor
shadow stacks at the moment.

Changing this property, so NMIs only get unblocked on successful
execution of an ERET{S,U}, was a key demand of the FRED spec.

i.e. until you have successfully ERET*'d, you're still logically in the
NMI handler and NMIs need to remain blocked even when handling the #GP
from a bad ERET.

~Andrew

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 28/34] x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user
  2023-03-17  9:55     ` andrew.cooper3
@ 2023-03-17 13:02       ` Lai Jiangshan
  2023-03-17 21:23         ` H. Peter Anvin
  2023-03-17 21:00       ` H. Peter Anvin
  1 sibling, 1 reply; 80+ messages in thread
From: Lai Jiangshan @ 2023-03-17 13:02 UTC (permalink / raw)
  To: andrew.cooper3
  Cc: Xin Li, linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen,
	hpa, peterz, seanjc, pbonzini, ravi.v.shankar

On Fri, Mar 17, 2023 at 5:56 PM <andrew.cooper3@citrix.com> wrote:
>
> On 17/03/2023 9:39 am, Lai Jiangshan wrote:
> >> +#ifdef CONFIG_X86_FRED
> >> +static bool ex_handler_eretu(const struct exception_table_entry *fixup,
> >> +                            struct pt_regs *regs, unsigned long error_code)
> >> +{
> >> +       struct pt_regs *uregs = (struct pt_regs *)(regs->sp - offsetof(struct pt_regs, ip));
> >> +       unsigned short ss = uregs->ss;
> >> +       unsigned short cs = uregs->cs;
> >> +
> >> +       fred_info(uregs)->edata = fred_event_data(regs);
> >> +       uregs->ssx = regs->ssx;
> >> +       uregs->ss = ss;
> >> +       uregs->csx = regs->csx;
> >> +       uregs->current_stack_level = 0;
> >> +       uregs->cs = cs;
> > Hello
> >
> > If the ERETU instruction had tried to return from NMI to ring3 and just faulted,
> > is NMI still blocked?
> >
> > We know that IRET unconditionally enables NMI, but I can't find any clue in the
> > FRED's manual.
> >
> > In the pseudocode of ERETU in the manual, it seems that NMI is only enabled when
> > ERETU succeeds with bit28 in csx set.  If so, this code will fail to reenable
> > NMI if bit28 is not explicitly re-set in csx.
>
> IRET clearing NMI blocking is the source of an immense amount of grief,
> and ultimately the reason why Linux and others can't use supervisor
> shadow stacks at the moment.
>
> Changing this property, so NMIs only get unblocked on successful
> execution of an ERET{S,U}, was a key demand of the FRED spec.
>
> i.e. until you have successfully ERET*'d, you're still logically in the
> NMI handler and NMIs need to remain blocked even when handling the #GP
> from a bad ERET.
>

Handling of the #GP for a bad ERETU can be rescheduled. It is not
OK to reschedule with NMI blocked.

I think "regs->nmi = 1;" (not uregs->nmi) can fix the problem.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-07  2:39 ` [PATCH v5 22/34] x86/fred: FRED initialization code Xin Li
@ 2023-03-17 13:35   ` Lai Jiangshan
  2023-03-17 21:32     ` H. Peter Anvin
  0 siblings, 1 reply; 80+ messages in thread
From: Lai Jiangshan @ 2023-03-17 13:35 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	peterz, andrew.cooper3, seanjc, pbonzini, ravi.v.shankar

Hello


Comments in cpu_init_fred_exceptions() seem scarce for understanding.

On Tue, Mar 7, 2023 at 11:07 AM Xin Li <xin3.li@intel.com> wrote:

> +/*
> + * Initialize FRED on this CPU. This cannot be __init as it is called
> + * during CPU hotplug.
> + */
> +void cpu_init_fred_exceptions(void)
> +{
> +       wrmsrl(MSR_IA32_FRED_CONFIG,
> +              FRED_CONFIG_ENTRYPOINT(fred_entrypoint_user) |
> +              FRED_CONFIG_REDZONE(8) | /* Reserve for CALL emulation */
> +              FRED_CONFIG_INT_STKLVL(0));

What is it about "Reserve for CALL emulation"?

I guess it relates to X86_TRAP_BP. In entry_64.S:

        .if \vector == X86_TRAP_BP
                /*
                 * If coming from kernel space, create a 6-word gap to allow the
                 * int3 handler to emulate a call instruction.
                 */

> +
> +       wrmsrl(MSR_IA32_FRED_STKLVLS,
> +              FRED_STKLVL(X86_TRAP_DB,  1) |
> +              FRED_STKLVL(X86_TRAP_NMI, 2) |
> +              FRED_STKLVL(X86_TRAP_MC,  2) |
> +              FRED_STKLVL(X86_TRAP_DF,  3));

Why each exception here needs a stack level > 0?
Especially for X86_TRAP_DB and X86_TRAP_NMI.

Why does or why does not X86_TRAP_VE have a stack level > 0?

X86_TRAP_DF is the highest stack level, is it accidental
or deliberate?

Thanks
Lai

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 28/34] x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user
  2023-03-17  9:55     ` andrew.cooper3
  2023-03-17 13:02       ` Lai Jiangshan
@ 2023-03-17 21:00       ` H. Peter Anvin
  1 sibling, 0 replies; 80+ messages in thread
From: H. Peter Anvin @ 2023-03-17 21:00 UTC (permalink / raw)
  To: andrew.cooper3, Lai Jiangshan, Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, peterz,
	seanjc, pbonzini, ravi.v.shankar

On March 17, 2023 2:55:44 AM PDT, andrew.cooper3@citrix.com wrote:
>On 17/03/2023 9:39 am, Lai Jiangshan wrote:
>>> +#ifdef CONFIG_X86_FRED
>>> +static bool ex_handler_eretu(const struct exception_table_entry *fixup,
>>> +                            struct pt_regs *regs, unsigned long error_code)
>>> +{
>>> +       struct pt_regs *uregs = (struct pt_regs *)(regs->sp - offsetof(struct pt_regs, ip));
>>> +       unsigned short ss = uregs->ss;
>>> +       unsigned short cs = uregs->cs;
>>> +
>>> +       fred_info(uregs)->edata = fred_event_data(regs);
>>> +       uregs->ssx = regs->ssx;
>>> +       uregs->ss = ss;
>>> +       uregs->csx = regs->csx;
>>> +       uregs->current_stack_level = 0;
>>> +       uregs->cs = cs;
>> Hello
>>
>> If the ERETU instruction had tried to return from NMI to ring3 and just faulted,
>> is NMI still blocked?
>>
>> We know that IRET unconditionally enables NMI, but I can't find any clue in the
>> FRED's manual.
>>
>> In the pseudocode of ERETU in the manual, it seems that NMI is only enabled when
>> ERETU succeeds with bit28 in csx set.  If so, this code will fail to reenable
>> NMI if bit28 is not explicitly re-set in csx.
>
>IRET clearing NMI blocking is the source of an immense amount of grief,
>and ultimately the reason why Linux and others can't use supervisor
>shadow stacks at the moment.
>
>Changing this property, so NMIs only get unblocked on successful
>execution of an ERET{S,U}, was a key demand of the FRED spec.
>
>i.e. until you have successfully ERET*'d, you're still logically in the
>NMI handler and NMIs need to remain blocked even when handling the #GP
>from a bad ERET.
>
>~Andrew

This is correct.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 28/34] x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user
  2023-03-17 13:02       ` Lai Jiangshan
@ 2023-03-17 21:23         ` H. Peter Anvin
  0 siblings, 0 replies; 80+ messages in thread
From: H. Peter Anvin @ 2023-03-17 21:23 UTC (permalink / raw)
  To: Lai Jiangshan, andrew.cooper3
  Cc: Xin Li, linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen,
	peterz, seanjc, pbonzini, ravi.v.shankar

On March 17, 2023 6:02:52 AM PDT, Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>On Fri, Mar 17, 2023 at 5:56 PM <andrew.cooper3@citrix.com> wrote:
>>
>> On 17/03/2023 9:39 am, Lai Jiangshan wrote:
>> >> +#ifdef CONFIG_X86_FRED
>> >> +static bool ex_handler_eretu(const struct exception_table_entry *fixup,
>> >> +                            struct pt_regs *regs, unsigned long error_code)
>> >> +{
>> >> +       struct pt_regs *uregs = (struct pt_regs *)(regs->sp - offsetof(struct pt_regs, ip));
>> >> +       unsigned short ss = uregs->ss;
>> >> +       unsigned short cs = uregs->cs;
>> >> +
>> >> +       fred_info(uregs)->edata = fred_event_data(regs);
>> >> +       uregs->ssx = regs->ssx;
>> >> +       uregs->ss = ss;
>> >> +       uregs->csx = regs->csx;
>> >> +       uregs->current_stack_level = 0;
>> >> +       uregs->cs = cs;
>> > Hello
>> >
>> > If the ERETU instruction had tried to return from NMI to ring3 and just faulted,
>> > is NMI still blocked?
>> >
>> > We know that IRET unconditionally enables NMI, but I can't find any clue in the
>> > FRED's manual.
>> >
>> > In the pseudocode of ERETU in the manual, it seems that NMI is only enabled when
>> > ERETU succeeds with bit28 in csx set.  If so, this code will fail to reenable
>> > NMI if bit28 is not explicitly re-set in csx.
>>
>> IRET clearing NMI blocking is the source of an immense amount of grief,
>> and ultimately the reason why Linux and others can't use supervisor
>> shadow stacks at the moment.
>>
>> Changing this property, so NMIs only get unblocked on successful
>> execution of an ERET{S,U}, was a key demand of the FRED spec.
>>
>> i.e. until you have successfully ERET*'d, you're still logically in the
>> NMI handler and NMIs need to remain blocked even when handling the #GP
>> from a bad ERET.
>>
>
>Handling of the #GP for a bad ERETU can be rescheduled. It is not
>OK to reschedule with NMI blocked.
>
>I think "regs->nmi = 1;" (not uregs->nmi) can fix the problem.
>

You are quite correct, since what we want here is to emulate having taken the fault in user space – which meant that NMI would have been re-enabled by the never-executed return.

I think the "best" solution is:

regs->nmi = uregs->nmi;
uregs->nmi = 0;

... as enabling NMI is expected to have a performance penalty (being the less common case, an implementation which has a performance difference at all would want to optimize the non-NMI case), and I believe the compiler should be able to at least mostly fold those operations into ones it is doing anyway.



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-17 13:35   ` Lai Jiangshan
@ 2023-03-17 21:32     ` H. Peter Anvin
  2023-03-18  6:33       ` Lai Jiangshan
  2023-03-20 16:44       ` Peter Zijlstra
  0 siblings, 2 replies; 80+ messages in thread
From: H. Peter Anvin @ 2023-03-17 21:32 UTC (permalink / raw)
  To: Lai Jiangshan, Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, peterz,
	andrew.cooper3, seanjc, pbonzini, ravi.v.shankar

On March 17, 2023 6:35:57 AM PDT, Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>Hello
>
>
>Comments in cpu_init_fred_exceptions() seem scarce for understanding.
>
>On Tue, Mar 7, 2023 at 11:07 AM Xin Li <xin3.li@intel.com> wrote:
>
>> +/*
>> + * Initialize FRED on this CPU. This cannot be __init as it is called
>> + * during CPU hotplug.
>> + */
>> +void cpu_init_fred_exceptions(void)
>> +{
>> +       wrmsrl(MSR_IA32_FRED_CONFIG,
>> +              FRED_CONFIG_ENTRYPOINT(fred_entrypoint_user) |
>> +              FRED_CONFIG_REDZONE(8) | /* Reserve for CALL emulation */
>> +              FRED_CONFIG_INT_STKLVL(0));
>
>What is it about "Reserve for CALL emulation"?
>
>I guess it relates to X86_TRAP_BP. In entry_64.S:
>
>        .if \vector == X86_TRAP_BP
>                /*
>                 * If coming from kernel space, create a 6-word gap to allow the
>                 * int3 handler to emulate a call instruction.
>                 */
>
>> +
>> +       wrmsrl(MSR_IA32_FRED_STKLVLS,
>> +              FRED_STKLVL(X86_TRAP_DB,  1) |
>> +              FRED_STKLVL(X86_TRAP_NMI, 2) |
>> +              FRED_STKLVL(X86_TRAP_MC,  2) |
>> +              FRED_STKLVL(X86_TRAP_DF,  3));
>
>Why each exception here needs a stack level > 0?
>Especially for X86_TRAP_DB and X86_TRAP_NMI.
>
>Why does or why does not X86_TRAP_VE have a stack level > 0?
>
>X86_TRAP_DF is the highest stack level, is it accidental
>or deliberate?
>
>Thanks
>Lai
>

Yes, the extra redzone space is there to allow for the call emulation without having to adjust the stack frame "manually".

In theory we could enable it only while code patching is in progress, but that would probably just result in stack overflows becoming utterly impossible to debug as we have to consider the worst case.

The purpose of separate stacks for NMI, #DB and #MC *in the kernel* (remember that user space faults are always taken on stack level 0) is to avoid overflowing the kernel stack. #DB in the kernel would imply the use of a kernel debugger.

#DF is the highest level because a #DF means "something went wrong *while delivering an exception*." The number of cases for which that can happen with FRED is drastically reduced and basically amount to "the stack you pointed me to is broken."

Thus, you basically always want to change stacks on #DF, which means it should be at the highest level.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-17 21:32     ` H. Peter Anvin
@ 2023-03-18  6:33       ` Lai Jiangshan
  2023-03-20 16:49         ` Peter Zijlstra
  2023-03-22  2:22         ` Li, Xin3
  2023-03-20 16:44       ` Peter Zijlstra
  1 sibling, 2 replies; 80+ messages in thread
From: Lai Jiangshan @ 2023-03-18  6:33 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xin Li, linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen,
	peterz, andrew.cooper3, seanjc, pbonzini, ravi.v.shankar

On Sat, Mar 18, 2023 at 5:32 AM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On March 17, 2023 6:35:57 AM PDT, Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> >Hello
> >
> >
> >Comments in cpu_init_fred_exceptions() seem scarce for understanding.
> >
> >On Tue, Mar 7, 2023 at 11:07 AM Xin Li <xin3.li@intel.com> wrote:
> >
> >> +/*
> >> + * Initialize FRED on this CPU. This cannot be __init as it is called
> >> + * during CPU hotplug.
> >> + */
> >> +void cpu_init_fred_exceptions(void)
> >> +{
> >> +       wrmsrl(MSR_IA32_FRED_CONFIG,
> >> +              FRED_CONFIG_ENTRYPOINT(fred_entrypoint_user) |
> >> +              FRED_CONFIG_REDZONE(8) | /* Reserve for CALL emulation */
> >> +              FRED_CONFIG_INT_STKLVL(0));
> >
> >What is it about "Reserve for CALL emulation"?
> >
> >I guess it relates to X86_TRAP_BP. In entry_64.S:
> >
> >        .if \vector == X86_TRAP_BP
> >                /*
> >                 * If coming from kernel space, create a 6-word gap to allow the
> >                 * int3 handler to emulate a call instruction.
> >                 */
> >
> >> +
> >> +       wrmsrl(MSR_IA32_FRED_STKLVLS,
> >> +              FRED_STKLVL(X86_TRAP_DB,  1) |
> >> +              FRED_STKLVL(X86_TRAP_NMI, 2) |
> >> +              FRED_STKLVL(X86_TRAP_MC,  2) |
> >> +              FRED_STKLVL(X86_TRAP_DF,  3));
> >
> >Why each exception here needs a stack level > 0?
> >Especially for X86_TRAP_DB and X86_TRAP_NMI.
> >
> >Why does or why does not X86_TRAP_VE have a stack level > 0?
> >
> >X86_TRAP_DF is the highest stack level, is it accidental
> >or deliberate?
> >
> >Thanks
> >Lai
> >
>
> Yes, the extra redzone space is there to allow for the call emulation without having to adjust the stack frame "manually".
>
> In theory we could enable it only while code patching is in progress, but that would probably just result in stack overflows becoming utterly impossible to debug as we have to consider the worst case.
>
> The purpose of separate stacks for NMI, #DB and #MC *in the kernel* (remember that user space faults are always taken on stack level 0) is to avoid overflowing the kernel stack. #DB in the kernel would imply the use of a kernel debugger.

Could you add it to the code, please? I think it can help other reviewers.

If there is no other concrete reason other than overflowing for
assigning NMI and #DB with a stack level > 0, #VE should also
be assigned with a stack level > 0, and #BP too. #VE can happen
anytime and anywhere, so it is subject to overflowing too.

>
>
> #DF is the highest level because a #DF means "something went wrong *while delivering an exception*." The number of cases for which that can happen with FRED is drastically reduced and basically amount to "the stack you pointed me to is broken."
>
> Thus, you basically always want to change stacks on #DF, which means it should be at the highest level.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 28/34] x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user
  2023-03-17  9:39   ` Lai Jiangshan
  2023-03-17  9:55     ` andrew.cooper3
@ 2023-03-18  7:55     ` Li, Xin3
  1 sibling, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-18  7:55 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	peterz, andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> > +#ifdef CONFIG_X86_FRED
> > +static bool ex_handler_eretu(const struct exception_table_entry *fixup,
> > +                            struct pt_regs *regs, unsigned long
> > +error_code) {
> > +       struct pt_regs *uregs = (struct pt_regs *)(regs->sp - offsetof(struct pt_regs,
> ip));
> > +       unsigned short ss = uregs->ss;
> > +       unsigned short cs = uregs->cs;
> > +
> > +       fred_info(uregs)->edata = fred_event_data(regs);
> > +       uregs->ssx = regs->ssx;
> > +       uregs->ss = ss;
> > +       uregs->csx = regs->csx;
> > +       uregs->current_stack_level = 0;
> > +       uregs->cs = cs;
> 
> Hello
> 
> If the ERETU instruction had tried to return from NMI to ring3 and just faulted, is
> NMI still blocked?

Do you mean the NMI FRED stack frame contains an invalid ring3 context?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 00/34] x86: enable FRED for x86-64
  2023-03-11  9:58 ` [PATCH v5 00/34] x86: enable FRED for x86-64 Kang, Shan
  2023-03-11 21:29   ` Li, Xin3
@ 2023-03-20  7:40   ` Kang, Shan
  1 sibling, 0 replies; 80+ messages in thread
From: Kang, Shan @ 2023-03-20  7:40 UTC (permalink / raw)
  To: Li, Xin3, kvm, linux-kernel, x86
  Cc: Christopherson,,
	Sean, bp, dave.hansen, peterz, hpa, mingo, tglx, andrew.cooper3,
	pbonzini, Shankar, Ravi V

We wanted to check whether there are KVM regressions with the v5 FRED patch set,
and had a round of Kselftest on KVM guests. 
Following are the results on X86-64.

+--------------------------------+-------+-------+-------+-------+
|             Config             |  Pass |  Fail |  Skip |  Hang |
+--------------------------------+-------+-------+-------+-------+
| the 7th Intel(R) Core(TM) CPU  |       |       |       |       |
|   | host 6.3.0-rc1+ w/ FRED    |  2720 |  403  |  689  |   9   |
|   patch set FRED disabled |    |       |       |       |       |
|        guest 6.3.0-rc1+        |       |       |       |       |
+--------------------------------+-------+-------+-------+-------+
| the 7th Intel(R) Core(TM) CPU  |       |       |       |       |
|   | host 6.3.0-rc1+ w/ FRED    |  2720 |  403  |  689  |   9   |
|  patch set | guest 6.3.0-rc1+  |       |       |       |       |
+--------------------------------+-------+-------+-------+-------+
|  Intel Simics® Simulator w/o   |       |       |       |       |
| FRED model | host 6.3.0-rc1+ | |  1403 |  277  |  2127 |   14  |
|        guest 6.3.0-rc1+        |       |       |       |       |
+--------------------------------+-------+-------+-------+-------+
|  Intel Simics® Simulator w/o   |       |       |       |       |
|  FRED model | host 6.3.0-rc1+  |       |       |       |       |
|   w/ FRED patch set | guest    |  1403 |  277  |  2127 |   14  |
|  6.3.0-rc1+ w/ FRED patch set  |       |       |       |       |
|         FRED disabled          |       |       |       |       |
+--------------------------------+-------+-------+-------+-------+
|  Intel Simics® Simulator w/o   |       |       |       |       |
|  FRED model | host 6.3.0-rc1+  |  1403 |  277  |  2127 |   14  |
|   w/ FRED patch set | guest    |       |       |       |       |
|  6.3.0-rc1+ w/ FRED patch set  |       |       |       |       |
+--------------------------------+-------+-------+-------+-------+
|   Intel Simics® Simulator w/   |       |       |       |       |
| FRED model + host 6.3.0-rc1+ | |  1404 |  276  |  2127 |   14  |
| guest 6.3.0-rc1+ w/ FRED patch |       |       |       |       |
|              set               |       |       |       |       |
+--------------------------------+-------+-------+-------+-------+
|   Intel Simics® Simulator w/   |       |       |       |       |
|  FRED model | host 6.3.0-rc1+  |  1404 |  276  |  2127 |   14  |
|     w/ FRED patch set FRED     |       |       |       |       |
|  disabled | guest 6.3.0-rc1+   |       |       |       |       |
+--------------------------------+-------+-------+-------+-------+
|   Intel Simics® Simulator w/   |       |       |       |       |
|  FRED model | host 6.3.0-rc1+  |       |       |       |       |
|     w/ FRED patch set FRED     |  1404 |  276  |  2127 |   14  |
| disabled | guest 6.3.0-rc1+ w/ |       |       |       |       |
|         FRED patch set         |       |       |       |       |
+--------------------------------+-------+-------+-------+-------+
|   Intel Simics® Simulator w/   |       |       |       |       |
|  FRED model | host 6.3.0-rc1+  |  1404 |  276  |  2127 |   14  |
|   w/ FRED patch set | guest    |       |       |       |       |
|           6.3.0-rc1+           |       |       |       |       |
+--------------------------------+-------+-------+-------+-------+
|   Intel Simics® Simulator w/   |       |       |       |       |
|  FRED model | host 6.3.0-rc1+  |       |       |       |       |
|   w/ FRED patch set | guest    |  1404 |  276  |  2127 |   14  |
|  6.3.0-rc1+ w/ FRED patch set  |       |       |       |       |
|         FRED disabled          |       |       |       |       |
+--------------------------------+-------+-------+-------+-------+
|   Intel Simics® Simulator w/   |       |       |       |       |
|  FRED model | host 6.3.0-rc1+  |  1404 |  276  |  2127 |   14  |
|   w/ FRED patch set | guest    |       |       |       |       |
|  6.3.0-rc1+ w/ FRED patch set  |       |       |       |       |
+--------------------------------+-------+-------+-------+-------+

The Simics FRED model has one more passed case, which is x86:test_vsyscall_32,
due to https://lore.kernel.org/lkml/20230311084824.2340-1-xin3.li@intel.com/.

Thanks
   --Shan

On Sat, 2023-03-11 at 09:58 +0000, Kang, Shan wrote:
> We tested the v5 FRED patch set on the Intel Simics® Simulator and a machine
> with a 7th Intel(R) Core(TM) CPU.
> 
> Following are the Kselftest results on X86-64.
> +--------------------------------------------+-------+-------+-------+-------+
> >                  Config                    |  Pass |  Fail |  Skip |  Hang |
> +--------------------------------------------+-------+-------+-------+-------+
> >       the 7th Intel(R) Core(TM) CPU        |  3078 |  458  |  734  |   5   |
> >                 6.3.0-rc1+                 |       |       |       |       |
> +--------------------------------------------+-------+-------+-------+-------+
> >       the 7th Intel(R) Core(TM) CPU        |  3078 |  458  |  734  |   5   |
> >        6.3.0-rc1+ w/ FRED patch set        |       |       |       |       |
> +--------------------------------------------+-------+-------+-------+-------+
> >   Intel Simics® Simulator w/o FRED model   |  1888 |  271  |  2105 |   11  |
> >                 6.3.0-rc1+                 |       |       |       |       |
> +--------------------------------------------+-------+-------+-------+-------+
> >   Intel Simics® Simulator w/o FRED model   |  1888 |  271  |  2105 |   11  |
> >        6.3.0-rc1+ w/ FRED patch set        |       |       |       |       |
> +--------------------------------------------+-------+-------+-------+-------+
> >   Intel Simics® Simulator w/ FRED model    |  1889 |  270  |  2105 |   11  |
> >                 6.3.0-rc1+                 |       |       |       |       |
> +--------------------------------------------+-------+-------+-------+-------+
> >   Intel Simics® Simulator w/ FRED model    |  1889 |  270  |  2105 |   11  |
> > 6.3.0-rc1+ w/ FRED patch set FRED disabled |       |       |       |       |
> +--------------------------------------------+-------+-------+-------+-------+
> >   Intel Simics® Simulator w/ FRED model    |  1888 |  270  |  2105 |   12  |
> >        6.3.0-rc1+ w/ FRED patch set        |       |       |       |       |
> +--------------------------------------------+-------+-------+-------+-------+
> 
> The following issues are seen in this round of test.
> +----------------+----------------+----------------+----------------+
> >                | x86:test_      | bpf:test_progs | x86:sysret     |
> >                | vsyscall_32    |                |    _rip_64     |
> +----------------+----------------+----------------+----------------+
> >    the 7th     |                |                |                |
> >    Intel(R)    |      FAIL      |      FAIL      |      PASS      |
> >  Core(TM) CPU  |                |                |                |
> >   6.3.0-rc1+   |                |                |                |
> +----------------+----------------+----------------+----------------+
> >    the 7th     |                |                |                |
> >    Intel(R)    |                |                |                |
> >  Core(TM) CPU  |      FAIL      |      FAIL      |      PASS      |
> > 6.3.0-rc1+ w/  |                |                |                |
> > FRED patch set |                |                |                |
> +----------------+----------------+----------------+----------------+
> > Intel Simics®  |                |                |                |
> > Simulator w/o  |      FAIL      |      FAIL      |      PASS      |
> >   FRED model   |                |                |                |
> >   6.3.0-rc1+   |                |                |                |
> +----------------+----------------+----------------+----------------+
> > Intel Simics®  |                |                |                |
> > Simulator w/o  |                |                |                |
> >   FRED model   |      FAIL      |      FAIL      |      PASS      |
> > 6.3.0-rc1+ w/  |                |                |                |
> > FRED patch set |                |                |                |
> +----------------+----------------+----------------+----------------+
> > Intel Simics®  |                |                |                |
> >  Simulator w/  |      PASS      |      FAIL      |      PASS      |
> >   FRED model   |                |                |                |
> >   6.3.0-rc1+   |                |                |                |
> +----------------+----------------+----------------+----------------+
> > Intel Simics®  |                |                |                |
> >  Simulator w/  |                |                |                |
> >   FRED model   |      PASS      |      FAIL      |      PASS      |
> > 6.3.0-rc1+ w/  |                |                |                |
> > FRED patch set |                |                |                |
> > FRED disabled  |                |                |                |
> +----------------+----------------+----------------+----------------+
> > Intel Simics®  |                |                |                |
> >  Simulator w/  |                |                |                |
> >   FRED model   |      PASS      |      HANG      |      FAIL      |
> > 6.3.0-rc1+ w/  |                |                |                |
> > FRED patch set |                |                |                |
> +----------------+----------------+----------------+----------------+
> 
> The test "x86:sysret_rip_64" is NOT a valid test on FRED, and there is a fix
> from Ammar Faizi after we discussed it in the LKML.
> 
> The test "bpf:test_progs" is still in investigation.
> 
> The "x86:test_vsyscall_32" is a regression since the v3 FRED patch set.
> 
> Thanks
>    --Shan
> 
> On Mon, 2023-03-06 at 18:39 -0800, Xin Li wrote:
> > This patch set enables FRED for x86-64.
> > 
> > The Intel flexible return and event delivery (FRED) architecture defines
> > simple
> > new transitions that change privilege level (ring transitions). The FRED
> > architecture was designed with the following goals:
> > 1) Improve overall performance and response time by replacing event delivery
> > through the interrupt descriptor table (IDT event delivery) and event return
> > by
> > the IRET instruction with lower latency transitions.
> > 2) Improve software robustness by ensuring that event delivery establishes
> > the
> > full supervisor context and that event return establishes the full user
> > context.
> > 
> > The new transitions defined by the FRED architecture are FRED event delivery
> > and,
> > for returning from events, two FRED return instructions. FRED event delivery
> > can
> > effect a transition from ring 3 to ring 0, but it is used also to deliver
> > events
> > incident to ring 0. One FRED instruction (ERETU) effects a return from ring
> > 0
> > to
> > ring 3, while the other (ERETS) returns while remaining in ring 0.
> > 
> > Search for the latest FRED spec in most search engines with this search
> > pattern:
> > 
> >   site:intel.com FRED (flexible return and event delivery) specification
> > 
> > As of now there is no publicly avaiable CPU supporting FRED, thus the Intel
> > Simics® Simulator is used as software development and testing vehicles. And
> > it can be downloaded from:
> >   
> > https://www.intel.com/content/www/us/en/developer/articles/tool/simics-simulator.html
> > 
> > To enable FRED, the Simics package 8112 QSP-CPU needs to be installed with
> > CPU
> > model configured as:
> > 	$cpu_comp_class = "x86-experimental-fred"
> > 
> > Longer term, we should refactor common code shared by FRED and IDT into
> > common
> > shared files, and contain IDT code using a new config CONFIG_X86_IDT.
> > 
> > Changes since v4:
> > * Rebased against v6.3-rc1.
> > * Do NOT use the term "injection", which in the KVM context means to
> >   reinject an event into the guest (Sean Christopherson).
> > * Add the explanation of why to execute "int $2" to invoke the NMI handler
> >   in NMI caused VM exits (Sean Christopherson).
> > * Use cs/ss instead of csx/ssx when initializing the pt_regs structure
> >   for calling external_interrupt(), otherwise it breaks i386 build.
> > 
> > Changes since v3:
> > * Call external_interrupt() to handle IRQ in IRQ caused VM exits.
> > * Execute "int $2" to handle NMI in NMI caused VM exits.
> > * Rename csl/ssl of the pt_regs structure to csx/ssx (x for extended)
> >   (Andrew Cooper).
> > 
> > Changes since v2:
> > * Improve comments for changes in arch/x86/include/asm/idtentry.h.
> > 
> > Changes since v1:
> > * call irqentry_nmi_{enter,exit}() in both IDT and FRED debug fault kernel
> >   handler (Peter Zijlstra).
> > * Initialize a FRED exception handler to fred_bad_event() instead of NULL
> >   if no FRED handler defined for an exception vector (Peter Zijlstra).
> > * Push calling irqentry_{enter,exit}() and instrumentation_{begin,end}()
> >   down into individual FRED exception handlers, instead of in the dispatch
> >   framework (Peter Zijlstra).
> > 
> > 
> > H. Peter Anvin (Intel) (24):
> >   x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR
> >   x86/traps: add a system interrupt table for system interrupt dispatch
> >   x86/traps: add external_interrupt() to dispatch external interrupts
> >   x86/cpufeature: add the cpu feature bit for FRED
> >   x86/opcode: add ERETU, ERETS instructions to x86-opcode-map
> >   x86/objtool: teach objtool about ERETU and ERETS
> >   x86/cpu: add X86_CR4_FRED macro
> >   x86/fred: add Kconfig option for FRED (CONFIG_X86_FRED)
> >   x86/fred: if CONFIG_X86_FRED is disabled, disable FRED support
> >   x86/cpu: add MSR numbers for FRED configuration
> >   x86/fred: header file with FRED definitions
> >   x86/fred: make unions for the cs and ss fields in struct pt_regs
> >   x86/fred: reserve space for the FRED stack frame
> >   x86/fred: add a page fault entry stub for FRED
> >   x86/fred: add a debug fault entry stub for FRED
> >   x86/fred: add a NMI entry stub for FRED
> >   x86/fred: FRED entry/exit and dispatch code
> >   x86/fred: FRED initialization code
> >   x86/fred: update MSR_IA32_FRED_RSP0 during task switch
> >   x86/fred: let ret_from_fork() jmp to fred_exit_user when FRED is
> >     enabled
> >   x86/fred: disallow the swapgs instruction when FRED is enabled
> >   x86/fred: no ESPFIX needed when FRED is enabled
> >   x86/fred: allow single-step trap and NMI when starting a new thread
> >   x86/fred: allow FRED systems to use interrupt vectors 0x10-0x1f
> > 
> > Xin Li (10):
> >   x86/traps: add install_system_interrupt_handler()
> >   x86/traps: export external_interrupt() for VMX IRQ reinjection
> >   x86/fred: header file for event types
> >   x86/fred: add a machine check entry stub for FRED
> >   x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user
> >   x86/ia32: do not modify the DPL bits for a null selector
> >   x86/fred: allow dynamic stack frame size
> >   x86/fred: disable FRED by default in its early stage
> >   KVM: x86/vmx: call external_interrupt() to handle IRQ in IRQ caused VM
> >     exits
> >   KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits
> >     when FRED is enabled
> > 
> >  .../admin-guide/kernel-parameters.txt         |   4 +
> >  arch/x86/Kconfig                              |   9 +
> >  arch/x86/entry/Makefile                       |   5 +-
> >  arch/x86/entry/entry_32.S                     |   2 +-
> >  arch/x86/entry/entry_64.S                     |   5 +
> >  arch/x86/entry/entry_64_fred.S                |  59 +++++
> >  arch/x86/entry/entry_fred.c                   | 234 ++++++++++++++++++
> >  arch/x86/entry/vsyscall/vsyscall_64.c         |   2 +-
> >  arch/x86/include/asm/cpufeatures.h            |   1 +
> >  arch/x86/include/asm/disabled-features.h      |   8 +-
> >  arch/x86/include/asm/entry-common.h           |   3 +
> >  arch/x86/include/asm/event-type.h             |  17 ++
> >  arch/x86/include/asm/extable_fixup_types.h    |   4 +-
> >  arch/x86/include/asm/fred.h                   | 131 ++++++++++
> >  arch/x86/include/asm/idtentry.h               |  76 +++++-
> >  arch/x86/include/asm/irq.h                    |   5 +
> >  arch/x86/include/asm/irq_vectors.h            |  15 +-
> >  arch/x86/include/asm/msr-index.h              |  13 +-
> >  arch/x86/include/asm/processor.h              |  12 +-
> >  arch/x86/include/asm/ptrace.h                 |  36 ++-
> >  arch/x86/include/asm/switch_to.h              |  10 +-
> >  arch/x86/include/asm/thread_info.h            |  35 +--
> >  arch/x86/include/asm/traps.h                  |  13 +
> >  arch/x86/include/asm/vmx.h                    |  17 +-
> >  arch/x86/include/uapi/asm/processor-flags.h   |   2 +
> >  arch/x86/kernel/Makefile                      |   1 +
> >  arch/x86/kernel/apic/apic.c                   |  11 +-
> >  arch/x86/kernel/apic/vector.c                 |   8 +-
> >  arch/x86/kernel/cpu/acrn.c                    |   7 +-
> >  arch/x86/kernel/cpu/common.c                  |  88 ++++---
> >  arch/x86/kernel/cpu/mce/core.c                |  11 +
> >  arch/x86/kernel/cpu/mshyperv.c                |  22 +-
> >  arch/x86/kernel/espfix_64.c                   |   8 +
> >  arch/x86/kernel/fred.c                        |  73 ++++++
> >  arch/x86/kernel/head_32.S                     |   3 +-
> >  arch/x86/kernel/idt.c                         |   6 +-
> >  arch/x86/kernel/irq.c                         |   6 +-
> >  arch/x86/kernel/irqinit.c                     |   7 +-
> >  arch/x86/kernel/kvm.c                         |   4 +-
> >  arch/x86/kernel/nmi.c                         |  28 +++
> >  arch/x86/kernel/process.c                     |   5 +
> >  arch/x86/kernel/process_64.c                  |  21 +-
> >  arch/x86/kernel/signal_32.c                   |  21 +-
> >  arch/x86/kernel/traps.c                       | 175 +++++++++++--
> >  arch/x86/kvm/vmx/vmx.c                        |  33 ++-
> >  arch/x86/lib/x86-opcode-map.txt               |   2 +-
> >  arch/x86/mm/extable.c                         |  28 +++
> >  arch/x86/mm/fault.c                           |  20 +-
> >  drivers/xen/events/events_base.c              |   5 +-
> >  kernel/fork.c                                 |   6 +
> >  tools/arch/x86/include/asm/cpufeatures.h      |   1 +
> >  .../arch/x86/include/asm/disabled-features.h  |   8 +-
> >  tools/arch/x86/include/asm/msr-index.h        |  13 +-
> >  tools/arch/x86/lib/x86-opcode-map.txt         |   2 +-
> >  tools/objtool/arch/x86/decode.c               |  19 +-
> >  55 files changed, 1185 insertions(+), 175 deletions(-)
> >  create mode 100644 arch/x86/entry/entry_64_fred.S
> >  create mode 100644 arch/x86/entry/entry_fred.c
> >  create mode 100644 arch/x86/include/asm/event-type.h
> >  create mode 100644 arch/x86/include/asm/fred.h
> >  create mode 100644 arch/x86/kernel/fred.c
> > 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 04/34] x86/traps: add external_interrupt() to dispatch external interrupts
  2023-03-07  2:39 ` [PATCH v5 04/34] x86/traps: add external_interrupt() to dispatch external interrupts Xin Li
@ 2023-03-20 15:36   ` Peter Zijlstra
  2023-03-20 17:42     ` Peter Zijlstra
  2023-03-20 17:53     ` Li, Xin3
  0 siblings, 2 replies; 80+ messages in thread
From: Peter Zijlstra @ 2023-03-20 15:36 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	andrew.cooper3, seanjc, pbonzini, ravi.v.shankar

On Mon, Mar 06, 2023 at 06:39:16PM -0800, Xin Li wrote:

> +#ifndef CONFIG_X86_LOCAL_APIC
> +/*
> + * Used when local APIC is not compiled into the kernel, but
> + * external_interrupt() needs dispatch_spurious_interrupt().
> + */
> +DEFINE_IDTENTRY_IRQ(spurious_interrupt)
> +{
> +	pr_info("Spurious interrupt (vector 0x%x) on CPU#%d, should never happen.\n",
> +		vector, smp_processor_id());
> +}
> +#endif
> +
> +/*
> + * External interrupt dispatch function.
> + *
> + * Until/unless dispatch_common_interrupt() can be taught to deal with the
> + * special system vectors, split the dispatch.
> + *
> + * Note: dispatch_common_interrupt() already deals with IRQ_MOVE_CLEANUP_VECTOR.
> + */
> +int external_interrupt(struct pt_regs *regs, unsigned int vector)
> +{
> +	unsigned int sysvec = vector - FIRST_SYSTEM_VECTOR;
> +
> +	if (vector < FIRST_EXTERNAL_VECTOR) {
> +		pr_err("invalid external interrupt vector %d\n", vector);
> +		return -EINVAL;
> +	}
> +
> +	if (sysvec < NR_SYSTEM_VECTORS) {
> +		if (system_interrupt_handlers[sysvec])
> +			system_interrupt_handlers[sysvec](regs);
> +		else
> +			dispatch_spurious_interrupt(regs, vector);

ISTR suggesting you can get rid of this branch if you stuff
system_interrupt_handlers[] with dispatch_spurious_interrupt instead of
NULL.

> +	} else {
> +		dispatch_common_interrupt(regs, vector);
> +	}
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 20/34] x86/fred: add a machine check entry stub for FRED
  2023-03-07  2:39 ` [PATCH v5 20/34] x86/fred: add a machine check " Xin Li
@ 2023-03-20 16:00   ` Peter Zijlstra
  2023-03-21  0:04     ` Li, Xin3
  0 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2023-03-20 16:00 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	andrew.cooper3, seanjc, pbonzini, ravi.v.shankar

On Mon, Mar 06, 2023 at 06:39:32PM -0800, Xin Li wrote:
> Add a machine check entry stub for FRED.
> 
> Unlike IDT, no need to save/restore dr7 in FRED machine check handler.

Given how fragile MCE is, the question should be, do we ever want hw
breakpoints to happen while it is running?

If the hw-breakpoint handler trips on the same memory fail that got us
into the mce the first time, we're dead.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-17 21:32     ` H. Peter Anvin
  2023-03-18  6:33       ` Lai Jiangshan
@ 2023-03-20 16:44       ` Peter Zijlstra
  2023-03-21  0:13         ` Li, Xin3
  1 sibling, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2023-03-20 16:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Lai Jiangshan, Xin Li, linux-kernel, x86, kvm, tglx, mingo, bp,
	dave.hansen, andrew.cooper3, seanjc, pbonzini, ravi.v.shankar

On Fri, Mar 17, 2023 at 02:32:28PM -0700, H. Peter Anvin wrote:
> The purpose of separate stacks for NMI, #DB and #MC *in the kernel*
> (remember that user space faults are always taken on stack level 0) is
> to avoid overflowing the kernel stack. #DB in the kernel would imply
> the use of a kernel debugger.

Perf (and through it bpf) also has access to #DB. They can set
breakpoints on kernel instructions/memory just fine provided permission
etc.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-18  6:33       ` Lai Jiangshan
@ 2023-03-20 16:49         ` Peter Zijlstra
  2023-03-21  0:12           ` Li, Xin3
  2023-03-22  2:22         ` Li, Xin3
  1 sibling, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2023-03-20 16:49 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: H. Peter Anvin, Xin Li, linux-kernel, x86, kvm, tglx, mingo, bp,
	dave.hansen, andrew.cooper3, seanjc, pbonzini, ravi.v.shankar

On Sat, Mar 18, 2023 at 02:33:30PM +0800, Lai Jiangshan wrote:
> If there is no other concrete reason other than overflowing for
> assigning NMI and #DB with a stack level > 0, #VE should also
> be assigned with a stack level > 0, and #BP too. #VE can happen
> anytime and anywhere, so it is subject to overflowing too.

So #BP needs the stack-gap (redzone) for text_poke_bp().

#BP can end up in kprobes which can then end up in ftrace/perf,
depending on how it's all wired up.

#VE is currently a trainwreck vs NMI/MCE, but I think FRED solves the
worst of that. I'm not exactly sure how deep the #VE handler goes.



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 23/34] x86/fred: update MSR_IA32_FRED_RSP0 during task switch
  2023-03-07  2:39 ` [PATCH v5 23/34] x86/fred: update MSR_IA32_FRED_RSP0 during task switch Xin Li
@ 2023-03-20 16:52   ` Peter Zijlstra
  2023-03-20 23:54     ` Li, Xin3
  0 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2023-03-20 16:52 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	andrew.cooper3, seanjc, pbonzini, ravi.v.shankar

On Mon, Mar 06, 2023 at 06:39:35PM -0800, Xin Li wrote:
> From: "H. Peter Anvin (Intel)" <hpa@zytor.com>
> 
> MSR_IA32_FRED_RSP0 is used during ring 3 event delivery, and needs to
> be updated to point to the top of next task stack during task switch.
> 
> Update MSR_IA32_FRED_RSP0 with WRMSR instruction for now, and will use
> WRMSRNS/WRMSRLIST for performance once it gets upstreamed.
> 
> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>  arch/x86/include/asm/switch_to.h | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index 5c91305d09d2..00fd85abc1d2 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -68,9 +68,16 @@ static inline void update_task_stack(struct task_struct *task)
>  #ifdef CONFIG_X86_32
>  	this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0);
>  #else
> -	/* Xen PV enters the kernel on the thread stack. */
> -	if (cpu_feature_enabled(X86_FEATURE_XENPV))
> +	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
> +		/*
> +		 * Will use WRMSRNS/WRMSRLIST for performance once it's upstreamed.
> +		 */
> +		wrmsrl(MSR_IA32_FRED_RSP0,
> +		       task_top_of_stack(task) + TOP_OF_KERNEL_STACK_PADDING);
> +	} else if (cpu_feature_enabled(X86_FEATURE_XENPV)) {

Whee, so hardware will really only ever look at this when RSP0? I don't
need to worry about exceptions during context switch?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 25/34] x86/fred: disallow the swapgs instruction when FRED is enabled
  2023-03-07  2:39 ` [PATCH v5 25/34] x86/fred: disallow the swapgs instruction " Xin Li
@ 2023-03-20 16:54   ` Peter Zijlstra
  2023-03-20 17:58     ` Li, Xin3
  0 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2023-03-20 16:54 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	andrew.cooper3, seanjc, pbonzini, ravi.v.shankar

On Mon, Mar 06, 2023 at 06:39:37PM -0800, Xin Li wrote:
> From: "H. Peter Anvin (Intel)" <hpa@zytor.com>
> 
> The FRED architecture establishes the full supervisor/user through:
> 1) FRED event delivery swaps the value of the GS base address and
>    that of the IA32_KERNEL_GS_BASE MSR.
> 2) ERETU swaps the value of the GS base address and that of the
>    IA32_KERNEL_GS_BASE MSR.
> Thus, the swapgs instruction is disallowed when FRED is enabled,
> otherwise it cauess #UD.
                 ^^^ --- new word :-)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 04/34] x86/traps: add external_interrupt() to dispatch external interrupts
  2023-03-20 15:36   ` Peter Zijlstra
@ 2023-03-20 17:42     ` Peter Zijlstra
  2023-03-20 23:47       ` Li, Xin3
  2023-03-20 17:53     ` Li, Xin3
  1 sibling, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2023-03-20 17:42 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	andrew.cooper3, seanjc, pbonzini, ravi.v.shankar

On Mon, Mar 20, 2023 at 04:36:30PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 06, 2023 at 06:39:16PM -0800, Xin Li wrote:
> 
> > +#ifndef CONFIG_X86_LOCAL_APIC
> > +/*
> > + * Used when local APIC is not compiled into the kernel, but
> > + * external_interrupt() needs dispatch_spurious_interrupt().
> > + */
> > +DEFINE_IDTENTRY_IRQ(spurious_interrupt)
> > +{
> > +	pr_info("Spurious interrupt (vector 0x%x) on CPU#%d, should never happen.\n",
> > +		vector, smp_processor_id());
> > +}
> > +#endif
> > +
> > +/*
> > + * External interrupt dispatch function.
> > + *
> > + * Until/unless dispatch_common_interrupt() can be taught to deal with the
> > + * special system vectors, split the dispatch.
> > + *
> > + * Note: dispatch_common_interrupt() already deals with IRQ_MOVE_CLEANUP_VECTOR.
> > + */
> > +int external_interrupt(struct pt_regs *regs, unsigned int vector)
> > +{
> > +	unsigned int sysvec = vector - FIRST_SYSTEM_VECTOR;
> > +
> > +	if (vector < FIRST_EXTERNAL_VECTOR) {
> > +		pr_err("invalid external interrupt vector %d\n", vector);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (sysvec < NR_SYSTEM_VECTORS) {
> > +		if (system_interrupt_handlers[sysvec])
> > +			system_interrupt_handlers[sysvec](regs);
> > +		else
> > +			dispatch_spurious_interrupt(regs, vector);
> 
> ISTR suggesting you can get rid of this branch if you stuff
> system_interrupt_handlers[] with dispatch_spurious_interrupt instead of
> NULL.

Ah, I suggested that for another function vector, but it applies here
too I suppose :-)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 04/34] x86/traps: add external_interrupt() to dispatch external interrupts
  2023-03-20 15:36   ` Peter Zijlstra
  2023-03-20 17:42     ` Peter Zijlstra
@ 2023-03-20 17:53     ` Li, Xin3
  1 sibling, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-20 17:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> > +	if (sysvec < NR_SYSTEM_VECTORS) {
> > +		if (system_interrupt_handlers[sysvec])
> > +			system_interrupt_handlers[sysvec](regs);
> > +		else
> > +			dispatch_spurious_interrupt(regs, vector);
> 
> ISTR suggesting you can get rid of this branch if you stuff
> system_interrupt_handlers[] with dispatch_spurious_interrupt instead of NULL.

You're right, however I only fixed one.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 25/34] x86/fred: disallow the swapgs instruction when FRED is enabled
  2023-03-20 16:54   ` Peter Zijlstra
@ 2023-03-20 17:58     ` Li, Xin3
  0 siblings, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-20 17:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> > The FRED architecture establishes the full supervisor/user through:
> > 1) FRED event delivery swaps the value of the GS base address and
> >    that of the IA32_KERNEL_GS_BASE MSR.
> > 2) ERETU swaps the value of the GS base address and that of the
> >    IA32_KERNEL_GS_BASE MSR.
> > Thus, the swapgs instruction is disallowed when FRED is enabled,
> > otherwise it cauess #UD.
>                  ^^^ --- new word :-)

My stupid fingers...

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 04/34] x86/traps: add external_interrupt() to dispatch external interrupts
  2023-03-20 17:42     ` Peter Zijlstra
@ 2023-03-20 23:47       ` Li, Xin3
  0 siblings, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-20 23:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> > > +	if (sysvec < NR_SYSTEM_VECTORS) {
> > > +		if (system_interrupt_handlers[sysvec])
> > > +			system_interrupt_handlers[sysvec](regs);
> > > +		else
> > > +			dispatch_spurious_interrupt(regs, vector);
> >
> > ISTR suggesting you can get rid of this branch if you stuff
> > system_interrupt_handlers[] with dispatch_spurious_interrupt instead
> > of NULL.
> 
> Ah, I suggested that for another function vector, but it applies here too I suppose :-)

Of course!

We just need to use a wrapper as dispatch_spurious_interrupt() takes an extra
parameter "vector".

Thanks!
  Xin



^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 23/34] x86/fred: update MSR_IA32_FRED_RSP0 during task switch
  2023-03-20 16:52   ` Peter Zijlstra
@ 2023-03-20 23:54     ` Li, Xin3
  0 siblings, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-20 23:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> > -	if (cpu_feature_enabled(X86_FEATURE_XENPV))
> > +	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
> > +		/*
> > +		 * Will use WRMSRNS/WRMSRLIST for performance once it's
> upstreamed.
> > +		 */
> > +		wrmsrl(MSR_IA32_FRED_RSP0,
> > +		       task_top_of_stack(task) +
> TOP_OF_KERNEL_STACK_PADDING);
> > +	} else if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
> 
> Whee, so hardware will really only ever look at this when RSP0? I don't need to
> worry about exceptions during context switch?

You're right, we don't.

RSP0 is only used in ring3. Exceptions from ring0 just keep using the current
kernel stack unless a higher stack level needs to be used, e.g., RSP3 for #DF.

Thanks!
  Xin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 20/34] x86/fred: add a machine check entry stub for FRED
  2023-03-20 16:00   ` Peter Zijlstra
@ 2023-03-21  0:04     ` Li, Xin3
  2023-03-21  8:59       ` Peter Zijlstra
  0 siblings, 1 reply; 80+ messages in thread
From: Li, Xin3 @ 2023-03-21  0:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> > Unlike IDT, no need to save/restore dr7 in FRED machine check handler.
> 
> Given how fragile MCE is, the question should be, do we ever want hw
> breakpoints to happen while it is running?

HW breakpoints still work if they are properly configured.

> If the hw-breakpoint handler trips on the same memory fail that got us into the
> mce the first time, we're dead.

Right.

Unless the MCIP bit is turned off any subsequent #MC goes to shutdown
("machine is screwed").

It's the kernel debugger's responsibility to decide how to proceed in such
cases. But if the kernel debugger itself is in a screwed memory region, we
are soooooo dead.

Thanks!
  Xin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-20 16:49         ` Peter Zijlstra
@ 2023-03-21  0:12           ` Li, Xin3
  2023-03-21  1:02             ` andrew.cooper3
  0 siblings, 1 reply; 80+ messages in thread
From: Li, Xin3 @ 2023-03-21  0:12 UTC (permalink / raw)
  To: Peter Zijlstra, Lai Jiangshan
  Cc: H. Peter Anvin, linux-kernel, x86, kvm, tglx, mingo, bp,
	dave.hansen, andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> > If there is no other concrete reason other than overflowing for
> > assigning NMI and #DB with a stack level > 0, #VE should also be
> > assigned with a stack level > 0, and #BP too. #VE can happen anytime
> > and anywhere, so it is subject to overflowing too.
> 
> So #BP needs the stack-gap (redzone) for text_poke_bp().
> 
> #BP can end up in kprobes which can then end up in ftrace/perf, depending on
> how it's all wired up.
> 
> #VE is currently a trainwreck vs NMI/MCE, but I think FRED solves the worst of
> that. I'm not exactly sure how deep the #VE handler goes.
> 

VE under IDT is *not* using an IST, we need some solid rationales here.

Thanks!
  Xin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-20 16:44       ` Peter Zijlstra
@ 2023-03-21  0:13         ` Li, Xin3
  0 siblings, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-21  0:13 UTC (permalink / raw)
  To: Peter Zijlstra, H. Peter Anvin
  Cc: Lai Jiangshan, linux-kernel, x86, kvm, tglx, mingo, bp,
	dave.hansen, andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> > The purpose of separate stacks for NMI, #DB and #MC *in the kernel*
> > (remember that user space faults are always taken on stack level 0) is
> > to avoid overflowing the kernel stack. #DB in the kernel would imply
> > the use of a kernel debugger.
> 
> Perf (and through it bpf) also has access to #DB. They can set
> breakpoints on kernel instructions/memory just fine provided permission
> etc.

So they are still *kernel* debuggers :)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-21  0:12           ` Li, Xin3
@ 2023-03-21  1:02             ` andrew.cooper3
  2023-03-21  7:49               ` Li, Xin3
  2023-03-22 16:29               ` Dave Hansen
  0 siblings, 2 replies; 80+ messages in thread
From: andrew.cooper3 @ 2023-03-21  1:02 UTC (permalink / raw)
  To: Li, Xin3, Peter Zijlstra, Lai Jiangshan
  Cc: H. Peter Anvin, linux-kernel, x86, kvm, tglx, mingo, bp,
	dave.hansen, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

On 21/03/2023 12:12 am, Li, Xin3 wrote:
>>> If there is no other concrete reason other than overflowing for
>>> assigning NMI and #DB with a stack level > 0, #VE should also be
>>> assigned with a stack level > 0, and #BP too. #VE can happen anytime
>>> and anywhere, so it is subject to overflowing too.
>> So #BP needs the stack-gap (redzone) for text_poke_bp().
>>
>> #BP can end up in kprobes which can then end up in ftrace/perf, depending on
>> how it's all wired up.
>>
>> #VE is currently a trainwreck vs NMI/MCE, but I think FRED solves the worst of
>> that. I'm not exactly sure how deep the #VE handler goes.
>>
> VE under IDT is *not* using an IST, we need some solid rationales here.

#VE, and #VC on AMD, are borderline unusable.  Both under IDT and FRED.

The reason #VE is not IST is because there are plenty of real cases
where a non-malicious outer hypervisor could create reentrant faults
that lose program state.  e.g. hitting an IO instruction, then hitting
an emulated MSR.

There are fewer cases where a non-IST #VE ends up in a re-entrant fault
(IIRC, you can still manage it by unmapping the entry stack), but you're
still trusting the outer hypervisor to not e.g. unmap the SYSCALL entry
point.

FRED gets rid of the "reentrant fault overwriting it on the stack" case,
and removes the syscall gap case, replacing them instead with a stack
overflow in the worst case because there is still no upper bound to how
many times #VE can actually be delivered in the course of servicing a
single #VE.

~Andrew

P.S. While I hate to cite myself, if you haven't read
https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
yet, do so.  It did feed into some of the FRED design.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-21  1:02             ` andrew.cooper3
@ 2023-03-21  7:49               ` Li, Xin3
  2023-03-22 16:29               ` Dave Hansen
  1 sibling, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-21  7:49 UTC (permalink / raw)
  To: andrew.cooper3, Peter Zijlstra, Lai Jiangshan
  Cc: H. Peter Anvin, linux-kernel, x86, kvm, tglx, mingo, bp,
	dave.hansen, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> >>> If there is no other concrete reason other than overflowing for
> >>> assigning NMI and #DB with a stack level > 0, #VE should also be
> >>> assigned with a stack level > 0, and #BP too. #VE can happen anytime
> >>> and anywhere, so it is subject to overflowing too.
> >> So #BP needs the stack-gap (redzone) for text_poke_bp().
> >>
> >> #BP can end up in kprobes which can then end up in ftrace/perf, depending
> on
> >> how it's all wired up.
> >>
> >> #VE is currently a trainwreck vs NMI/MCE, but I think FRED solves the worst of
> >> that. I'm not exactly sure how deep the #VE handler goes.
> >>
> > VE under IDT is *not* using an IST, we need some solid rationales here.
> 
> #VE, and #VC on AMD, are borderline unusable.  Both under IDT and FRED.

Oops!

> The reason #VE is not IST is because there are plenty of real cases
> where a non-malicious outer hypervisor could create reentrant faults
> that lose program state.  e.g. hitting an IO instruction, then hitting
> an emulated MSR.
>
> There are fewer cases where a non-IST #VE ends up in a re-entrant fault
> (IIRC, you can still manage it by unmapping the entry stack), but you're
> still trusting the outer hypervisor to not e.g. unmap the SYSCALL entry
> point.
> 
> FRED gets rid of the "reentrant fault overwriting it on the stack" case,
> and removes the syscall gap case, replacing them instead with a stack
> overflow in the worst case because there is still no upper bound to how
> many times #VE can actually be delivered in the course of servicing a
> single #VE.

Exactly, FRED stack levels can make use of the whole regular stack space.

I guess you don't seem to support #VE on a higher stack level? 

> ~Andrew
> 
> P.S. While I hate to cite myself, if you haven't read
> https://docs.google.com/document/d/1hWejnyDkjRRAW-
> JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
> yet, do so.  It did feed into some of the FRED design.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 20/34] x86/fred: add a machine check entry stub for FRED
  2023-03-21  0:04     ` Li, Xin3
@ 2023-03-21  8:59       ` Peter Zijlstra
  2023-03-21 16:38         ` Li, Xin3
  0 siblings, 1 reply; 80+ messages in thread
From: Peter Zijlstra @ 2023-03-21  8:59 UTC (permalink / raw)
  To: Li, Xin3
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

On Tue, Mar 21, 2023 at 12:04:47AM +0000, Li, Xin3 wrote:
> > > Unlike IDT, no need to save/restore dr7 in FRED machine check handler.
> > 
> > Given how fragile MCE is, the question should be, do we ever want hw
> > breakpoints to happen while it is running?
> 
> HW breakpoints still work if they are properly configured.
> 
> > If the hw-breakpoint handler trips on the same memory fail that got us into the
> > mce the first time, we're dead.
> 
> Right.
> 
> Unless the MCIP bit is turned off any subsequent #MC goes to shutdown
> ("machine is screwed").
> 
> It's the kernel debugger's responsibility to decide how to proceed in such
> cases. But if the kernel debugger itself is in a screwed memory region, we
> are soooooo dead.

Yeah, so I would much prefer, for robustness sake, to start out with not
allowing #DB in MCE -- much like today.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 20/34] x86/fred: add a machine check entry stub for FRED
  2023-03-21  8:59       ` Peter Zijlstra
@ 2023-03-21 16:38         ` Li, Xin3
  0 siblings, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-21 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> On Tue, Mar 21, 2023 at 12:04:47AM +0000, Li, Xin3 wrote:
> > > > Unlike IDT, no need to save/restore dr7 in FRED machine check handler.
> > >
> > > Given how fragile MCE is, the question should be, do we ever want hw
> > > breakpoints to happen while it is running?
> >
> > HW breakpoints still work if they are properly configured.
> >
> > > If the hw-breakpoint handler trips on the same memory fail that got
> > > us into the mce the first time, we're dead.
> >
> > Right.
> >
> > Unless the MCIP bit is turned off any subsequent #MC goes to shutdown
> > ("machine is screwed").
> >
> > It's the kernel debugger's responsibility to decide how to proceed in
> > such cases. But if the kernel debugger itself is in a screwed memory
> > region, we are soooooo dead.
> 
> Yeah, so I would much prefer, for robustness sake, to start out with not allowing
> #DB in MCE -- much like today.

Will disable #DB inside #MCE then.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-18  6:33       ` Lai Jiangshan
  2023-03-20 16:49         ` Peter Zijlstra
@ 2023-03-22  2:22         ` Li, Xin3
  2023-03-22  4:01           ` Dave Hansen
  2023-03-22 18:25           ` andrew.cooper3
  1 sibling, 2 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-22  2:22 UTC (permalink / raw)
  To: Lai Jiangshan, H. Peter Anvin
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, peterz,
	andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> If there is no other concrete reason other than overflowing for assigning NMI and
> #DB with a stack level > 0, #VE should also be assigned with a stack level > 0, and
> #BP too. #VE can happen anytime and anywhere, so it is subject to overflowing too.

With IDT, both #VE and #BP do not use IST, but NMI, #DB, #MC and #DF do.

Let's keep this "secret" logic for now, i.e., not change the stack levels
for #VE and #BP at this point. We can do "optimization", i.e., change them
later :).


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-22  2:22         ` Li, Xin3
@ 2023-03-22  4:01           ` Dave Hansen
  2023-03-22  5:40             ` Li, Xin3
  2023-03-22 18:25           ` andrew.cooper3
  1 sibling, 1 reply; 80+ messages in thread
From: Dave Hansen @ 2023-03-22  4:01 UTC (permalink / raw)
  To: Li, Xin3, Lai Jiangshan, H. Peter Anvin
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, peterz,
	andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

On 3/21/23 19:22, Li, Xin3 wrote:
>> If there is no other concrete reason other than overflowing for assigning NMI and
>> #DB with a stack level > 0, #VE should also be assigned with a stack level > 0, and
>> #BP too. #VE can happen anytime and anywhere, so it is subject to overflowing too.
> With IDT, both #VE and #BP do not use IST, but NMI, #DB, #MC and #DF do.
> 
> Let's keep this "secret" logic for now, i.e., not change the stack levels
> for #VE and #BP at this point. We can do "optimization", i.e., change them
> later 😄.

#VE also can't happen anywhere.  There is some documentation about it in
here:

	https://docs.kernel.org/x86/tdx.html#linux-ve-handler

But, basically, the only halfway sane thing a guest might do to hit a
#VE is touch some "MMIO".  The host can *not* cause them in arbitrary
places because of the SEPT_VE_DISABLE attribute.

#VE's also can't nest until after the guest retrieves the "VE info".
That means that the #VE handler at _least_ reaches C code before it's
subject to another #VE and that second one would still need to be
induced by something the guest does explicitly.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-22  4:01           ` Dave Hansen
@ 2023-03-22  5:40             ` Li, Xin3
  0 siblings, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-22  5:40 UTC (permalink / raw)
  To: Hansen, Dave, Lai Jiangshan, H. Peter Anvin
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, peterz,
	andrew.cooper3, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

> >> If there is no other concrete reason other than overflowing for assigning NMI
> and
> >> #DB with a stack level > 0, #VE should also be assigned with a stack level > 0,
> and
> >> #BP too. #VE can happen anytime and anywhere, so it is subject to
> overflowing too.
> > With IDT, both #VE and #BP do not use IST, but NMI, #DB, #MC and #DF do.
> >
> > Let's keep this "secret" logic for now, i.e., not change the stack levels
> > for #VE and #BP at this point. We can do "optimization", i.e., change them
> > later 😄.
> 
> #VE also can't happen anywhere.  There is some documentation about it in
> here:
> 
> 	https://docs.kernel.org/x86/tdx.html#linux-ve-handler
> 
> But, basically, the only halfway sane thing a guest might do to hit a
> #VE is touch some "MMIO".  The host can *not* cause them in arbitrary
> places because of the SEPT_VE_DISABLE attribute.
> 
> #VE's also can't nest until after the guest retrieves the "VE info".
> That means that the #VE handler at _least_ reaches C code before it's
> subject to another #VE and that second one would still need to be
> induced by something the guest does explicitly.

Thanks a lot for the detailed background!

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-21  1:02             ` andrew.cooper3
  2023-03-21  7:49               ` Li, Xin3
@ 2023-03-22 16:29               ` Dave Hansen
  1 sibling, 0 replies; 80+ messages in thread
From: Dave Hansen @ 2023-03-22 16:29 UTC (permalink / raw)
  To: andrew.cooper3, Li, Xin3, Peter Zijlstra, Lai Jiangshan
  Cc: H. Peter Anvin, linux-kernel, x86, kvm, tglx, mingo, bp,
	dave.hansen, Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

On 3/20/23 18:02, andrew.cooper3@citrix.com wrote:
> There are fewer cases where a non-IST #VE ends up in a re-entrant fault
> (IIRC, you can still manage it by unmapping the entry stack), but you're
> still trusting the outer hypervisor to not e.g. unmap the SYSCALL entry
> point.

This is a general weakness of #VE.  But, the current Linux TDX guest
implementation is not vulnerable to it.  If the host unmaps something
unexpectedly, the guest will just die because of ATTR_SEPT_VE_DISABLE.
No #VE:

> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/coco/tdx/tdx.c#n216



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled
  2023-03-07  2:39 ` [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled Xin Li
  2023-03-07 22:00   ` Li, Xin3
@ 2023-03-22 17:49   ` Sean Christopherson
  2023-03-22 23:03     ` andrew.cooper3
  2023-03-22 23:43     ` Li, Xin3
  1 sibling, 2 replies; 80+ messages in thread
From: Sean Christopherson @ 2023-03-22 17:49 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	peterz, andrew.cooper3, pbonzini, ravi.v.shankar

On Mon, Mar 06, 2023, Xin Li wrote:
> Execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled.
> 
> Like IRET for IDT, ERETS/ERETU are required to end the NMI handler for FRED
> to unblock NMI ASAP (w/ bit 28 of CS set).

That's "CS" on the stack correct?  Is bit 28 set manually by software, or is it
set automatically by hardware?  If it's set by hardware, does "int $2" actually
set the bit since it's not a real NMI?

> And there are 2 approaches to
> invoke the FRED NMI handler:
> 1) execute "int $2", let the h/w do the job.
> 2) create a FRED NMI stack frame on the current kernel stack with ASM,
>    and then jump to fred_entrypoint_kernel in arch/x86/entry/entry_64_fred.S.
> 
> 1) is preferred as we want less ASM.

Who is "we", and how much assembly are we talking about?  E.g. I personally don't
mind a trampoline in KVM if it's small and/or can share code with existing
assembly subroutines.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 05/34] x86/traps: export external_interrupt() for VMX IRQ reinjection
  2023-03-07  2:39 ` [PATCH v5 05/34] x86/traps: export external_interrupt() for VMX IRQ reinjection Xin Li
@ 2023-03-22 17:52   ` Sean Christopherson
  2023-03-22 22:38     ` Li, Xin3
  0 siblings, 1 reply; 80+ messages in thread
From: Sean Christopherson @ 2023-03-22 17:52 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	peterz, andrew.cooper3, pbonzini, ravi.v.shankar

On Mon, Mar 06, 2023, Xin Li wrote:
> To eliminate dispatching IRQ through the IDT, export external_interrupt()
> for VMX IRQ reinjection.
> 
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>  arch/x86/include/asm/traps.h |  2 ++
>  arch/x86/kernel/traps.c      | 14 ++++++++++++++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index 46f5e4e2a346..da4c21ed68b4 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -56,4 +56,6 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
>  	void f (struct pt_regs *regs)
>  typedef DECLARE_SYSTEM_INTERRUPT_HANDLER((*system_interrupt_handler));
>  
> +int external_interrupt(struct pt_regs *regs, unsigned int vector);
> +
>  #endif /* _ASM_X86_TRAPS_H */
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 31ad645be2fb..cebba1f49e19 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -1540,6 +1540,20 @@ int external_interrupt(struct pt_regs *regs, unsigned int vector)
>  	return 0;
>  }
>  
> +#if IS_ENABLED(CONFIG_KVM_INTEL)
> +/*
> + * KVM VMX reinjects IRQ on its current stack, it's a sync call
> + * thus the values in the pt_regs structure are not used in
> + * executing IRQ handlers, except cs.RPL and flags.IF, which
> + * are both always 0 in the VMX IRQ reinjection context.
> + *
> + * However, the pt_regs structure is sometimes used in stack
> + * dump, e.g., show_regs(). So let the caller, i.e., KVM VMX
> + * decide how to initialize the input pt_regs structure.
> + */
> +EXPORT_SYMBOL_GPL(external_interrupt);
> +#endif

If the x86 maintainers don't object, I would prefer this to be squashed with the
actual KVM usage, that way discussions on exactly what the exported API should be
can be contained in a single thread.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 33/34] KVM: x86/vmx: call external_interrupt() to handle IRQ in IRQ caused VM exits
  2023-03-07  2:39 ` [PATCH v5 33/34] KVM: x86/vmx: call external_interrupt() to handle IRQ in IRQ caused VM exits Xin Li
@ 2023-03-22 17:57   ` Sean Christopherson
  0 siblings, 0 replies; 80+ messages in thread
From: Sean Christopherson @ 2023-03-22 17:57 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	peterz, andrew.cooper3, pbonzini, ravi.v.shankar

On Mon, Mar 06, 2023, Xin Li wrote:
> @@ -6923,7 +6924,26 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
>  		return;
>  
>  	kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
> -	vmx_do_interrupt_irqoff(gate_offset(desc));
> +	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
> +		struct vcpu_vmx *vmx = to_vmx(vcpu);
> +		struct pt_regs regs = {};
> +
> +		/*
> +		 * Create an event return stack frame with the
> +		 * host context immediately after a VM exit.

Why snapshot the context immediately after VM-Exit?  It diverges from what is
done in the non-FRED path, and it seems quite misleading and maybe even dangerous.
The RSP and RIP values are long since gone, e.g. if something explodes, the stack
trace will be outright wrong.

> +		 *
> +		 * All other fields of the pt_regs structure are
> +		 * cleared to 0.
> +		 */
> +		regs.ss		= __KERNEL_DS;
> +		regs.sp		= vmx->loaded_vmcs->host_state.rsp;
> +		regs.flags	= X86_EFLAGS_FIXED;
> +		regs.cs		= __KERNEL_CS;
> +		regs.ip		= (unsigned long)vmx_vmexit;
> +
> +		external_interrupt(&regs, vector);

I assume FRED still uses the stack, so why not do something similar to
vmx_do_interrupt_irqoff() and build @regs after an explicit CALL?  Might even
be possible to share some/all of VMX_DO_EVENT_IRQOFF.

> +	} else

Curly braces needed since the first half has 'em.

> +		vmx_do_interrupt_irqoff(gate_offset(desc));
>  	kvm_after_interrupt(vcpu);
>  
>  	vcpu->arch.at_instruction_boundary = true;
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 22/34] x86/fred: FRED initialization code
  2023-03-22  2:22         ` Li, Xin3
  2023-03-22  4:01           ` Dave Hansen
@ 2023-03-22 18:25           ` andrew.cooper3
  1 sibling, 0 replies; 80+ messages in thread
From: andrew.cooper3 @ 2023-03-22 18:25 UTC (permalink / raw)
  To: Li, Xin3, Lai Jiangshan, H. Peter Anvin
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, peterz,
	Christopherson,,
	Sean, pbonzini, Shankar, Ravi V

On 22/03/2023 2:22 am, Li, Xin3 wrote:
>> If there is no other concrete reason other than overflowing for assigning NMI and
>> #DB with a stack level > 0, #VE should also be assigned with a stack level > 0, and
>> #BP too. #VE can happen anytime and anywhere, so it is subject to overflowing too.
> With IDT, both #VE and #BP do not use IST, but NMI, #DB, #MC and #DF do.
>
> Let's keep this "secret" logic for now, i.e., not change the stack levels
> for #VE and #BP at this point. We can do "optimization", i.e., change them
> later :).

Fun fact.  #BP used to be IST, and used to share the same IST as #DF.

This was spoiled by CVE-2018-8897 and a MovSS-delayed breakpoint over
INT3, at which point hardware queued both a #BP and #DB on the same IST
stack and lost program state.

There's no need specific need for #BP to be IST to begin with, hence why
making it not-IST was the security fix.

~Andrew

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 05/34] x86/traps: export external_interrupt() for VMX IRQ reinjection
  2023-03-22 17:52   ` Sean Christopherson
@ 2023-03-22 22:38     ` Li, Xin3
  0 siblings, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-22 22:38 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	peterz, andrew.cooper3, pbonzini, Shankar, Ravi V

> > +#if IS_ENABLED(CONFIG_KVM_INTEL)
> > +/*
> > + * KVM VMX reinjects IRQ on its current stack, it's a sync call
> > + * thus the values in the pt_regs structure are not used in
> > + * executing IRQ handlers, except cs.RPL and flags.IF, which
> > + * are both always 0 in the VMX IRQ reinjection context.
> > + *
> > + * However, the pt_regs structure is sometimes used in stack
> > + * dump, e.g., show_regs(). So let the caller, i.e., KVM VMX
> > + * decide how to initialize the input pt_regs structure.
> > + */
> > +EXPORT_SYMBOL_GPL(external_interrupt);
> > +#endif
> 
> If the x86 maintainers don't object, I would prefer this to be squashed with the
> actual KVM usage, that way discussions on exactly what the exported API should be
> can be contained in a single thread.

The KVM usage is the only one now, thus it does make sense to squash into one.

I'm working on v6 and will merge this patch into the corresponding KVM patch.
BTW, I will stop using "reinject" as asked.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled
  2023-03-22 17:49   ` Sean Christopherson
@ 2023-03-22 23:03     ` andrew.cooper3
  2023-03-22 23:42       ` Sean Christopherson
  2023-03-22 23:43     ` Li, Xin3
  1 sibling, 1 reply; 80+ messages in thread
From: andrew.cooper3 @ 2023-03-22 23:03 UTC (permalink / raw)
  To: Sean Christopherson, Xin Li
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	peterz, pbonzini, ravi.v.shankar

On 22/03/2023 5:49 pm, Sean Christopherson wrote:
> On Mon, Mar 06, 2023, Xin Li wrote:
>> Execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled.
>>
>> Like IRET for IDT, ERETS/ERETU are required to end the NMI handler for FRED
>> to unblock NMI ASAP (w/ bit 28 of CS set).
> That's "CS" on the stack correct?  Is bit 28 set manually by software, or is it
> set automatically by hardware?  If it's set by hardware, does "int $2" actually
> set the bit since it's not a real NMI?

int $2 had better not set it...  This is the piece of state that is
intended to cause everything which isn't a real NMI to nest properly
inside a real NMI.

It is supposed to be set on delivery of an NMI, and act as the trigger
for ERET{U,S} to drop the latch.

Software is can set it manually in a FRED-frame in order to explicitly
unblock NMIs.

~Andrew

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled
  2023-03-22 23:03     ` andrew.cooper3
@ 2023-03-22 23:42       ` Sean Christopherson
  2023-03-23  0:26         ` Li, Xin3
  0 siblings, 1 reply; 80+ messages in thread
From: Sean Christopherson @ 2023-03-22 23:42 UTC (permalink / raw)
  To: andrew.cooper3
  Cc: Xin Li, linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen,
	hpa, peterz, pbonzini, ravi.v.shankar

On Wed, Mar 22, 2023, andrew.cooper3@citrix.com wrote:
> On 22/03/2023 5:49 pm, Sean Christopherson wrote:
> > On Mon, Mar 06, 2023, Xin Li wrote:
> >> Execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled.
> >>
> >> Like IRET for IDT, ERETS/ERETU are required to end the NMI handler for FRED
> >> to unblock NMI ASAP (w/ bit 28 of CS set).
> > That's "CS" on the stack correct?  Is bit 28 set manually by software, or is it
> > set automatically by hardware?  If it's set by hardware, does "int $2" actually
> > set the bit since it's not a real NMI?
> 
> int $2 had better not set it...� This is the piece of state that is
> intended to cause everything which isn't a real NMI to nest properly
> inside a real NMI.
> 
> It is supposed to be set on delivery of an NMI, and act as the trigger
> for ERET{U,S} to drop the latch.
> 
> Software is can set it manually in a FRED-frame in order to explicitly
> unblock NMIs.

Ah, found this in patch 19.  That hunk really belongs in this patch, because this
patch is full of magic without that information.

+       /*
+        * VM exits induced by NMIs keep NMI blocked, and we do
+        * "int $2" to reinject the NMI w/ NMI kept being blocked.
+        * However "int $2" doesn't set the nmi bit in the FRED
+        * stack frame, so we explicitly set it to make sure a
+        * later ERETS will unblock NMI immediately.
+        */
+       regs->nmi = 1;

Organization aside, this seems to defeat the purpose of _not_ unconditionally
unmasking NMIs on ERET since the kernel assumes any random "int $2" is coming from
KVM after an NMI VM-Exit.

Eww, and "int $2" doesn't even go directly to fred_exc_nmi(), it trampolines
through fred_sw_interrupt_kernel() first.  Looks like "int $2" from userspace gets
routed to a #GP, so at least that bit is handled.

I'm not dead set against the proposed approach, but IMO it's not obviously better
than a bit of assembly to have a more direct call into the NMI handler.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled
  2023-03-22 17:49   ` Sean Christopherson
  2023-03-22 23:03     ` andrew.cooper3
@ 2023-03-22 23:43     ` Li, Xin3
  1 sibling, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-22 23:43 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	peterz, andrew.cooper3, pbonzini, Shankar, Ravi V

> > Like IRET for IDT, ERETS/ERETU are required to end the NMI handler for
> > FRED to unblock NMI ASAP (w/ bit 28 of CS set).
> 
> That's "CS" on the stack correct?  Is bit 28 set manually by software, or is it set
> automatically by hardware?  If it's set by hardware, does "int $2" actually set the
> bit since it's not a real NMI?

Right, It's the "CS" on the stack. The bit 28 is set by the FRED NMI handler:
https://lore.kernel.org/lkml/20230307023946.14516-20-xin3.li@intel.com/

Upon a NMI delivery, the NMI bit is always set by H/W. However, "int $2" does
NOT set it, thus we need to explicitly set it.
 
> > And there are 2 approaches to
> > invoke the FRED NMI handler:
> > 1) execute "int $2", let the h/w do the job.
> > 2) create a FRED NMI stack frame on the current kernel stack with ASM,
> >    and then jump to fred_entrypoint_kernel in arch/x86/entry/entry_64_fred.S.
> >
> > 1) is preferred as we want less ASM.
> 
> Who is "we", and how much assembly are we talking about?  E.g. I personally don't
> mind a trampoline in KVM if it's small and/or can share code with existing assembly
> subroutines.

I ever got such a comment:
https://lore.kernel.org/lkml/8735bpbhat.ffs@tglx/

However, if ASM is also okay, I can work on it.  And I don't think the ASM code
will be big.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled
  2023-03-22 23:42       ` Sean Christopherson
@ 2023-03-23  0:26         ` Li, Xin3
  2023-03-24 17:45           ` Li, Xin3
  0 siblings, 1 reply; 80+ messages in thread
From: Li, Xin3 @ 2023-03-23  0:26 UTC (permalink / raw)
  To: Christopherson,, Sean, andrew.cooper3
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	peterz, pbonzini, Shankar, Ravi V

> Organization aside, this seems to defeat the purpose of _not_ unconditionally
> unmasking NMIs on ERET since the kernel assumes any random "int $2" is coming
> from KVM after an NMI VM-Exit.

I'm a bit confused.  KVM VMX is the only component needing to execute "int $2"
and it surely has NMI blocked after an NMI VM-exit.

> Eww, and "int $2" doesn't even go directly to fred_exc_nmi(), it trampolines
> through fred_sw_interrupt_kernel() first.  Looks like "int $2" from userspace gets
> routed to a #GP, so at least that bit is handled.

FRED does a 2-level dispatch, unless an event handler is on a hot path,
we don't promote its handling.  NMI seems not a frequent event.

> I'm not dead set against the proposed approach, but IMO it's not obviously better
> than a bit of assembly to have a more direct call into the NMI handler.

I will give it a shot.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled
  2023-03-23  0:26         ` Li, Xin3
@ 2023-03-24 17:45           ` Li, Xin3
  0 siblings, 0 replies; 80+ messages in thread
From: Li, Xin3 @ 2023-03-24 17:45 UTC (permalink / raw)
  To: Li, Xin3, Christopherson,, Sean, andrew.cooper3
  Cc: linux-kernel, x86, kvm, tglx, mingo, bp, dave.hansen, hpa,
	peterz, pbonzini, Shankar, Ravi V

> > I'm not dead set against the proposed approach, but IMO it's not
> > obviously better than a bit of assembly to have a more direct call into the NMI
> handler.
> 
> I will give it a shot.

Hi Sean,

I got a working patch, before I resend the whole FRED patch set again, can
you please check if this is what you're expecting?

When FRED is enabled, the x86 CPU always pushes an error code on the stack
immediately after the return instruction address is pushed. To generate such
a stack frame, call a trampoline function first to push the return instruction
address on the stack, and the trampoline function then pushes an error code
(0 for IRQ/NMI) and jump to fred_entrypoint_kernel.

I could have vmx_do_interrupt_trampoline jump to fred_entrypoint_kernel
Instead of calling external_interrupt(), but that would reenter the noinstr
text again (not a big problem but seems not preferred by Peter Z).

Thanks!
  Xin

diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 631fd7da2bc3..6682b5bd202b 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -31,7 +31,7 @@
 #define VCPU_R15       __VCPU_REGS_R15 * WORD_SIZE
 #endif

-.macro VMX_DO_EVENT_IRQOFF call_insn call_target
+.macro VMX_DO_EVENT_IRQOFF call_insn call_target fred=1 nmi=0
        /*
         * Unconditionally create a stack frame, getting the correct RSP on the
         * stack (for x86-64) would take two instructions anyways, and RBP can
@@ -46,11 +46,34 @@
         * creating the synthetic interrupt stack frame for the IRQ/NMI.
         */
        and  $-16, %rsp
+
+       .if \fred
+       push $0         /* Reserved by FRED, must be 0 */
+       push $0         /* FRED event data, 0 for NMI and external interrupts */
+
+       .if \nmi
+       mov $(2 << 32 | 2 << 48), %_ASM_AX      /* NMI event type and vector */
+       .else
+       mov %_ASM_ARG1, %_ASM_AX
+       shl $32, %_ASM_AX                       /* external interrupt vector */
+       .endif
+       add $__KERNEL_DS, %_ASM_AX
+       bts $57, %_ASM_AX                       /* bit 57: 64-bit mode */
+       push %_ASM_AX
+       .else
        push $__KERNEL_DS
+       .endif
+
        push %rbp
 #endif
        pushf
+       .if \nmi
+       mov $__KERNEL_CS, %_ASM_AX
+       bts $28, %_ASM_AX                       /* set the NMI bit */
+       push %_ASM_AX
+       .else
        push $__KERNEL_CS
+       .endif
        \call_insn \call_target

        /*
@@ -299,8 +322,19 @@ SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)

 SYM_FUNC_END(__vmx_vcpu_run)

+SYM_FUNC_START(vmx_do_nmi_trampoline)
+#ifdef CONFIG_X86_FRED
+       ALTERNATIVE "jmp .Lno_errorcode_push", "", X86_FEATURE_FRED
+       push $0         /* FRED error code, 0 for NMI */
+       jmp fred_entrypoint_kernel
+#endif
+
+.Lno_errorcode_push:
+       jmp asm_exc_nmi_kvm_vmx
+SYM_FUNC_END(vmx_do_nmi_trampoline)
+
 SYM_FUNC_START(vmx_do_nmi_irqoff)
-       VMX_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
+       VMX_DO_EVENT_IRQOFF call vmx_do_nmi_trampoline nmi=1
 SYM_FUNC_END(vmx_do_nmi_irqoff)


@@ -358,5 +392,51 @@ SYM_FUNC_END(vmread_error_trampoline)
 #endif

 SYM_FUNC_START(vmx_do_interrupt_irqoff)
-       VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
+       VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1 fred=0
 SYM_FUNC_END(vmx_do_interrupt_irqoff)
+
+#ifdef CONFIG_X86_64
+SYM_FUNC_START(vmx_do_interrupt_trampoline)
+       push $0 /* FRED error code, 0 for NMI and external interrupts */
+       push %rdi
+       push %rsi
+       push %rdx
+       push %rcx
+       push %rax
+       push %r8
+       push %r9
+       push %r10
+       push %r11
+       push %rbx
+       push %rbp
+       push %r12
+       push %r13
+       push %r14
+       push %r15
+
+       movq    %rsp, %rdi      /* %rdi -> pt_regs */
+       call external_interrupt
+
+       pop %r15
+       pop %r14
+       pop %r13
+       pop %r12
+       pop %rbp
+       pop %rbx
+       pop %r11
+       pop %r10
+       pop %r9
+       pop %r8
+       pop %rax
+       pop %rcx
+       pop %rdx
+       pop %rsi
+       pop %rdi
+       addq $8,%rsp            /* Drop FRED error code */
+       RET
+SYM_FUNC_END(vmx_do_interrupt_trampoline)
+
+SYM_FUNC_START(vmx_do_fred_interrupt_irqoff)
+       VMX_DO_EVENT_IRQOFF call vmx_do_interrupt_trampoline
+SYM_FUNC_END(vmx_do_fred_interrupt_irqoff)
+#endif
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d2d6e1b6c788..5addfee5cc6d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6875,6 +6875,7 @@ static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 }

 void vmx_do_interrupt_irqoff(unsigned long entry);
+void vmx_do_fred_interrupt_irqoff(unsigned int vector);
 void vmx_do_nmi_irqoff(void);

 static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
@@ -6923,7 +6924,12 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
                return;

        kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
-       vmx_do_interrupt_irqoff(gate_offset(desc));
+#ifdef CONFIG_X86_64
+       if (cpu_feature_enabled(X86_FEATURE_FRED))
+               vmx_do_fred_interrupt_irqoff(vector);
+       else
+#endif
+               vmx_do_interrupt_irqoff(gate_offset(desc));
        kvm_after_interrupt(vcpu);

        vcpu->arch.at_instruction_boundary = true;

^ permalink raw reply related	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2023-03-24 17:45 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-07  2:39 [PATCH v5 00/34] x86: enable FRED for x86-64 Xin Li
2023-03-07  2:39 ` [PATCH v5 01/34] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR Xin Li
2023-03-07  2:39 ` [PATCH v5 02/34] x86/traps: add a system interrupt table for system interrupt dispatch Xin Li
2023-03-07  2:39 ` [PATCH v5 03/34] x86/traps: add install_system_interrupt_handler() Xin Li
2023-03-07  2:39 ` [PATCH v5 04/34] x86/traps: add external_interrupt() to dispatch external interrupts Xin Li
2023-03-20 15:36   ` Peter Zijlstra
2023-03-20 17:42     ` Peter Zijlstra
2023-03-20 23:47       ` Li, Xin3
2023-03-20 17:53     ` Li, Xin3
2023-03-07  2:39 ` [PATCH v5 05/34] x86/traps: export external_interrupt() for VMX IRQ reinjection Xin Li
2023-03-22 17:52   ` Sean Christopherson
2023-03-22 22:38     ` Li, Xin3
2023-03-07  2:39 ` [PATCH v5 06/34] x86/cpufeature: add the cpu feature bit for FRED Xin Li
2023-03-07  2:39 ` [PATCH v5 07/34] x86/opcode: add ERETU, ERETS instructions to x86-opcode-map Xin Li
2023-03-07  2:39 ` [PATCH v5 08/34] x86/objtool: teach objtool about ERETU and ERETS Xin Li
2023-03-07  2:39 ` [PATCH v5 09/34] x86/cpu: add X86_CR4_FRED macro Xin Li
2023-03-07  2:39 ` [PATCH v5 10/34] x86/fred: add Kconfig option for FRED (CONFIG_X86_FRED) Xin Li
2023-03-07  2:39 ` [PATCH v5 11/34] x86/fred: if CONFIG_X86_FRED is disabled, disable FRED support Xin Li
2023-03-07  2:39 ` [PATCH v5 12/34] x86/cpu: add MSR numbers for FRED configuration Xin Li
2023-03-07  2:39 ` [PATCH v5 13/34] x86/fred: header file for event types Xin Li
2023-03-07  2:39 ` [PATCH v5 14/34] x86/fred: header file with FRED definitions Xin Li
2023-03-07  2:39 ` [PATCH v5 15/34] x86/fred: make unions for the cs and ss fields in struct pt_regs Xin Li
2023-03-07  2:39 ` [PATCH v5 16/34] x86/fred: reserve space for the FRED stack frame Xin Li
2023-03-07  2:39 ` [PATCH v5 17/34] x86/fred: add a page fault entry stub for FRED Xin Li
2023-03-07  2:39 ` [PATCH v5 18/34] x86/fred: add a debug " Xin Li
2023-03-07  2:39 ` [PATCH v5 19/34] x86/fred: add a NMI " Xin Li
2023-03-07  2:39 ` [PATCH v5 20/34] x86/fred: add a machine check " Xin Li
2023-03-20 16:00   ` Peter Zijlstra
2023-03-21  0:04     ` Li, Xin3
2023-03-21  8:59       ` Peter Zijlstra
2023-03-21 16:38         ` Li, Xin3
2023-03-07  2:39 ` [PATCH v5 21/34] x86/fred: FRED entry/exit and dispatch code Xin Li
2023-03-07  2:39 ` [PATCH v5 22/34] x86/fred: FRED initialization code Xin Li
2023-03-17 13:35   ` Lai Jiangshan
2023-03-17 21:32     ` H. Peter Anvin
2023-03-18  6:33       ` Lai Jiangshan
2023-03-20 16:49         ` Peter Zijlstra
2023-03-21  0:12           ` Li, Xin3
2023-03-21  1:02             ` andrew.cooper3
2023-03-21  7:49               ` Li, Xin3
2023-03-22 16:29               ` Dave Hansen
2023-03-22  2:22         ` Li, Xin3
2023-03-22  4:01           ` Dave Hansen
2023-03-22  5:40             ` Li, Xin3
2023-03-22 18:25           ` andrew.cooper3
2023-03-20 16:44       ` Peter Zijlstra
2023-03-21  0:13         ` Li, Xin3
2023-03-07  2:39 ` [PATCH v5 23/34] x86/fred: update MSR_IA32_FRED_RSP0 during task switch Xin Li
2023-03-20 16:52   ` Peter Zijlstra
2023-03-20 23:54     ` Li, Xin3
2023-03-07  2:39 ` [PATCH v5 24/34] x86/fred: let ret_from_fork() jmp to fred_exit_user when FRED is enabled Xin Li
2023-03-07  2:39 ` [PATCH v5 25/34] x86/fred: disallow the swapgs instruction " Xin Li
2023-03-20 16:54   ` Peter Zijlstra
2023-03-20 17:58     ` Li, Xin3
2023-03-07  2:39 ` [PATCH v5 26/34] x86/fred: no ESPFIX needed " Xin Li
2023-03-07  2:39 ` [PATCH v5 27/34] x86/fred: allow single-step trap and NMI when starting a new thread Xin Li
2023-03-07  2:39 ` [PATCH v5 28/34] x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user Xin Li
2023-03-17  9:39   ` Lai Jiangshan
2023-03-17  9:55     ` andrew.cooper3
2023-03-17 13:02       ` Lai Jiangshan
2023-03-17 21:23         ` H. Peter Anvin
2023-03-17 21:00       ` H. Peter Anvin
2023-03-18  7:55     ` Li, Xin3
2023-03-07  2:39 ` [PATCH v5 29/34] x86/ia32: do not modify the DPL bits for a null selector Xin Li
2023-03-07  2:39 ` [PATCH v5 30/34] x86/fred: allow FRED systems to use interrupt vectors 0x10-0x1f Xin Li
2023-03-07  2:39 ` [PATCH v5 31/34] x86/fred: allow dynamic stack frame size Xin Li
2023-03-07  2:39 ` [PATCH v5 32/34] x86/fred: disable FRED by default in its early stage Xin Li
2023-03-07  2:39 ` [PATCH v5 33/34] KVM: x86/vmx: call external_interrupt() to handle IRQ in IRQ caused VM exits Xin Li
2023-03-22 17:57   ` Sean Christopherson
2023-03-07  2:39 ` [PATCH v5 34/34] KVM: x86/vmx: execute "int $2" to handle NMI in NMI caused VM exits when FRED is enabled Xin Li
2023-03-07 22:00   ` Li, Xin3
2023-03-22 17:49   ` Sean Christopherson
2023-03-22 23:03     ` andrew.cooper3
2023-03-22 23:42       ` Sean Christopherson
2023-03-23  0:26         ` Li, Xin3
2023-03-24 17:45           ` Li, Xin3
2023-03-22 23:43     ` Li, Xin3
2023-03-11  9:58 ` [PATCH v5 00/34] x86: enable FRED for x86-64 Kang, Shan
2023-03-11 21:29   ` Li, Xin3
2023-03-20  7:40   ` Kang, Shan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).