linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi
@ 2020-05-29 21:27 Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 01/14] x86/hw_breakpoint: Add within_area() to check data breakpoints Peter Zijlstra
                   ` (13 more replies)
  0 siblings, 14 replies; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

These patches disallow #DB during NMI/#MC and allow removing a lot of fugly code.

And also include 4 patches from the lockdep/nmi series that clean up x86/entry
bits.

I'll send the rest of the lockdep patches seperately.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/14] x86/hw_breakpoint: Add within_area() to check data breakpoints
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 02/14] x86/hw_breakpoint: Prevent data breakpoints on direct GDT Peter Zijlstra
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

From: Lai Jiangshan <laijs@linux.alibaba.com>

within_area() is added for checking if the data breakpoints overlap
with cpu_entry_area, and will be used for checking if the data
breakpoints overlap with GDT, IDT, or TSS in places other than
cpu_entry_area next patches.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200526014221.2119-2-laijs@linux.alibaba.com
---
 arch/x86/kernel/hw_breakpoint.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -228,13 +228,22 @@ int arch_check_bp_in_kernelspace(struct
 }
 
 /*
+ * Checks whether the range [addr, end], overlaps the area [base, base + size).
+ */
+static inline bool within_area(unsigned long addr, unsigned long end,
+			       unsigned long base, unsigned long size)
+{
+	return end >= base && addr < (base + size);
+}
+
+/*
  * Checks whether the range from addr to end, inclusive, overlaps the CPU
  * entry area range.
  */
 static inline bool within_cpu_entry_area(unsigned long addr, unsigned long end)
 {
-	return end >= CPU_ENTRY_AREA_BASE &&
-	       addr < (CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_TOTAL_SIZE);
+	return within_area(addr, end, CPU_ENTRY_AREA_BASE,
+			   CPU_ENTRY_AREA_TOTAL_SIZE);
 }
 
 static int arch_build_bp_info(struct perf_event *bp,



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 02/14] x86/hw_breakpoint: Prevent data breakpoints on direct GDT
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 01/14] x86/hw_breakpoint: Add within_area() to check data breakpoints Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-30 12:45   ` Andrew Cooper
  2020-05-29 21:27 ` [PATCH 03/14] x86/hw_breakpoint: Prevent data breakpoints on per_cpu cpu_tss_rw Peter Zijlstra
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

From: Lai Jiangshan <laijs@linux.alibaba.com>

A data breakpoint on the GDT is terrifying and should be avoided.
The GDT on CPU entry area is already protected. The direct GDT
should be also protected, although it is seldom used and only
used for short time.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200526014221.2119-3-laijs@linux.alibaba.com
---
 arch/x86/kernel/hw_breakpoint.c |   30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -32,6 +32,7 @@
 #include <asm/processor.h>
 #include <asm/debugreg.h>
 #include <asm/user.h>
+#include <asm/desc.h>
 
 /* Per cpu debug control register value */
 DEFINE_PER_CPU(unsigned long, cpu_dr7);
@@ -237,13 +238,26 @@ static inline bool within_area(unsigned
 }
 
 /*
- * Checks whether the range from addr to end, inclusive, overlaps the CPU
- * entry area range.
+ * Checks whether the range from addr to end, inclusive, overlaps the fixed
+ * mapped CPU entry area range or other ranges used for CPU entry.
  */
-static inline bool within_cpu_entry_area(unsigned long addr, unsigned long end)
+static inline bool within_cpu_entry(unsigned long addr, unsigned long end)
 {
-	return within_area(addr, end, CPU_ENTRY_AREA_BASE,
-			   CPU_ENTRY_AREA_TOTAL_SIZE);
+	int cpu;
+
+	/* CPU entry erea is always used for CPU entry */
+	if (within_area(addr, end, CPU_ENTRY_AREA_BASE,
+			CPU_ENTRY_AREA_TOTAL_SIZE))
+		return true;
+
+	for_each_possible_cpu(cpu) {
+		/* The original rw GDT is being used after load_direct_gdt() */
+		if (within_area(addr, end, (unsigned long)get_cpu_gdt_rw(cpu),
+				GDT_SIZE))
+			return true;
+	}
+
+	return false;
 }
 
 static int arch_build_bp_info(struct perf_event *bp,
@@ -257,12 +271,12 @@ static int arch_build_bp_info(struct per
 		return -EINVAL;
 
 	/*
-	 * Prevent any breakpoint of any type that overlaps the
-	 * cpu_entry_area.  This protects the IST stacks and also
+	 * Prevent any breakpoint of any type that overlaps the CPU
+	 * entry area and data.  This protects the IST stacks and also
 	 * reduces the chance that we ever find out what happens if
 	 * there's a data breakpoint on the GDT, IDT, or TSS.
 	 */
-	if (within_cpu_entry_area(attr->bp_addr, bp_end))
+	if (within_cpu_entry(attr->bp_addr, bp_end))
 		return -EINVAL;
 
 	hw->address = attr->bp_addr;



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 03/14] x86/hw_breakpoint: Prevent data breakpoints on per_cpu cpu_tss_rw
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 01/14] x86/hw_breakpoint: Add within_area() to check data breakpoints Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 02/14] x86/hw_breakpoint: Prevent data breakpoints on direct GDT Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 04/14] x86/hw_breakpoint: Prevent data breakpoints on user_pcid_flush_mask Peter Zijlstra
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

From: Lai Jiangshan <laijs@linux.alibaba.com>

cpu_tss_rw is not directly referenced by hardware, but
cpu_tss_rw is also used in CPU entry code, especially
when #DB shifts its stacks. If a data breakpoint is on
the cpu_tss_rw.x86_tss.ist[IST_INDEX_DB], it will cause
recursive #DB (and then #DF soon for #DB is generated
after the access, IST-shifting, is done).

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200526014221.2119-4-laijs@linux.alibaba.com
---
 arch/x86/kernel/hw_breakpoint.c |    9 +++++++++
 1 file changed, 9 insertions(+)

--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -255,6 +255,15 @@ static inline bool within_cpu_entry(unsi
 		if (within_area(addr, end, (unsigned long)get_cpu_gdt_rw(cpu),
 				GDT_SIZE))
 			return true;
+
+		/*
+		 * cpu_tss_rw is not directly referenced by hardware, but
+		 * cpu_tss_rw is also used in CPU entry code,
+		 */
+		if (within_area(addr, end,
+				(unsigned long)&per_cpu(cpu_tss_rw, cpu),
+				sizeof(struct tss_struct)))
+			return true;
 	}
 
 	return false;



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 04/14] x86/hw_breakpoint: Prevent data breakpoints on user_pcid_flush_mask
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
                   ` (2 preceding siblings ...)
  2020-05-29 21:27 ` [PATCH 03/14] x86/hw_breakpoint: Prevent data breakpoints on per_cpu cpu_tss_rw Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 05/14] x86/entry: Introduce local_db_{save,restore}() Peter Zijlstra
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

From: Lai Jiangshan <laijs@linux.alibaba.com>

The percpu user_pcid_flush_mask is used for CPU entry
If a data breakpoint on it, it will cause an unwanted #DB.
Protect the full cpu_tlbstate structure to be sure.

There are some other percpu data used in CPU entry, but they are
either in already-protected cpu_tss_rw or are safe to trigger #DB
(espfix_waddr, espfix_stack).

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200526014221.2119-5-laijs@linux.alibaba.com
---
 arch/x86/kernel/hw_breakpoint.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -33,6 +33,7 @@
 #include <asm/debugreg.h>
 #include <asm/user.h>
 #include <asm/desc.h>
+#include <asm/tlbflush.h>
 
 /* Per cpu debug control register value */
 DEFINE_PER_CPU(unsigned long, cpu_dr7);
@@ -264,6 +265,16 @@ static inline bool within_cpu_entry(unsi
 				(unsigned long)&per_cpu(cpu_tss_rw, cpu),
 				sizeof(struct tss_struct)))
 			return true;
+
+		/*
+		 * cpu_tlbstate.user_pcid_flush_mask is used for CPU entry.
+		 * If a data breakpoint on it, it will cause an unwanted #DB.
+		 * Protect the full cpu_tlbstate structure to be sure.
+		 */
+		if (within_area(addr, end,
+				(unsigned long)&per_cpu(cpu_tlbstate, cpu),
+				sizeof(struct tlb_state)))
+			return true;
 	}
 
 	return false;



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 05/14] x86/entry: Introduce local_db_{save,restore}()
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
                   ` (3 preceding siblings ...)
  2020-05-29 21:27 ` [PATCH 04/14] x86/hw_breakpoint: Prevent data breakpoints on user_pcid_flush_mask Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 06/14] x86/entry, nmi: Disable #DB Peter Zijlstra
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

In order to allow other exceptions than #DB to disable breakpoints,
provide common helpers.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/debugreg.h |   30 ++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c         |   18 ++----------------
 2 files changed, 32 insertions(+), 16 deletions(-)

--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -113,6 +113,36 @@ static inline void debug_stack_usage_inc
 static inline void debug_stack_usage_dec(void) { }
 #endif /* X86_64 */
 
+static __always_inline unsigned long local_db_save(void)
+{
+	unsigned long dr7;
+
+	get_debugreg(dr7, 7);
+	dr7 &= ~0x400; /* architecturally set bit */
+	if (dr7)
+		set_debugreg(0, 7);
+	/*
+	 * Ensure the compiler doesn't lower the above statements into
+	 * the critical section; disabling breakpoints late would not
+	 * be good.
+	 */
+	barrier();
+
+	return dr7;
+}
+
+static __always_inline void local_db_restore(unsigned long dr7)
+{
+	/*
+	 * Ensure the compiler doesn't raise this statement into
+	 * the critical section; enabling breakpoints early would
+	 * not be good.
+	 */
+	barrier();
+	if (dr7)
+		set_debugreg(dr7, 7);
+}
+
 #ifdef CONFIG_CPU_SUP_AMD
 extern void set_dr_addr_mask(unsigned long mask, int dr);
 #else
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -727,15 +727,7 @@ static __always_inline void debug_enter(
 	 * Entry text is excluded for HW_BP_X and cpu_entry_area, which
 	 * includes the entry stack is excluded for everything.
 	 */
-	get_debugreg(*dr7, 7);
-	set_debugreg(0, 7);
-
-	/*
-	 * Ensure the compiler doesn't lower the above statements into
-	 * the critical section; disabling breakpoints late would not
-	 * be good.
-	 */
-	barrier();
+	*dr7 = local_db_save();
 
 	/*
 	 * The Intel SDM says:
@@ -756,13 +748,7 @@ static __always_inline void debug_enter(
 
 static __always_inline void debug_exit(unsigned long dr7)
 {
-	/*
-	 * Ensure the compiler doesn't raise this statement into
-	 * the critical section; enabling breakpoints early would
-	 * not be good.
-	 */
-	barrier();
-	set_debugreg(dr7, 7);
+	local_db_restore(dr7);
 }
 
 /*



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 06/14] x86/entry, nmi: Disable #DB
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
                   ` (4 preceding siblings ...)
  2020-05-29 21:27 ` [PATCH 05/14] x86/entry: Introduce local_db_{save,restore}() Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 07/14] x86/entry, mce: Disallow #DB during #MC Peter Zijlstra
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

Instead of playing stupid games with IST stacks, fully disallow #DB
during NMIs. There is absolutely no reason to allow them, and killing
this saves a heap of trouble.

We already disallow #DB on noinstr and CEA, so we can't get #DB before
this, and this ensures we can't get it after this either.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/nmi.c |   55 ++------------------------------------------------
 1 file changed, 3 insertions(+), 52 deletions(-)

--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -474,40 +474,7 @@ enum nmi_states {
 };
 static DEFINE_PER_CPU(enum nmi_states, nmi_state);
 static DEFINE_PER_CPU(unsigned long, nmi_cr2);
-
-#ifdef CONFIG_X86_64
-/*
- * In x86_64, we need to handle breakpoint -> NMI -> breakpoint.  Without
- * some care, the inner breakpoint will clobber the outer breakpoint's
- * stack.
- *
- * If a breakpoint is being processed, and the debug stack is being
- * used, if an NMI comes in and also hits a breakpoint, the stack
- * pointer will be set to the same fixed address as the breakpoint that
- * was interrupted, causing that stack to be corrupted. To handle this
- * case, check if the stack that was interrupted is the debug stack, and
- * if so, change the IDT so that new breakpoints will use the current
- * stack and not switch to the fixed address. On return of the NMI,
- * switch back to the original IDT.
- */
-static DEFINE_PER_CPU(int, update_debug_stack);
-
-static noinstr bool is_debug_stack(unsigned long addr)
-{
-	struct cea_exception_stacks *cs = __this_cpu_read(cea_exception_stacks);
-	unsigned long top = CEA_ESTACK_TOP(cs, DB);
-	unsigned long bot = CEA_ESTACK_BOT(cs, DB1);
-
-	if (__this_cpu_read(debug_stack_usage))
-		return true;
-	/*
-	 * Note, this covers the guard page between DB and DB1 as well to
-	 * avoid two checks. But by all means @addr can never point into
-	 * the guard page.
-	 */
-	return addr >= bot && addr < top;
-}
-#endif
+static DEFINE_PER_CPU(unsigned long, nmi_dr7);
 
 DEFINE_IDTENTRY_NMI(exc_nmi)
 {
@@ -522,18 +489,7 @@ DEFINE_IDTENTRY_NMI(exc_nmi)
 	this_cpu_write(nmi_cr2, read_cr2());
 nmi_restart:
 
-#ifdef CONFIG_X86_64
-	/*
-	 * If we interrupted a breakpoint, it is possible that
-	 * the nmi handler will have breakpoints too. We need to
-	 * change the IDT such that breakpoints that happen here
-	 * continue to use the NMI stack.
-	 */
-	if (unlikely(is_debug_stack(regs->sp))) {
-		debug_stack_set_zero();
-		this_cpu_write(update_debug_stack, 1);
-	}
-#endif
+	this_cpu_write(nmi_dr7, local_db_save());
 
 	nmi_enter();
 
@@ -544,12 +500,7 @@ DEFINE_IDTENTRY_NMI(exc_nmi)
 
 	nmi_exit();
 
-#ifdef CONFIG_X86_64
-	if (unlikely(this_cpu_read(update_debug_stack))) {
-		debug_stack_reset();
-		this_cpu_write(update_debug_stack, 0);
-	}
-#endif
+	local_db_restore(this_cpu_read(nmi_dr7));
 
 	if (unlikely(this_cpu_read(nmi_cr2) != read_cr2()))
 		write_cr2(this_cpu_read(nmi_cr2));



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 07/14] x86/entry, mce: Disallow #DB during #MC
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
                   ` (5 preceding siblings ...)
  2020-05-29 21:27 ` [PATCH 06/14] x86/entry, nmi: Disable #DB Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 08/14] x86/entry: Optimize local_db_save() for virt Peter Zijlstra
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

#MC is fragile as heck, don't tempt fate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/mce/core.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1936,22 +1936,34 @@ static __always_inline void exc_machine_
 /* MCE hit kernel mode */
 DEFINE_IDTENTRY_MCE(exc_machine_check)
 {
+	unsigned long dr7;
+
+	dr7 = local_db_save();
 	exc_machine_check_kernel(regs);
+	local_db_restore(dr7);
 }
 
 /* The user mode variant. */
 DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
 {
+	unsigned long dr7;
+
+	dr7 = local_db_save();
 	exc_machine_check_user(regs);
+	local_db_restore(dr7);
 }
 #else
 /* 32bit unified entry point */
 DEFINE_IDTENTRY_MCE(exc_machine_check)
 {
+	unsigned long dr7;
+
+	dr7 = local_db_save();
 	if (user_mode(regs))
 		exc_machine_check_user(regs);
 	else
 		exc_machine_check_kernel(regs);
+	local_db_restore(dr7);
 }
 #endif
 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 08/14] x86/entry: Optimize local_db_save() for virt
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
                   ` (6 preceding siblings ...)
  2020-05-29 21:27 ` [PATCH 07/14] x86/entry, mce: Disallow #DB during #MC Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  2020-06-03  1:17   ` [PATCH 08/14] " Sean Christopherson
  2020-05-29 21:27 ` [PATCH 09/14] x86/entry: Remove debug IDT frobbing Peter Zijlstra
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy,
	Andy Lutomirski

Because DRn access is 'difficult' with virt; but the DR7 read is
cheaper than a cacheline miss on native, add a virt specific
fast path to local_db_save(), such that when breakpoints are not in
use we avoid touching DRn entirely.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/debugreg.h |    7 ++++++-
 arch/x86/kernel/hw_breakpoint.c |   26 ++++++++++++++++++++++----
 arch/x86/kvm/vmx/nested.c       |    2 +-
 3 files changed, 29 insertions(+), 6 deletions(-)

--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -85,8 +85,8 @@ static inline void hw_breakpoint_disable
 	set_debugreg(0UL, 3);
 }
 
-static inline int hw_breakpoint_active(void)
+static inline bool hw_breakpoint_active(void)
 {
 	return __this_cpu_read(cpu_dr7) & DR_GLOBAL_ENABLE_MASK;
 }
 
@@ -117,6 +119,9 @@ static __always_inline unsigned long loc
 {
 	unsigned long dr7;
 
+	if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active())
+		return 0;
+
 	get_debugreg(dr7, 7);
 	dr7 &= ~0x400; /* architecturally set bit */
 	if (dr7)
--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -99,6 +99,8 @@ int arch_install_hw_breakpoint(struct pe
 	unsigned long *dr7;
 	int i;
 
+	lockdep_assert_irqs_disabled();
+
 	for (i = 0; i < HBP_NUM; i++) {
 		struct perf_event **slot = this_cpu_ptr(&bp_per_reg[i]);
 
@@ -117,6 +119,12 @@ int arch_install_hw_breakpoint(struct pe
 	dr7 = this_cpu_ptr(&cpu_dr7);
 	*dr7 |= encode_dr7(i, info->len, info->type);
 
+	/*
+	 * Ensure we first write cpu_dr7 before we set the DR7 register.
+	 * This ensures an NMI never see cpu_dr7 0 when DR7 is not.
+	 */
+	barrier();
+
 	set_debugreg(*dr7, 7);
 	if (info->mask)
 		set_dr_addr_mask(info->mask, i);
@@ -136,9 +144,11 @@ int arch_install_hw_breakpoint(struct pe
 void arch_uninstall_hw_breakpoint(struct perf_event *bp)
 {
 	struct arch_hw_breakpoint *info = counter_arch_bp(bp);
-	unsigned long *dr7;
+	unsigned long dr7;
 	int i;
 
+	lockdep_assert_irqs_disabled();
+
 	for (i = 0; i < HBP_NUM; i++) {
 		struct perf_event **slot = this_cpu_ptr(&bp_per_reg[i]);
 
@@ -151,12 +161,20 @@ void arch_uninstall_hw_breakpoint(struct
 	if (WARN_ONCE(i == HBP_NUM, "Can't find any breakpoint slot"))
 		return;
 
-	dr7 = this_cpu_ptr(&cpu_dr7);
-	*dr7 &= ~__encode_dr7(i, info->len, info->type);
+	dr7 = this_cpu_read(cpu_dr7);
+	dr7 &= ~__encode_dr7(i, info->len, info->type);
 
-	set_debugreg(*dr7, 7);
+	set_debugreg(dr7, 7);
 	if (info->mask)
 		set_dr_addr_mask(0, i);
+
+	/*
+	 * Ensure the write to cpu_dr7 is after we've set the DR7 register.
+	 * This ensures an NMI never see cpu_dr7 0 when DR7 is not.
+	 */
+	barrier();
+
+	this_cpu_write(cpu_dr7, dr7);
 }
 
 static int arch_bp_generic_len(int x86_len)
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3028,9 +3028,9 @@ static int nested_vmx_check_vmentry_hw(s
 	/*
 	 * VMExit clears RFLAGS.IF and DR7, even on a consistency check.
 	 */
-	local_irq_enable();
 	if (hw_breakpoint_active())
 		set_debugreg(__this_cpu_read(cpu_dr7), 7);
+	local_irq_enable();
 	preempt_enable();
 
 	/*



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 09/14] x86/entry: Remove debug IDT frobbing
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
                   ` (7 preceding siblings ...)
  2020-05-29 21:27 ` [PATCH 08/14] x86/entry: Optimize local_db_save() for virt Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 10/14] x86/entry: Remove DBn stacks Peter Zijlstra
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

This is all unused now.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/debugreg.h |   19 -------------------
 arch/x86/include/asm/desc.h     |   34 +---------------------------------
 arch/x86/kernel/cpu/common.c    |   17 -----------------
 arch/x86/kernel/idt.c           |   30 ------------------------------
 arch/x86/kernel/traps.c         |    9 ---------
 5 files changed, 1 insertion(+), 108 deletions(-)

--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -96,25 +96,6 @@ extern void aout_dump_debugregs(struct u
 
 extern void hw_breakpoint_restore(void);
 
-#ifdef CONFIG_X86_64
-DECLARE_PER_CPU(int, debug_stack_usage);
-static inline void debug_stack_usage_inc(void)
-{
-	__this_cpu_inc(debug_stack_usage);
-}
-static inline void debug_stack_usage_dec(void)
-{
-	__this_cpu_dec(debug_stack_usage);
-}
-void debug_stack_set_zero(void);
-void debug_stack_reset(void);
-#else /* !X86_64 */
-static inline void debug_stack_set_zero(void) { }
-static inline void debug_stack_reset(void) { }
-static inline void debug_stack_usage_inc(void) { }
-static inline void debug_stack_usage_dec(void) { }
-#endif /* X86_64 */
-
 static __always_inline unsigned long local_db_save(void)
 {
 	unsigned long dr7;
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -42,8 +42,6 @@ static inline void fill_ldt(struct desc_
 
 extern struct desc_ptr idt_descr;
 extern gate_desc idt_table[];
-extern const struct desc_ptr debug_idt_descr;
-extern gate_desc debug_idt_table[];
 
 struct gdt_page {
 	struct desc_struct gdt[GDT_ENTRIES];
@@ -390,31 +388,6 @@ void alloc_intr_gate(unsigned int n, con
 
 extern unsigned long system_vectors[];
 
-#ifdef CONFIG_X86_64
-DECLARE_PER_CPU(u32, debug_idt_ctr);
-static __always_inline bool is_debug_idt_enabled(void)
-{
-	if (this_cpu_read(debug_idt_ctr))
-		return true;
-
-	return false;
-}
-
-static __always_inline void load_debug_idt(void)
-{
-	load_idt((const struct desc_ptr *)&debug_idt_descr);
-}
-#else
-static inline bool is_debug_idt_enabled(void)
-{
-	return false;
-}
-
-static inline void load_debug_idt(void)
-{
-}
-#endif
-
 /*
  * The load_current_idt() must be called with interrupts disabled
  * to avoid races. That way the IDT will always be set back to the expected
@@ -424,10 +397,7 @@ static inline void load_debug_idt(void)
  */
 static __always_inline void load_current_idt(void)
 {
-	if (is_debug_idt_enabled())
-		load_debug_idt();
-	else
-		load_idt((const struct desc_ptr *)&idt_descr);
+	load_idt((const struct desc_ptr *)&idt_descr);
 }
 
 extern void idt_setup_early_handler(void);
@@ -438,11 +408,9 @@ extern void idt_setup_apic_and_irq_gates
 #ifdef CONFIG_X86_64
 extern void idt_setup_early_pf(void);
 extern void idt_setup_ist_traps(void);
-extern void idt_setup_debugidt_traps(void);
 #else
 static inline void idt_setup_early_pf(void) { }
 static inline void idt_setup_ist_traps(void) { }
-static inline void idt_setup_debugidt_traps(void) { }
 #endif
 
 extern void idt_invalidate(void *addr);
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1672,23 +1672,6 @@ void syscall_init(void)
 	       X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
 }
 
-DEFINE_PER_CPU(int, debug_stack_usage);
-DEFINE_PER_CPU(u32, debug_idt_ctr);
-
-noinstr void debug_stack_set_zero(void)
-{
-	this_cpu_inc(debug_idt_ctr);
-	load_current_idt();
-}
-
-noinstr void debug_stack_reset(void)
-{
-	if (WARN_ON(!this_cpu_read(debug_idt_ctr)))
-		return;
-	if (this_cpu_dec_return(debug_idt_ctr) == 0)
-		load_current_idt();
-}
-
 #else	/* CONFIG_X86_64 */
 
 DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -158,14 +158,6 @@ static const __initconst struct idt_data
 static const __initconst struct idt_data early_pf_idts[] = {
 	INTG(X86_TRAP_PF,		asm_exc_page_fault),
 };
-
-/*
- * Override for the debug_idt. Same as the default, but with interrupt
- * stack set to DEFAULT_STACK (0). Required for NMI trap handling.
- */
-static const __initconst struct idt_data dbg_idts[] = {
-	INTG(X86_TRAP_DB,		asm_exc_debug),
-};
 #endif
 
 /* Must be page-aligned because the real IDT is used in a fixmap. */
@@ -177,9 +169,6 @@ struct desc_ptr idt_descr __ro_after_ini
 };
 
 #ifdef CONFIG_X86_64
-/* No need to be aligned, but done to keep all IDTs defined the same way. */
-gate_desc debug_idt_table[IDT_ENTRIES] __page_aligned_bss;
-
 /*
  * The exceptions which use Interrupt stacks. They are setup after
  * cpu_init() when the TSS has been initialized.
@@ -192,15 +181,6 @@ static const __initconst struct idt_data
 	ISTG(X86_TRAP_MC,	asm_exc_machine_check,	IST_INDEX_MCE),
 #endif
 };
-
-/*
- * Override for the debug_idt. Same as the default, but with interrupt
- * stack set to DEFAULT_STACK (0). Required for NMI trap handling.
- */
-const struct desc_ptr debug_idt_descr = {
-	.size		= IDT_ENTRIES * 16 - 1,
-	.address	= (unsigned long) debug_idt_table,
-};
 #endif
 
 static inline void idt_init_desc(gate_desc *gate, const struct idt_data *d)
@@ -292,16 +272,6 @@ void __init idt_setup_ist_traps(void)
 {
 	idt_setup_from_table(idt_table, ist_idts, ARRAY_SIZE(ist_idts), true);
 }
-
-/**
- * idt_setup_debugidt_traps - Initialize the debug idt table with debug traps
- */
-void __init idt_setup_debugidt_traps(void)
-{
-	memcpy(&debug_idt_table, &idt_table, IDT_ENTRIES * 16);
-
-	idt_setup_from_table(debug_idt_table, dbg_idts, ARRAY_SIZE(dbg_idts), false);
-}
 #endif
 
 /**
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -798,12 +798,6 @@ static void noinstr handle_debug(struct
 		return;
 	}
 
-	/*
-	 * Let others (NMI) know that the debug stack is in use
-	 * as we may switch to the interrupt stack.
-	 */
-	debug_stack_usage_inc();
-
 	/* It's safe to allow irq's after DR6 has been saved */
 	cond_local_irq_enable(regs);
 
@@ -831,7 +825,6 @@ static void noinstr handle_debug(struct
 
 out:
 	cond_local_irq_disable(regs);
-	debug_stack_usage_dec();
 	instrumentation_end();
 }
 
@@ -1077,6 +1070,4 @@ void __init trap_init(void)
 	cpu_init();
 
 	idt_setup_ist_traps();
-
-	idt_setup_debugidt_traps();
 }



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 10/14] x86/entry: Remove DBn stacks
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
                   ` (8 preceding siblings ...)
  2020-05-29 21:27 ` [PATCH 09/14] x86/entry: Remove debug IDT frobbing Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 11/14] x86/entry: Clarify irq_{enter,exit}_rcu() Peter Zijlstra
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

Both #DB itself, as all other IST users (NMI, #MC) now clear DR7 on
entry. Combined with not allowing breakpoints on entry/noinstr/NOKPROBE
text and no single step (EFLAGS.TF) inside the #DB handler should
guarantee us no nested #DB.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/entry_64.S             |   17 -----------------
 arch/x86/include/asm/cpu_entry_area.h |   12 +++---------
 arch/x86/kernel/asm-offsets_64.c      |    3 ---
 arch/x86/kernel/dumpstack_64.c        |    7 ++-----
 arch/x86/mm/cpu_entry_area.c          |    1 -
 5 files changed, 5 insertions(+), 35 deletions(-)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -396,11 +396,6 @@ SYM_CODE_END(\asmsym)
 	idtentry \vector asm_\cfunc \cfunc has_error_code=0
 .endm
 
-/*
- * MCE and DB exceptions
- */
-#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss_rw) + (TSS_ist + (x) * 8)
-
 /**
  * idtentry_mce_db - Macro to generate entry stubs for #MC and #DB
  * @vector:		Vector number
@@ -416,10 +411,6 @@ SYM_CODE_END(\asmsym)
  * If hits in kernel mode then it needs to go through the paranoid
  * entry as the exception can hit any random state. No preemption
  * check on exit to keep the paranoid path simple.
- *
- * If the trap is #DB then the interrupt stack entry in the IST is
- * moved to the second stack, so a potential recursion will have a
- * fresh IST.
  */
 .macro idtentry_mce_db vector asmsym cfunc
 SYM_CODE_START(\asmsym)
@@ -445,16 +436,8 @@ SYM_CODE_START(\asmsym)
 
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
-	.if \vector == X86_TRAP_DB
-		subq	$DB_STACK_OFFSET, CPU_TSS_IST(IST_INDEX_DB)
-	.endif
-
 	call	\cfunc
 
-	.if \vector == X86_TRAP_DB
-		addq	$DB_STACK_OFFSET, CPU_TSS_IST(IST_INDEX_DB)
-	.endif
-
 	jmp	paranoid_exit
 
 	/* Switch to the regular task stack and use the noist entry point */
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -11,15 +11,11 @@
 #ifdef CONFIG_X86_64
 
 /* Macro to enforce the same ordering and stack sizes */
-#define ESTACKS_MEMBERS(guardsize, db2_holesize)\
+#define ESTACKS_MEMBERS(guardsize)		\
 	char	DF_stack_guard[guardsize];	\
 	char	DF_stack[EXCEPTION_STKSZ];	\
 	char	NMI_stack_guard[guardsize];	\
 	char	NMI_stack[EXCEPTION_STKSZ];	\
-	char	DB2_stack_guard[guardsize];	\
-	char	DB2_stack[db2_holesize];	\
-	char	DB1_stack_guard[guardsize];	\
-	char	DB1_stack[EXCEPTION_STKSZ];	\
 	char	DB_stack_guard[guardsize];	\
 	char	DB_stack[EXCEPTION_STKSZ];	\
 	char	MCE_stack_guard[guardsize];	\
@@ -28,12 +24,12 @@
 
 /* The exception stacks' physical storage. No guard pages required */
 struct exception_stacks {
-	ESTACKS_MEMBERS(0, 0)
+	ESTACKS_MEMBERS(0)
 };
 
 /* The effective cpu entry area mapping with guard pages. */
 struct cea_exception_stacks {
-	ESTACKS_MEMBERS(PAGE_SIZE, EXCEPTION_STKSZ)
+	ESTACKS_MEMBERS(PAGE_SIZE)
 };
 
 /*
@@ -42,8 +38,6 @@ struct cea_exception_stacks {
 enum exception_stack_ordering {
 	ESTACK_DF,
 	ESTACK_NMI,
-	ESTACK_DB2,
-	ESTACK_DB1,
 	ESTACK_DB,
 	ESTACK_MCE,
 	N_EXCEPTION_STACKS
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -57,9 +57,6 @@ int main(void)
 	BLANK();
 #undef ENTRY
 
-	OFFSET(TSS_ist, tss_struct, x86_tss.ist);
-	DEFINE(DB_STACK_OFFSET, offsetof(struct cea_exception_stacks, DB_stack) -
-	       offsetof(struct cea_exception_stacks, DB1_stack));
 	BLANK();
 
 #ifdef CONFIG_STACKPROTECTOR
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -22,15 +22,13 @@
 static const char * const exception_stack_names[] = {
 		[ ESTACK_DF	]	= "#DF",
 		[ ESTACK_NMI	]	= "NMI",
-		[ ESTACK_DB2	]	= "#DB2",
-		[ ESTACK_DB1	]	= "#DB1",
 		[ ESTACK_DB	]	= "#DB",
 		[ ESTACK_MCE	]	= "#MC",
 };
 
 const char *stack_type_name(enum stack_type type)
 {
-	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+	BUILD_BUG_ON(N_EXCEPTION_STACKS != 4);
 
 	if (type == STACK_TYPE_IRQ)
 		return "IRQ";
@@ -79,7 +77,6 @@ static const
 struct estack_pages estack_pages[CEA_ESTACK_PAGES] ____cacheline_aligned = {
 	EPAGERANGE(DF),
 	EPAGERANGE(NMI),
-	EPAGERANGE(DB1),
 	EPAGERANGE(DB),
 	EPAGERANGE(MCE),
 };
@@ -91,7 +88,7 @@ static bool in_exception_stack(unsigned
 	struct pt_regs *regs;
 	unsigned int k;
 
-	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+	BUILD_BUG_ON(N_EXCEPTION_STACKS != 4);
 
 	begin = (unsigned long)__this_cpu_read(cea_exception_stacks);
 	/*
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -107,7 +107,6 @@ static void __init percpu_setup_exceptio
 	 */
 	cea_map_stack(DF);
 	cea_map_stack(NMI);
-	cea_map_stack(DB1);
 	cea_map_stack(DB);
 	cea_map_stack(MCE);
 }



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 11/14] x86/entry: Clarify irq_{enter,exit}_rcu()
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
                   ` (9 preceding siblings ...)
  2020-05-29 21:27 ` [PATCH 10/14] x86/entry: Remove DBn stacks Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  2020-06-02 14:42   ` [PATCH 11/14] " Qian Cai
  2020-05-29 21:27 ` [PATCH 12/14] x86/entry: Rename trace_hardirqs_off_prepare() Peter Zijlstra
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

Because:

  irq_enter_rcu() includes lockdep_hardirq_enter()
  irq_exit_rcu() does *NOT* include lockdep_hardirq_exit()

Which resulted in two 'stray' lockdep_hardirq_exit() calls in
idtentry.h, and me spending a long time trying to find the matching
enter calls.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/idtentry.h |    2 --
 kernel/softirq.c                |   19 +++++++++++++------
 2 files changed, 13 insertions(+), 8 deletions(-)

--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -206,7 +206,6 @@ __visible noinstr void func(struct pt_re
 	kvm_set_cpu_l1tf_flush_l1d();					\
 	__##func (regs, (u8)error_code);				\
 	irq_exit_rcu();							\
-	lockdep_hardirq_exit();						\
 	instrumentation_end();						\
 	idtentry_exit_cond_rcu(regs, rcu_exit);				\
 }									\
@@ -249,7 +248,6 @@ __visible noinstr void func(struct pt_re
 	kvm_set_cpu_l1tf_flush_l1d();					\
 	run_on_irqstack_cond(__##func, regs, regs);			\
 	irq_exit_rcu();							\
-	lockdep_hardirq_exit();						\
 	instrumentation_end();						\
 	idtentry_exit_cond_rcu(regs, rcu_exit);				\
 }									\
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -404,12 +404,7 @@ static inline void tick_irq_exit(void)
 #endif
 }
 
-/**
- * irq_exit_rcu() - Exit an interrupt context without updating RCU
- *
- * Also processes softirqs if needed and possible.
- */
-void irq_exit_rcu(void)
+static inline void __irq_exit_rcu(void)
 {
 #ifndef __ARCH_IRQ_EXIT_IRQS_DISABLED
 	local_irq_disable();
@@ -425,6 +420,18 @@ void irq_exit_rcu(void)
 }
 
 /**
+ * irq_exit_rcu() - Exit an interrupt context without updating RCU
+ *
+ * Also processes softirqs if needed and possible.
+ */
+void irq_exit_rcu(void)
+{
+	__irq_exit_rcu();
+	 /* must be last! */
+	lockdep_hardirq_exit();
+}
+
+/**
  * irq_exit - Exit an interrupt context, update RCU and lockdep
  *
  * Also processes softirqs if needed and possible.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 12/14] x86/entry: Rename trace_hardirqs_off_prepare()
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
                   ` (10 preceding siblings ...)
  2020-05-29 21:27 ` [PATCH 11/14] x86/entry: Clarify irq_{enter,exit}_rcu() Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 13/14] lockdep: Prepare for NMI IRQ state tracking Peter Zijlstra
  2020-05-29 21:27 ` [PATCH 14/14] x86/entry: Fix NMI vs " Peter Zijlstra
  13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

The typical pattern for trace_hardirqs_off_prepare() is:

  ENTRY
    lockdep_hardirqs_off(); // because hardware
    ... do entry magic
    instrumentation_begin();
    trace_hardirqs_off_prepare();
    ... do actual work
    trace_hardirqs_on_prepare();
    lockdep_hardirqs_on_prepare();
    instrumentation_end();
    ... do exit magic
    lockdep_hardirqs_on();

which shows that it's named wrong, rename it to
trace_hardirqs_off_finish(), as it concludes the hardirq_off
transition.

Also, given that the above is the only correct order, make the
traditional all-in-one trace_hardirqs_off() follow suit.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/common.c         |    6 +++---
 arch/x86/kernel/cpu/mce/core.c  |    2 +-
 arch/x86/kernel/nmi.c           |    2 +-
 arch/x86/kernel/traps.c         |    4 ++--
 include/linux/irqflags.h        |    4 ++--
 kernel/trace/trace_preemptirq.c |   10 +++++-----
 6 files changed, 14 insertions(+), 14 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -65,7 +65,7 @@ static noinstr void enter_from_user_mode
 
 	instrumentation_begin();
 	CT_WARN_ON(state != CONTEXT_USER);
-	trace_hardirqs_off_prepare();
+	trace_hardirqs_off_finish();
 	instrumentation_end();
 }
 #else
@@ -73,7 +73,7 @@ static __always_inline void enter_from_u
 {
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	instrumentation_begin();
-	trace_hardirqs_off_prepare();
+	trace_hardirqs_off_finish();
 	instrumentation_end();
 }
 #endif
@@ -569,7 +569,7 @@ bool noinstr idtentry_enter_cond_rcu(str
 		lockdep_hardirqs_off(CALLER_ADDR0);
 		rcu_irq_enter();
 		instrumentation_begin();
-		trace_hardirqs_off_prepare();
+		trace_hardirqs_off_finish();
 		instrumentation_end();
 
 		return true;
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1915,7 +1915,7 @@ static __always_inline void exc_machine_
 	 * that out because it's an indirect call. Annotate it.
 	 */
 	instrumentation_begin();
-	trace_hardirqs_off_prepare();
+	trace_hardirqs_off_finish();
 	machine_check_vector(regs);
 	if (regs->flags & X86_EFLAGS_IF)
 		trace_hardirqs_on_prepare();
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -330,7 +330,7 @@ static noinstr void default_do_nmi(struc
 	__this_cpu_write(last_nmi_rip, regs->ip);
 
 	instrumentation_begin();
-	trace_hardirqs_off_prepare();
+	trace_hardirqs_off_finish();
 
 	handled = nmi_handle(NMI_LOCAL, regs);
 	__this_cpu_add(nmi_stats.normal, handled);
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -634,7 +634,7 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 	} else {
 		nmi_enter();
 		instrumentation_begin();
-		trace_hardirqs_off_prepare();
+		trace_hardirqs_off_finish();
 		if (!do_int3(regs))
 			die("int3", regs, 0);
 		if (regs->flags & X86_EFLAGS_IF)
@@ -854,7 +854,7 @@ static __always_inline void exc_debug_ke
 {
 	nmi_enter();
 	instrumentation_begin();
-	trace_hardirqs_off_prepare();
+	trace_hardirqs_off_finish();
 	instrumentation_end();
 
 	/*
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -32,7 +32,7 @@
 
 #ifdef CONFIG_TRACE_IRQFLAGS
   extern void trace_hardirqs_on_prepare(void);
-  extern void trace_hardirqs_off_prepare(void);
+  extern void trace_hardirqs_off_finish(void);
   extern void trace_hardirqs_on(void);
   extern void trace_hardirqs_off(void);
 # define lockdep_hardirq_context(p)	((p)->hardirq_context)
@@ -101,7 +101,7 @@ do {						\
 
 #else
 # define trace_hardirqs_on_prepare()		do { } while (0)
-# define trace_hardirqs_off_prepare()		do { } while (0)
+# define trace_hardirqs_off_finish()		do { } while (0)
 # define trace_hardirqs_on()		do { } while (0)
 # define trace_hardirqs_off()		do { } while (0)
 # define lockdep_hardirq_context(p)	0
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -58,7 +58,7 @@ NOKPROBE_SYMBOL(trace_hardirqs_on);
  * and lockdep uses a staged approach which splits the lockdep hardirq
  * tracking into a RCU on and a RCU off section.
  */
-void trace_hardirqs_off_prepare(void)
+void trace_hardirqs_off_finish(void)
 {
 	if (!this_cpu_read(tracing_irq_cpu)) {
 		this_cpu_write(tracing_irq_cpu, 1);
@@ -68,19 +68,19 @@ void trace_hardirqs_off_prepare(void)
 	}
 
 }
-EXPORT_SYMBOL(trace_hardirqs_off_prepare);
-NOKPROBE_SYMBOL(trace_hardirqs_off_prepare);
+EXPORT_SYMBOL(trace_hardirqs_off_finish);
+NOKPROBE_SYMBOL(trace_hardirqs_off_finish);
 
 void trace_hardirqs_off(void)
 {
+	lockdep_hardirqs_off(CALLER_ADDR0);
+
 	if (!this_cpu_read(tracing_irq_cpu)) {
 		this_cpu_write(tracing_irq_cpu, 1);
 		tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
 		if (!in_nmi())
 			trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
 	}
-
-	lockdep_hardirqs_off(CALLER_ADDR0);
 }
 EXPORT_SYMBOL(trace_hardirqs_off);
 NOKPROBE_SYMBOL(trace_hardirqs_off);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 13/14] lockdep: Prepare for NMI IRQ state tracking
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
                   ` (11 preceding siblings ...)
  2020-05-29 21:27 ` [PATCH 12/14] x86/entry: Rename trace_hardirqs_off_prepare() Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  2020-05-29 22:14   ` Steven Rostedt
  2020-05-29 21:27 ` [PATCH 14/14] x86/entry: Fix NMI vs " Peter Zijlstra
  13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

There is no reason not to always, accurately, track IRQ state.

This change also makes IRQ state tracking ignore lockdep_off().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/lockdep.c |   33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3646,7 +3646,13 @@ static void __trace_hardirqs_on_caller(v
  */
 void lockdep_hardirqs_on_prepare(unsigned long ip)
 {
-	if (unlikely(!debug_locks || current->lockdep_recursion))
+	/*
+	 * NMIs do not (and cannot) track lock dependencies, nothing to do.
+	 */
+	if (in_nmi())
+		return;
+
+	if (DEBUG_LOCKS_WARN_ON(current->lockdep_recursion & LOCKDEP_RECURSION_MASK))
 		return;
 
 	if (unlikely(current->hardirqs_enabled)) {
@@ -3692,7 +3698,24 @@ void noinstr lockdep_hardirqs_on(unsigne
 {
 	struct task_struct *curr = current;
 
-	if (unlikely(!debug_locks || curr->lockdep_recursion))
+	/*
+	 * NMIs can happen in the middle of local_irq_{en,dis}able() where the
+	 * tracking state and hardware state are out of sync.
+	 *
+	 * NMIs must save lockdep_hardirqs_enabled() to restore IRQ state from,
+	 * and not rely on hardware state like normal interrupts.
+	 */
+	if (in_nmi()) {
+		/*
+		 * Skip:
+		 *  - recursion check, because NMI can hit lockdep;
+		 *  - hardware state check, because above;
+		 *  - chain_key check, see lockdep_hardirqs_on_prepare().
+		 */
+		goto skip_checks;
+	}
+
+	if (DEBUG_LOCKS_WARN_ON(curr->lockdep_recursion & LOCKDEP_RECURSION_MASK))
 		return;
 
 	if (curr->hardirqs_enabled) {
@@ -3720,6 +3743,7 @@ void noinstr lockdep_hardirqs_on(unsigne
 	DEBUG_LOCKS_WARN_ON(current->hardirq_chain_key !=
 			    current->curr_chain_key);
 
+skip_checks:
 	/* we'll do an OFF -> ON transition: */
 	curr->hardirqs_enabled = 1;
 	curr->hardirq_enable_ip = ip;
@@ -3735,7 +3759,10 @@ void noinstr lockdep_hardirqs_off(unsign
 {
 	struct task_struct *curr = current;
 
-	if (unlikely(!debug_locks || curr->lockdep_recursion))
+	/*
+	 * NMIs can happen in lockdep.
+	 */
+	if (!in_nmi() && DEBUG_LOCKS_WARN_ON(curr->lockdep_recursion & LOCKDEP_RECURSION_MASK))
 		return;
 
 	/*



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 14/14] x86/entry: Fix NMI vs IRQ state tracking
  2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
                   ` (12 preceding siblings ...)
  2020-05-29 21:27 ` [PATCH 13/14] lockdep: Prepare for NMI IRQ state tracking Peter Zijlstra
@ 2020-05-29 21:27 ` Peter Zijlstra
  13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 21:27 UTC (permalink / raw)
  To: tglx, luto, peterz
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	andrew.cooper3, daniel.thompson, a.darwish, rostedt, bigeasy

While the nmi_enter() users did
trace_hardirqs_{off_prepare,on_finish}() there was no matching
lockdep_hardirqs_*() calls to complete the picture.

Introduce idtentry_{enter,exit}_nmi() to enable proper IRQ state
tracking across the NMIs.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/common.c         |   42 ++++++++++++++++++++++++++++++++++++----
 arch/x86/include/asm/idtentry.h |    3 ++
 arch/x86/kernel/nmi.c           |    9 +++-----
 arch/x86/kernel/traps.c         |   20 ++++---------------
 include/linux/hardirq.h         |   28 ++++++++++++++++++--------
 5 files changed, 69 insertions(+), 33 deletions(-)

--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -550,7 +550,7 @@ SYSCALL_DEFINE0(ni_syscall)
  * The return value must be fed into the rcu_exit argument of
  * idtentry_exit_cond_rcu().
  */
-bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
+noinstr bool idtentry_enter_cond_rcu(struct pt_regs *regs)
 {
 	if (user_mode(regs)) {
 		enter_from_user_mode();
@@ -619,7 +619,7 @@ static void idtentry_exit_cond_resched(s
  * Counterpart to idtentry_enter_cond_rcu(). The return value of the entry
  * function must be fed into the @rcu_exit argument.
  */
-void noinstr idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit)
+noinstr void idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit)
 {
 	lockdep_assert_irqs_disabled();
 
@@ -663,7 +663,7 @@ void noinstr idtentry_exit_cond_rcu(stru
  * Invokes enter_from_user_mode() to establish the proper context for
  * NOHZ_FULL. Otherwise scheduling on exit would not be possible.
  */
-void noinstr idtentry_enter_user(struct pt_regs *regs)
+noinstr void idtentry_enter_user(struct pt_regs *regs)
 {
 	enter_from_user_mode();
 }
@@ -680,13 +680,47 @@ void noinstr idtentry_enter_user(struct
  *
  * Counterpart to idtentry_enter_user().
  */
-void noinstr idtentry_exit_user(struct pt_regs *regs)
+noinstr void idtentry_exit_user(struct pt_regs *regs)
 {
 	lockdep_assert_irqs_disabled();
 
 	prepare_exit_to_usermode(regs);
 }
 
+noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
+{
+	bool irq_state = lockdep_hardirqs_enabled(current);
+
+	__nmi_enter();
+	lockdep_hardirqs_off(CALLER_ADDR0);
+	lockdep_hardirq_enter();
+	rcu_nmi_enter();
+
+	instrumentation_begin();
+	trace_hardirqs_off_finish();
+	ftrace_nmi_enter();
+	instrumentation_end();
+
+	return irq_state;
+}
+
+noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
+{
+	instrumentation_begin();
+	ftrace_nmi_exit();
+	if (restore) {
+		trace_hardirqs_on_prepare();
+		lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+	}
+	instrumentation_end();
+
+	rcu_nmi_exit();
+	lockdep_hardirq_exit();
+	if (restore)
+		lockdep_hardirqs_on(CALLER_ADDR0);
+	__nmi_exit();
+}
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -16,6 +16,9 @@ void idtentry_exit_user(struct pt_regs *
 bool idtentry_enter_cond_rcu(struct pt_regs *regs);
 void idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit);
 
+bool idtentry_enter_nmi(struct pt_regs *regs);
+void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -330,7 +330,6 @@ static noinstr void default_do_nmi(struc
 	__this_cpu_write(last_nmi_rip, regs->ip);
 
 	instrumentation_begin();
-	trace_hardirqs_off_finish();
 
 	handled = nmi_handle(NMI_LOCAL, regs);
 	__this_cpu_add(nmi_stats.normal, handled);
@@ -417,8 +416,6 @@ static noinstr void default_do_nmi(struc
 		unknown_nmi_error(reason, regs);
 
 out:
-	if (regs->flags & X86_EFLAGS_IF)
-		trace_hardirqs_on_prepare();
 	instrumentation_end();
 }
 
@@ -478,6 +475,8 @@ static DEFINE_PER_CPU(unsigned long, nmi
 
 DEFINE_IDTENTRY_NMI(exc_nmi)
 {
+	bool irq_state;
+
 	if (IS_ENABLED(CONFIG_SMP) && cpu_is_offline(smp_processor_id()))
 		return;
 
@@ -491,14 +490,14 @@ DEFINE_IDTENTRY_NMI(exc_nmi)
 
 	this_cpu_write(nmi_dr7, local_db_save());
 
-	nmi_enter();
+	irq_state = idtentry_enter_nmi(regs);
 
 	inc_irq_stat(__nmi_count);
 
 	if (!ignore_nmis)
 		default_do_nmi(regs);
 
-	nmi_exit();
+	idtentry_exit_nmi(regs, irq_state);
 
 	local_db_restore(this_cpu_read(nmi_dr7));
 
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -387,7 +387,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 	}
 #endif
 
-	nmi_enter();
+	idtentry_enter_nmi(regs);
 	instrumentation_begin();
 	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
 
@@ -632,15 +632,12 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 		instrumentation_end();
 		idtentry_exit_user(regs);
 	} else {
-		nmi_enter();
+		bool irq_state = idtentry_enter_nmi(regs);
 		instrumentation_begin();
-		trace_hardirqs_off_finish();
 		if (!do_int3(regs))
 			die("int3", regs, 0);
-		if (regs->flags & X86_EFLAGS_IF)
-			trace_hardirqs_on_prepare();
 		instrumentation_end();
-		nmi_exit();
+		idtentry_exit_nmi(regs, irq_state);
 	}
 }
 
@@ -831,10 +828,7 @@ static void noinstr handle_debug(struct
 static __always_inline void exc_debug_kernel(struct pt_regs *regs,
 					     unsigned long dr6)
 {
-	nmi_enter();
-	instrumentation_begin();
-	trace_hardirqs_off_finish();
-	instrumentation_end();
+	bool irq_state = idtentry_enter_nmi(regs);
 
 	/*
 	 * The SDM says "The processor clears the BTF flag when it
@@ -857,11 +851,7 @@ static __always_inline void exc_debug_ke
 	if (dr6)
 		handle_debug(regs, dr6, false);
 
-	instrumentation_begin();
-	if (regs->flags & X86_EFLAGS_IF)
-		trace_hardirqs_on_prepare();
-	instrumentation_end();
-	nmi_exit();
+	idtentry_exit_nmi(regs, irq_state);
 }
 
 static __always_inline void exc_debug_user(struct pt_regs *regs,
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -111,32 +111,42 @@ extern void rcu_nmi_exit(void);
 /*
  * nmi_enter() can nest up to 15 times; see NMI_BITS.
  */
-#define nmi_enter()						\
+#define __nmi_enter()						\
 	do {							\
+		lockdep_off();					\
 		arch_nmi_enter();				\
 		printk_nmi_enter();				\
-		lockdep_off();					\
 		BUG_ON(in_nmi() == NMI_MASK);			\
 		__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
-		rcu_nmi_enter();				\
+	} while (0)
+
+#define nmi_enter()						\
+	do {							\
+		__nmi_enter();					\
 		lockdep_hardirq_enter();			\
+		rcu_nmi_enter();				\
 		instrumentation_begin();			\
 		ftrace_nmi_enter();				\
 		instrumentation_end();				\
 	} while (0)
 
+#define __nmi_exit()						\
+	do {							\
+		BUG_ON(!in_nmi());				\
+		__preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);	\
+		printk_nmi_exit();				\
+		arch_nmi_exit();				\
+		lockdep_on();					\
+	} while (0)
+
 #define nmi_exit()						\
 	do {							\
 		instrumentation_begin();			\
 		ftrace_nmi_exit();				\
 		instrumentation_end();				\
-		lockdep_hardirq_exit();				\
 		rcu_nmi_exit();					\
-		BUG_ON(!in_nmi());				\
-		__preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);	\
-		lockdep_on();					\
-		printk_nmi_exit();				\
-		arch_nmi_exit();				\
+		lockdep_hardirq_exit();				\
+		__nmi_exit();					\
 	} while (0)
 
 #endif /* LINUX_HARDIRQ_H */



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 13/14] lockdep: Prepare for NMI IRQ state tracking
  2020-05-29 21:27 ` [PATCH 13/14] lockdep: Prepare for NMI IRQ state tracking Peter Zijlstra
@ 2020-05-29 22:14   ` Steven Rostedt
  2020-05-29 22:25     ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2020-05-29 22:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, luto, linux-kernel, x86, Lai Jiangshan,
	sean.j.christopherson, andrew.cooper3, daniel.thompson,
	a.darwish, bigeasy

On Fri, 29 May 2020 23:27:41 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> There is no reason not to always, accurately, track IRQ state.
> 
> This change also makes IRQ state tracking ignore lockdep_off().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/locking/lockdep.c |   33 ++++++++++++++++++++++++++++++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
> 
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -3646,7 +3646,13 @@ static void __trace_hardirqs_on_caller(v
>   */
>  void lockdep_hardirqs_on_prepare(unsigned long ip)
>  {
> -	if (unlikely(!debug_locks || current->lockdep_recursion))

Why remove the check for debug_locks? Isn't that there to disable
everything at once to prevent more warnings to be printed?

Also, isn't there other ways that we could have recursion besides NMIs?
Say we do a printk inside here, or call something that may also enable
interrupts? I thought the recursion check was also to prevent lockdep
infrastructure calling something that lockdep monitors being a problem?

Or am I missing something?

-- Steve


> +	/*
> +	 * NMIs do not (and cannot) track lock dependencies, nothing to do.
> +	 */
> +	if (in_nmi())
> +		return;
> +
> +	if (DEBUG_LOCKS_WARN_ON(current->lockdep_recursion & LOCKDEP_RECURSION_MASK))
>  		return;
>  
>  	if (unlikely(current->hardirqs_enabled)) {
> @@ -3692,7 +3698,24 @@ void noinstr lockdep_hardirqs_on(unsigne
>  {
>  	struct task_struct *curr = current;
>  
> -	if (unlikely(!debug_locks || curr->lockdep_recursion))
> +	/*
> +	 * NMIs can happen in the middle of local_irq_{en,dis}able() where the
> +	 * tracking state and hardware state are out of sync.
> +	 *
> +	 * NMIs must save lockdep_hardirqs_enabled() to restore IRQ state from,
> +	 * and not rely on hardware state like normal interrupts.
> +	 */
> +	if (in_nmi()) {
> +		/*
> +		 * Skip:
> +		 *  - recursion check, because NMI can hit lockdep;
> +		 *  - hardware state check, because above;
> +		 *  - chain_key check, see lockdep_hardirqs_on_prepare().
> +		 */
> +		goto skip_checks;
> +	}
> +
> +	if (DEBUG_LOCKS_WARN_ON(curr->lockdep_recursion & LOCKDEP_RECURSION_MASK))
>  		return;
>  
>  	if (curr->hardirqs_enabled) {
> @@ -3720,6 +3743,7 @@ void noinstr lockdep_hardirqs_on(unsigne
>  	DEBUG_LOCKS_WARN_ON(current->hardirq_chain_key !=
>  			    current->curr_chain_key);
>  
> +skip_checks:
>  	/* we'll do an OFF -> ON transition: */
>  	curr->hardirqs_enabled = 1;
>  	curr->hardirq_enable_ip = ip;
> @@ -3735,7 +3759,10 @@ void noinstr lockdep_hardirqs_off(unsign
>  {
>  	struct task_struct *curr = current;
>  
> -	if (unlikely(!debug_locks || curr->lockdep_recursion))
> +	/*
> +	 * NMIs can happen in lockdep.
> +	 */
> +	if (!in_nmi() && DEBUG_LOCKS_WARN_ON(curr->lockdep_recursion & LOCKDEP_RECURSION_MASK))
>  		return;
>  
>  	/*
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 13/14] lockdep: Prepare for NMI IRQ state tracking
  2020-05-29 22:14   ` Steven Rostedt
@ 2020-05-29 22:25     ` Peter Zijlstra
  2020-05-29 22:28       ` Steven Rostedt
                         ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 22:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tglx, luto, linux-kernel, x86, Lai Jiangshan,
	sean.j.christopherson, andrew.cooper3, daniel.thompson,
	a.darwish, bigeasy

On Fri, May 29, 2020 at 06:14:01PM -0400, Steven Rostedt wrote:
> On Fri, 29 May 2020 23:27:41 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > There is no reason not to always, accurately, track IRQ state.
> > 
> > This change also makes IRQ state tracking ignore lockdep_off().
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/locking/lockdep.c |   33 ++++++++++++++++++++++++++++++---
> >  1 file changed, 30 insertions(+), 3 deletions(-)
> > 
> > --- a/kernel/locking/lockdep.c
> > +++ b/kernel/locking/lockdep.c
> > @@ -3646,7 +3646,13 @@ static void __trace_hardirqs_on_caller(v
> >   */
> >  void lockdep_hardirqs_on_prepare(unsigned long ip)
> >  {
> > -	if (unlikely(!debug_locks || current->lockdep_recursion))
> 
> Why remove the check for debug_locks? Isn't that there to disable
> everything at once to prevent more warnings to be printed?

Yeah, maybe. I was thinking we could keep IRQ state running. But you're
right, if we mess up the IRQ state itself this might generate a wee
mess.

> Also, isn't there other ways that we could have recursion besides NMIs?
> Say we do a printk inside here, or call something that may also enable
> interrupts? I thought the recursion check was also to prevent lockdep
> infrastructure calling something that lockdep monitors being a problem?
> 
> Or am I missing something?

> > +	/*
> > +	 * NMIs do not (and cannot) track lock dependencies, nothing to do.
> > +	 */
> > +	if (in_nmi())
> > +		return;
> > +
> > +	if (DEBUG_LOCKS_WARN_ON(current->lockdep_recursion & LOCKDEP_RECURSION_MASK))
> >  		return;

^^ there's your regular recursion check.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 13/14] lockdep: Prepare for NMI IRQ state tracking
  2020-05-29 22:25     ` Peter Zijlstra
@ 2020-05-29 22:28       ` Steven Rostedt
  2020-05-29 22:33       ` Peter Zijlstra
  2020-06-02 20:00       ` Peter Zijlstra
  2 siblings, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2020-05-29 22:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, luto, linux-kernel, x86, Lai Jiangshan,
	sean.j.christopherson, andrew.cooper3, daniel.thompson,
	a.darwish, bigeasy

On Sat, 30 May 2020 00:25:05 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > > +	if (DEBUG_LOCKS_WARN_ON(current->lockdep_recursion & LOCKDEP_RECURSION_MASK))
> > >  		return;  
> 
> ^^ there's your regular recursion check.

Yes, but this is more of a "bug if it happens" than just "ignore it".

-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 13/14] lockdep: Prepare for NMI IRQ state tracking
  2020-05-29 22:25     ` Peter Zijlstra
  2020-05-29 22:28       ` Steven Rostedt
@ 2020-05-29 22:33       ` Peter Zijlstra
  2020-06-02 20:00       ` Peter Zijlstra
  2 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2020-05-29 22:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tglx, luto, linux-kernel, x86, Lai Jiangshan,
	sean.j.christopherson, andrew.cooper3, daniel.thompson,
	a.darwish, bigeasy

On Sat, May 30, 2020 at 12:25:05AM +0200, Peter Zijlstra wrote:
> On Fri, May 29, 2020 at 06:14:01PM -0400, Steven Rostedt wrote:
> > On Fri, 29 May 2020 23:27:41 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > There is no reason not to always, accurately, track IRQ state.
> > > 
> > > This change also makes IRQ state tracking ignore lockdep_off().
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  kernel/locking/lockdep.c |   33 ++++++++++++++++++++++++++++++---
> > >  1 file changed, 30 insertions(+), 3 deletions(-)
> > > 
> > > --- a/kernel/locking/lockdep.c
> > > +++ b/kernel/locking/lockdep.c
> > > @@ -3646,7 +3646,13 @@ static void __trace_hardirqs_on_caller(v
> > >   */
> > >  void lockdep_hardirqs_on_prepare(unsigned long ip)
> > >  {
> > > -	if (unlikely(!debug_locks || current->lockdep_recursion))
> > 
> > Why remove the check for debug_locks? Isn't that there to disable
> > everything at once to prevent more warnings to be printed?
> 
> Yeah, maybe. I was thinking we could keep IRQ state running. But you're
> right, if we mess up the IRQ state itself this might generate a wee
> mess.

That is, mostly the IRQ state recovers when we mess up. It's only when
we then trigger more fail that we crash and burn, and that will likely
already give more warnings.

But I can put the debug_locks check back.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tip: x86/entry] x86/entry: Rename trace_hardirqs_off_prepare()
  2020-05-29 21:27 ` [PATCH 12/14] x86/entry: Rename trace_hardirqs_off_prepare() Peter Zijlstra
@ 2020-05-30  9:57   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-30  9:57 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     029149180d1d6e05e81e7db0d46c00960ab2e84f
Gitweb:        https://git.kernel.org/tip/029149180d1d6e05e81e7db0d46c00960ab2e84f
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 29 May 2020 23:27:40 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 30 May 2020 10:00:11 +02:00

x86/entry: Rename trace_hardirqs_off_prepare()

The typical pattern for trace_hardirqs_off_prepare() is:

  ENTRY
    lockdep_hardirqs_off(); // because hardware
    ... do entry magic
    instrumentation_begin();
    trace_hardirqs_off_prepare();
    ... do actual work
    trace_hardirqs_on_prepare();
    lockdep_hardirqs_on_prepare();
    instrumentation_end();
    ... do exit magic
    lockdep_hardirqs_on();

which shows that it's named wrong, rename it to
trace_hardirqs_off_finish(), as it concludes the hardirq_off transition.

Also, given that the above is the only correct order, make the traditional
all-in-one trace_hardirqs_off() follow suit.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.415774872@infradead.org

---
 arch/x86/entry/common.c         |  6 +++---
 arch/x86/kernel/cpu/mce/core.c  |  2 +-
 arch/x86/kernel/nmi.c           |  2 +-
 arch/x86/kernel/traps.c         |  4 ++--
 include/linux/irqflags.h        |  4 ++--
 kernel/trace/trace_preemptirq.c | 10 +++++-----
 6 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 17a9a5a..aea6b4f 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -65,7 +65,7 @@ static noinstr void enter_from_user_mode(void)
 
 	instrumentation_begin();
 	CT_WARN_ON(state != CONTEXT_USER);
-	trace_hardirqs_off_prepare();
+	trace_hardirqs_off_finish();
 	instrumentation_end();
 }
 #else
@@ -73,7 +73,7 @@ static __always_inline void enter_from_user_mode(void)
 {
 	lockdep_hardirqs_off(CALLER_ADDR0);
 	instrumentation_begin();
-	trace_hardirqs_off_prepare();
+	trace_hardirqs_off_finish();
 	instrumentation_end();
 }
 #endif
@@ -569,7 +569,7 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs)
 		lockdep_hardirqs_off(CALLER_ADDR0);
 		rcu_irq_enter();
 		instrumentation_begin();
-		trace_hardirqs_off_prepare();
+		trace_hardirqs_off_finish();
 		instrumentation_end();
 
 		return true;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index be49926..b9cb381 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1922,7 +1922,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 	 * that out because it's an indirect call. Annotate it.
 	 */
 	instrumentation_begin();
-	trace_hardirqs_off_prepare();
+	trace_hardirqs_off_finish();
 	machine_check_vector(regs);
 	if (regs->flags & X86_EFLAGS_IF)
 		trace_hardirqs_on_prepare();
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 52a708e..4a43934 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -334,7 +334,7 @@ static noinstr void default_do_nmi(struct pt_regs *regs)
 	__this_cpu_write(last_nmi_rip, regs->ip);
 
 	instrumentation_begin();
-	trace_hardirqs_off_prepare();
+	trace_hardirqs_off_finish();
 
 	handled = nmi_handle(NMI_LOCAL, regs);
 	__this_cpu_add(nmi_stats.normal, handled);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6f887be..79af913 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -634,7 +634,7 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 	} else {
 		nmi_enter();
 		instrumentation_begin();
-		trace_hardirqs_off_prepare();
+		trace_hardirqs_off_finish();
 		if (!do_int3(regs))
 			die("int3", regs, 0);
 		if (regs->flags & X86_EFLAGS_IF)
@@ -833,7 +833,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
 {
 	nmi_enter();
 	instrumentation_begin();
-	trace_hardirqs_off_prepare();
+	trace_hardirqs_off_finish();
 	instrumentation_end();
 
 	/*
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index d7f7e43..6384d28 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -32,7 +32,7 @@
 
 #ifdef CONFIG_TRACE_IRQFLAGS
   extern void trace_hardirqs_on_prepare(void);
-  extern void trace_hardirqs_off_prepare(void);
+  extern void trace_hardirqs_off_finish(void);
   extern void trace_hardirqs_on(void);
   extern void trace_hardirqs_off(void);
 # define lockdep_hardirq_context(p)	((p)->hardirq_context)
@@ -101,7 +101,7 @@ do {						\
 
 #else
 # define trace_hardirqs_on_prepare()		do { } while (0)
-# define trace_hardirqs_off_prepare()		do { } while (0)
+# define trace_hardirqs_off_finish()		do { } while (0)
 # define trace_hardirqs_on()		do { } while (0)
 # define trace_hardirqs_off()		do { } while (0)
 # define lockdep_hardirq_context(p)	0
diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index fb0691b..f10073e 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -58,7 +58,7 @@ NOKPROBE_SYMBOL(trace_hardirqs_on);
  * and lockdep uses a staged approach which splits the lockdep hardirq
  * tracking into a RCU on and a RCU off section.
  */
-void trace_hardirqs_off_prepare(void)
+void trace_hardirqs_off_finish(void)
 {
 	if (!this_cpu_read(tracing_irq_cpu)) {
 		this_cpu_write(tracing_irq_cpu, 1);
@@ -68,19 +68,19 @@ void trace_hardirqs_off_prepare(void)
 	}
 
 }
-EXPORT_SYMBOL(trace_hardirqs_off_prepare);
-NOKPROBE_SYMBOL(trace_hardirqs_off_prepare);
+EXPORT_SYMBOL(trace_hardirqs_off_finish);
+NOKPROBE_SYMBOL(trace_hardirqs_off_finish);
 
 void trace_hardirqs_off(void)
 {
+	lockdep_hardirqs_off(CALLER_ADDR0);
+
 	if (!this_cpu_read(tracing_irq_cpu)) {
 		this_cpu_write(tracing_irq_cpu, 1);
 		tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
 		if (!in_nmi())
 			trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
 	}
-
-	lockdep_hardirqs_off(CALLER_ADDR0);
 }
 EXPORT_SYMBOL(trace_hardirqs_off);
 NOKPROBE_SYMBOL(trace_hardirqs_off);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: x86/entry] x86/entry: Clarify irq_{enter,exit}_rcu()
  2020-05-29 21:27 ` [PATCH 11/14] x86/entry: Clarify irq_{enter,exit}_rcu() Peter Zijlstra
@ 2020-05-30  9:57   ` tip-bot2 for Peter Zijlstra
  2020-06-02 14:42   ` [PATCH 11/14] " Qian Cai
  1 sibling, 0 replies; 35+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-30  9:57 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     b614345f52bcde8299a53132f5e48a9eb5a1f320
Gitweb:        https://git.kernel.org/tip/b614345f52bcde8299a53132f5e48a9eb5a1f320
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 29 May 2020 23:27:39 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 30 May 2020 10:00:10 +02:00

x86/entry: Clarify irq_{enter,exit}_rcu()

Because:

  irq_enter_rcu() includes lockdep_hardirq_enter()
  irq_exit_rcu() does *NOT* include lockdep_hardirq_exit()

Which resulted in two 'stray' lockdep_hardirq_exit() calls in
idtentry.h, and me spending a long time trying to find the matching
enter calls.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.359433429@infradead.org

---
 arch/x86/include/asm/idtentry.h |  2 --
 kernel/softirq.c                | 19 +++++++++++++------
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index d214a30..f8e2737 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -206,7 +206,6 @@ __visible noinstr void func(struct pt_regs *regs,			\
 	kvm_set_cpu_l1tf_flush_l1d();					\
 	__##func (regs, (u8)error_code);				\
 	irq_exit_rcu();							\
-	lockdep_hardirq_exit();						\
 	instrumentation_end();						\
 	idtentry_exit_cond_rcu(regs, rcu_exit);				\
 }									\
@@ -249,7 +248,6 @@ __visible noinstr void func(struct pt_regs *regs)			\
 	kvm_set_cpu_l1tf_flush_l1d();					\
 	run_on_irqstack_cond(__##func, regs, regs);			\
 	irq_exit_rcu();							\
-	lockdep_hardirq_exit();						\
 	instrumentation_end();						\
 	idtentry_exit_cond_rcu(regs, rcu_exit);				\
 }									\
diff --git a/kernel/softirq.c b/kernel/softirq.c
index beb8e3a..a3eb6eb 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -404,12 +404,7 @@ static inline void tick_irq_exit(void)
 #endif
 }
 
-/**
- * irq_exit_rcu() - Exit an interrupt context without updating RCU
- *
- * Also processes softirqs if needed and possible.
- */
-void irq_exit_rcu(void)
+static inline void __irq_exit_rcu(void)
 {
 #ifndef __ARCH_IRQ_EXIT_IRQS_DISABLED
 	local_irq_disable();
@@ -425,6 +420,18 @@ void irq_exit_rcu(void)
 }
 
 /**
+ * irq_exit_rcu() - Exit an interrupt context without updating RCU
+ *
+ * Also processes softirqs if needed and possible.
+ */
+void irq_exit_rcu(void)
+{
+	__irq_exit_rcu();
+	 /* must be last! */
+	lockdep_hardirq_exit();
+}
+
+/**
  * irq_exit - Exit an interrupt context, update RCU and lockdep
  *
  * Also processes softirqs if needed and possible.

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: x86/entry] x86/entry: Remove DBn stacks
  2020-05-29 21:27 ` [PATCH 10/14] x86/entry: Remove DBn stacks Peter Zijlstra
@ 2020-05-30  9:57   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-30  9:57 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     0f81407e6e4cf7e878f1e5d6423324dbd966acba
Gitweb:        https://git.kernel.org/tip/0f81407e6e4cf7e878f1e5d6423324dbd966acba
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 29 May 2020 23:27:38 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 30 May 2020 10:00:09 +02:00

x86/entry: Remove DBn stacks

Both #DB itself, as all other IST users (NMI, #MC) now clear DR7 on
entry. Combined with not allowing breakpoints on entry/noinstr/NOKPROBE
text and no single step (EFLAGS.TF) inside the #DB handler should guarantee
no nested #DB.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.303027161@infradead.org

---
 arch/x86/entry/entry_64.S             | 17 -----------------
 arch/x86/include/asm/cpu_entry_area.h | 12 +++---------
 arch/x86/kernel/asm-offsets_64.c      |  3 ---
 arch/x86/kernel/dumpstack_64.c        |  7 ++-----
 arch/x86/mm/cpu_entry_area.c          |  1 -
 5 files changed, 5 insertions(+), 35 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 265ff97..8ecaeee 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -396,11 +396,6 @@ SYM_CODE_END(\asmsym)
 	idtentry \vector asm_\cfunc \cfunc has_error_code=0
 .endm
 
-/*
- * MCE and DB exceptions
- */
-#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss_rw) + (TSS_ist + (x) * 8)
-
 /**
  * idtentry_mce_db - Macro to generate entry stubs for #MC and #DB
  * @vector:		Vector number
@@ -416,10 +411,6 @@ SYM_CODE_END(\asmsym)
  * If hits in kernel mode then it needs to go through the paranoid
  * entry as the exception can hit any random state. No preemption
  * check on exit to keep the paranoid path simple.
- *
- * If the trap is #DB then the interrupt stack entry in the IST is
- * moved to the second stack, so a potential recursion will have a
- * fresh IST.
  */
 .macro idtentry_mce_db vector asmsym cfunc
 SYM_CODE_START(\asmsym)
@@ -445,16 +436,8 @@ SYM_CODE_START(\asmsym)
 
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
-	.if \vector == X86_TRAP_DB
-		subq	$DB_STACK_OFFSET, CPU_TSS_IST(IST_INDEX_DB)
-	.endif
-
 	call	\cfunc
 
-	.if \vector == X86_TRAP_DB
-		addq	$DB_STACK_OFFSET, CPU_TSS_IST(IST_INDEX_DB)
-	.endif
-
 	jmp	paranoid_exit
 
 	/* Switch to the regular task stack and use the noist entry point */
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 02c0078..8902fdb 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -11,15 +11,11 @@
 #ifdef CONFIG_X86_64
 
 /* Macro to enforce the same ordering and stack sizes */
-#define ESTACKS_MEMBERS(guardsize, db2_holesize)\
+#define ESTACKS_MEMBERS(guardsize)		\
 	char	DF_stack_guard[guardsize];	\
 	char	DF_stack[EXCEPTION_STKSZ];	\
 	char	NMI_stack_guard[guardsize];	\
 	char	NMI_stack[EXCEPTION_STKSZ];	\
-	char	DB2_stack_guard[guardsize];	\
-	char	DB2_stack[db2_holesize];	\
-	char	DB1_stack_guard[guardsize];	\
-	char	DB1_stack[EXCEPTION_STKSZ];	\
 	char	DB_stack_guard[guardsize];	\
 	char	DB_stack[EXCEPTION_STKSZ];	\
 	char	MCE_stack_guard[guardsize];	\
@@ -28,12 +24,12 @@
 
 /* The exception stacks' physical storage. No guard pages required */
 struct exception_stacks {
-	ESTACKS_MEMBERS(0, 0)
+	ESTACKS_MEMBERS(0)
 };
 
 /* The effective cpu entry area mapping with guard pages. */
 struct cea_exception_stacks {
-	ESTACKS_MEMBERS(PAGE_SIZE, EXCEPTION_STKSZ)
+	ESTACKS_MEMBERS(PAGE_SIZE)
 };
 
 /*
@@ -42,8 +38,6 @@ struct cea_exception_stacks {
 enum exception_stack_ordering {
 	ESTACK_DF,
 	ESTACK_NMI,
-	ESTACK_DB2,
-	ESTACK_DB1,
 	ESTACK_DB,
 	ESTACK_MCE,
 	N_EXCEPTION_STACKS
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index c2a4701..828be79 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -57,9 +57,6 @@ int main(void)
 	BLANK();
 #undef ENTRY
 
-	OFFSET(TSS_ist, tss_struct, x86_tss.ist);
-	DEFINE(DB_STACK_OFFSET, offsetof(struct cea_exception_stacks, DB_stack) -
-	       offsetof(struct cea_exception_stacks, DB1_stack));
 	BLANK();
 
 #ifdef CONFIG_STACKPROTECTOR
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 460ae7f..4a94d38 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -22,15 +22,13 @@
 static const char * const exception_stack_names[] = {
 		[ ESTACK_DF	]	= "#DF",
 		[ ESTACK_NMI	]	= "NMI",
-		[ ESTACK_DB2	]	= "#DB2",
-		[ ESTACK_DB1	]	= "#DB1",
 		[ ESTACK_DB	]	= "#DB",
 		[ ESTACK_MCE	]	= "#MC",
 };
 
 const char *stack_type_name(enum stack_type type)
 {
-	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+	BUILD_BUG_ON(N_EXCEPTION_STACKS != 4);
 
 	if (type == STACK_TYPE_IRQ)
 		return "IRQ";
@@ -79,7 +77,6 @@ static const
 struct estack_pages estack_pages[CEA_ESTACK_PAGES] ____cacheline_aligned = {
 	EPAGERANGE(DF),
 	EPAGERANGE(NMI),
-	EPAGERANGE(DB1),
 	EPAGERANGE(DB),
 	EPAGERANGE(MCE),
 };
@@ -91,7 +88,7 @@ static bool in_exception_stack(unsigned long *stack, struct stack_info *info)
 	struct pt_regs *regs;
 	unsigned int k;
 
-	BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
+	BUILD_BUG_ON(N_EXCEPTION_STACKS != 4);
 
 	begin = (unsigned long)__this_cpu_read(cea_exception_stacks);
 	/*
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 5199d8a..060f083 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -107,7 +107,6 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu)
 	 */
 	cea_map_stack(DF);
 	cea_map_stack(NMI);
-	cea_map_stack(DB1);
 	cea_map_stack(DB);
 	cea_map_stack(MCE);
 }

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: x86/entry] x86/entry: Optimize local_db_save() for virt
  2020-05-29 21:27 ` [PATCH 08/14] x86/entry: Optimize local_db_save() for virt Peter Zijlstra
@ 2020-05-30  9:57   ` tip-bot2 for Peter Zijlstra
  2020-06-03  1:17   ` [PATCH 08/14] " Sean Christopherson
  1 sibling, 0 replies; 35+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-30  9:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Andy Lutomirski, Peter Zijlstra (Intel), Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     299a9a21bf913717c0f28ef4ae8b2f0668c7f00a
Gitweb:        https://git.kernel.org/tip/299a9a21bf913717c0f28ef4ae8b2f0668c7f00a
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 29 May 2020 23:27:36 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 30 May 2020 10:00:08 +02:00

x86/entry: Optimize local_db_save() for virt

Because DRn access is 'difficult' with virt; but the DR7 read is cheaper
than a cacheline miss on native, add a virt specific fast path to
local_db_save(), such that when breakpoints are not in use to avoid
touching DRn entirely.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.187833200@infradead.org

---
 arch/x86/include/asm/debugreg.h |  5 ++++-
 arch/x86/kernel/hw_breakpoint.c | 26 ++++++++++++++++++++++----
 arch/x86/kvm/vmx/nested.c       |  2 +-
 3 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index 4ef8690..3e1c502 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -85,7 +85,7 @@ static inline void hw_breakpoint_disable(void)
 	set_debugreg(0UL, 3);
 }
 
-static inline int hw_breakpoint_active(void)
+static inline bool hw_breakpoint_active(void)
 {
 	return __this_cpu_read(cpu_dr7) & DR_GLOBAL_ENABLE_MASK;
 }
@@ -117,6 +117,9 @@ static __always_inline unsigned long local_db_save(void)
 {
 	unsigned long dr7;
 
+	if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active())
+		return 0;
+
 	get_debugreg(dr7, 7);
 	dr7 &= ~0x400; /* architecturally set bit */
 	if (dr7)
diff --git a/arch/x86/kernel/hw_breakpoint.c b/arch/x86/kernel/hw_breakpoint.c
index fc1743a..8cdf29f 100644
--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -99,6 +99,8 @@ int arch_install_hw_breakpoint(struct perf_event *bp)
 	unsigned long *dr7;
 	int i;
 
+	lockdep_assert_irqs_disabled();
+
 	for (i = 0; i < HBP_NUM; i++) {
 		struct perf_event **slot = this_cpu_ptr(&bp_per_reg[i]);
 
@@ -117,6 +119,12 @@ int arch_install_hw_breakpoint(struct perf_event *bp)
 	dr7 = this_cpu_ptr(&cpu_dr7);
 	*dr7 |= encode_dr7(i, info->len, info->type);
 
+	/*
+	 * Ensure we first write cpu_dr7 before we set the DR7 register.
+	 * This ensures an NMI never see cpu_dr7 0 when DR7 is not.
+	 */
+	barrier();
+
 	set_debugreg(*dr7, 7);
 	if (info->mask)
 		set_dr_addr_mask(info->mask, i);
@@ -136,9 +144,11 @@ int arch_install_hw_breakpoint(struct perf_event *bp)
 void arch_uninstall_hw_breakpoint(struct perf_event *bp)
 {
 	struct arch_hw_breakpoint *info = counter_arch_bp(bp);
-	unsigned long *dr7;
+	unsigned long dr7;
 	int i;
 
+	lockdep_assert_irqs_disabled();
+
 	for (i = 0; i < HBP_NUM; i++) {
 		struct perf_event **slot = this_cpu_ptr(&bp_per_reg[i]);
 
@@ -151,12 +161,20 @@ void arch_uninstall_hw_breakpoint(struct perf_event *bp)
 	if (WARN_ONCE(i == HBP_NUM, "Can't find any breakpoint slot"))
 		return;
 
-	dr7 = this_cpu_ptr(&cpu_dr7);
-	*dr7 &= ~__encode_dr7(i, info->len, info->type);
+	dr7 = this_cpu_read(cpu_dr7);
+	dr7 &= ~__encode_dr7(i, info->len, info->type);
 
-	set_debugreg(*dr7, 7);
+	set_debugreg(dr7, 7);
 	if (info->mask)
 		set_dr_addr_mask(0, i);
+
+	/*
+	 * Ensure the write to cpu_dr7 is after we've set the DR7 register.
+	 * This ensures an NMI never see cpu_dr7 0 when DR7 is not.
+	 */
+	barrier();
+
+	this_cpu_write(cpu_dr7, dr7);
 }
 
 static int arch_bp_generic_len(int x86_len)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index e44f33c..9b40c6a 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3028,9 +3028,9 @@ static int nested_vmx_check_vmentry_hw(struct kvm_vcpu *vcpu)
 	/*
 	 * VMExit clears RFLAGS.IF and DR7, even on a consistency check.
 	 */
-	local_irq_enable();
 	if (hw_breakpoint_active())
 		set_debugreg(__this_cpu_read(cpu_dr7), 7);
+	local_irq_enable();
 	preempt_enable();
 
 	/*

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: x86/entry] x86/entry: Remove debug IDT frobbing
  2020-05-29 21:27 ` [PATCH 09/14] x86/entry: Remove debug IDT frobbing Peter Zijlstra
@ 2020-05-30  9:57   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-30  9:57 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     8449e768dcb85b4d8db51482d8c9260bb05ccabc
Gitweb:        https://git.kernel.org/tip/8449e768dcb85b4d8db51482d8c9260bb05ccabc
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 29 May 2020 23:27:37 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 30 May 2020 10:00:09 +02:00

x86/entry: Remove debug IDT frobbing

This is all unused now.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.245019500@infradead.org

---
 arch/x86/include/asm/debugreg.h | 19 +------------------
 arch/x86/include/asm/desc.h     | 34 +--------------------------------
 arch/x86/kernel/cpu/common.c    | 17 +----------------
 arch/x86/kernel/idt.c           | 30 +----------------------------
 arch/x86/kernel/traps.c         |  9 +--------
 5 files changed, 1 insertion(+), 108 deletions(-)

diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index 3e1c502..42fc35d 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -94,25 +94,6 @@ extern void aout_dump_debugregs(struct user *dump);
 
 extern void hw_breakpoint_restore(void);
 
-#ifdef CONFIG_X86_64
-DECLARE_PER_CPU(int, debug_stack_usage);
-static inline void debug_stack_usage_inc(void)
-{
-	__this_cpu_inc(debug_stack_usage);
-}
-static inline void debug_stack_usage_dec(void)
-{
-	__this_cpu_dec(debug_stack_usage);
-}
-void debug_stack_set_zero(void);
-void debug_stack_reset(void);
-#else /* !X86_64 */
-static inline void debug_stack_set_zero(void) { }
-static inline void debug_stack_reset(void) { }
-static inline void debug_stack_usage_inc(void) { }
-static inline void debug_stack_usage_dec(void) { }
-#endif /* X86_64 */
-
 static __always_inline unsigned long local_db_save(void)
 {
 	unsigned long dr7;
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index d6c3d34..07632f3 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -42,8 +42,6 @@ static inline void fill_ldt(struct desc_struct *desc, const struct user_desc *in
 
 extern struct desc_ptr idt_descr;
 extern gate_desc idt_table[];
-extern const struct desc_ptr debug_idt_descr;
-extern gate_desc debug_idt_table[];
 
 struct gdt_page {
 	struct desc_struct gdt[GDT_ENTRIES];
@@ -390,31 +388,6 @@ void alloc_intr_gate(unsigned int n, const void *addr);
 
 extern unsigned long system_vectors[];
 
-#ifdef CONFIG_X86_64
-DECLARE_PER_CPU(u32, debug_idt_ctr);
-static __always_inline bool is_debug_idt_enabled(void)
-{
-	if (this_cpu_read(debug_idt_ctr))
-		return true;
-
-	return false;
-}
-
-static __always_inline void load_debug_idt(void)
-{
-	load_idt((const struct desc_ptr *)&debug_idt_descr);
-}
-#else
-static inline bool is_debug_idt_enabled(void)
-{
-	return false;
-}
-
-static inline void load_debug_idt(void)
-{
-}
-#endif
-
 /*
  * The load_current_idt() must be called with interrupts disabled
  * to avoid races. That way the IDT will always be set back to the expected
@@ -424,10 +397,7 @@ static inline void load_debug_idt(void)
  */
 static __always_inline void load_current_idt(void)
 {
-	if (is_debug_idt_enabled())
-		load_debug_idt();
-	else
-		load_idt((const struct desc_ptr *)&idt_descr);
+	load_idt((const struct desc_ptr *)&idt_descr);
 }
 
 extern void idt_setup_early_handler(void);
@@ -438,11 +408,9 @@ extern void idt_setup_apic_and_irq_gates(void);
 #ifdef CONFIG_X86_64
 extern void idt_setup_early_pf(void);
 extern void idt_setup_ist_traps(void);
-extern void idt_setup_debugidt_traps(void);
 #else
 static inline void idt_setup_early_pf(void) { }
 static inline void idt_setup_ist_traps(void) { }
-static inline void idt_setup_debugidt_traps(void) { }
 #endif
 
 extern void idt_invalidate(void *addr);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6751b81..c55be3b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1689,23 +1689,6 @@ void syscall_init(void)
 	       X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
 }
 
-DEFINE_PER_CPU(int, debug_stack_usage);
-DEFINE_PER_CPU(u32, debug_idt_ctr);
-
-noinstr void debug_stack_set_zero(void)
-{
-	this_cpu_inc(debug_idt_ctr);
-	load_current_idt();
-}
-
-noinstr void debug_stack_reset(void)
-{
-	if (WARN_ON(!this_cpu_read(debug_idt_ctr)))
-		return;
-	if (this_cpu_dec_return(debug_idt_ctr) == 0)
-		load_current_idt();
-}
-
 #else	/* CONFIG_X86_64 */
 
 DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index bc9b0d1..226c992 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -158,14 +158,6 @@ static const __initconst struct idt_data apic_idts[] = {
 static const __initconst struct idt_data early_pf_idts[] = {
 	INTG(X86_TRAP_PF,		asm_exc_page_fault),
 };
-
-/*
- * Override for the debug_idt. Same as the default, but with interrupt
- * stack set to DEFAULT_STACK (0). Required for NMI trap handling.
- */
-static const __initconst struct idt_data dbg_idts[] = {
-	INTG(X86_TRAP_DB,		asm_exc_debug),
-};
 #endif
 
 /* Must be page-aligned because the real IDT is used in a fixmap. */
@@ -177,9 +169,6 @@ struct desc_ptr idt_descr __ro_after_init = {
 };
 
 #ifdef CONFIG_X86_64
-/* No need to be aligned, but done to keep all IDTs defined the same way. */
-gate_desc debug_idt_table[IDT_ENTRIES] __page_aligned_bss;
-
 /*
  * The exceptions which use Interrupt stacks. They are setup after
  * cpu_init() when the TSS has been initialized.
@@ -192,15 +181,6 @@ static const __initconst struct idt_data ist_idts[] = {
 	ISTG(X86_TRAP_MC,	asm_exc_machine_check,	IST_INDEX_MCE),
 #endif
 };
-
-/*
- * Override for the debug_idt. Same as the default, but with interrupt
- * stack set to DEFAULT_STACK (0). Required for NMI trap handling.
- */
-const struct desc_ptr debug_idt_descr = {
-	.size		= IDT_ENTRIES * 16 - 1,
-	.address	= (unsigned long) debug_idt_table,
-};
 #endif
 
 static inline void idt_init_desc(gate_desc *gate, const struct idt_data *d)
@@ -292,16 +272,6 @@ void __init idt_setup_ist_traps(void)
 {
 	idt_setup_from_table(idt_table, ist_idts, ARRAY_SIZE(ist_idts), true);
 }
-
-/**
- * idt_setup_debugidt_traps - Initialize the debug idt table with debug traps
- */
-void __init idt_setup_debugidt_traps(void)
-{
-	memcpy(&debug_idt_table, &idt_table, IDT_ENTRIES * 16);
-
-	idt_setup_from_table(debug_idt_table, dbg_idts, ARRAY_SIZE(dbg_idts), false);
-}
 #endif
 
 /**
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index bcb9dd9..6f887be 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -798,12 +798,6 @@ static void noinstr handle_debug(struct pt_regs *regs, unsigned long dr6,
 		return;
 	}
 
-	/*
-	 * Let others (NMI) know that the debug stack is in use
-	 * as we may switch to the interrupt stack.
-	 */
-	debug_stack_usage_inc();
-
 	/* It's safe to allow irq's after DR6 has been saved */
 	cond_local_irq_enable(regs);
 
@@ -831,7 +825,6 @@ static void noinstr handle_debug(struct pt_regs *regs, unsigned long dr6,
 
 out:
 	cond_local_irq_disable(regs);
-	debug_stack_usage_dec();
 	instrumentation_end();
 }
 
@@ -1077,6 +1070,4 @@ void __init trap_init(void)
 	cpu_init();
 
 	idt_setup_ist_traps();
-
-	idt_setup_debugidt_traps();
 }

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: x86/entry] x86/entry, mce: Disallow #DB during #MC
  2020-05-29 21:27 ` [PATCH 07/14] x86/entry, mce: Disallow #DB during #MC Peter Zijlstra
@ 2020-05-30  9:57   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-30  9:57 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     ff98610a03285516b578821549973f969118d6a3
Gitweb:        https://git.kernel.org/tip/ff98610a03285516b578821549973f969118d6a3
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 29 May 2020 23:27:35 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 30 May 2020 10:00:08 +02:00

x86/entry, mce: Disallow #DB during #MC

#MC is fragile as heck, don't tempt fate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.131187767@infradead.org

---
 arch/x86/kernel/cpu/mce/core.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 068e6ca..be49926 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1943,22 +1943,34 @@ static __always_inline void exc_machine_check_user(struct pt_regs *regs)
 /* MCE hit kernel mode */
 DEFINE_IDTENTRY_MCE(exc_machine_check)
 {
+	unsigned long dr7;
+
+	dr7 = local_db_save();
 	exc_machine_check_kernel(regs);
+	local_db_restore(dr7);
 }
 
 /* The user mode variant. */
 DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
 {
+	unsigned long dr7;
+
+	dr7 = local_db_save();
 	exc_machine_check_user(regs);
+	local_db_restore(dr7);
 }
 #else
 /* 32bit unified entry point */
 DEFINE_IDTENTRY_MCE(exc_machine_check)
 {
+	unsigned long dr7;
+
+	dr7 = local_db_save();
 	if (user_mode(regs))
 		exc_machine_check_user(regs);
 	else
 		exc_machine_check_kernel(regs);
+	local_db_restore(dr7);
 }
 #endif
 

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: x86/entry] x86/entry, nmi: Disable #DB
  2020-05-29 21:27 ` [PATCH 06/14] x86/entry, nmi: Disable #DB Peter Zijlstra
@ 2020-05-30  9:57   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-30  9:57 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     af87e4c4d65b2008709efcfb7657551f1c62a98b
Gitweb:        https://git.kernel.org/tip/af87e4c4d65b2008709efcfb7657551f1c62a98b
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 29 May 2020 23:27:34 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 30 May 2020 10:00:07 +02:00

x86/entry, nmi: Disable #DB

Instead of playing stupid games with IST stacks, fully disallow #DB
during NMIs. There is absolutely no reason to allow them, and killing
this saves a heap of trouble.

#DB is already forbidden on noinstr and CEA, so there can't be a #DB before
this. Disabling it right after nmi_enter() ensures that the full NMI code
is protected.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.069223695@infradead.org

---
 arch/x86/kernel/nmi.c | 55 ++----------------------------------------
 1 file changed, 3 insertions(+), 52 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 1c58454..52a708e 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -478,40 +478,7 @@ enum nmi_states {
 };
 static DEFINE_PER_CPU(enum nmi_states, nmi_state);
 static DEFINE_PER_CPU(unsigned long, nmi_cr2);
-
-#ifdef CONFIG_X86_64
-/*
- * In x86_64, we need to handle breakpoint -> NMI -> breakpoint.  Without
- * some care, the inner breakpoint will clobber the outer breakpoint's
- * stack.
- *
- * If a breakpoint is being processed, and the debug stack is being
- * used, if an NMI comes in and also hits a breakpoint, the stack
- * pointer will be set to the same fixed address as the breakpoint that
- * was interrupted, causing that stack to be corrupted. To handle this
- * case, check if the stack that was interrupted is the debug stack, and
- * if so, change the IDT so that new breakpoints will use the current
- * stack and not switch to the fixed address. On return of the NMI,
- * switch back to the original IDT.
- */
-static DEFINE_PER_CPU(int, update_debug_stack);
-
-static noinstr bool is_debug_stack(unsigned long addr)
-{
-	struct cea_exception_stacks *cs = __this_cpu_read(cea_exception_stacks);
-	unsigned long top = CEA_ESTACK_TOP(cs, DB);
-	unsigned long bot = CEA_ESTACK_BOT(cs, DB1);
-
-	if (__this_cpu_read(debug_stack_usage))
-		return true;
-	/*
-	 * Note, this covers the guard page between DB and DB1 as well to
-	 * avoid two checks. But by all means @addr can never point into
-	 * the guard page.
-	 */
-	return addr >= bot && addr < top;
-}
-#endif
+static DEFINE_PER_CPU(unsigned long, nmi_dr7);
 
 DEFINE_IDTENTRY_NMI(exc_nmi)
 {
@@ -526,18 +493,7 @@ DEFINE_IDTENTRY_NMI(exc_nmi)
 	this_cpu_write(nmi_cr2, read_cr2());
 nmi_restart:
 
-#ifdef CONFIG_X86_64
-	/*
-	 * If we interrupted a breakpoint, it is possible that
-	 * the nmi handler will have breakpoints too. We need to
-	 * change the IDT such that breakpoints that happen here
-	 * continue to use the NMI stack.
-	 */
-	if (unlikely(is_debug_stack(regs->sp))) {
-		debug_stack_set_zero();
-		this_cpu_write(update_debug_stack, 1);
-	}
-#endif
+	this_cpu_write(nmi_dr7, local_db_save());
 
 	nmi_enter();
 
@@ -548,12 +504,7 @@ nmi_restart:
 
 	nmi_exit();
 
-#ifdef CONFIG_X86_64
-	if (unlikely(this_cpu_read(update_debug_stack))) {
-		debug_stack_reset();
-		this_cpu_write(update_debug_stack, 0);
-	}
-#endif
+	local_db_restore(this_cpu_read(nmi_dr7));
 
 	if (unlikely(this_cpu_read(nmi_cr2) != read_cr2()))
 		write_cr2(this_cpu_read(nmi_cr2));

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: x86/entry] x86/entry: Introduce local_db_{save,restore}()
  2020-05-29 21:27 ` [PATCH 05/14] x86/entry: Introduce local_db_{save,restore}() Peter Zijlstra
@ 2020-05-30  9:57   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-30  9:57 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     57234891b3287b986e003876f906d95c9871e62e
Gitweb:        https://git.kernel.org/tip/57234891b3287b986e003876f906d95c9871e62e
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 29 May 2020 23:27:33 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 30 May 2020 10:00:07 +02:00

x86/entry: Introduce local_db_{save,restore}()

In order to allow other exceptions than #DB to disable breakpoints,
provide common helpers.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.012060983@infradead.org

---
 arch/x86/include/asm/debugreg.h | 30 ++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c         | 18 ++----------------
 2 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index 1a8609a..4ef8690 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -113,6 +113,36 @@ static inline void debug_stack_usage_inc(void) { }
 static inline void debug_stack_usage_dec(void) { }
 #endif /* X86_64 */
 
+static __always_inline unsigned long local_db_save(void)
+{
+	unsigned long dr7;
+
+	get_debugreg(dr7, 7);
+	dr7 &= ~0x400; /* architecturally set bit */
+	if (dr7)
+		set_debugreg(0, 7);
+	/*
+	 * Ensure the compiler doesn't lower the above statements into
+	 * the critical section; disabling breakpoints late would not
+	 * be good.
+	 */
+	barrier();
+
+	return dr7;
+}
+
+static __always_inline void local_db_restore(unsigned long dr7)
+{
+	/*
+	 * Ensure the compiler doesn't raise this statement into
+	 * the critical section; enabling breakpoints early would
+	 * not be good.
+	 */
+	barrier();
+	if (dr7)
+		set_debugreg(dr7, 7);
+}
+
 #ifdef CONFIG_CPU_SUP_AMD
 extern void set_dr_addr_mask(unsigned long mask, int dr);
 #else
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 50fb9cd..bcb9dd9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -727,15 +727,7 @@ static __always_inline void debug_enter(unsigned long *dr6, unsigned long *dr7)
 	 * Entry text is excluded for HW_BP_X and cpu_entry_area, which
 	 * includes the entry stack is excluded for everything.
 	 */
-	get_debugreg(*dr7, 7);
-	set_debugreg(0, 7);
-
-	/*
-	 * Ensure the compiler doesn't lower the above statements into
-	 * the critical section; disabling breakpoints late would not
-	 * be good.
-	 */
-	barrier();
+	*dr7 = local_db_save();
 
 	/*
 	 * The Intel SDM says:
@@ -756,13 +748,7 @@ static __always_inline void debug_enter(unsigned long *dr6, unsigned long *dr7)
 
 static __always_inline void debug_exit(unsigned long dr7)
 {
-	/*
-	 * Ensure the compiler doesn't raise this statement into
-	 * the critical section; enabling breakpoints early would
-	 * not be good.
-	 */
-	barrier();
-	set_debugreg(dr7, 7);
+	local_db_restore(dr7);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/14] x86/hw_breakpoint: Prevent data breakpoints on direct GDT
  2020-05-29 21:27 ` [PATCH 02/14] x86/hw_breakpoint: Prevent data breakpoints on direct GDT Peter Zijlstra
@ 2020-05-30 12:45   ` Andrew Cooper
  2020-05-30 15:15     ` Lai Jiangshan
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2020-05-30 12:45 UTC (permalink / raw)
  To: Peter Zijlstra, tglx, luto
  Cc: linux-kernel, x86, Lai Jiangshan, sean.j.christopherson,
	daniel.thompson, a.darwish, rostedt, bigeasy

On 29/05/2020 22:27, Peter Zijlstra wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
>
> A data breakpoint on the GDT is terrifying and should be avoided.
> The GDT on CPU entry area is already protected. The direct GDT
> should be also protected, although it is seldom used and only
> used for short time.

While I agree with the sentiment...

>
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20200526014221.2119-3-laijs@linux.alibaba.com
> ---
>  arch/x86/kernel/hw_breakpoint.c |   30 ++++++++++++++++++++++--------
>  1 file changed, 22 insertions(+), 8 deletions(-)
>
> --- a/arch/x86/kernel/hw_breakpoint.c
> +++ b/arch/x86/kernel/hw_breakpoint.c
> @@ -32,6 +32,7 @@
>  #include <asm/processor.h>
>  #include <asm/debugreg.h>
>  #include <asm/user.h>
> +#include <asm/desc.h>
>  
>  /* Per cpu debug control register value */
>  DEFINE_PER_CPU(unsigned long, cpu_dr7);
> @@ -237,13 +238,26 @@ static inline bool within_area(unsigned
>  }
>  
>  /*
> - * Checks whether the range from addr to end, inclusive, overlaps the CPU
> - * entry area range.
> + * Checks whether the range from addr to end, inclusive, overlaps the fixed
> + * mapped CPU entry area range or other ranges used for CPU entry.
>   */
> -static inline bool within_cpu_entry_area(unsigned long addr, unsigned long end)
> +static inline bool within_cpu_entry(unsigned long addr, unsigned long end)
>  {
> -	return within_area(addr, end, CPU_ENTRY_AREA_BASE,
> -			   CPU_ENTRY_AREA_TOTAL_SIZE);
> +	int cpu;
> +
> +	/* CPU entry erea is always used for CPU entry */
> +	if (within_area(addr, end, CPU_ENTRY_AREA_BASE,
> +			CPU_ENTRY_AREA_TOTAL_SIZE))
> +		return true;
> +
> +	for_each_possible_cpu(cpu) {
> +		/* The original rw GDT is being used after load_direct_gdt() */
> +		if (within_area(addr, end, (unsigned long)get_cpu_gdt_rw(cpu),
> +				GDT_SIZE))

... why the O(n) loop over the system?

It is only GDTs which might ever be active on this local CPU(/thread)
which are a problem, because the breakpoint registers are similarly local.

Nothing is going to go wrong If I put a breakpoint on someone else's
live GDT, because they wont interact in the "fun" ways we're trying to
avoid.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/14] x86/hw_breakpoint: Prevent data breakpoints on direct GDT
  2020-05-30 12:45   ` Andrew Cooper
@ 2020-05-30 15:15     ` Lai Jiangshan
  0 siblings, 0 replies; 35+ messages in thread
From: Lai Jiangshan @ 2020-05-30 15:15 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Peter Zijlstra, Thomas Gleixner, Andy Lutomirski, LKML, X86 ML,
	Lai Jiangshan, Sean Christopherson, Daniel Thompson, a.darwish,
	Steven Rostedt, Sebastian Andrzej Siewior

On Sat, May 30, 2020 at 8:48 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 29/05/2020 22:27, Peter Zijlstra wrote:
> > From: Lai Jiangshan <laijs@linux.alibaba.com>
> >
> > A data breakpoint on the GDT is terrifying and should be avoided.
> > The GDT on CPU entry area is already protected. The direct GDT
> > should be also protected, although it is seldom used and only
> > used for short time.
>
> While I agree with the sentiment...
>
> >
> > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Link: https://lkml.kernel.org/r/20200526014221.2119-3-laijs@linux.alibaba.com
> > ---
> >  arch/x86/kernel/hw_breakpoint.c |   30 ++++++++++++++++++++++--------
> >  1 file changed, 22 insertions(+), 8 deletions(-)
> >
> > --- a/arch/x86/kernel/hw_breakpoint.c
> > +++ b/arch/x86/kernel/hw_breakpoint.c
> > @@ -32,6 +32,7 @@
> >  #include <asm/processor.h>
> >  #include <asm/debugreg.h>
> >  #include <asm/user.h>
> > +#include <asm/desc.h>
> >
> >  /* Per cpu debug control register value */
> >  DEFINE_PER_CPU(unsigned long, cpu_dr7);
> > @@ -237,13 +238,26 @@ static inline bool within_area(unsigned
> >  }
> >
> >  /*
> > - * Checks whether the range from addr to end, inclusive, overlaps the CPU
> > - * entry area range.
> > + * Checks whether the range from addr to end, inclusive, overlaps the fixed
> > + * mapped CPU entry area range or other ranges used for CPU entry.
> >   */
> > -static inline bool within_cpu_entry_area(unsigned long addr, unsigned long end)
> > +static inline bool within_cpu_entry(unsigned long addr, unsigned long end)
> >  {
> > -     return within_area(addr, end, CPU_ENTRY_AREA_BASE,
> > -                        CPU_ENTRY_AREA_TOTAL_SIZE);
> > +     int cpu;
> > +
> > +     /* CPU entry erea is always used for CPU entry */
> > +     if (within_area(addr, end, CPU_ENTRY_AREA_BASE,
> > +                     CPU_ENTRY_AREA_TOTAL_SIZE))
> > +             return true;
> > +
> > +     for_each_possible_cpu(cpu) {
> > +             /* The original rw GDT is being used after load_direct_gdt() */
> > +             if (within_area(addr, end, (unsigned long)get_cpu_gdt_rw(cpu),
> > +                             GDT_SIZE))
>
> ... why the O(n) loop over the system?
>
> It is only GDTs which might ever be active on this local CPU(/thread)
> which are a problem, because the breakpoint registers are similarly local.
>
> Nothing is going to go wrong If I put a breakpoint on someone else's
> live GDT, because they wont interact in the "fun" ways we're trying to
> avoid.

Hello

It can help to find the bugs that some cpus may access
to the the wrong GDTs as your suggestion and avoids the
O(nr_cpus) loop.

However, it needs to refactor the hw_breakpoint.c to some
extend:
Some breakpoints are allowed when they are being installed,
but they will be filtered out on some CPUs without causing
any confusion in the general hw_breakpoint.c and perf event
and the handlers need to know they may lost some events in
some cases.

But current code doesn't have such framework yet, we have to
block them directly IMHO.

Thanks
Lai

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 11/14] x86/entry: Clarify irq_{enter,exit}_rcu()
  2020-05-29 21:27 ` [PATCH 11/14] x86/entry: Clarify irq_{enter,exit}_rcu() Peter Zijlstra
  2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
@ 2020-06-02 14:42   ` Qian Cai
  2020-06-02 15:05     ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Qian Cai @ 2020-06-02 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, luto, linux-kernel, x86, Lai Jiangshan,
	sean.j.christopherson, andrew.cooper3, daniel.thompson,
	a.darwish, rostedt, bigeasy, Michael Ellerman, linuxppc-dev

On Fri, May 29, 2020 at 11:27:39PM +0200, Peter Zijlstra wrote:
> Because:
> 
>   irq_enter_rcu() includes lockdep_hardirq_enter()
>   irq_exit_rcu() does *NOT* include lockdep_hardirq_exit()
> 
> Which resulted in two 'stray' lockdep_hardirq_exit() calls in
> idtentry.h, and me spending a long time trying to find the matching
> enter calls.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/include/asm/idtentry.h |    2 --
>  kernel/softirq.c                |   19 +++++++++++++------
>  2 files changed, 13 insertions(+), 8 deletions(-)
> 
[]
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -404,12 +404,7 @@ static inline void tick_irq_exit(void)
>  #endif
>  }
>  
> -/**
> - * irq_exit_rcu() - Exit an interrupt context without updating RCU
> - *
> - * Also processes softirqs if needed and possible.
> - */
> -void irq_exit_rcu(void)
> +static inline void __irq_exit_rcu(void)
>  {
>  #ifndef __ARCH_IRQ_EXIT_IRQS_DISABLED
>  	local_irq_disable();
> @@ -425,6 +420,18 @@ void irq_exit_rcu(void)
>  }
>  
>  /**
> + * irq_exit_rcu() - Exit an interrupt context without updating RCU
> + *
> + * Also processes softirqs if needed and possible.
> + */
> +void irq_exit_rcu(void)
> +{
> +	__irq_exit_rcu();
> +	 /* must be last! */
> +	lockdep_hardirq_exit();
> +}
> +
> +/**
>   * irq_exit - Exit an interrupt context, update RCU and lockdep
>   *
>   * Also processes softirqs if needed and possible.
> 
>

Reverted this commit fixed the POWER9 boot warning,

[    0.005196][    T0] clocksource: timebase: mask: 0xffffffffffffffff max_cycles: 0x761537d007, max_idle_ns: 440795202126 ns
[    0.012502][    T0] clocksource: timebase mult[1f40000] shift[24] registered
[    0.030273][    T0] ------------[ cut here ]------------
[    0.034421][    T0] DEBUG_LOCKS_WARN_ON(current->hardirq_context)
[    0.034433][    T0] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3680 lockdep_hardirqs_on_prepare+0x29c/0x2d0
[    0.045874][    T0] Modules linked in:
[    0.047977][    T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-next-20200602 #1
[    0.053187][    T0] NIP:  c0000000001d2fec LR: c0000000001d2fe8 CTR: c00000000074b0a0
[    0.057395][    T0] REGS: c00000000130f810 TRAP: 0700   Not tainted  (5.7.0-next-20200602)
[    0.062614][    T0] MSR:  9000000000021033 <SF,HV,ME,IR,DR,RI,LE>  CR: 48000422  XER: 20040000
[    0.069856][    T0] CFAR: c00000000010e448 IRQMASK: 1
[    0.069856][    T0] GPR00: c0000000001d2fe8 c00000000130faa0 c00000000130aa00 000000000000002d
[    0.069856][    T0] GPR04: c00000000133c3b0 000000000000000d 000000006e6f635f 72727563284e4f5f
[    0.069856][    T0] GPR08: 0000000000000002 c000000000dcf230 0000000000000001 c0000000012b0280
[    0.069856][    T0] GPR12: 0000000000000000 c0000000057b0000 0000000000000000 0000000000000000
[    0.069856][    T0] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.069856][    T0] GPR20: 0000000000000000 0000000000000001 0000000010004d9c 00000000100053ed
[    0.069856][    T0] GPR24: 0000000010005411 0000000000000001 0000000000000002 0000000000000003
[    0.069856][    T0] GPR28: 0000000000000000 0000000000000000 0000000000000000 c000000003e3b008
[    0.117846][    T0] NIP [c0000000001d2fec] lockdep_hardirqs_on_prepare+0x29c/0x2d0
[    0.123052][    T0] LR [c0000000001d2fe8] lockdep_hardirqs_on_prepare+0x298/0x2d0
[    0.127248][    T0] Call Trace:
[    0.129337][    T0] [c00000000130faa0] [c0000000001d2fe8] lockdep_hardirqs_on_prepare+0x298/0x2d0 (unreliable)
[    0.137613][    T0] [c00000000130fb10] [c0000000002d3834] trace_hardirqs_on+0x94/0x230
trace_hardirqs_on at kernel/trace/trace_preemptirq.c:49
[    0.141824][    T0] [c00000000130fb60] [c000000000039100] interrupt_exit_kernel_prepare+0x110/0x1f0
interrupt_exit_kernel_prepare at arch/powerpc/kernel/syscall_64.c:337
[    0.148069][    T0] [c00000000130fbc0] [c00000000000f328] interrupt_return+0x118/0x1c0
[    0.152281][    T0] --- interrupt: 900 at arch_local_irq_restore+0xc0/0xd0
arch_local_irq_restore at arch/powerpc/kernel/irq.c:367
(inlined by) arch_local_irq_restore at arch/powerpc/kernel/irq.c:318
[    0.152281][    T0]     LR = start_kernel+0x7f0/0x9dc
[    0.153579][    T0] [c00000000130fec0] [c000000001208fa8] init_on_free+0x0/0x2b0 (unreliable)
[    0.159810][    T0] [c00000000130fee0] [c000000000c845c8] start_kernel+0x7e4/0x9dc
start_kernel at init/main.c:961 (discriminator 3)
[    0.165017][    T0] [c00000000130ff90] [c00000000000c890] start_here_common+0x1c/0x8c
[    0.169224][    T0] Instruction dump:
[    0.171324][    T0] 0fe00000 e8010080 ebc10060 ebe10068 7c0803a6 4bfffe7c 3c82ff8b 3c62ff8a
[    0.177558][    T0] 38848808 3863e460 4bf3b3fd 60000000 <0fe00000> e8010080 ebc10060 ebe10068
[    0.183796][    T0] irq event stamp: 16
[    0.186904][    T0] hardirqs last  enabled at (14): [<c00000000020cf14>] rcu_core+0x9a4/0xbe0
[    0.191130][    T0] hardirqs last disabled at (15): [<c000000000a39944>] __do_softirq+0x5d4/0x8d8
[    0.195365][    T0] softirqs last  enabled at (16): [<c000000000a399c8>] __do_softirq+0x658/0x8d8
[    0.201606][    T0] softirqs last disabled at (5): [<c00000000011cbbc>] irq_exit+0x17c/0x1c0
[    0.206832][    T0] ---[ end trace 339d75c2056bfda1 ]---
[    0.208990][    T0] printk: console [hvc0] enabled
[    0.208990][    T0] printk: console [hvc0] enabled

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 11/14] x86/entry: Clarify irq_{enter,exit}_rcu()
  2020-06-02 14:42   ` [PATCH 11/14] " Qian Cai
@ 2020-06-02 15:05     ` Peter Zijlstra
  2020-06-02 18:47       ` Qian Cai
  2020-06-03 17:50       ` [tip: x86/entry] x86/entry: Use __irq_exit_rcu() in irq_exit() tip-bot2 for Peter Zijlstra
  0 siblings, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2020-06-02 15:05 UTC (permalink / raw)
  To: Qian Cai
  Cc: tglx, luto, linux-kernel, x86, Lai Jiangshan,
	sean.j.christopherson, andrew.cooper3, daniel.thompson,
	a.darwish, rostedt, bigeasy, Michael Ellerman, linuxppc-dev

On Tue, Jun 02, 2020 at 10:42:35AM -0400, Qian Cai wrote:

> Reverted this commit fixed the POWER9 boot warning,

ARGH, I'm an idiot. Please try this instead:


diff --git a/kernel/softirq.c b/kernel/softirq.c
index a3eb6eba8c41..c4201b7f42b1 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -438,7 +438,7 @@ void irq_exit_rcu(void)
  */
 void irq_exit(void)
 {
-	irq_exit_rcu();
+	__irq_exit_rcu();
 	rcu_irq_exit();
 	 /* must be last! */
 	lockdep_hardirq_exit();



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 11/14] x86/entry: Clarify irq_{enter,exit}_rcu()
  2020-06-02 15:05     ` Peter Zijlstra
@ 2020-06-02 18:47       ` Qian Cai
  2020-06-03 17:50       ` [tip: x86/entry] x86/entry: Use __irq_exit_rcu() in irq_exit() tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: Qian Cai @ 2020-06-02 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, luto, linux-kernel, x86, Lai Jiangshan,
	sean.j.christopherson, andrew.cooper3, daniel.thompson,
	a.darwish, rostedt, bigeasy, Michael Ellerman, linuxppc-dev

On Tue, Jun 02, 2020 at 05:05:11PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 02, 2020 at 10:42:35AM -0400, Qian Cai wrote:
> 
> > Reverted this commit fixed the POWER9 boot warning,
> 
> ARGH, I'm an idiot. Please try this instead:
>
> 
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index a3eb6eba8c41..c4201b7f42b1 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -438,7 +438,7 @@ void irq_exit_rcu(void)
>   */
>  void irq_exit(void)
>  {
> -	irq_exit_rcu();
> +	__irq_exit_rcu();
>  	rcu_irq_exit();
>  	 /* must be last! */
>  	lockdep_hardirq_exit();

This works fine.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 13/14] lockdep: Prepare for NMI IRQ state tracking
  2020-05-29 22:25     ` Peter Zijlstra
  2020-05-29 22:28       ` Steven Rostedt
  2020-05-29 22:33       ` Peter Zijlstra
@ 2020-06-02 20:00       ` Peter Zijlstra
  2 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2020-06-02 20:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tglx, luto, linux-kernel, x86, Lai Jiangshan,
	sean.j.christopherson, andrew.cooper3, daniel.thompson,
	a.darwish, bigeasy

On Sat, May 30, 2020 at 12:25:05AM +0200, Peter Zijlstra wrote:
> On Fri, May 29, 2020 at 06:14:01PM -0400, Steven Rostedt wrote:

> > Why remove the check for debug_locks? Isn't that there to disable
> > everything at once to prevent more warnings to be printed?
> 
> Yeah, maybe. I was thinking we could keep IRQ state running. But you're
> right, if we mess up the IRQ state itself this might generate a wee
> mess.

How's this then?

---
Subject: lockdep: Prepare for NMI IRQ state tracking
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed May 27 15:00:57 CEST 2020

There is no reason not to always, accurately, track IRQ state.

This change also makes IRQ state tracking ignore lockdep_off().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/lockdep.c |   44 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 41 insertions(+), 3 deletions(-)

--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3646,7 +3646,16 @@ static void __trace_hardirqs_on_caller(v
  */
 void lockdep_hardirqs_on_prepare(unsigned long ip)
 {
-	if (unlikely(!debug_locks || current->lockdep_recursion))
+	if (unlikely(!debug_locks))
+		return;
+
+	/*
+	 * NMIs do not (and cannot) track lock dependencies, nothing to do.
+	 */
+	if (unlikely(in_nmi()))
+		return;
+
+	if (unlikely(current->lockdep_recursion & LOCKDEP_RECURSION_MASK))
 		return;
 
 	if (unlikely(current->hardirqs_enabled)) {
@@ -3692,7 +3701,27 @@ void noinstr lockdep_hardirqs_on(unsigne
 {
 	struct task_struct *curr = current;
 
-	if (unlikely(!debug_locks || curr->lockdep_recursion))
+	if (unlikely(!debug_locks))
+		return;
+
+	/*
+	 * NMIs can happen in the middle of local_irq_{en,dis}able() where the
+	 * tracking state and hardware state are out of sync.
+	 *
+	 * NMIs must save lockdep_hardirqs_enabled() to restore IRQ state from,
+	 * and not rely on hardware state like normal interrupts.
+	 */
+	if (unlikely(in_nmi())) {
+		/*
+		 * Skip:
+		 *  - recursion check, because NMI can hit lockdep;
+		 *  - hardware state check, because above;
+		 *  - chain_key check, see lockdep_hardirqs_on_prepare().
+		 */
+		goto skip_checks;
+	}
+
+	if (unlikely(current->lockdep_recursion & LOCKDEP_RECURSION_MASK))
 		return;
 
 	if (curr->hardirqs_enabled) {
@@ -3720,6 +3749,7 @@ void noinstr lockdep_hardirqs_on(unsigne
 	DEBUG_LOCKS_WARN_ON(current->hardirq_chain_key !=
 			    current->curr_chain_key);
 
+skip_checks:
 	/* we'll do an OFF -> ON transition: */
 	curr->hardirqs_enabled = 1;
 	curr->hardirq_enable_ip = ip;
@@ -3735,7 +3765,15 @@ void noinstr lockdep_hardirqs_off(unsign
 {
 	struct task_struct *curr = current;
 
-	if (unlikely(!debug_locks || curr->lockdep_recursion))
+	if (unlikely(!debug_locks))
+		return;
+
+	/*
+	 * Matching lockdep_hardirqs_on(), allow NMIs in the middle of lockdep;
+	 * they will restore the software state. This ensures the software
+	 * state is consistent inside NMIs as well.
+	 */
+	if (unlikely(!in_nmi() && (current->lockdep_recursion & LOCKDEP_RECURSION_MASK)))
 		return;
 
 	/*

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 08/14] x86/entry: Optimize local_db_save() for virt
  2020-05-29 21:27 ` [PATCH 08/14] x86/entry: Optimize local_db_save() for virt Peter Zijlstra
  2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
@ 2020-06-03  1:17   ` Sean Christopherson
  1 sibling, 0 replies; 35+ messages in thread
From: Sean Christopherson @ 2020-06-03  1:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, luto, linux-kernel, x86, Lai Jiangshan, andrew.cooper3,
	daniel.thompson, a.darwish, rostedt, bigeasy, Andy Lutomirski

On Fri, May 29, 2020 at 11:27:36PM +0200, Peter Zijlstra wrote:
> Because DRn access is 'difficult' with virt; but the DR7 read is
> cheaper than a cacheline miss on native, add a virt specific
> fast path to local_db_save(), such that when breakpoints are not in
> use we avoid touching DRn entirely.
> 
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/include/asm/debugreg.h |    7 ++++++-
>  arch/x86/kernel/hw_breakpoint.c |   26 ++++++++++++++++++++++----
>  arch/x86/kvm/vmx/nested.c       |    2 +-
>  3 files changed, 29 insertions(+), 6 deletions(-)

...

> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3028,9 +3028,9 @@ static int nested_vmx_check_vmentry_hw(s
>  	/*
>  	 * VMExit clears RFLAGS.IF and DR7, even on a consistency check.
>  	 */
> -	local_irq_enable();
>  	if (hw_breakpoint_active())
>  		set_debugreg(__this_cpu_read(cpu_dr7), 7);
> +	local_irq_enable();
>  	preempt_enable();

This should be a separate patch, probably with:

  Cc: stable@vger.kernel.org
  Fixes: 52017608da33 ("KVM: nVMX: add option to perform early consistency checks via H/W")


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tip: x86/entry] x86/entry: Use __irq_exit_rcu() in irq_exit()
  2020-06-02 15:05     ` Peter Zijlstra
  2020-06-02 18:47       ` Qian Cai
@ 2020-06-03 17:50       ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-06-03 17:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Qian Cai, Peter Zijlstra (Intel), Thomas Gleixner, x86, LKML

The following commit has been merged into the x86/entry branch of tip:

Commit-ID:     10396895ab36357e676b894d89f64667ce226150
Gitweb:        https://git.kernel.org/tip/10396895ab36357e676b894d89f64667ce226150
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 03 Jun 2020 13:40:15 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 03 Jun 2020 16:35:36 +02:00

x86/entry: Use __irq_exit_rcu() in irq_exit()

Because if you rename a function, you should also rename the users.

Fixes: b614345f52bc ("x86/entry: Clarify irq_{enter,exit}_rcu()")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qian Cai <cai@lca.pw>
Link: https://lkml.kernel.org/r/20200602150511.GH706478@hirez.programming.kicks-ass.net
Link: https://lkml.kernel.org/r/20200603114051.838509047@infradead.org

---
 kernel/softirq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index a3eb6eb..c4201b7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -438,7 +438,7 @@ void irq_exit_rcu(void)
  */
 void irq_exit(void)
 {
-	irq_exit_rcu();
+	__irq_exit_rcu();
 	rcu_irq_exit();
 	 /* must be last! */
 	lockdep_hardirq_exit();

^ permalink raw reply related	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2020-06-03 17:51 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-29 21:27 [PATCH 00/14] x86/entry: disallow #DB more and x86/entry lockdep/nmi Peter Zijlstra
2020-05-29 21:27 ` [PATCH 01/14] x86/hw_breakpoint: Add within_area() to check data breakpoints Peter Zijlstra
2020-05-29 21:27 ` [PATCH 02/14] x86/hw_breakpoint: Prevent data breakpoints on direct GDT Peter Zijlstra
2020-05-30 12:45   ` Andrew Cooper
2020-05-30 15:15     ` Lai Jiangshan
2020-05-29 21:27 ` [PATCH 03/14] x86/hw_breakpoint: Prevent data breakpoints on per_cpu cpu_tss_rw Peter Zijlstra
2020-05-29 21:27 ` [PATCH 04/14] x86/hw_breakpoint: Prevent data breakpoints on user_pcid_flush_mask Peter Zijlstra
2020-05-29 21:27 ` [PATCH 05/14] x86/entry: Introduce local_db_{save,restore}() Peter Zijlstra
2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
2020-05-29 21:27 ` [PATCH 06/14] x86/entry, nmi: Disable #DB Peter Zijlstra
2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
2020-05-29 21:27 ` [PATCH 07/14] x86/entry, mce: Disallow #DB during #MC Peter Zijlstra
2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
2020-05-29 21:27 ` [PATCH 08/14] x86/entry: Optimize local_db_save() for virt Peter Zijlstra
2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
2020-06-03  1:17   ` [PATCH 08/14] " Sean Christopherson
2020-05-29 21:27 ` [PATCH 09/14] x86/entry: Remove debug IDT frobbing Peter Zijlstra
2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
2020-05-29 21:27 ` [PATCH 10/14] x86/entry: Remove DBn stacks Peter Zijlstra
2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
2020-05-29 21:27 ` [PATCH 11/14] x86/entry: Clarify irq_{enter,exit}_rcu() Peter Zijlstra
2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
2020-06-02 14:42   ` [PATCH 11/14] " Qian Cai
2020-06-02 15:05     ` Peter Zijlstra
2020-06-02 18:47       ` Qian Cai
2020-06-03 17:50       ` [tip: x86/entry] x86/entry: Use __irq_exit_rcu() in irq_exit() tip-bot2 for Peter Zijlstra
2020-05-29 21:27 ` [PATCH 12/14] x86/entry: Rename trace_hardirqs_off_prepare() Peter Zijlstra
2020-05-30  9:57   ` [tip: x86/entry] " tip-bot2 for Peter Zijlstra
2020-05-29 21:27 ` [PATCH 13/14] lockdep: Prepare for NMI IRQ state tracking Peter Zijlstra
2020-05-29 22:14   ` Steven Rostedt
2020-05-29 22:25     ` Peter Zijlstra
2020-05-29 22:28       ` Steven Rostedt
2020-05-29 22:33       ` Peter Zijlstra
2020-06-02 20:00       ` Peter Zijlstra
2020-05-29 21:27 ` [PATCH 14/14] x86/entry: Fix NMI vs " Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).