linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] [v3] Workaround for Xeon Phi PTE A/D bits erratum
@ 2016-07-01  0:12 Dave Hansen
  2016-07-01  0:12 ` [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro Dave Hansen
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Dave Hansen @ 2016-07-01  0:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen, minchan

The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
Landing) has an erratum where a processor thread setting the Accessed
or Dirty bits may not do so atomically against its checks for the
Present bit.  This may cause a thread (which is about to page fault)
to set A and/or D, even though the Present bit had already been
atomically cleared.

If the PTE is used for storing a swap index or a NUMA migration index,
the A bit could be misinterpreted as part of the swap type.  The stray
bits being set cause a software-cleared PTE to be interpreted as a
swap entry.  In some cases (like when the swap index ends up being
for a non-existent swapfile), the kernel detects the stray value
and WARN()s about it, but there is no guarantee that the kernel can
always detect it.

This patch causes the page unmap path in vmscan/direct reclaim to
flush remote TLBs after clearing each page, and also clears the PTE
again after the flush.  For reclaim, this brings the behavior (and
associated reclaim performance) back to what it was before Mel's
changes that increased TLB flush batching.

For the unmap path, this patch may force some additional flushes, but
they are limited to a maximum of one per PTE page.  This patch clears
these stray A/D bits before releasing the pagetable lock which
prevents other parts of the kernel from observing the stray bits.

Andi Kleen wrote the original version of this patch, and Dave Hansen
added the batching.  The original version was much simpler but it
did too many extra TLB flushes which killed performance.

v3: huge rework to keep batching working in unmap case
v2: out of line. avoid single thread flush. cover more clear
    cases

Cc: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro
  2016-07-01  0:12 [PATCH 0/6] [v3] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
@ 2016-07-01  0:12 ` Dave Hansen
  2016-07-01  9:23   ` Borislav Petkov
  2016-07-01  0:12 ` [PATCH 2/6] mm, tlb: add mmu_gather->saw_unset_a_or_d Dave Hansen
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2016-07-01  0:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen,
	dave.hansen, luto, stable


From: Dave Hansen <dave.hansen@linux.intel.com>

epufeatures.h currently defines X86_BUG(9) twice on 32-bit:

	#define X86_BUG_NULL_SEG        X86_BUG(9) /* Nulling a selector preserves the base */
	...
	#ifdef CONFIG_X86_32
	#define X86_BUG_ESPFIX          X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */
	#endif

I think what happened was that this added the X86_BUG_ESPFIX, but
in an #ifdef below most of the bugs:

	[58a5aac5] x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled

Then this came along and added X86_BUG_NULL_SEG, but collided
with the earlier one that did the bug below the main block
defining all the X86_BUG()s.

	[7a5d6704] x86/cpu: Probe the behavior of nulling out a segment at boot time

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
---

 b/arch/x86/include/asm/cpufeatures.h |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff -puN arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros arch/x86/include/asm/cpufeatures.h
--- a/arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros	2016-06-30 17:10:41.215185869 -0700
+++ b/arch/x86/include/asm/cpufeatures.h	2016-06-30 17:10:41.218186005 -0700
@@ -301,10 +301,6 @@
 #define X86_BUG_FXSAVE_LEAK	X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
 #define X86_BUG_CLFLUSH_MONITOR	X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
 #define X86_BUG_SYSRET_SS_ATTRS	X86_BUG(8) /* SYSRET doesn't fix up SS attrs */
-#define X86_BUG_NULL_SEG	X86_BUG(9) /* Nulling a selector preserves the base */
-#define X86_BUG_SWAPGS_FENCE	X86_BUG(10) /* SWAPGS without input dep on GS */
-
-
 #ifdef CONFIG_X86_32
 /*
  * 64-bit kernels don't use X86_BUG_ESPFIX.  Make the define conditional
@@ -312,5 +308,7 @@
  */
 #define X86_BUG_ESPFIX		X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */
 #endif
+#define X86_BUG_NULL_SEG	X86_BUG(10) /* Nulling a selector preserves the base */
+#define X86_BUG_SWAPGS_FENCE	X86_BUG(11) /* SWAPGS without input dep on GS */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/6] mm, tlb: add mmu_gather->saw_unset_a_or_d
  2016-07-01  0:12 [PATCH 0/6] [v3] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
  2016-07-01  0:12 ` [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro Dave Hansen
@ 2016-07-01  0:12 ` Dave Hansen
  2016-07-01  0:12 ` [PATCH 3/6] mm: add force_batch_flush to mmu_gather Dave Hansen
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2016-07-01  0:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Add a field (->saw_unset_a_or_d) to the asm-generic version of
mmu_gather.  We will use this on x86 to indicate when a PTE got
cleared that might potentially have a stray Accessed or Dirty bit
set.

Note that since ->saw_unset_a_or_d shares space in a bitfield
with ->fullmm and ->need_flush_all, there's no incremental
storage cost.  In addition, since it is initialized to 0 like
->need_flush_all, they can likely be initialized together,
leading to no real cost for having ->saw_unset_a_or_d around.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/include/asm-generic/tlb.h |    7 ++++++-
 b/mm/memory.c               |    6 ++++--
 2 files changed, 10 insertions(+), 3 deletions(-)

diff -puN include/asm-generic/tlb.h~knl-leak-20-saw_unset_a_or_d include/asm-generic/tlb.h
--- a/include/asm-generic/tlb.h~knl-leak-20-saw_unset_a_or_d	2016-06-30 17:10:41.606203608 -0700
+++ b/include/asm-generic/tlb.h	2016-06-30 17:10:41.611203835 -0700
@@ -101,7 +101,12 @@ struct mmu_gather {
 	unsigned int		fullmm : 1,
 	/* we have performed an operation which
 	 * requires a complete flush of the tlb */
-				need_flush_all : 1;
+				need_flush_all : 1,
+	/* we cleared a PTE bit which may potentially
+	 * get set by hardware */
+				saw_unset_a_or_d: 1;
+
+
 
 	struct mmu_gather_batch *active;
 	struct mmu_gather_batch	local;
diff -puN mm/memory.c~knl-leak-20-saw_unset_a_or_d mm/memory.c
--- a/mm/memory.c~knl-leak-20-saw_unset_a_or_d	2016-06-30 17:10:41.607203654 -0700
+++ b/mm/memory.c	2016-06-30 17:10:41.614203971 -0700
@@ -222,8 +222,10 @@ void tlb_gather_mmu(struct mmu_gather *t
 	tlb->mm = mm;
 
 	/* Is it from 0 to ~0? */
-	tlb->fullmm     = !(start | (end+1));
-	tlb->need_flush_all = 0;
+	tlb->fullmm		= !(start | (end+1));
+	tlb->need_flush_all	= 0;
+	tlb->saw_unset_a_or_d	= 0;
+
 	tlb->local.next = NULL;
 	tlb->local.nr   = 0;
 	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/6] mm: add force_batch_flush to mmu_gather
  2016-07-01  0:12 [PATCH 0/6] [v3] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
  2016-07-01  0:12 ` [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro Dave Hansen
  2016-07-01  0:12 ` [PATCH 2/6] mm, tlb: add mmu_gather->saw_unset_a_or_d Dave Hansen
@ 2016-07-01  0:12 ` Dave Hansen
  2016-07-01  0:12 ` [PATCH 4/6] mm: move flush in madvise_free_pte_range() Dave Hansen
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2016-07-01  0:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Currently, zap_pte_range() has a local variable called
'force_flush'.  It is set when the zapping code runs in to an
entry that requires flushing before the ptl is released.

Currently, there are two reasons we might do that:
1. The TLB batching in __tlb_remove_page() has run out of
   space and can no longer record new pages being flushed.
2. An entry for a dirty page was flushed, and we need to
   ensure that software walking the page tables can not
   observe the cleared bit before we flush the TLB.

We need the x86 code to be able to force a flush for an
additional reason: if the PTE being cleared might have a stray
Accessed or Dirty bit set on it because of a hardware erratum.
For these purposes, we also need to flush before the ptl is
released.  So, we move the 'force_flush' variable into the
mmu_gather structure where we can set it from arch code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/arm/include/asm/tlb.h  |    1 +
 b/arch/ia64/include/asm/tlb.h |    1 +
 b/arch/s390/include/asm/tlb.h |    1 +
 b/arch/sh/include/asm/tlb.h   |    1 +
 b/arch/um/include/asm/tlb.h   |    1 +
 b/include/asm-generic/tlb.h   |    3 +++
 b/mm/memory.c                 |   22 ++++++++++++----------
 7 files changed, 20 insertions(+), 10 deletions(-)

diff -puN arch/arm/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/arm/include/asm/tlb.h
--- a/arch/arm/include/asm/tlb.h~knl-leak-30-tlb_force_flush	2016-06-30 17:10:42.021222437 -0700
+++ b/arch/arm/include/asm/tlb.h	2016-06-30 17:10:42.037223163 -0700
@@ -69,6 +69,7 @@ struct mmu_gather {
 	unsigned int		need_flush;
 #endif
 	unsigned int		fullmm;
+	unsigned int		force_batch_flush;
 	struct vm_area_struct	*vma;
 	unsigned long		start, end;
 	unsigned long		range_start;
diff -puN arch/ia64/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/ia64/include/asm/tlb.h
--- a/arch/ia64/include/asm/tlb.h~knl-leak-30-tlb_force_flush	2016-06-30 17:10:42.022222482 -0700
+++ b/arch/ia64/include/asm/tlb.h	2016-06-30 17:10:42.037223163 -0700
@@ -57,6 +57,7 @@ struct mmu_gather {
 	unsigned int		nr;
 	unsigned int		max;
 	unsigned char		fullmm;		/* non-zero means full mm flush */
+	unsigned int		force_batch_flush; /* stop batching and flush */
 	unsigned char		need_flush;	/* really unmapped some PTEs? */
 	unsigned long		start, end;
 	unsigned long		start_addr;
diff -puN arch/s390/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/s390/include/asm/tlb.h
--- a/arch/s390/include/asm/tlb.h~knl-leak-30-tlb_force_flush	2016-06-30 17:10:42.024222573 -0700
+++ b/arch/s390/include/asm/tlb.h	2016-06-30 17:10:42.038223208 -0700
@@ -32,6 +32,7 @@ struct mmu_gather {
 	struct mm_struct *mm;
 	struct mmu_table_batch *batch;
 	unsigned int fullmm;
+	unsigned int force_batch_flush; /* stop batching and flush */
 	unsigned long start, end;
 };
 
diff -puN arch/sh/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/sh/include/asm/tlb.h
--- a/arch/sh/include/asm/tlb.h~knl-leak-30-tlb_force_flush	2016-06-30 17:10:42.025222618 -0700
+++ b/arch/sh/include/asm/tlb.h	2016-06-30 17:10:42.038223208 -0700
@@ -21,6 +21,7 @@
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		fullmm;
+	unsigned int		force_batch_flush; /* stop batching and flush */
 	unsigned long		start, end;
 };
 
diff -puN arch/um/include/asm/tlb.h~knl-leak-30-tlb_force_flush arch/um/include/asm/tlb.h
--- a/arch/um/include/asm/tlb.h~knl-leak-30-tlb_force_flush	2016-06-30 17:10:42.027222709 -0700
+++ b/arch/um/include/asm/tlb.h	2016-06-30 17:10:42.039223253 -0700
@@ -20,6 +20,7 @@ struct mmu_gather {
 	unsigned long		start;
 	unsigned long		end;
 	unsigned int		fullmm; /* non-zero means full mm flush */
+	unsigned int		force_batch_flush; /* stop batching and flush */
 };
 
 static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
diff -puN include/asm-generic/tlb.h~knl-leak-30-tlb_force_flush include/asm-generic/tlb.h
--- a/include/asm-generic/tlb.h~knl-leak-30-tlb_force_flush	2016-06-30 17:10:42.028222754 -0700
+++ b/include/asm-generic/tlb.h	2016-06-30 17:10:42.039223253 -0700
@@ -102,6 +102,9 @@ struct mmu_gather {
 	/* we have performed an operation which
 	 * requires a complete flush of the tlb */
 				need_flush_all : 1,
+	/* need to flush the tlb and stop batching
+	 * before we release ptl */
+				force_batch_flush : 1,
 	/* we cleared a PTE bit which may potentially
 	 * get set by hardware */
 				saw_unset_a_or_d: 1;
diff -puN mm/memory.c~knl-leak-30-tlb_force_flush mm/memory.c
--- a/mm/memory.c~knl-leak-30-tlb_force_flush	2016-06-30 17:10:42.031222890 -0700
+++ b/mm/memory.c	2016-06-30 17:10:42.040223299 -0700
@@ -225,6 +225,7 @@ void tlb_gather_mmu(struct mmu_gather *t
 	tlb->fullmm		= !(start | (end+1));
 	tlb->need_flush_all	= 0;
 	tlb->saw_unset_a_or_d	= 0;
+	tlb->force_batch_flush	= 0;
 
 	tlb->local.next = NULL;
 	tlb->local.nr   = 0;
@@ -1105,7 +1106,6 @@ static unsigned long zap_pte_range(struc
 				struct zap_details *details)
 {
 	struct mm_struct *mm = tlb->mm;
-	int force_flush = 0;
 	int rss[NR_MM_COUNTERS];
 	spinlock_t *ptl;
 	pte_t *start_pte;
@@ -1151,7 +1151,7 @@ again:
 					 */
 					if (unlikely(details && details->ignore_dirty))
 						continue;
-					force_flush = 1;
+					tlb->force_batch_flush = 1;
 					set_page_dirty(page);
 				}
 				if (pte_young(ptent) &&
@@ -1163,7 +1163,7 @@ again:
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(!__tlb_remove_page(tlb, page))) {
-				force_flush = 1;
+				tlb->force_batch_flush = 1;
 				addr += PAGE_SIZE;
 				break;
 			}
@@ -1191,18 +1191,20 @@ again:
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
-	if (force_flush)
+	if (tlb->force_batch_flush)
 		tlb_flush_mmu_tlbonly(tlb);
 	pte_unmap_unlock(start_pte, ptl);
 
 	/*
-	 * If we forced a TLB flush (either due to running out of
-	 * batch buffers or because we needed to flush dirty TLB
-	 * entries before releasing the ptl), free the batched
-	 * memory too. Restart if we didn't do everything.
+	 * If we forced a TLB flush, free the batched
+	 * memory too.  Restart if we didn't do everything.
+	 *
+	 * We force this due to running out of batch buffers,
+	 * needing to flush dirty TLB entries before releasing
+	 * the ptl, or for arch-specific reasons.
 	 */
-	if (force_flush) {
-		force_flush = 0;
+	if (tlb->force_batch_flush) {
+		tlb->force_batch_flush = 0;
 		tlb_flush_mmu_free(tlb);
 
 		if (addr != end)
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/6] mm: move flush in madvise_free_pte_range()
  2016-07-01  0:12 [PATCH 0/6] [v3] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
                   ` (2 preceding siblings ...)
  2016-07-01  0:12 ` [PATCH 3/6] mm: add force_batch_flush to mmu_gather Dave Hansen
@ 2016-07-01  0:12 ` Dave Hansen
  2016-07-01  0:12 ` [PATCH 5/6] mm: make tlb_flush_mmu_tlbonly() return whether it flushed Dave Hansen
  2016-07-01  0:12 ` [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs Dave Hansen
  5 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2016-07-01  0:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen,
	dave.hansen, minchan


From: Dave Hansen <dave.hansen@linux.intel.com>

I think this code is OK and does not *need* to be patched.  We
are just rewriting the PTE without the Accessed and Dirty bits.
The hardware could come along and set them at any time with or
without the erratum that this series addresses

But this does make the ptep_get_and_clear_full() and
tlb_remove_tlb_entry() calls here more consistent with the other
places they are used together and look *obviously* the same
between call-sites.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Minchan Kim <minchan@kernel.org>
---

 b/mm/madvise.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN mm/madvise.c~knl-leak-40-madvise_free_pte_range-move-flush mm/madvise.c
--- a/mm/madvise.c~knl-leak-40-madvise_free_pte_range-move-flush	2016-06-30 17:10:42.557246755 -0700
+++ b/mm/madvise.c	2016-06-30 17:10:42.561246936 -0700
@@ -369,13 +369,13 @@ static int madvise_free_pte_range(pmd_t
 			 */
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
+			tlb_remove_tlb_entry(tlb, pte, addr);
 
 			ptent = pte_mkold(ptent);
 			ptent = pte_mkclean(ptent);
 			set_pte_at(mm, addr, pte, ptent);
 			if (PageActive(page))
 				deactivate_page(page);
-			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
 	}
 out:
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 5/6] mm: make tlb_flush_mmu_tlbonly() return whether it flushed
  2016-07-01  0:12 [PATCH 0/6] [v3] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
                   ` (3 preceding siblings ...)
  2016-07-01  0:12 ` [PATCH 4/6] mm: move flush in madvise_free_pte_range() Dave Hansen
@ 2016-07-01  0:12 ` Dave Hansen
  2016-07-01  0:12 ` [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs Dave Hansen
  5 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2016-07-01  0:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

When batching TLB operations, we track the scope of the needed
TLB invalidation inside the 'struct mmu_gather'.  With
mmu_gather->end, we indicate whether any operations required
a flush.

If the flush was not performed, then we know that no
pte_present() PTEs were found and no workaround is needed.
But, we do not know this unless tlb_flush_mmu_tlbonly()
tells us whether it flushed or not.

Have it return a bool to tell us.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/arm/include/asm/tlb.h  |    3 ++-
 b/arch/ia64/include/asm/tlb.h |    3 ++-
 b/arch/s390/include/asm/tlb.h |    3 ++-
 b/arch/sh/include/asm/tlb.h   |    1 +
 b/arch/um/include/asm/tlb.h   |    3 ++-
 b/mm/memory.c                 |    5 +++--
 6 files changed, 12 insertions(+), 6 deletions(-)

diff -puN arch/arm/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/arm/include/asm/tlb.h
--- a/arch/arm/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly	2016-06-30 17:10:42.947264449 -0700
+++ b/arch/arm/include/asm/tlb.h	2016-06-30 17:10:42.958264947 -0700
@@ -126,12 +126,13 @@ static inline void __tlb_alloc_page(stru
 	}
 }
 
-static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+static inline bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	tlb_flush(tlb);
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
+	return true;
 }
 
 static inline void tlb_flush_mmu_free(struct mmu_gather *tlb)
diff -puN arch/ia64/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/ia64/include/asm/tlb.h
--- a/arch/ia64/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly	2016-06-30 17:10:42.948264494 -0700
+++ b/arch/ia64/include/asm/tlb.h	2016-06-30 17:10:42.959264993 -0700
@@ -219,9 +219,10 @@ static inline int __tlb_remove_page(stru
 	return tlb->max - tlb->nr;
 }
 
-static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+static inline bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	ia64_tlb_flush_mmu_tlbonly(tlb, tlb->start_addr, tlb->end_addr);
+	return true;
 }
 
 static inline void tlb_flush_mmu_free(struct mmu_gather *tlb)
diff -puN arch/s390/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/s390/include/asm/tlb.h
--- a/arch/s390/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly	2016-06-30 17:10:42.950264585 -0700
+++ b/arch/s390/include/asm/tlb.h	2016-06-30 17:10:42.960265038 -0700
@@ -60,9 +60,10 @@ static inline void tlb_gather_mmu(struct
 	tlb->batch = NULL;
 }
 
-static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+static inline bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	__tlb_flush_mm_lazy(tlb->mm);
+	return true;
 }
 
 static inline void tlb_flush_mmu_free(struct mmu_gather *tlb)
diff -puN arch/sh/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/sh/include/asm/tlb.h
--- a/arch/sh/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly	2016-06-30 17:10:42.952264675 -0700
+++ b/arch/sh/include/asm/tlb.h	2016-06-30 17:10:42.960265038 -0700
@@ -89,6 +89,7 @@ tlb_end_vma(struct mmu_gather *tlb, stru
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
+	return true;
 }
 
 static inline void tlb_flush_mmu_free(struct mmu_gather *tlb)
diff -puN arch/um/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly arch/um/include/asm/tlb.h
--- a/arch/um/include/asm/tlb.h~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly	2016-06-30 17:10:42.953264721 -0700
+++ b/arch/um/include/asm/tlb.h	2016-06-30 17:10:42.960265038 -0700
@@ -59,10 +59,11 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 			       unsigned long end);
 
-static inline void
+static inline bool
 tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	flush_tlb_mm_range(tlb->mm, tlb->start, tlb->end);
+	return true;
 }
 
 static inline void
diff -puN mm/memory.c~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly mm/memory.c
--- a/mm/memory.c~knl-leak-50-bool-return-tlb_flush_mmu_tlbonly	2016-06-30 17:10:42.955264811 -0700
+++ b/mm/memory.c	2016-06-30 17:10:42.962265129 -0700
@@ -240,10 +240,10 @@ void tlb_gather_mmu(struct mmu_gather *t
 	__tlb_reset_range(tlb);
 }
 
-static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+static bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	if (!tlb->end)
-		return;
+		return false;
 
 	tlb_flush(tlb);
 	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
@@ -251,6 +251,7 @@ static void tlb_flush_mmu_tlbonly(struct
 	tlb_table_flush(tlb);
 #endif
 	__tlb_reset_range(tlb);
+	return true;
 }
 
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01  0:12 [PATCH 0/6] [v3] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
                   ` (4 preceding siblings ...)
  2016-07-01  0:12 ` [PATCH 5/6] mm: make tlb_flush_mmu_tlbonly() return whether it flushed Dave Hansen
@ 2016-07-01  0:12 ` Dave Hansen
  2016-07-01  1:50   ` Nadav Amit
  2016-07-01  2:55   ` Linus Torvalds
  5 siblings, 2 replies; 25+ messages in thread
From: Dave Hansen @ 2016-07-01  0:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
Landing) has an erratum where a processor thread setting the Accessed
or Dirty bits may not do so atomically against its checks for the
Present bit.  This may cause a thread (which is about to page fault)
to set A and/or D, even though the Present bit had already been
atomically cleared.

If the PTE is used for storing a swap index or a NUMA migration index,
the A bit could be misinterpreted as part of the swap type.  The stray
bits being set cause a software-cleared PTE to be interpreted as a
swap entry.  In some cases (like when the swap index ends up being
for a non-existent swapfile), the kernel detects the stray value
and WARN()s about it, but there is no guarantee that the kernel can
always detect it.

There are basically three things we have to do to work around this
erratum:

1. Extra TLB flushes when we clear PTEs that might be affected by
   this erratum.
2. Extra pass back over the suspect PTEs after the TLBs have been
   flushed to clear stray bits.
3. Make sure to hold ptl in pte_unmap_same() (underneath
   do_swap_page()) to keep the swap code from observing the bad
   entries between #1 and #2.

Notes:
 * The little pte++ in zap_pte_range() is to ensure that 'pte'
   points _past_ the last PTE that was cleared so that the
   whole range can be cleaned up.
 * We could do more of the new arch_*() helpers inside the
   existing TLB flush callers if we passed the old 'ptent'
   in as an argument.  That would require a more invasive
   rework, though.
 * change_pte_range() does not need to be modified.  It fully
   writes back over the PTE after clearing it.  It also does
   this inside the ptl, so the cleared PTE potentially
   containing stray bits is never visible to anyone.
 * move_ptes() does remote TLB flushes after each PTE clear.
   This is slow, but mremap() is not as important as munmap()
   Leave it simple for now.
 * could apply A/D optimization to huge pages too
 * As far as I can tell, sites that just change PTE permissions
   are OK.  They generally do some variant of:
	ptent = ptep_get_and_clear(...);
	ptent = pte_mksomething(ptent);
	set_pte_at(mm, addr, pte, ptent);
	tlb_flush...();
   This is OK because the cleared PTE (which might contain
   the stray bits) is written over by set_pte_at() and this
   all happens under the pagetable lock.
   Examples of this:
    * madvise_free_pte_range()
    * hugetlb_change_protection()
    * clear_soft_dirty()
    * move_ptes() - new PTE will not fault so will not hit
      erratum
    * change_pte_range()

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/arm/include/asm/tlb.h         |    1 
 b/arch/ia64/include/asm/tlb.h        |    1 
 b/arch/s390/include/asm/tlb.h        |    1 
 b/arch/sh/include/asm/tlb.h          |    1 
 b/arch/um/include/asm/tlb.h          |    2 
 b/arch/x86/include/asm/cpufeatures.h |    1 
 b/arch/x86/include/asm/pgtable.h     |   32 +++++++++
 b/arch/x86/include/asm/tlbflush.h    |   37 +++++++++++
 b/arch/x86/kernel/cpu/intel.c        |  113 +++++++++++++++++++++++++++++++++++
 b/include/asm-generic/tlb.h          |    4 +
 b/include/linux/mm.h                 |   17 +++++
 b/mm/memory.c                        |   12 ++-
 b/mm/mremap.c                        |    9 ++
 b/mm/rmap.c                          |    4 +
 b/mm/vmalloc.c                       |    1 
 15 files changed, 231 insertions(+), 5 deletions(-)

diff -puN arch/arm/include/asm/tlb.h~knl-leak-60-actual-fix arch/arm/include/asm/tlb.h
--- a/arch/arm/include/asm/tlb.h~knl-leak-60-actual-fix	2016-06-30 17:10:43.455287496 -0700
+++ b/arch/arm/include/asm/tlb.h	2016-06-30 17:10:43.483288766 -0700
@@ -264,6 +264,7 @@ tlb_remove_pmd_tlb_entry(struct mmu_gath
 #define pud_free_tlb(tlb, pudp, addr)	pud_free((tlb)->mm, pudp)
 
 #define tlb_migrate_finish(mm)		do { } while (0)
+#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0)
 
 #endif /* CONFIG_MMU */
 #endif
diff -puN arch/ia64/include/asm/tlb.h~knl-leak-60-actual-fix arch/ia64/include/asm/tlb.h
--- a/arch/ia64/include/asm/tlb.h~knl-leak-60-actual-fix	2016-06-30 17:10:43.457287587 -0700
+++ b/arch/ia64/include/asm/tlb.h	2016-06-30 17:10:43.484288812 -0700
@@ -254,6 +254,7 @@ __tlb_remove_tlb_entry (struct mmu_gathe
 }
 
 #define tlb_migrate_finish(mm)	platform_tlb_migrate_finish(mm)
+#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0)
 
 #define tlb_start_vma(tlb, vma)			do { } while (0)
 #define tlb_end_vma(tlb, vma)			do { } while (0)
diff -puN arch/s390/include/asm/tlb.h~knl-leak-60-actual-fix arch/s390/include/asm/tlb.h
--- a/arch/s390/include/asm/tlb.h~knl-leak-60-actual-fix	2016-06-30 17:10:43.460287723 -0700
+++ b/arch/s390/include/asm/tlb.h	2016-06-30 17:10:43.484288812 -0700
@@ -146,5 +146,6 @@ static inline void pud_free_tlb(struct m
 #define tlb_remove_tlb_entry(tlb, ptep, addr)	do { } while (0)
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, addr)	do { } while (0)
 #define tlb_migrate_finish(mm)			do { } while (0)
+#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0)
 
 #endif /* _S390_TLB_H */
diff -puN arch/sh/include/asm/tlb.h~knl-leak-60-actual-fix arch/sh/include/asm/tlb.h
--- a/arch/sh/include/asm/tlb.h~knl-leak-60-actual-fix	2016-06-30 17:10:43.461287768 -0700
+++ b/arch/sh/include/asm/tlb.h	2016-06-30 17:10:43.484288812 -0700
@@ -116,6 +116,7 @@ static inline void tlb_remove_page(struc
 #define pud_free_tlb(tlb, pudp, addr)	pud_free((tlb)->mm, pudp)
 
 #define tlb_migrate_finish(mm)		do { } while (0)
+#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0)
 
 #if defined(CONFIG_CPU_SH4) || defined(CONFIG_SUPERH64)
 extern void tlb_wire_entry(struct vm_area_struct *, unsigned long, pte_t);
diff -puN arch/um/include/asm/tlb.h~knl-leak-60-actual-fix arch/um/include/asm/tlb.h
--- a/arch/um/include/asm/tlb.h~knl-leak-60-actual-fix	2016-06-30 17:10:43.463287859 -0700
+++ b/arch/um/include/asm/tlb.h	2016-06-30 17:10:43.485288857 -0700
@@ -133,4 +133,6 @@ static inline void tlb_remove_page(struc
 
 #define tlb_migrate_finish(mm) do {} while (0)
 
+#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0)
+
 #endif
diff -puN arch/x86/include/asm/cpufeatures.h~knl-leak-60-actual-fix arch/x86/include/asm/cpufeatures.h
--- a/arch/x86/include/asm/cpufeatures.h~knl-leak-60-actual-fix	2016-06-30 17:10:43.464287904 -0700
+++ b/arch/x86/include/asm/cpufeatures.h	2016-06-30 17:10:43.485288857 -0700
@@ -310,5 +310,6 @@
 #endif
 #define X86_BUG_NULL_SEG	X86_BUG(10) /* Nulling a selector preserves the base */
 #define X86_BUG_SWAPGS_FENCE	X86_BUG(11) /* SWAPGS without input dep on GS */
+#define X86_BUG_PTE_LEAK	X86_BUG(12) /* PTE may leak A/D bits after clear */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff -puN arch/x86/include/asm/pgtable.h~knl-leak-60-actual-fix arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~knl-leak-60-actual-fix	2016-06-30 17:10:43.466287995 -0700
+++ b/arch/x86/include/asm/pgtable.h	2016-06-30 17:10:43.486288902 -0700
@@ -794,6 +794,12 @@ extern int ptep_test_and_clear_young(str
 extern int ptep_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pte_t *ptep);
 
+#ifdef CONFIG_CPU_SUP_INTEL
+#define __HAVE_ARCH_PTEP_CLEAR_FLUSH
+extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
+			      unsigned long address, pte_t *ptep);
+#endif
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 				       pte_t *ptep)
@@ -956,6 +962,32 @@ static inline u16 pte_flags_pkey(unsigne
 #endif
 }
 
+#ifdef CONFIG_CPU_SUP_INTEL
+/*
+ * These are all specific to working around an Intel-specific
+ * bug and the out-of-line code is all defined in intel.c.
+ */
+extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr,
+			 pte_t *ptep);
+#define ARCH_HAS_FIX_PTE_LEAK 1
+static inline void arch_fix_pte_leak(struct mm_struct *mm, unsigned long addr,
+				     pte_t *ptep)
+{
+	if (static_cpu_has_bug(X86_BUG_PTE_LEAK))
+		fix_pte_leak(mm, addr, ptep);
+}
+#define ARCH_HAS_NEEDS_SWAP_PTL 1
+static inline bool arch_needs_swap_ptl(void)
+{
+	return static_cpu_has_bug(X86_BUG_PTE_LEAK);
+}
+#define ARCH_DISABLE_DEFERRED_FLUSH 1
+static inline bool arch_disable_deferred_flush(void)
+{
+	return static_cpu_has_bug(X86_BUG_PTE_LEAK);
+}
+#endif /* CONFIG_CPU_SUP_INTEL */
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff -puN arch/x86/include/asm/tlbflush.h~knl-leak-60-actual-fix arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~knl-leak-60-actual-fix	2016-06-30 17:10:43.468288086 -0700
+++ b/arch/x86/include/asm/tlbflush.h	2016-06-30 17:10:43.486288902 -0700
@@ -324,4 +324,41 @@ static inline void reset_lazy_tlbstate(v
 	native_flush_tlb_others(mask, mm, start, end)
 #endif
 
+static inline bool intel_detect_leaked_pte(pte_t ptent)
+{
+	/*
+	 * Hardware sets the dirty bit only on writable ptes and
+	 * only on ptes where dirty is unset.
+	 */
+	if (pte_write(ptent) && !pte_dirty(ptent))
+		return true;
+	/*
+	 * The leak occurs when the hardware sees a unset A/D bit
+	 * and tries to set it.  If the PTE has both A/D bits
+	 * set, then the hardware will not be going to try to set
+	 * it and we have no chance for a leak.
+	 */
+	if (!pte_young(ptent))
+		return true;
+
+	return false;
+}
+
+#ifdef CONFIG_CPU_SUP_INTEL
+void intel_cleanup_pte_range(struct mmu_gather *tlb, pte_t *start_ptep,
+		pte_t *end_ptep);
+#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {		\
+	if (static_cpu_has_bug(X86_BUG_PTE_LEAK))			\
+		intel_cleanup_pte_range(tlb, start_ptep, end_ptep);	\
+} while (0)
+#define ARCH_FLUSH_CLEARED_PTE 1
+#define arch_flush_cleared_pte(tlb, ptent) do {				\
+	if (static_cpu_has_bug(X86_BUG_PTE_LEAK) &&			\
+	    intel_detect_leaked_pte(ptent)) {				\
+		tlb->saw_unset_a_or_d = 1;				\
+		tlb->force_batch_flush = 1;				\
+	}								\
+} while (0)
+#endif /* CONFIG_CPU_SUP_INTEL */
+
 #endif /* _ASM_X86_TLBFLUSH_H */
diff -puN arch/x86/kernel/cpu/intel.c~knl-leak-60-actual-fix arch/x86/kernel/cpu/intel.c
--- a/arch/x86/kernel/cpu/intel.c~knl-leak-60-actual-fix	2016-06-30 17:10:43.469288131 -0700
+++ b/arch/x86/kernel/cpu/intel.c	2016-06-30 17:10:43.487288948 -0700
@@ -9,10 +9,14 @@
 #include <linux/uaccess.h>
 
 #include <asm/cpufeature.h>
+#include <asm/intel-family.h>
 #include <asm/pgtable.h>
 #include <asm/msr.h>
 #include <asm/bugs.h>
 #include <asm/cpu.h>
+#include <asm/tlb.h>
+
+#include <trace/events/tlb.h>
 
 #ifdef CONFIG_X86_64
 #include <linux/topology.h>
@@ -181,6 +185,11 @@ static void early_init_intel(struct cpui
 		}
 	}
 
+	if (c->x86_model == INTEL_FAM6_XEON_PHI_KNL) {
+		pr_info_once("x86/intel: Enabling PTE leaking workaround\n");
+		set_cpu_bug(c, X86_BUG_PTE_LEAK);
+	}
+
 	/*
 	 * Intel Quark Core DevMan_001.pdf section 6.4.11
 	 * "The operating system also is required to invalidate (i.e., flush)
@@ -820,3 +829,107 @@ static const struct cpu_dev intel_cpu_de
 
 cpu_dev_register(intel_cpu_dev);
 
+/*
+ * Workaround for KNL issue:
+ *
+ * A thread that is going to page fault due to P=0, may still
+ * non atomically set A or D bits, which could corrupt swap entries.
+ * Always flush the other CPUs and clear the PTE again to avoid
+ * this leakage. We are excluded using the pagetable lock.
+ *
+ * This only needs to be called on processors that might "leak"
+ * A or D bits and have X86_BUG_PTE_LEAK set.
+ */
+void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
+		flush_tlb_others(mm_cpumask(mm), mm, addr,
+				 addr + PAGE_SIZE);
+		set_pte(ptep, __pte(0));
+	}
+}
+
+/*
+ * We batched a bunch of PTE clears up.  After the TLB has been
+ * flushed for the whole batch, we might have had some leaked
+ * A or D bits and need to clear them here.
+ *
+ * This should be called with the page table lock still held.
+ *
+ * This only needs to be called on processors that might "leak"
+ * A or D bits and have X86_BUG_PTE_LEAK set.
+ */
+void intel_cleanup_pte_range(struct mmu_gather *tlb, pte_t *start_ptep,
+		pte_t *end_ptep)
+{
+	pte_t *pte;
+
+	/*
+	 * fullmm means nobody will care that we have leaked bits
+	 * laying around.  We also skip TLB flushes when doing
+	 * fullmm teardown, so the additional pte clearing would
+	 * not help the issue.
+	 */
+	if (tlb->fullmm)
+		return;
+
+	/*
+	 * If none of the PTEs hit inside intel_detect_leaked_pte(),
+	 * then we have nothing that might have been leaked and
+	 * nothing to clear.
+	 */
+	if (!tlb->saw_unset_a_or_d)
+		return;
+
+	/*
+	 * Contexts calling us with NULL ptep's do not have any
+	 * PTEs for us to go clear because they did not do any
+	 * actual TLB invalidation.
+	 */
+	if (!start_ptep || !end_ptep)
+		return;
+
+	/*
+	 * Mark that the workaround is no longer needed for
+	 * this batch.
+	 */
+	tlb->saw_unset_a_or_d = 0;
+
+	/*
+	 * Ensure that the compiler orders our set_pte()
+	 * after the preceding TLB flush no matter what.
+	 */
+	barrier();
+
+	/*
+	 * Re-clear out all the PTEs into which the hardware
+	 * may have leaked Accessed or Dirty bits.
+	 */
+	for (pte = start_ptep; pte < end_ptep; pte++)
+		set_pte(pte, __pte(0));
+}
+
+/*
+ * Kinda weird to define this in here, but we only use it for
+ * an Intel-specific issue.  This will get used on all
+ * processors (even non-Intel) if CONFIG_CPU_SUP_INTEL=y.
+ */
+pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
+		       pte_t *ptep)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t pte;
+
+	pte = ptep_get_and_clear(mm, address, ptep);
+	if (pte_accessible(mm, pte)) {
+		flush_tlb_page(vma, address);
+		/*
+		 * Ensure that the compiler orders our set_pte()
+		 * after the flush_tlb_page() no matter what.
+		 */
+		barrier();
+		if (static_cpu_has_bug(X86_BUG_PTE_LEAK))
+			set_pte(ptep, __pte(0));
+	}
+	return pte;
+}
diff -puN include/asm-generic/tlb.h~knl-leak-60-actual-fix include/asm-generic/tlb.h
--- a/include/asm-generic/tlb.h~knl-leak-60-actual-fix	2016-06-30 17:10:43.471288222 -0700
+++ b/include/asm-generic/tlb.h	2016-06-30 17:10:43.489289039 -0700
@@ -226,4 +226,8 @@ static inline void __tlb_reset_range(str
 
 #define tlb_migrate_finish(mm) do {} while (0)
 
+#ifndef __tlb_cleanup_pte_range
+#define __tlb_cleanup_pte_range(tlb, start_ptep, end_ptep) do {} while (0)
+#endif
+
 #endif /* _ASM_GENERIC__TLB_H */
diff -puN include/linux/mm.h~knl-leak-60-actual-fix include/linux/mm.h
--- a/include/linux/mm.h~knl-leak-60-actual-fix	2016-06-30 17:10:43.472288267 -0700
+++ b/include/linux/mm.h	2016-06-30 17:10:43.492289175 -0700
@@ -2404,6 +2404,23 @@ static inline bool debug_guardpage_enabl
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
+#ifndef ARCH_HAS_NEEDS_SWAP_PTL
+static inline bool arch_needs_swap_ptl(void) { return false; }
+#endif
+
+#ifndef ARCH_HAS_FIX_PTE_LEAK
+static inline void arch_fix_pte_leak(struct mm_struct *mm, unsigned long addr,
+				     pte_t *ptep) {}
+#endif
+
+#ifndef ARCH_DISABLE_DEFERRED_FLUSH
+static inline bool arch_disable_deferred_flush(void) { return false; }
+#endif
+
+#ifndef ARCH_FLUSH_CLEARED_PTE
+static inline void arch_flush_cleared_pte(struct mmu_gather *tlb, pte_t ptent) {}
+#endif
+
 #if MAX_NUMNODES > 1
 void __init setup_nr_node_ids(void);
 #else
diff -puN mm/memory.c~knl-leak-60-actual-fix mm/memory.c
--- a/mm/memory.c~knl-leak-60-actual-fix	2016-06-30 17:10:43.474288358 -0700
+++ b/mm/memory.c	2016-06-30 17:10:43.496289356 -0700
@@ -1141,6 +1141,7 @@ again:
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
+			arch_flush_cleared_pte(tlb, ptent);
 			if (unlikely(!page))
 				continue;
 
@@ -1166,6 +1167,7 @@ again:
 			if (unlikely(!__tlb_remove_page(tlb, page))) {
 				tlb->force_batch_flush = 1;
 				addr += PAGE_SIZE;
+				pte++;
 				break;
 			}
 			continue;
@@ -1192,8 +1194,11 @@ again:
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
-	if (tlb->force_batch_flush)
-		tlb_flush_mmu_tlbonly(tlb);
+	if (tlb->force_batch_flush) {
+		bool did_flush = tlb_flush_mmu_tlbonly(tlb);
+		if (did_flush)
+			__tlb_cleanup_pte_range(tlb, start_pte, pte);
+	}
 	pte_unmap_unlock(start_pte, ptl);
 
 	/*
@@ -1965,7 +1970,8 @@ static inline int pte_unmap_same(struct
 {
 	int same = 1;
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-	if (sizeof(pte_t) > sizeof(unsigned long)) {
+	if (unlikely(arch_needs_swap_ptl() ||
+	    sizeof(pte_t) > sizeof(unsigned long))) {
 		spinlock_t *ptl = pte_lockptr(mm, pmd);
 		spin_lock(ptl);
 		same = pte_same(*page_table, orig_pte);
diff -puN mm/mremap.c~knl-leak-60-actual-fix mm/mremap.c
--- a/mm/mremap.c~knl-leak-60-actual-fix	2016-06-30 17:10:43.476288449 -0700
+++ b/mm/mremap.c	2016-06-30 17:10:43.497289401 -0700
@@ -24,6 +24,7 @@
 #include <linux/mm-arch-hooks.h>
 
 #include <asm/cacheflush.h>
+#include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
 #include "internal.h"
@@ -144,10 +145,14 @@ static void move_ptes(struct vm_area_str
 
 	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
 				   new_pte++, new_addr += PAGE_SIZE) {
+		pte_t old_ptent;
+
 		if (pte_none(*old_pte))
 			continue;
-		pte = ptep_get_and_clear(mm, old_addr, old_pte);
-		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
+		old_ptent = ptep_get_and_clear(mm, old_addr, old_pte);
+		pte = move_pte(old_ptent, new_vma->vm_page_prot,
+				old_addr, new_addr);
+		arch_fix_pte_leak(mm, old_addr, old_pte);
 		pte = move_soft_dirty_pte(pte);
 		set_pte_at(mm, new_addr, new_pte, pte);
 	}
diff -puN mm/rmap.c~knl-leak-60-actual-fix mm/rmap.c
--- a/mm/rmap.c~knl-leak-60-actual-fix	2016-06-30 17:10:43.478288539 -0700
+++ b/mm/rmap.c	2016-06-30 17:10:43.498289447 -0700
@@ -633,6 +633,10 @@ static bool should_defer_flush(struct mm
 {
 	bool should_defer = false;
 
+	/* x86 may need an immediate flush after a pte clear */
+	if (arch_disable_deferred_flush())
+		return false;
+
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
diff -puN mm/vmalloc.c~knl-leak-60-actual-fix mm/vmalloc.c
--- a/mm/vmalloc.c~knl-leak-60-actual-fix	2016-06-30 17:10:43.479288585 -0700
+++ b/mm/vmalloc.c	2016-06-30 17:10:43.500289538 -0700
@@ -66,6 +66,7 @@ static void vunmap_pte_range(pmd_t *pmd,
 	pte = pte_offset_kernel(pmd, addr);
 	do {
 		pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);
+		arch_fix_pte_leak(&init_mm, addr, pte);
 		WARN_ON(!pte_none(ptent) && !pte_present(ptent));
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 }
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01  0:12 ` [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs Dave Hansen
@ 2016-07-01  1:50   ` Nadav Amit
  2016-07-01  1:54     ` Dave Hansen
  2016-07-01  2:55   ` Linus Torvalds
  1 sibling, 1 reply; 25+ messages in thread
From: Nadav Amit @ 2016-07-01  1:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: LKML, x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen

Dave Hansen <dave@sr71.net> wrote:

> +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
> +		       pte_t *ptep)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pte_t pte;
> +
> +	pte = ptep_get_and_clear(mm, address, ptep);
> +	if (pte_accessible(mm, pte)) {
> +		flush_tlb_page(vma, address);
> +		/*
> +		 * Ensure that the compiler orders our set_pte()
> +		 * after the flush_tlb_page() no matter what.
> +		 */
> +		barrier();

I don’t think such a barrier (after remote TLB flush) is needed.
Eventually, if a remote flush takes place, you get csd_lock_wait() to be
called, and then smp_rmb() is called (which is essentially a barrier()
call on x86).

Regards,
Nadav

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01  1:50   ` Nadav Amit
@ 2016-07-01  1:54     ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2016-07-01  1:54 UTC (permalink / raw)
  To: Nadav Amit
  Cc: LKML, x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen

On 06/30/2016 06:50 PM, Nadav Amit wrote:
> Dave Hansen <dave@sr71.net> wrote:
>> +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
>> +		       pte_t *ptep)
>> +{
>> +	struct mm_struct *mm = vma->vm_mm;
>> +	pte_t pte;
>> +
>> +	pte = ptep_get_and_clear(mm, address, ptep);
>> +	if (pte_accessible(mm, pte)) {
>> +		flush_tlb_page(vma, address);
>> +		/*
>> +		 * Ensure that the compiler orders our set_pte()
>> +		 * after the flush_tlb_page() no matter what.
>> +		 */
>> +		barrier();
> 
> I don’t think such a barrier (after remote TLB flush) is needed.
> Eventually, if a remote flush takes place, you get csd_lock_wait() to be
> called, and then smp_rmb() is called (which is essentially a barrier()
> call on x86).

Andi really wanted to make sure this got in here.  He said there was a
bug that bit him really badly once where a function got reordered.
Granted, a call _should_ be sufficient to keep the compiler from
reordering things, but this makes double sure.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01  0:12 ` [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs Dave Hansen
  2016-07-01  1:50   ` Nadav Amit
@ 2016-07-01  2:55   ` Linus Torvalds
  2016-07-01  3:06     ` Brian Gerst
  2016-07-01  4:39     ` Dave Hansen
  1 sibling, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2016-07-01  2:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers, linux-mm,
	Andrew Morton, Borislav Petkov, Andi Kleen, Michal Hocko,
	Dave Hansen

On Thu, Jun 30, 2016 at 5:12 PM, Dave Hansen <dave@sr71.net> wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
> Landing) has an erratum where a processor thread setting the Accessed
> or Dirty bits may not do so atomically against its checks for the
> Present bit.  This may cause a thread (which is about to page fault)
> to set A and/or D, even though the Present bit had already been
> atomically cleared.

So I don't think your approach is wrong, but I suspect this is
overkill, and what we should instead just do is to not use the A/D
bits at all in the swap representation.

The swap-entry representation was a bit tight on 32-bit page table
entries, but in 64-bit ones, I think we have tons of bits, don't we?
So we could decide just to not use those two bits on x86.

It's not like anybody will ever care about 32-bit page tables on
Knights Landing anyway.

So rather than add this kind of complexity and worry, how about just
simplifying the problem?

Or was there some discussion or implication I missed?

             Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01  2:55   ` Linus Torvalds
@ 2016-07-01  3:06     ` Brian Gerst
  2016-07-03 17:10       ` Dave Hansen
  2016-07-01  4:39     ` Dave Hansen
  1 sibling, 1 reply; 25+ messages in thread
From: Brian Gerst @ 2016-07-01  3:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Linux Kernel Mailing List, the arch/x86 maintainers,
	linux-mm, Andrew Morton, Borislav Petkov, Andi Kleen,
	Michal Hocko, Dave Hansen

On Thu, Jun 30, 2016 at 10:55 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jun 30, 2016 at 5:12 PM, Dave Hansen <dave@sr71.net> wrote:
>>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
>> Landing) has an erratum where a processor thread setting the Accessed
>> or Dirty bits may not do so atomically against its checks for the
>> Present bit.  This may cause a thread (which is about to page fault)
>> to set A and/or D, even though the Present bit had already been
>> atomically cleared.
>
> So I don't think your approach is wrong, but I suspect this is
> overkill, and what we should instead just do is to not use the A/D
> bits at all in the swap representation.
>
> The swap-entry representation was a bit tight on 32-bit page table
> entries, but in 64-bit ones, I think we have tons of bits, don't we?
> So we could decide just to not use those two bits on x86.
>
> It's not like anybody will ever care about 32-bit page tables on
> Knights Landing anyway.

Could this affect a 32-bit guest VM?

--
Brian Gerst

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01  2:55   ` Linus Torvalds
  2016-07-01  3:06     ` Brian Gerst
@ 2016-07-01  4:39     ` Dave Hansen
  2016-07-01  5:43       ` Linus Torvalds
  2016-07-01 16:07       ` Linus Torvalds
  1 sibling, 2 replies; 25+ messages in thread
From: Dave Hansen @ 2016-07-01  4:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers, linux-mm,
	Andrew Morton, Borislav Petkov, Andi Kleen, Michal Hocko,
	Dave Hansen

On 06/30/2016 07:55 PM, Linus Torvalds wrote:
> On Thu, Jun 30, 2016 at 5:12 PM, Dave Hansen <dave@sr71.net> wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>> The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
>> Landing) has an erratum where a processor thread setting the Accessed
>> or Dirty bits may not do so atomically against its checks for the
>> Present bit.  This may cause a thread (which is about to page fault)
>> to set A and/or D, even though the Present bit had already been
>> atomically cleared.
> 
> So I don't think your approach is wrong, but I suspect this is
> overkill, and what we should instead just do is to not use the A/D
> bits at all in the swap representation.

We actually don't even use Dirty today.  It's (implicitly) used to
determine pte_none(), but it ends up being masked out for the
swp_offset/type() calculations entirely, much to my surprise.

I think what you suggest will work if we don't consider A/D in
pte_none().  I think there are a bunch of code path where assume that
!pte_present() && !pte_none() means swap.

> The swap-entry representation was a bit tight on 32-bit page table
> entries, but in 64-bit ones, I think we have tons of bits, don't we?
> So we could decide just to not use those two bits on x86.

Yeah, we've definitely got space.  I'll go poke around and make sure
that this works everywhere.  I agree that throwing 32-bit non-PAE under
the bus is definitely worth it here.  Nobody will care about that in a
million years.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01  4:39     ` Dave Hansen
@ 2016-07-01  5:43       ` Linus Torvalds
  2016-07-01 14:25         ` Eric W. Biederman
  2016-07-01 16:07       ` Linus Torvalds
  1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2016-07-01  5:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers, linux-mm,
	Andrew Morton, Borislav Petkov, Andi Kleen, Michal Hocko,
	Dave Hansen

On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen <dave@sr71.net> wrote:
>
> I think what you suggest will work if we don't consider A/D in
> pte_none().  I think there are a bunch of code path where assume that
> !pte_present() && !pte_none() means swap.

Yeah, we would need to change pte_none() to mask off D/A, but I think
that might be the only real change needed (other than making sure that
we don't use the bits in the swap entries, I didn't look at that part
at all)

           Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro
  2016-07-01  0:12 ` [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro Dave Hansen
@ 2016-07-01  9:23   ` Borislav Petkov
  2016-07-01 16:30     ` Andy Lutomirski
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2016-07-01  9:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, linux-mm, torvalds, akpm, ak, mhocko,
	dave.hansen, luto, stable

On Thu, Jun 30, 2016 at 05:12:10PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> epufeatures.h currently defines X86_BUG(9) twice on 32-bit:
> 
> 	#define X86_BUG_NULL_SEG        X86_BUG(9) /* Nulling a selector preserves the base */
> 	...
> 	#ifdef CONFIG_X86_32
> 	#define X86_BUG_ESPFIX          X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */
> 	#endif
> 
> I think what happened was that this added the X86_BUG_ESPFIX, but
> in an #ifdef below most of the bugs:
> 
> 	[58a5aac5] x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled
> 
> Then this came along and added X86_BUG_NULL_SEG, but collided
> with the earlier one that did the bug below the main block
> defining all the X86_BUG()s.
> 
> 	[7a5d6704] x86/cpu: Probe the behavior of nulling out a segment at boot time
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Acked-by: Andy Lutomirski <luto@kernel.org>
> Cc: stable@vger.kernel.org
> ---
> 
>  b/arch/x86/include/asm/cpufeatures.h |    6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff -puN arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros arch/x86/include/asm/cpufeatures.h
> --- a/arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros	2016-06-30 17:10:41.215185869 -0700
> +++ b/arch/x86/include/asm/cpufeatures.h	2016-06-30 17:10:41.218186005 -0700
> @@ -301,10 +301,6 @@
>  #define X86_BUG_FXSAVE_LEAK	X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
>  #define X86_BUG_CLFLUSH_MONITOR	X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
>  #define X86_BUG_SYSRET_SS_ATTRS	X86_BUG(8) /* SYSRET doesn't fix up SS attrs */
> -#define X86_BUG_NULL_SEG	X86_BUG(9) /* Nulling a selector preserves the base */
> -#define X86_BUG_SWAPGS_FENCE	X86_BUG(10) /* SWAPGS without input dep on GS */
> -
> -
>  #ifdef CONFIG_X86_32
>  /*
>   * 64-bit kernels don't use X86_BUG_ESPFIX.  Make the define conditional

So I'd remove the "#ifdef CONFIG_X86_32" ifdeffery too and make that bit
unconditional - so what, we have enough free bits. But I'd leave the
comment to still avoid the confusion :)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01  5:43       ` Linus Torvalds
@ 2016-07-01 14:25         ` Eric W. Biederman
  2016-07-01 15:51           ` Dave Hansen
  0 siblings, 1 reply; 25+ messages in thread
From: Eric W. Biederman @ 2016-07-01 14:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Linux Kernel Mailing List, the arch/x86 maintainers,
	linux-mm, Andrew Morton, Borislav Petkov, Andi Kleen,
	Michal Hocko, Dave Hansen

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen <dave@sr71.net> wrote:
>>
>> I think what you suggest will work if we don't consider A/D in
>> pte_none().  I think there are a bunch of code path where assume that
>> !pte_present() && !pte_none() means swap.
>
> Yeah, we would need to change pte_none() to mask off D/A, but I think
> that might be the only real change needed (other than making sure that
> we don't use the bits in the swap entries, I didn't look at that part
> at all)

It looks like __pte_to_swp_entry also needs to be changed to mask out
those bits when the swap code reads pte entries.  For all of the same
reasons as pte_none.

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01 14:25         ` Eric W. Biederman
@ 2016-07-01 15:51           ` Dave Hansen
  2016-07-01 18:12             ` Eric W. Biederman
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2016-07-01 15:51 UTC (permalink / raw)
  To: Eric W. Biederman, Linus Torvalds
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers, linux-mm,
	Andrew Morton, Borislav Petkov, Andi Kleen, Michal Hocko,
	Dave Hansen

On 07/01/2016 07:25 AM, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>> > On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen <dave@sr71.net> wrote:
>>> >>
>>> >> I think what you suggest will work if we don't consider A/D in
>>> >> pte_none().  I think there are a bunch of code path where assume that
>>> >> !pte_present() && !pte_none() means swap.
>> >
>> > Yeah, we would need to change pte_none() to mask off D/A, but I think
>> > that might be the only real change needed (other than making sure that
>> > we don't use the bits in the swap entries, I didn't look at that part
>> > at all)
> It looks like __pte_to_swp_entry also needs to be changed to mask out
> those bits when the swap code reads pte entries.  For all of the same
> reasons as pte_none.

I guess that would be nice, but isn't it redundant?

static inline swp_entry_t pte_to_swp_entry(pte_t pte)
{
	...
        arch_entry = __pte_to_swp_entry(pte);
	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
}

As long as __swp_type() and __swp_offset() don't let A/D through, then
we should be OK.  This site is the only call to __pte_to_swp_entry()
that I can find in the entire codebase.

Or am I missing something?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01  4:39     ` Dave Hansen
  2016-07-01  5:43       ` Linus Torvalds
@ 2016-07-01 16:07       ` Linus Torvalds
  2016-07-01 16:14         ` Dave Hansen
  1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2016-07-01 16:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers, linux-mm,
	Andrew Morton, Borislav Petkov, Andi Kleen, Michal Hocko,
	Dave Hansen

On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen <dave@sr71.net> wrote:
>
> I think what you suggest will work if we don't consider A/D in
> pte_none().  I think there are a bunch of code path where assume that
> !pte_present() && !pte_none() means swap.

Hmm.

Thinking about it some more, I still think it's a good idea to avoid
A/D bits in the swap entries and in pte_none() and friends, and it
might simplify some of this all.

But I also started worrying about us just losing sight of the dirty
bit in particular. It's not enough that we ignore the dirty bit - we'd
still want to make sure that the underlying backing page gets marked
dirty, even if the CPU is buggy and ends doing it "delayed" after
we've already unmapped the page.

So I get this feeling that we may need a fair chunk of your patch-series anyway.

                Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01 16:07       ` Linus Torvalds
@ 2016-07-01 16:14         ` Dave Hansen
  2016-07-01 16:25           ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2016-07-01 16:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers, linux-mm,
	Andrew Morton, Borislav Petkov, Andi Kleen, Michal Hocko,
	Dave Hansen

On 07/01/2016 09:07 AM, Linus Torvalds wrote:
> But I also started worrying about us just losing sight of the dirty
> bit in particular. It's not enough that we ignore the dirty bit - we'd
> still want to make sure that the underlying backing page gets marked
> dirty, even if the CPU is buggy and ends doing it "delayed" after
> we've already unmapped the page.
> 
> So I get this feeling that we may need a fair chunk of your
> patch-series anyway.

As I understand it, the erratum only affects a thread which is about to
page fault.  The write associated with the dirty bit being set never
actually gets executed.  So, the bit really *is* stray and isn't
something we need to preserve.

Otherwise, we'd be really screwed because we couldn't ever simply clear it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01 16:14         ` Dave Hansen
@ 2016-07-01 16:25           ` Linus Torvalds
  0 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2016-07-01 16:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers, linux-mm,
	Andrew Morton, Borislav Petkov, Andi Kleen, Michal Hocko,
	Dave Hansen

On Fri, Jul 1, 2016 at 9:14 AM, Dave Hansen <dave@sr71.net> wrote:
>
> As I understand it, the erratum only affects a thread which is about to
> page fault.  The write associated with the dirty bit being set never
> actually gets executed.  So, the bit really *is* stray and isn't
> something we need to preserve.

Ok, good.

> Otherwise, we'd be really screwed because we couldn't ever simply clear it.

Oh, we could do the whole "clear the pte, then flush the tlb, then go
back and clear the stale dirty bits and move them into the backing
page".

I was afraid we might have to do something like that.

            Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro
  2016-07-01  9:23   ` Borislav Petkov
@ 2016-07-01 16:30     ` Andy Lutomirski
  2016-07-01 16:46       ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Andy Lutomirski @ 2016-07-01 16:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Hocko, linux-kernel, stable, Andi Kleen, linux-mm,
	Linus Torvalds, Andrew Morton, X86 ML, Dave Hansen, Dave Hansen

On Jul 1, 2016 2:23 AM, "Borislav Petkov" <bp@alien8.de> wrote:
>
> On Thu, Jun 30, 2016 at 05:12:10PM -0700, Dave Hansen wrote:
> >
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> >
> > epufeatures.h currently defines X86_BUG(9) twice on 32-bit:
> >
> >       #define X86_BUG_NULL_SEG        X86_BUG(9) /* Nulling a selector preserves the base */
> >       ...
> >       #ifdef CONFIG_X86_32
> >       #define X86_BUG_ESPFIX          X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */
> >       #endif
> >
> > I think what happened was that this added the X86_BUG_ESPFIX, but
> > in an #ifdef below most of the bugs:
> >
> >       [58a5aac5] x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled
> >
> > Then this came along and added X86_BUG_NULL_SEG, but collided
> > with the earlier one that did the bug below the main block
> > defining all the X86_BUG()s.
> >
> >       [7a5d6704] x86/cpu: Probe the behavior of nulling out a segment at boot time
> >
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Acked-by: Andy Lutomirski <luto@kernel.org>
> > Cc: stable@vger.kernel.org
> > ---
> >
> >  b/arch/x86/include/asm/cpufeatures.h |    6 ++----
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff -puN arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros arch/x86/include/asm/cpufeatures.h
> > --- a/arch/x86/include/asm/cpufeatures.h~knl-leak-10-fix-x86-bugs-macros      2016-06-30 17:10:41.215185869 -0700
> > +++ b/arch/x86/include/asm/cpufeatures.h      2016-06-30 17:10:41.218186005 -0700
> > @@ -301,10 +301,6 @@
> >  #define X86_BUG_FXSAVE_LEAK  X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
> >  #define X86_BUG_CLFLUSH_MONITOR      X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
> >  #define X86_BUG_SYSRET_SS_ATTRS      X86_BUG(8) /* SYSRET doesn't fix up SS attrs */
> > -#define X86_BUG_NULL_SEG     X86_BUG(9) /* Nulling a selector preserves the base */
> > -#define X86_BUG_SWAPGS_FENCE X86_BUG(10) /* SWAPGS without input dep on GS */
> > -
> > -
> >  #ifdef CONFIG_X86_32
> >  /*
> >   * 64-bit kernels don't use X86_BUG_ESPFIX.  Make the define conditional
>
> So I'd remove the "#ifdef CONFIG_X86_32" ifdeffery too and make that bit
> unconditional - so what, we have enough free bits. But I'd leave the
> comment to still avoid the confusion :)
>

I put the ifdef there to prevent anyone from accidentally using it in
a 64-bit code path, not to save a bit.  We could put in the middle of
the list to make the mistake much less likely to be repeated, I
suppose.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro
  2016-07-01 16:30     ` Andy Lutomirski
@ 2016-07-01 16:46       ` Borislav Petkov
  2016-07-03 14:36         ` Andy Lutomirski
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2016-07-01 16:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Michal Hocko, linux-kernel, stable, Andi Kleen, linux-mm,
	Linus Torvalds, Andrew Morton, X86 ML, Dave Hansen, Dave Hansen

On Fri, Jul 01, 2016 at 09:30:37AM -0700, Andy Lutomirski wrote:
> I put the ifdef there to prevent anyone from accidentally using it in
> a 64-bit code path, not to save a bit.  We could put in the middle of
> the list to make the mistake much less likely to be repeated, I
> suppose.

Well, if someone does, someone will notice pretty soon, no?

I just don't see the reason to worry but maybe I'm missing it.

And we can call it X86_BUG_ESPFIX_X86_32 or so too...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01 15:51           ` Dave Hansen
@ 2016-07-01 18:12             ` Eric W. Biederman
  0 siblings, 0 replies; 25+ messages in thread
From: Eric W. Biederman @ 2016-07-01 18:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	the arch/x86 maintainers, linux-mm, Andrew Morton,
	Borislav Petkov, Andi Kleen, Michal Hocko, Dave Hansen

Dave Hansen <dave@sr71.net> writes:

> On 07/01/2016 07:25 AM, Eric W. Biederman wrote:
>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>>> > On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen <dave@sr71.net> wrote:
>>>> >>
>>>> >> I think what you suggest will work if we don't consider A/D in
>>>> >> pte_none().  I think there are a bunch of code path where assume that
>>>> >> !pte_present() && !pte_none() means swap.
>>> >
>>> > Yeah, we would need to change pte_none() to mask off D/A, but I think
>>> > that might be the only real change needed (other than making sure that
>>> > we don't use the bits in the swap entries, I didn't look at that part
>>> > at all)
>> It looks like __pte_to_swp_entry also needs to be changed to mask out
>> those bits when the swap code reads pte entries.  For all of the same
>> reasons as pte_none.
>
> I guess that would be nice, but isn't it redundant?
>
> static inline swp_entry_t pte_to_swp_entry(pte_t pte)
> {
> 	...
>         arch_entry = __pte_to_swp_entry(pte);
> 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
> }
>
> As long as __swp_type() and __swp_offset() don't let A/D through, then
> we should be OK.  This site is the only call to __pte_to_swp_entry()
> that I can find in the entire codebase.
>
> Or am I missing something?

Given that __pte_to_swp_entry on x86_64 is just __pte_val or pte.pte it
does no filtering.  Similarly __swp_type(arch_entry) is a >> and
swp_entry(type, ...) is a << of what appears to be same amount
for the swap type.

So any corruption in the upper bits of the pte will be preserved as a
swap type.

In fact I strongly suspect that the compiler can optimize out all of the
work done by "swp_entry(__swp_type(arch_entry), _swp_offset(arch_entry))".

Eric

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro
  2016-07-01 16:46       ` Borislav Petkov
@ 2016-07-03 14:36         ` Andy Lutomirski
  2016-07-03 18:44           ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Andy Lutomirski @ 2016-07-03 14:36 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, Michal Hocko, X86 ML, linux-mm, stable,
	Andrew Morton, linux-kernel, Dave Hansen, Linus Torvalds,
	Andi Kleen

On Jul 1, 2016 9:47 AM, "Borislav Petkov" <bp@alien8.de> wrote:
>
> On Fri, Jul 01, 2016 at 09:30:37AM -0700, Andy Lutomirski wrote:
> > I put the ifdef there to prevent anyone from accidentally using it in
> > a 64-bit code path, not to save a bit.  We could put in the middle of
> > the list to make the mistake much less likely to be repeated, I
> > suppose.
>
> Well, if someone does, someone will notice pretty soon, no?

Dunno.  ESPFIX was broken under KVM for years and no one notices.

>
> I just don't see the reason to worry but maybe I'm missing it.
>
> And we can call it X86_BUG_ESPFIX_X86_32 or so too...

We could do that, too, I guess.  But the current solution is only two
extra lines of code.  We could reorder the things so that it's in the
middle instead of at the end, I suppose.

--Andy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
  2016-07-01  3:06     ` Brian Gerst
@ 2016-07-03 17:10       ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2016-07-03 17:10 UTC (permalink / raw)
  To: Brian Gerst, Linus Torvalds
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers, linux-mm,
	Andrew Morton, Borislav Petkov, Andi Kleen, Michal Hocko,
	Dave Hansen

On 06/30/2016 08:06 PM, Brian Gerst wrote:
>> > It's not like anybody will ever care about 32-bit page tables on
>> > Knights Landing anyway.
> Could this affect a 32-bit guest VM?

This isn't about 32-bit *mode*.  It's about using the the 32-bit 2-level
_paging_ mode that supports only 4GB virtual and 4GB physical addresses.
 That mode also doesn't support the No-eXecute (NX) bit, which basically
everyone needs today for its security benefits.

Even the little Quark CPU supports PAE (64-bit page tables).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro
  2016-07-03 14:36         ` Andy Lutomirski
@ 2016-07-03 18:44           ` Borislav Petkov
  0 siblings, 0 replies; 25+ messages in thread
From: Borislav Petkov @ 2016-07-03 18:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Michal Hocko, X86 ML, linux-mm, stable,
	Andrew Morton, linux-kernel, Dave Hansen, Linus Torvalds,
	Andi Kleen

On Sun, Jul 03, 2016 at 07:36:30AM -0700, Andy Lutomirski wrote:
> Dunno.  ESPFIX was broken under KVM for years and no one notices.

Ah, so this really was the case already. :-\

> We could do that, too, I guess.  But the current solution is only two
> extra lines of code.  We could reorder the things so that it's in the
> middle instead of at the end, I suppose.

Yeah, sounds good. Especially if it was already broken - I missed that
fact.

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-07-03 18:44 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-01  0:12 [PATCH 0/6] [v3] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
2016-07-01  0:12 ` [PATCH 1/6] x86: fix duplicated X86_BUG(9) macro Dave Hansen
2016-07-01  9:23   ` Borislav Petkov
2016-07-01 16:30     ` Andy Lutomirski
2016-07-01 16:46       ` Borislav Petkov
2016-07-03 14:36         ` Andy Lutomirski
2016-07-03 18:44           ` Borislav Petkov
2016-07-01  0:12 ` [PATCH 2/6] mm, tlb: add mmu_gather->saw_unset_a_or_d Dave Hansen
2016-07-01  0:12 ` [PATCH 3/6] mm: add force_batch_flush to mmu_gather Dave Hansen
2016-07-01  0:12 ` [PATCH 4/6] mm: move flush in madvise_free_pte_range() Dave Hansen
2016-07-01  0:12 ` [PATCH 5/6] mm: make tlb_flush_mmu_tlbonly() return whether it flushed Dave Hansen
2016-07-01  0:12 ` [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs Dave Hansen
2016-07-01  1:50   ` Nadav Amit
2016-07-01  1:54     ` Dave Hansen
2016-07-01  2:55   ` Linus Torvalds
2016-07-01  3:06     ` Brian Gerst
2016-07-03 17:10       ` Dave Hansen
2016-07-01  4:39     ` Dave Hansen
2016-07-01  5:43       ` Linus Torvalds
2016-07-01 14:25         ` Eric W. Biederman
2016-07-01 15:51           ` Dave Hansen
2016-07-01 18:12             ` Eric W. Biederman
2016-07-01 16:07       ` Linus Torvalds
2016-07-01 16:14         ` Dave Hansen
2016-07-01 16:25           ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).