linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum
@ 2016-07-01 17:46 Dave Hansen
  2016-07-01 17:47 ` [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum Dave Hansen
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Dave Hansen @ 2016-07-01 17:46 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen

This is very lightly tested.  I haven't even run it on the affected
hardware.  Just sending it quickly in case someone can easily see
something fatally wrong with it.

This seems a lot less fragile than the previous patches that relied
on TLB flushing.  Those seemed like it would be easy to add new code
that hit this issue.

The new approach seems like it'll be harder to break.  The most
likely thing to break would be someone looking for a zero pte_val()
and seeing a stray bit.  But that seems like a better alternative
than the nastiness that could happen if one of these bits *is*
considered when reading a swap PTE.

--

The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
Landing) has an erratum where a processor thread setting the Accessed
or Dirty bits may not do so atomically against its checks for the
Present bit.  This may cause a thread (which is about to page fault)
to set A and/or D, even though the Present bit had already been
atomically cleared.

If the PTE is used for storing a swap index or a NUMA migration index,
the A bit could be misinterpreted as part of the swap type.  The stray
bits being set cause a software-cleared PTE to be interpreted as a
swap entry.  In some cases (like when the swap index ends up being
for a non-existent swapfile), the kernel detects the stray value
and WARN()s about it, but there is no guarantee that the kernel can
always detect it.

This patch changes the kernel to attempt to ignore those stray bits
when they get set.  We do this by making our swap PTE format
completely ignore the A/D bits, and also by ignoring them in our
pte_none() checks.

Andi Kleen wrote the original version of this patch.  Dave Hansen
wrote the later ones.

v4: complete rework: let the bad bits stay around, but try to
    ignore them
v3: huge rework to keep batching working in unmap case
v2: out of line. avoid single thread flush. cover more clear
    cases

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum
  2016-07-01 17:46 [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
@ 2016-07-01 17:47 ` Dave Hansen
  2016-07-01 17:47 ` [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none() Dave Hansen
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Dave Hansen @ 2016-07-01 17:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen


This erratum can result in Accessed/Dirty getting set by the hardware
when we do not expect them to be (on !Present PTEs).

Instead of trying to fix them up after this happens, we just
allow the bits to get set and try to ignore them.  We do this by
shifting the layout of the bits we use for swap offset/type in
our 64-bit PTEs.

It looks like this:

bitnrs: |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0|
names:  |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P|
before: |         OFFSET (9-63)          |0|X|X| TYPE(1-5) |0|
 after: | OFFSET (14-63)  |  TYPE (9-13) |0|X|X|X| X| X|X|X|0|

Note that D was already a don't care (X) even before.  We just
move TYPE up and turn its old spot (which could be hit by the
A bit) into all don't cares.

We take 5 bits away from the offset, but that still leaves us
with 50 bits which lets us index into a 62-bit swapfile (4 EiB).
I think that's probably fine for the moment.  We could
theoretically reclaim 5 of the bits (1, 2, 3, 4, 7) but it
doesn't gain us anything.

---

 b/arch/x86/include/asm/pgtable_64.h |   26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff -puN arch/x86/include/asm/pgtable_64.h~knl-strays-10-move-swp-pte-bits arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h~knl-strays-10-move-swp-pte-bits	2016-07-01 10:42:06.511754330 -0700
+++ b/arch/x86/include/asm/pgtable_64.h	2016-07-01 10:42:06.514754467 -0700
@@ -140,18 +140,32 @@ static inline int pgd_large(pgd_t pgd) {
 #define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))
 #define pte_unmap(pte) ((void)(pte))/* NOP */
 
-/* Encode and de-code a swap entry */
+/*
+ * Encode and de-code a swap entry
+ *
+ * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0| <- bit number
+ * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
+ * | OFFSET (14->63) | TYPE (10-13) |0|X|X|X| X| X|X|X|0| <- swp entry
+ *
+ * G (8) is aliased and used as a PROT_NONE indicator for
+ * !present ptes.  We need to start storing swap entries above
+ * there.  We also need to avoid using A and D because of an
+ * erratum where they can be incorrectly set by hardware on
+ * non-present PTEs.
+ */
+#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
 #define SWP_TYPE_BITS 5
-#define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 1)
+/* Place the offset above the type: */
+#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS + 1)
 
 #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
 
-#define __swp_type(x)			(((x).val >> (_PAGE_BIT_PRESENT + 1)) \
+#define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \
 					 & ((1U << SWP_TYPE_BITS) - 1))
-#define __swp_offset(x)			((x).val >> SWP_OFFSET_SHIFT)
+#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)
 #define __swp_entry(type, offset)	((swp_entry_t) { \
-					 ((type) << (_PAGE_BIT_PRESENT + 1)) \
-					 | ((offset) << SWP_OFFSET_SHIFT) })
+					 ((type) << (SWP_TYPE_FIRST_BIT)) \
+					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
 
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none()
  2016-07-01 17:46 [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
  2016-07-01 17:47 ` [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum Dave Hansen
@ 2016-07-01 17:47 ` Dave Hansen
  2016-07-01 17:47 ` [PATCH 3/4] x86: disallow running with 32-bit PTEs to work around erratum Dave Hansen
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Dave Hansen @ 2016-07-01 17:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen


The erratum we are fixing here can lead to stray setting of the
A and D bits.  That means that a pte that we cleared might
suddenly have A/D set.  So, stop considering those bits when
determining if a pte is pte_none().  The same goes for the
other pmd_none() and pud_none().  pgd_none() can be skipped
because it is not affected; we do not use PGD entries for
anything other than pagetables on affected configurations.

This adds a tiny amount of overhead to all pte_none() checks.
I doubt we'll be able to measure it anywhere.

---

 b/arch/x86/include/asm/pgtable.h       |   13 ++++++++++---
 b/arch/x86/include/asm/pgtable_types.h |    6 ++++++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff -puN arch/x86/include/asm/pgtable.h~knl-strays-20-mod-pte-none arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~knl-strays-20-mod-pte-none	2016-07-01 10:42:06.895771764 -0700
+++ b/arch/x86/include/asm/pgtable.h	2016-07-01 10:42:06.900771991 -0700
@@ -480,7 +480,7 @@ pte_t *populate_extra_pte(unsigned long
 
 static inline int pte_none(pte_t pte)
 {
-	return !pte.pte;
+	return !(pte.pte & ~(_PAGE_KNL_ERRATUM_MASK));
 }
 
 #define __HAVE_ARCH_PTE_SAME
@@ -552,7 +552,8 @@ static inline int pmd_none(pmd_t pmd)
 {
 	/* Only check low word on 32-bit platforms, since it might be
 	   out of sync with upper half. */
-	return (unsigned long)native_pmd_val(pmd) == 0;
+	unsigned long val = native_pmd_val(pmd);
+	return (val & ~_PAGE_KNL_ERRATUM_MASK) == 0;
 }
 
 static inline unsigned long pmd_page_vaddr(pmd_t pmd)
@@ -616,7 +617,7 @@ static inline unsigned long pages_to_mb(
 #if CONFIG_PGTABLE_LEVELS > 2
 static inline int pud_none(pud_t pud)
 {
-	return native_pud_val(pud) == 0;
+	return (native_pud_val(pud) & ~(_PAGE_KNL_ERRATUM_MASK)) == 0;
 }
 
 static inline int pud_present(pud_t pud)
@@ -694,6 +695,12 @@ static inline int pgd_bad(pgd_t pgd)
 
 static inline int pgd_none(pgd_t pgd)
 {
+	/*
+	 * There is no need to do a workaround for the KNL stray
+	 * A/D bit erratum here.  PGDs only point to page tables
+	 * except on 32-bit non-PAE which is not supported on
+	 * KNL.
+	 */
 	return !native_pgd_val(pgd);
 }
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
diff -puN arch/x86/include/asm/pgtable_types.h~knl-strays-20-mod-pte-none arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~knl-strays-20-mod-pte-none	2016-07-01 10:42:06.896771809 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2016-07-01 10:42:06.900771991 -0700
@@ -70,6 +70,12 @@
 			 _PAGE_PKEY_BIT2 | \
 			 _PAGE_PKEY_BIT3)
 
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
+#else
+#define _PAGE_KNL_ERRATUM_MASK 0
+#endif
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/4] x86: disallow running with 32-bit PTEs to work around erratum
  2016-07-01 17:46 [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
  2016-07-01 17:47 ` [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum Dave Hansen
  2016-07-01 17:47 ` [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none() Dave Hansen
@ 2016-07-01 17:47 ` Dave Hansen
  2016-07-01 17:47 ` [PATCH 4/4] x86: use pte_none() to test for empty PTE Dave Hansen
  2016-07-01 22:28 ` [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Benjamin Herrenschmidt
  4 siblings, 0 replies; 11+ messages in thread
From: Dave Hansen @ 2016-07-01 17:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen


The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
Landing) has an erratum where a processor thread setting the Accessed
or Dirty bits may not do so atomically against its checks for the
Present bit.  This may cause a thread (which is about to page fault)
to set A and/or D, even though the Present bit had already been
atomically cleared.

If the PTE is used for storing a swap index or a NUMA migration index,
the A bit could be misinterpreted as part of the swap type.  The stray
bits being set cause a software-cleared PTE to be interpreted as a
swap entry.  In some cases (like when the swap index ends up being
for a non-existent swapfile), the kernel detects the stray value
and WARN()s about it, but there is no guarantee that the kernel can
always detect it.

When we have 64-bit PTEs (64-bit mode or 32-bit PAE), we were able
to move the swap PTE format around to avoid these troublesome bits.
But, 32-bit non-PAE is tight on bits.  So, disallow it from running
on this hardware.  I can't imagine anyone wanting to run 32-bit
on this hardware, but this is the safe thing to do.

---

 b/arch/x86/boot/boot.h     |    1 +
 b/arch/x86/boot/cpu.c      |    2 ++
 b/arch/x86/boot/cpucheck.c |   32 ++++++++++++++++++++++++++++++++
 b/arch/x86/boot/cpuflags.c |    1 +
 b/arch/x86/boot/cpuflags.h |    1 +
 5 files changed, 37 insertions(+)

diff -puN arch/x86/boot/boot.h~knl-strays-40-disallow-non-PAE-32-bit-on-KNL arch/x86/boot/boot.h
--- a/arch/x86/boot/boot.h~knl-strays-40-disallow-non-PAE-32-bit-on-KNL	2016-07-01 10:42:07.302790241 -0700
+++ b/arch/x86/boot/boot.h	2016-07-01 10:42:07.311790650 -0700
@@ -295,6 +295,7 @@ static inline int cmdline_find_option_bo
 
 /* cpu.c, cpucheck.c */
 int check_cpu(int *cpu_level_ptr, int *req_level_ptr, u32 **err_flags_ptr);
+int check_knl_erratum(void);
 int validate_cpu(void);
 
 /* early_serial_console.c */
diff -puN arch/x86/boot/cpu.c~knl-strays-40-disallow-non-PAE-32-bit-on-KNL arch/x86/boot/cpu.c
--- a/arch/x86/boot/cpu.c~knl-strays-40-disallow-non-PAE-32-bit-on-KNL	2016-07-01 10:42:07.303790286 -0700
+++ b/arch/x86/boot/cpu.c	2016-07-01 10:42:07.312790695 -0700
@@ -93,6 +93,8 @@ int validate_cpu(void)
 		show_cap_strs(err_flags);
 		putchar('\n');
 		return -1;
+	} else if (check_knl_erratum()) {
+		return -1;
 	} else {
 		return 0;
 	}
diff -puN arch/x86/boot/cpucheck.c~knl-strays-40-disallow-non-PAE-32-bit-on-KNL arch/x86/boot/cpucheck.c
--- a/arch/x86/boot/cpucheck.c~knl-strays-40-disallow-non-PAE-32-bit-on-KNL	2016-07-01 10:42:07.305790377 -0700
+++ b/arch/x86/boot/cpucheck.c	2016-07-01 10:42:07.312790695 -0700
@@ -24,6 +24,7 @@
 # include "boot.h"
 #endif
 #include <linux/types.h>
+#include <asm/intel-family.h>
 #include <asm/processor-flags.h>
 #include <asm/required-features.h>
 #include <asm/msr-index.h>
@@ -175,6 +176,8 @@ int check_cpu(int *cpu_level_ptr, int *r
 			puts("WARNING: PAE disabled. Use parameter 'forcepae' to enable at your own risk!\n");
 		}
 	}
+	if (!err)
+		err = check_knl_erratum();
 
 	if (err_flags_ptr)
 		*err_flags_ptr = err ? err_flags : NULL;
@@ -185,3 +188,32 @@ int check_cpu(int *cpu_level_ptr, int *r
 
 	return (cpu.level < req_level || err) ? -1 : 0;
 }
+
+int check_knl_erratum(void)
+{
+	/*
+	 * First check for the affected model/family:
+	 */
+	if (!is_intel() ||
+	    cpu.family != 6 ||
+	    cpu.model != INTEL_FAM6_XEON_PHI_KNL)
+		return 0;
+
+	/*
+	 * This erratum affects the Accessed/Dirty bits, and can
+	 * cause stray bits to be set in !Present PTEs.  We have
+	 * enough bits in our 64-bit PTEs (which we have on real
+	 * 64-bit mode or PAE) to avoid using these troublesome
+	 * bits.  But, we do not have enough soace in our 32-bit
+	 * PTEs.  So, refuse to run on 32-bit non-PAE kernels.
+	 */
+	if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
+		return 0;
+
+	puts("This 32-bit kernel can not run on this processor due\n"
+	     "to a processor erratum.  Use a 64-bit kernel, or PAE.\n\n");
+
+	return -1;
+}
+
+
diff -puN arch/x86/boot/cpuflags.c~knl-strays-40-disallow-non-PAE-32-bit-on-KNL arch/x86/boot/cpuflags.c
--- a/arch/x86/boot/cpuflags.c~knl-strays-40-disallow-non-PAE-32-bit-on-KNL	2016-07-01 10:42:07.307790468 -0700
+++ b/arch/x86/boot/cpuflags.c	2016-07-01 10:42:07.312790695 -0700
@@ -102,6 +102,7 @@ void get_cpuflags(void)
 			cpuid(0x1, &tfms, &ignored, &cpu.flags[4],
 			      &cpu.flags[0]);
 			cpu.level = (tfms >> 8) & 15;
+			cpu.family = cpu.level;
 			cpu.model = (tfms >> 4) & 15;
 			if (cpu.level >= 6)
 				cpu.model += ((tfms >> 16) & 0xf) << 4;
diff -puN arch/x86/boot/cpuflags.h~knl-strays-40-disallow-non-PAE-32-bit-on-KNL arch/x86/boot/cpuflags.h
--- a/arch/x86/boot/cpuflags.h~knl-strays-40-disallow-non-PAE-32-bit-on-KNL	2016-07-01 10:42:07.308790514 -0700
+++ b/arch/x86/boot/cpuflags.h	2016-07-01 10:42:07.313790740 -0700
@@ -6,6 +6,7 @@
 
 struct cpu_features {
 	int level;		/* Family, or 64 for x86-64 */
+	int family;		/* Family, always */
 	int model;
 	u32 flags[NCAPINTS];
 };
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 4/4] x86: use pte_none() to test for empty PTE
  2016-07-01 17:46 [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
                   ` (2 preceding siblings ...)
  2016-07-01 17:47 ` [PATCH 3/4] x86: disallow running with 32-bit PTEs to work around erratum Dave Hansen
@ 2016-07-01 17:47 ` Dave Hansen
  2016-07-01 22:28 ` [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Benjamin Herrenschmidt
  4 siblings, 0 replies; 11+ messages in thread
From: Dave Hansen @ 2016-07-01 17:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen


The page table manipulation code seems to have grown a couple of
sites that are looking for empty PTEs.  Just in case one of these
entries got a stray bit set, use pte_none() instead of checking
for a zero pte_val().

The use pte_same() makes me a bit nervous.  If we were doing a
pte_same() check against two cleared entries and one of them had
a stray bit set, it might fail the pte_same() check.  But, I
don't think we ever _do_ pte_same() for cleared entries.  It is
almost entirely used for checking for races in fault-in paths.

---

 b/arch/x86/mm/init_64.c    |   12 ++++++------
 b/arch/x86/mm/pageattr.c   |    2 +-
 b/arch/x86/mm/pgtable_32.c |    2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff -puN arch/x86/mm/init_64.c~knl-strays-50-pte_val-cleanups arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c~knl-strays-50-pte_val-cleanups	2016-07-01 10:42:07.781811987 -0700
+++ b/arch/x86/mm/init_64.c	2016-07-01 10:42:07.788812305 -0700
@@ -354,7 +354,7 @@ phys_pte_init(pte_t *pte_page, unsigned
 		 * pagetable pages as RO. So assume someone who pre-setup
 		 * these mappings are more intelligent.
 		 */
-		if (pte_val(*pte)) {
+		if (!pte_none(*pte)) {
 			if (!after_bootmem)
 				pages++;
 			continue;
@@ -396,7 +396,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned
 			continue;
 		}
 
-		if (pmd_val(*pmd)) {
+		if (!pmd_none(*pmd)) {
 			if (!pmd_large(*pmd)) {
 				spin_lock(&init_mm.page_table_lock);
 				pte = (pte_t *)pmd_page_vaddr(*pmd);
@@ -470,7 +470,7 @@ phys_pud_init(pud_t *pud_page, unsigned
 			continue;
 		}
 
-		if (pud_val(*pud)) {
+		if (!pud_none(*pud)) {
 			if (!pud_large(*pud)) {
 				pmd = pmd_offset(pud, 0);
 				last_map_addr = phys_pmd_init(pmd, addr, end,
@@ -673,7 +673,7 @@ static void __meminit free_pte_table(pte
 
 	for (i = 0; i < PTRS_PER_PTE; i++) {
 		pte = pte_start + i;
-		if (pte_val(*pte))
+		if (!pte_none(*pte))
 			return;
 	}
 
@@ -691,7 +691,7 @@ static void __meminit free_pmd_table(pmd
 
 	for (i = 0; i < PTRS_PER_PMD; i++) {
 		pmd = pmd_start + i;
-		if (pmd_val(*pmd))
+		if (!pmd_none(*pmd))
 			return;
 	}
 
@@ -710,7 +710,7 @@ static bool __meminit free_pud_table(pud
 
 	for (i = 0; i < PTRS_PER_PUD; i++) {
 		pud = pud_start + i;
-		if (pud_val(*pud))
+		if (!pud_none(*pud))
 			return false;
 	}
 
diff -puN arch/x86/mm/pageattr.c~knl-strays-50-pte_val-cleanups arch/x86/mm/pageattr.c
--- a/arch/x86/mm/pageattr.c~knl-strays-50-pte_val-cleanups	2016-07-01 10:42:07.783812078 -0700
+++ b/arch/x86/mm/pageattr.c	2016-07-01 10:42:07.789812350 -0700
@@ -1185,7 +1185,7 @@ repeat:
 		return __cpa_process_fault(cpa, address, primary);
 
 	old_pte = *kpte;
-	if (!pte_val(old_pte))
+	if (pte_none(old_pte))
 		return __cpa_process_fault(cpa, address, primary);
 
 	if (level == PG_LEVEL_4K) {
diff -puN arch/x86/mm/pgtable_32.c~knl-strays-50-pte_val-cleanups arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c~knl-strays-50-pte_val-cleanups	2016-07-01 10:42:07.785812169 -0700
+++ b/arch/x86/mm/pgtable_32.c	2016-07-01 10:42:07.789812350 -0700
@@ -47,7 +47,7 @@ void set_pte_vaddr(unsigned long vaddr,
 		return;
 	}
 	pte = pte_offset_kernel(pmd, vaddr);
-	if (pte_val(pteval))
+	if (!pte_none(pteval))
 		set_pte_at(&init_mm, vaddr, pte, pteval);
 	else
 		pte_clear(&init_mm, vaddr, pte);
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum
  2016-07-01 17:46 [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
                   ` (3 preceding siblings ...)
  2016-07-01 17:47 ` [PATCH 4/4] x86: use pte_none() to test for empty PTE Dave Hansen
@ 2016-07-01 22:28 ` Benjamin Herrenschmidt
  2016-07-13 11:37   ` Vlastimil Babka
  4 siblings, 1 reply; 11+ messages in thread
From: Benjamin Herrenschmidt @ 2016-07-01 22:28 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel; +Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko

On Fri, 2016-07-01 at 10:46 -0700, Dave Hansen wrote:
> The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
> Landing) has an erratum where a processor thread setting the Accessed
> or Dirty bits may not do so atomically against its checks for the
> Present bit.  This may cause a thread (which is about to page fault)
> to set A and/or D, even though the Present bit had already been
> atomically cleared.

Interesting.... I always wondered where in the Intel docs did it specify
that present was tested atomically with setting of A and D ... I couldn't
find it.

Isn't there a more fundamental issue however that you may actually lose
those bits ? For example if we do an munmap, in zap_pte_range()

We first exchange all the PTEs with 0 with ptep_get_and_clear_full()
and we then transfer D that we just read into the struct page.

We rely on the fact that D will never be set again, what we go it a
"final" D bit. IE. We rely on the fact that a processor either:

   - Has a cached PTE in its TLB with D set, in which case it can still
write to the page until we flush the TLB or

   - Doesn't have a cached PTE in its TLB with D set and so will fail
to do so due to the atomic P check, thus never writing.

With the errata, don't you have a situation where a processor in the second
category will write and set D despite P having been cleared (due to the
race) and thus causing us to miss the transfer of that D to the struct
page and essentially completely miss that the physical page is dirty ?

(Leading to memory corruption).

> If the PTE is used for storing a swap index or a NUMA migration index,
> the A bit could be misinterpreted as part of the swap type.  The stray
> bits being set cause a software-cleared PTE to be interpreted as a
> swap entry.  In some cases (like when the swap index ends up being
> for a non-existent swapfile), the kernel detects the stray value
> and WARN()s about it, but there is no guarantee that the kernel can
> always detect it.
> 
> This patch changes the kernel to attempt to ignore those stray bits
> when they get set.  We do this by making our swap PTE format
> completely ignore the A/D bits, and also by ignoring them in our
> pte_none() checks.
> 
> Andi Kleen wrote the original version of this patch.  Dave Hansen
> wrote the later ones.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum
  2016-07-01 22:28 ` [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Benjamin Herrenschmidt
@ 2016-07-13 11:37   ` Vlastimil Babka
  2016-07-13 12:10     ` Vlastimil Babka
  2016-07-13 14:04     ` Dave Hansen
  0 siblings, 2 replies; 11+ messages in thread
From: Vlastimil Babka @ 2016-07-13 11:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Dave Hansen, linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko

On 07/02/2016 12:28 AM, Benjamin Herrenschmidt wrote:
> On Fri, 2016-07-01 at 10:46 -0700, Dave Hansen wrote:
>> The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
>> Landing) has an erratum where a processor thread setting the Accessed
>> or Dirty bits may not do so atomically against its checks for the
>> Present bit.  This may cause a thread (which is about to page fault)
>> to set A and/or D, even though the Present bit had already been
>> atomically cleared.
>
> Interesting.... I always wondered where in the Intel docs did it specify
> that present was tested atomically with setting of A and D ... I couldn't
> find it.
>
> Isn't there a more fundamental issue however that you may actually lose
> those bits ? For example if we do an munmap, in zap_pte_range()
>
> We first exchange all the PTEs with 0 with ptep_get_and_clear_full()
> and we then transfer D that we just read into the struct page.
>
> We rely on the fact that D will never be set again, what we go it a
> "final" D bit. IE. We rely on the fact that a processor either:
>
>    - Has a cached PTE in its TLB with D set, in which case it can still
> write to the page until we flush the TLB or
>
>    - Doesn't have a cached PTE in its TLB with D set and so will fail
> to do so due to the atomic P check, thus never writing.
>
> With the errata, don't you have a situation where a processor in the second
> category will write and set D despite P having been cleared (due to the
> race) and thus causing us to miss the transfer of that D to the struct
> page and essentially completely miss that the physical page is dirty ?

Seems to me like this is indeed possible, but...

> (Leading to memory corruption).

... what memory corruption, exactly? If a process is writing to its 
memory from one thread and unmapping it from other thread at the same 
time, there are no guarantees anyway? Would anything sensible rely on 
the guarantee that if the write in such racy scenario didn't end up as a 
segfault (i.e. unmapping was faster), then it must hit the disk? Or are 
there any other scenarios where zap_pte_range() is called? Hmm, but how 
does this affect the page migration scenario, can we lose the D bit there?

And maybe related thing that just occured to me, what if page is made 
non-writable during fork() to catch COW? Any race in that one, or just 
the P bit? But maybe the argument would be the same as above...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum
  2016-07-13 11:37   ` Vlastimil Babka
@ 2016-07-13 12:10     ` Vlastimil Babka
  2016-07-13 14:04     ` Dave Hansen
  1 sibling, 0 replies; 11+ messages in thread
From: Vlastimil Babka @ 2016-07-13 12:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Dave Hansen, linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko

On 07/13/2016 01:37 PM, Vlastimil Babka wrote:
>> > With the errata, don't you have a situation where a processor in the second
>> > category will write and set D despite P having been cleared (due to the
>> > race) and thus causing us to miss the transfer of that D to the struct
>> > page and essentially completely miss that the physical page is dirty ?
> Seems to me like this is indeed possible, but...

Nevermind, I have read the v3 thread now, where Dave says [1] that 
setting the D bit due to the erratum doesn't mean that the page is 
really actually written to (it's not). So there shouldn't be any true 
dirty bit to leave behind.

[1] http://marc.info/?l=linux-mm&m=146738965614826&w=2

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum
  2016-07-13 11:37   ` Vlastimil Babka
  2016-07-13 12:10     ` Vlastimil Babka
@ 2016-07-13 14:04     ` Dave Hansen
  1 sibling, 0 replies; 11+ messages in thread
From: Dave Hansen @ 2016-07-13 14:04 UTC (permalink / raw)
  To: Vlastimil Babka, Benjamin Herrenschmidt, linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko

On 07/13/2016 04:37 AM, Vlastimil Babka wrote:
> On 07/02/2016 12:28 AM, Benjamin Herrenschmidt wrote:
>> With the errata, don't you have a situation where a processor in
>> the second category will write and set D despite P having been
>> cleared (due to the race) and thus causing us to miss the transfer
>> of that D to the struct
>> page and essentially completely miss that the physical page is dirty ?
> 
> Seems to me like this is indeed possible, but...

No, this isn't possible with the erratum.

I had some off-list follow up with Ben, and included this description in
the later post of the patch:
> These bits are truly "stray".  In the case of the Dirty bit, the
> thread associated with the stray set was *not* allowed to write to
> the page.  This means that we do not have to launder the bit(s); we
> can simply ignore them.


>> (Leading to memory corruption).
> 
> ... what memory corruption, exactly?

In this (non-existent) scenario, we would lose writes to mmap()'d files
because we did not see the dirty bit during the "get" part of
ptep_get_and_clear().

> If a process is writing to its
> memory from one thread and unmapping it from other thread at the same
> time, there are no guarantees anyway?

It's not just unmapping, it's also swap, NUMA migration, etc...  We
clear the PTE, flush, then re-populate it.

> Would anything sensible rely on
> the guarantee that if the write in such racy scenario didn't end up as a
> segfault (i.e. unmapping was faster), then it must hit the disk? Or are
> there any other scenarios where zap_pte_range() is called? Hmm, but how
> does this affect the page migration scenario, can we lose the D bit there?

Yeah, it's not just zap_pte_range(), it's everywhere that we change a
present PTE.

> And maybe related thing that just occured to me, what if page is made
> non-writable during fork() to catch COW? Any race in that one, or just
> the P bit? But maybe the argument would be the same as above...

Yeah, the argument is the same.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum
  2016-07-08  0:19 Dave Hansen
@ 2016-07-13  9:54 ` Vlastimil Babka
  0 siblings, 0 replies; 11+ messages in thread
From: Vlastimil Babka @ 2016-07-13  9:54 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen,
	Benjamin Herrenschmidt

On 07/08/2016 02:19 AM, Dave Hansen wrote:
> This patch survived a bunch of testing over the past week, including
> on hardware affected by the issue.  A debugging patch showed the
> "stray" bits being set, and no ill effects were noticed.
>
> Barring any heartburn from folks, I think this is ready for the tip
> tree.

I don't see any answer to Benjamin's question on the previous version?
https://lkml.org/lkml/2016/7/1/703

> --
>
> The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
> Landing) has an erratum where a processor thread setting the Accessed
> or Dirty bits may not do so atomically against its checks for the
> Present bit.  This may cause a thread (which is about to page fault)
> to set A and/or D, even though the Present bit had already been
> atomically cleared.
>
> These bits are truly "stray".  In the case of the Dirty bit, the
> thread associated with the stray set was *not* allowed to write to
> the page.  This means that we do not have to launder the bit(s); we
> can simply ignore them.
>
> More details can be found in the "Specification Update" under "KNL4":
>
> 	http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-phi-processor-specification-update.pdf
>
> If the PTE is used for storing a swap index or a NUMA migration index,
> the A bit could be misinterpreted as part of the swap type.  The stray
> bits being set cause a software-cleared PTE to be interpreted as a
> swap entry.  In some cases (like when the swap index ends up being
> for a non-existent swapfile), the kernel detects the stray value
> and WARN()s about it, but there is no guarantee that the kernel can
> always detect it.
>
> This patch changes the kernel to attempt to ignore those stray bits
> when they get set.  We do this by making our swap PTE format
> completely ignore the A/D bits, and also by ignoring them in our
> pte_none() checks.
>
> Andi Kleen wrote the original version of this patch.  Dave Hansen
> wrote the later ones.
>
> v4: complete rework: let the bad bits stay around, but try to
>     ignore them
> v3: huge rework to keep batching working in unmap case
> v2: out of line. avoid single thread flush. cover more clear
>     cases
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum
@ 2016-07-08  0:19 Dave Hansen
  2016-07-13  9:54 ` Vlastimil Babka
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Hansen @ 2016-07-08  0:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen, Dave Hansen

This patch survived a bunch of testing over the past week, including
on hardware affected by the issue.  A debugging patch showed the
"stray" bits being set, and no ill effects were noticed.

Barring any heartburn from folks, I think this is ready for the tip
tree.

--

The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
Landing) has an erratum where a processor thread setting the Accessed
or Dirty bits may not do so atomically against its checks for the
Present bit.  This may cause a thread (which is about to page fault)
to set A and/or D, even though the Present bit had already been
atomically cleared.

These bits are truly "stray".  In the case of the Dirty bit, the
thread associated with the stray set was *not* allowed to write to
the page.  This means that we do not have to launder the bit(s); we
can simply ignore them.

More details can be found in the "Specification Update" under "KNL4":

	http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-phi-processor-specification-update.pdf

If the PTE is used for storing a swap index or a NUMA migration index,
the A bit could be misinterpreted as part of the swap type.  The stray
bits being set cause a software-cleared PTE to be interpreted as a
swap entry.  In some cases (like when the swap index ends up being
for a non-existent swapfile), the kernel detects the stray value
and WARN()s about it, but there is no guarantee that the kernel can
always detect it.

This patch changes the kernel to attempt to ignore those stray bits
when they get set.  We do this by making our swap PTE format
completely ignore the A/D bits, and also by ignoring them in our
pte_none() checks.

Andi Kleen wrote the original version of this patch.  Dave Hansen
wrote the later ones.

v4: complete rework: let the bad bits stay around, but try to
    ignore them
v3: huge rework to keep batching working in unmap case
v2: out of line. avoid single thread flush. cover more clear
    cases

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-07-13 14:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-01 17:46 [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
2016-07-01 17:47 ` [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum Dave Hansen
2016-07-01 17:47 ` [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none() Dave Hansen
2016-07-01 17:47 ` [PATCH 3/4] x86: disallow running with 32-bit PTEs to work around erratum Dave Hansen
2016-07-01 17:47 ` [PATCH 4/4] x86: use pte_none() to test for empty PTE Dave Hansen
2016-07-01 22:28 ` [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Benjamin Herrenschmidt
2016-07-13 11:37   ` Vlastimil Babka
2016-07-13 12:10     ` Vlastimil Babka
2016-07-13 14:04     ` Dave Hansen
2016-07-08  0:19 Dave Hansen
2016-07-13  9:54 ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).