linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum
@ 2016-07-08  0:19 Dave Hansen
  2016-07-08  0:19 ` [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum Dave Hansen
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Dave Hansen @ 2016-07-08  0:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen, Dave Hansen

This patch survived a bunch of testing over the past week, including
on hardware affected by the issue.  A debugging patch showed the
"stray" bits being set, and no ill effects were noticed.

Barring any heartburn from folks, I think this is ready for the tip
tree.

--

The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
Landing) has an erratum where a processor thread setting the Accessed
or Dirty bits may not do so atomically against its checks for the
Present bit.  This may cause a thread (which is about to page fault)
to set A and/or D, even though the Present bit had already been
atomically cleared.

These bits are truly "stray".  In the case of the Dirty bit, the
thread associated with the stray set was *not* allowed to write to
the page.  This means that we do not have to launder the bit(s); we
can simply ignore them.

More details can be found in the "Specification Update" under "KNL4":

	http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-phi-processor-specification-update.pdf

If the PTE is used for storing a swap index or a NUMA migration index,
the A bit could be misinterpreted as part of the swap type.  The stray
bits being set cause a software-cleared PTE to be interpreted as a
swap entry.  In some cases (like when the swap index ends up being
for a non-existent swapfile), the kernel detects the stray value
and WARN()s about it, but there is no guarantee that the kernel can
always detect it.

This patch changes the kernel to attempt to ignore those stray bits
when they get set.  We do this by making our swap PTE format
completely ignore the A/D bits, and also by ignoring them in our
pte_none() checks.

Andi Kleen wrote the original version of this patch.  Dave Hansen
wrote the later ones.

v4: complete rework: let the bad bits stay around, but try to
    ignore them
v3: huge rework to keep batching working in unmap case
v2: out of line. avoid single thread flush. cover more clear
    cases

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum
  2016-07-08  0:19 [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
@ 2016-07-08  0:19 ` Dave Hansen
  2016-07-13  8:03   ` [tip:x86/mm] x86/mm: Move " tip-bot for Dave Hansen
  2016-07-13 15:19   ` [PATCH 1/4] x86, swap: move " Michal Hocko
  2016-07-08  0:19 ` [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none() Dave Hansen
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Dave Hansen @ 2016-07-08  0:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This erratum can result in Accessed/Dirty getting set by the hardware
when we do not expect them to be (on !Present PTEs).

Instead of trying to fix them up after this happens, we just
allow the bits to get set and try to ignore them.  We do this by
shifting the layout of the bits we use for swap offset/type in
our 64-bit PTEs.

It looks like this:

bitnrs: |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0|
names:  |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P|
before: |         OFFSET (9-63)          |0|X|X| TYPE(1-5) |0|
 after: | OFFSET (14-63)  |  TYPE (9-13) |0|X|X|X| X| X|X|X|0|

Note that D was already a don't care (X) even before.  We just
move TYPE up and turn its old spot (which could be hit by the
A bit) into all don't cares.

We take 5 bits away from the offset, but that still leaves us
with 50 bits which lets us index into a 62-bit swapfile (4 EiB).
I think that's probably fine for the moment.  We could
theoretically reclaim 5 of the bits (1, 2, 3, 4, 7) but it
doesn't gain us anything.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/pgtable_64.h |   26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff -puN arch/x86/include/asm/pgtable_64.h~knl-strays-10-move-swp-pte-bits arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h~knl-strays-10-move-swp-pte-bits	2016-07-07 17:17:43.556746185 -0700
+++ b/arch/x86/include/asm/pgtable_64.h	2016-07-07 17:17:43.559746319 -0700
@@ -140,18 +140,32 @@ static inline int pgd_large(pgd_t pgd) {
 #define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))
 #define pte_unmap(pte) ((void)(pte))/* NOP */
 
-/* Encode and de-code a swap entry */
+/*
+ * Encode and de-code a swap entry
+ *
+ * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0| <- bit number
+ * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
+ * | OFFSET (14->63) | TYPE (10-13) |0|X|X|X| X| X|X|X|0| <- swp entry
+ *
+ * G (8) is aliased and used as a PROT_NONE indicator for
+ * !present ptes.  We need to start storing swap entries above
+ * there.  We also need to avoid using A and D because of an
+ * erratum where they can be incorrectly set by hardware on
+ * non-present PTEs.
+ */
+#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
 #define SWP_TYPE_BITS 5
-#define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 1)
+/* Place the offset above the type: */
+#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS + 1)
 
 #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
 
-#define __swp_type(x)			(((x).val >> (_PAGE_BIT_PRESENT + 1)) \
+#define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \
 					 & ((1U << SWP_TYPE_BITS) - 1))
-#define __swp_offset(x)			((x).val >> SWP_OFFSET_SHIFT)
+#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)
 #define __swp_entry(type, offset)	((swp_entry_t) { \
-					 ((type) << (_PAGE_BIT_PRESENT + 1)) \
-					 | ((offset) << SWP_OFFSET_SHIFT) })
+					 ((type) << (SWP_TYPE_FIRST_BIT)) \
+					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
 
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none()
  2016-07-08  0:19 [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
  2016-07-08  0:19 ` [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum Dave Hansen
@ 2016-07-08  0:19 ` Dave Hansen
  2016-07-13  8:03   ` [tip:x86/mm] x86/mm: Ignore " tip-bot for Dave Hansen
  2016-07-13 15:21   ` [PATCH 2/4] x86, pagetable: ignore " Michal Hocko
  2016-07-08  0:19 ` [PATCH 3/4] x86: disallow running with 32-bit PTEs to work around erratum Dave Hansen
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Dave Hansen @ 2016-07-08  0:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The erratum we are fixing here can lead to stray setting of the
A and D bits.  That means that a pte that we cleared might
suddenly have A/D set.  So, stop considering those bits when
determining if a pte is pte_none().  The same goes for the
other pmd_none() and pud_none().  pgd_none() can be skipped
because it is not affected; we do not use PGD entries for
anything other than pagetables on affected configurations.

This adds a tiny amount of overhead to all pte_none() checks.
I doubt we'll be able to measure it anywhere.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/pgtable.h       |   13 ++++++++++---
 b/arch/x86/include/asm/pgtable_types.h |    6 ++++++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff -puN arch/x86/include/asm/pgtable.h~knl-strays-20-mod-pte-none arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~knl-strays-20-mod-pte-none	2016-07-07 17:17:43.974764976 -0700
+++ b/arch/x86/include/asm/pgtable.h	2016-07-07 17:17:43.980765246 -0700
@@ -480,7 +480,7 @@ pte_t *populate_extra_pte(unsigned long
 
 static inline int pte_none(pte_t pte)
 {
-	return !pte.pte;
+	return !(pte.pte & ~(_PAGE_KNL_ERRATUM_MASK));
 }
 
 #define __HAVE_ARCH_PTE_SAME
@@ -552,7 +552,8 @@ static inline int pmd_none(pmd_t pmd)
 {
 	/* Only check low word on 32-bit platforms, since it might be
 	   out of sync with upper half. */
-	return (unsigned long)native_pmd_val(pmd) == 0;
+	unsigned long val = native_pmd_val(pmd);
+	return (val & ~_PAGE_KNL_ERRATUM_MASK) == 0;
 }
 
 static inline unsigned long pmd_page_vaddr(pmd_t pmd)
@@ -616,7 +617,7 @@ static inline unsigned long pages_to_mb(
 #if CONFIG_PGTABLE_LEVELS > 2
 static inline int pud_none(pud_t pud)
 {
-	return native_pud_val(pud) == 0;
+	return (native_pud_val(pud) & ~(_PAGE_KNL_ERRATUM_MASK)) == 0;
 }
 
 static inline int pud_present(pud_t pud)
@@ -694,6 +695,12 @@ static inline int pgd_bad(pgd_t pgd)
 
 static inline int pgd_none(pgd_t pgd)
 {
+	/*
+	 * There is no need to do a workaround for the KNL stray
+	 * A/D bit erratum here.  PGDs only point to page tables
+	 * except on 32-bit non-PAE which is not supported on
+	 * KNL.
+	 */
 	return !native_pgd_val(pgd);
 }
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
diff -puN arch/x86/include/asm/pgtable_types.h~knl-strays-20-mod-pte-none arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~knl-strays-20-mod-pte-none	2016-07-07 17:17:43.976765066 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2016-07-07 17:17:43.980765246 -0700
@@ -70,6 +70,12 @@
 			 _PAGE_PKEY_BIT2 | \
 			 _PAGE_PKEY_BIT3)
 
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
+#else
+#define _PAGE_KNL_ERRATUM_MASK 0
+#endif
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/4] x86: disallow running with 32-bit PTEs to work around erratum
  2016-07-08  0:19 [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
  2016-07-08  0:19 ` [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum Dave Hansen
  2016-07-08  0:19 ` [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none() Dave Hansen
@ 2016-07-08  0:19 ` Dave Hansen
  2016-07-13  8:04   ` [tip:x86/mm] x86/mm: Disallow " tip-bot for Dave Hansen
  2016-07-08  0:19 ` [PATCH 4/4] x86: use pte_none() to test for empty PTE Dave Hansen
  2016-07-13  9:54 ` [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Vlastimil Babka
  4 siblings, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2016-07-08  0:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
Landing) has an erratum where a processor thread setting the Accessed
or Dirty bits may not do so atomically against its checks for the
Present bit.  This may cause a thread (which is about to page fault)
to set A and/or D, even though the Present bit had already been
atomically cleared.

These bits are truly "stray".  In the case of the Dirty bit, the
thread associated with the stray set was *not* allowed to write to
the page.  This means that we do not have to launder the bit(s); we
can simply ignore them.

If the PTE is used for storing a swap index or a NUMA migration index,
the A bit could be misinterpreted as part of the swap type.  The stray
bits being set cause a software-cleared PTE to be interpreted as a
swap entry.  In some cases (like when the swap index ends up being
for a non-existent swapfile), the kernel detects the stray value
and WARN()s about it, but there is no guarantee that the kernel can
always detect it.

When we have 64-bit PTEs (64-bit mode or 32-bit PAE), we were able
to move the swap PTE format around to avoid these troublesome bits.
But, 32-bit non-PAE is tight on bits.  So, disallow it from running
on this hardware.  I can't imagine anyone wanting to run 32-bit
non-highmem kernels on this hardware, but disallowing them from
running entirely is surely the safe thing to do.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/boot/boot.h     |    1 +
 b/arch/x86/boot/cpu.c      |    2 ++
 b/arch/x86/boot/cpucheck.c |   33 +++++++++++++++++++++++++++++++++
 b/arch/x86/boot/cpuflags.c |    1 +
 b/arch/x86/boot/cpuflags.h |    1 +
 5 files changed, 38 insertions(+)

diff -puN arch/x86/boot/boot.h~knl-strays-40-disallow-non-PAE-32-bit-on-KNL arch/x86/boot/boot.h
--- a/arch/x86/boot/boot.h~knl-strays-40-disallow-non-PAE-32-bit-on-KNL	2016-07-07 17:17:44.420785026 -0700
+++ b/arch/x86/boot/boot.h	2016-07-07 17:17:44.430785476 -0700
@@ -295,6 +295,7 @@ static inline int cmdline_find_option_bo
 
 /* cpu.c, cpucheck.c */
 int check_cpu(int *cpu_level_ptr, int *req_level_ptr, u32 **err_flags_ptr);
+int check_knl_erratum(void);
 int validate_cpu(void);
 
 /* early_serial_console.c */
diff -puN arch/x86/boot/cpu.c~knl-strays-40-disallow-non-PAE-32-bit-on-KNL arch/x86/boot/cpu.c
--- a/arch/x86/boot/cpu.c~knl-strays-40-disallow-non-PAE-32-bit-on-KNL	2016-07-07 17:17:44.422785116 -0700
+++ b/arch/x86/boot/cpu.c	2016-07-07 17:17:44.430785476 -0700
@@ -93,6 +93,8 @@ int validate_cpu(void)
 		show_cap_strs(err_flags);
 		putchar('\n');
 		return -1;
+	} else if (check_knl_erratum()) {
+		return -1;
 	} else {
 		return 0;
 	}
diff -puN arch/x86/boot/cpucheck.c~knl-strays-40-disallow-non-PAE-32-bit-on-KNL arch/x86/boot/cpucheck.c
--- a/arch/x86/boot/cpucheck.c~knl-strays-40-disallow-non-PAE-32-bit-on-KNL	2016-07-07 17:17:44.423785161 -0700
+++ b/arch/x86/boot/cpucheck.c	2016-07-07 17:17:44.431785520 -0700
@@ -24,6 +24,7 @@
 # include "boot.h"
 #endif
 #include <linux/types.h>
+#include <asm/intel-family.h>
 #include <asm/processor-flags.h>
 #include <asm/required-features.h>
 #include <asm/msr-index.h>
@@ -175,6 +176,8 @@ int check_cpu(int *cpu_level_ptr, int *r
 			puts("WARNING: PAE disabled. Use parameter 'forcepae' to enable at your own risk!\n");
 		}
 	}
+	if (!err)
+		err = check_knl_erratum();
 
 	if (err_flags_ptr)
 		*err_flags_ptr = err ? err_flags : NULL;
@@ -185,3 +188,33 @@ int check_cpu(int *cpu_level_ptr, int *r
 
 	return (cpu.level < req_level || err) ? -1 : 0;
 }
+
+int check_knl_erratum(void)
+{
+	/*
+	 * First check for the affected model/family:
+	 */
+	if (!is_intel() ||
+	    cpu.family != 6 ||
+	    cpu.model != INTEL_FAM6_XEON_PHI_KNL)
+		return 0;
+
+	/*
+	 * This erratum affects the Accessed/Dirty bits, and can
+	 * cause stray bits to be set in !Present PTEs.  We have
+	 * enough bits in our 64-bit PTEs (which we have on real
+	 * 64-bit mode or PAE) to avoid using these troublesome
+	 * bits.  But, we do not have enough space in our 32-bit
+	 * PTEs.  So, refuse to run on 32-bit non-PAE kernels.
+	 */
+	if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
+		return 0;
+
+	puts("This 32-bit kernel can not run on this Xeon Phi x200\n"
+	     "processor due to a processor erratum.  Use a 64-bit\n"
+	     "kernel, or enable PAE in this 32-bit kernel.\n\n");
+
+	return -1;
+}
+
+
diff -puN arch/x86/boot/cpuflags.c~knl-strays-40-disallow-non-PAE-32-bit-on-KNL arch/x86/boot/cpuflags.c
--- a/arch/x86/boot/cpuflags.c~knl-strays-40-disallow-non-PAE-32-bit-on-KNL	2016-07-07 17:17:44.425785251 -0700
+++ b/arch/x86/boot/cpuflags.c	2016-07-07 17:17:44.431785520 -0700
@@ -102,6 +102,7 @@ void get_cpuflags(void)
 			cpuid(0x1, &tfms, &ignored, &cpu.flags[4],
 			      &cpu.flags[0]);
 			cpu.level = (tfms >> 8) & 15;
+			cpu.family = cpu.level;
 			cpu.model = (tfms >> 4) & 15;
 			if (cpu.level >= 6)
 				cpu.model += ((tfms >> 16) & 0xf) << 4;
diff -puN arch/x86/boot/cpuflags.h~knl-strays-40-disallow-non-PAE-32-bit-on-KNL arch/x86/boot/cpuflags.h
--- a/arch/x86/boot/cpuflags.h~knl-strays-40-disallow-non-PAE-32-bit-on-KNL	2016-07-07 17:17:44.427785341 -0700
+++ b/arch/x86/boot/cpuflags.h	2016-07-07 17:17:44.431785520 -0700
@@ -6,6 +6,7 @@
 
 struct cpu_features {
 	int level;		/* Family, or 64 for x86-64 */
+	int family;		/* Family, always */
 	int model;
 	u32 flags[NCAPINTS];
 };
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/4] x86: use pte_none() to test for empty PTE
  2016-07-08  0:19 [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
                   ` (2 preceding siblings ...)
  2016-07-08  0:19 ` [PATCH 3/4] x86: disallow running with 32-bit PTEs to work around erratum Dave Hansen
@ 2016-07-08  0:19 ` Dave Hansen
  2016-07-13  8:04   ` [tip:x86/mm] x86/mm: Use " tip-bot for Dave Hansen
                     ` (2 more replies)
  2016-07-13  9:54 ` [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Vlastimil Babka
  4 siblings, 3 replies; 22+ messages in thread
From: Dave Hansen @ 2016-07-08  0:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The page table manipulation code seems to have grown a couple of
sites that are looking for empty PTEs.  Just in case one of these
entries got a stray bit set, use pte_none() instead of checking
for a zero pte_val().

The use pte_same() makes me a bit nervous.  If we were doing a
pte_same() check against two cleared entries and one of them had
a stray bit set, it might fail the pte_same() check.  But, I
don't think we ever _do_ pte_same() for cleared entries.  It is
almost entirely used for checking for races in fault-in paths.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/init_64.c    |   12 ++++++------
 b/arch/x86/mm/pageattr.c   |    2 +-
 b/arch/x86/mm/pgtable_32.c |    2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff -puN arch/x86/mm/init_64.c~knl-strays-50-pte_val-cleanups arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c~knl-strays-50-pte_val-cleanups	2016-07-07 17:17:44.942808493 -0700
+++ b/arch/x86/mm/init_64.c	2016-07-07 17:17:44.949808807 -0700
@@ -354,7 +354,7 @@ phys_pte_init(pte_t *pte_page, unsigned
 		 * pagetable pages as RO. So assume someone who pre-setup
 		 * these mappings are more intelligent.
 		 */
-		if (pte_val(*pte)) {
+		if (!pte_none(*pte)) {
 			if (!after_bootmem)
 				pages++;
 			continue;
@@ -396,7 +396,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned
 			continue;
 		}
 
-		if (pmd_val(*pmd)) {
+		if (!pmd_none(*pmd)) {
 			if (!pmd_large(*pmd)) {
 				spin_lock(&init_mm.page_table_lock);
 				pte = (pte_t *)pmd_page_vaddr(*pmd);
@@ -470,7 +470,7 @@ phys_pud_init(pud_t *pud_page, unsigned
 			continue;
 		}
 
-		if (pud_val(*pud)) {
+		if (!pud_none(*pud)) {
 			if (!pud_large(*pud)) {
 				pmd = pmd_offset(pud, 0);
 				last_map_addr = phys_pmd_init(pmd, addr, end,
@@ -673,7 +673,7 @@ static void __meminit free_pte_table(pte
 
 	for (i = 0; i < PTRS_PER_PTE; i++) {
 		pte = pte_start + i;
-		if (pte_val(*pte))
+		if (!pte_none(*pte))
 			return;
 	}
 
@@ -691,7 +691,7 @@ static void __meminit free_pmd_table(pmd
 
 	for (i = 0; i < PTRS_PER_PMD; i++) {
 		pmd = pmd_start + i;
-		if (pmd_val(*pmd))
+		if (!pmd_none(*pmd))
 			return;
 	}
 
@@ -710,7 +710,7 @@ static bool __meminit free_pud_table(pud
 
 	for (i = 0; i < PTRS_PER_PUD; i++) {
 		pud = pud_start + i;
-		if (pud_val(*pud))
+		if (!pud_none(*pud))
 			return false;
 	}
 
diff -puN arch/x86/mm/pageattr.c~knl-strays-50-pte_val-cleanups arch/x86/mm/pageattr.c
--- a/arch/x86/mm/pageattr.c~knl-strays-50-pte_val-cleanups	2016-07-07 17:17:44.944808582 -0700
+++ b/arch/x86/mm/pageattr.c	2016-07-07 17:17:44.950808852 -0700
@@ -1185,7 +1185,7 @@ repeat:
 		return __cpa_process_fault(cpa, address, primary);
 
 	old_pte = *kpte;
-	if (!pte_val(old_pte))
+	if (pte_none(old_pte))
 		return __cpa_process_fault(cpa, address, primary);
 
 	if (level == PG_LEVEL_4K) {
diff -puN arch/x86/mm/pgtable_32.c~knl-strays-50-pte_val-cleanups arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c~knl-strays-50-pte_val-cleanups	2016-07-07 17:17:44.946808672 -0700
+++ b/arch/x86/mm/pgtable_32.c	2016-07-07 17:17:44.950808852 -0700
@@ -47,7 +47,7 @@ void set_pte_vaddr(unsigned long vaddr,
 		return;
 	}
 	pte = pte_offset_kernel(pmd, vaddr);
-	if (pte_val(pteval))
+	if (!pte_none(pteval))
 		set_pte_at(&init_mm, vaddr, pte, pteval);
 	else
 		pte_clear(&init_mm, vaddr, pte);
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:x86/mm] x86/mm: Move swap offset/type up in PTE to work around erratum
  2016-07-08  0:19 ` [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum Dave Hansen
@ 2016-07-13  8:03   ` tip-bot for Dave Hansen
  2016-07-13 15:19   ` [PATCH 1/4] x86, swap: move " Michal Hocko
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot for Dave Hansen @ 2016-07-13  8:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: luto, dave.hansen, torvalds, hpa, peterz, toshi.kani, akpm, bp,
	dave, jpoimboe, dvlasenk, mcgrof, brgerst, tglx, mingo,
	linux-kernel

Commit-ID:  00839ee3b299303c6a5e26a0a2485427a3afcbbf
Gitweb:     http://git.kernel.org/tip/00839ee3b299303c6a5e26a0a2485427a3afcbbf
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Thu, 7 Jul 2016 17:19:11 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 13 Jul 2016 09:43:25 +0200

x86/mm: Move swap offset/type up in PTE to work around erratum

This erratum can result in Accessed/Dirty getting set by the hardware
when we do not expect them to be (on !Present PTEs).

Instead of trying to fix them up after this happens, we just
allow the bits to get set and try to ignore them.  We do this by
shifting the layout of the bits we use for swap offset/type in
our 64-bit PTEs.

It looks like this:

 bitnrs: |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0|
 names:  |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P|
 before: |         OFFSET (9-63)          |0|X|X| TYPE(1-5) |0|
  after: | OFFSET (14-63)  |  TYPE (9-13) |0|X|X|X| X| X|X|X|0|

Note that D was already a don't care (X) even before.  We just
move TYPE up and turn its old spot (which could be hit by the
A bit) into all don't cares.

We take 5 bits away from the offset, but that still leaves us
with 50 bits which lets us index into a 62-bit swapfile (4 EiB).
I think that's probably fine for the moment.  We could
theoretically reclaim 5 of the bits (1, 2, 3, 4, 7) but it
doesn't gain us anything.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: dave.hansen@intel.com
Cc: linux-mm@kvack.org
Cc: mhocko@suse.com
Link: http://lkml.kernel.org/r/20160708001911.9A3FD2B6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable_64.h | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 2ee7811..7e8ec7a 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -140,18 +140,32 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 #define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))
 #define pte_unmap(pte) ((void)(pte))/* NOP */
 
-/* Encode and de-code a swap entry */
+/*
+ * Encode and de-code a swap entry
+ *
+ * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0| <- bit number
+ * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
+ * | OFFSET (14->63) | TYPE (10-13) |0|X|X|X| X| X|X|X|0| <- swp entry
+ *
+ * G (8) is aliased and used as a PROT_NONE indicator for
+ * !present ptes.  We need to start storing swap entries above
+ * there.  We also need to avoid using A and D because of an
+ * erratum where they can be incorrectly set by hardware on
+ * non-present PTEs.
+ */
+#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
 #define SWP_TYPE_BITS 5
-#define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 1)
+/* Place the offset above the type: */
+#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS + 1)
 
 #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
 
-#define __swp_type(x)			(((x).val >> (_PAGE_BIT_PRESENT + 1)) \
+#define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \
 					 & ((1U << SWP_TYPE_BITS) - 1))
-#define __swp_offset(x)			((x).val >> SWP_OFFSET_SHIFT)
+#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)
 #define __swp_entry(type, offset)	((swp_entry_t) { \
-					 ((type) << (_PAGE_BIT_PRESENT + 1)) \
-					 | ((offset) << SWP_OFFSET_SHIFT) })
+					 ((type) << (SWP_TYPE_FIRST_BIT)) \
+					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:x86/mm] x86/mm: Ignore A/D bits in pte/pmd/pud_none()
  2016-07-08  0:19 ` [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none() Dave Hansen
@ 2016-07-13  8:03   ` tip-bot for Dave Hansen
  2016-07-13 15:21   ` [PATCH 2/4] x86, pagetable: ignore " Michal Hocko
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot for Dave Hansen @ 2016-07-13  8:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: akpm, dvlasenk, mingo, dave, dave.hansen, peterz, torvalds,
	jpoimboe, brgerst, toshi.kani, luto, tglx, linux-kernel, mcgrof,
	bp, hpa

Commit-ID:  97e3c602ccbdd7db54e92fe05675c664c052a466
Gitweb:     http://git.kernel.org/tip/97e3c602ccbdd7db54e92fe05675c664c052a466
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Thu, 7 Jul 2016 17:19:12 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 13 Jul 2016 09:43:25 +0200

x86/mm: Ignore A/D bits in pte/pmd/pud_none()

The erratum we are fixing here can lead to stray setting of the
A and D bits.  That means that a pte that we cleared might
suddenly have A/D set.  So, stop considering those bits when
determining if a pte is pte_none().  The same goes for the
other pmd_none() and pud_none().  pgd_none() can be skipped
because it is not affected; we do not use PGD entries for
anything other than pagetables on affected configurations.

This adds a tiny amount of overhead to all pte_none() checks.
I doubt we'll be able to measure it anywhere.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: dave.hansen@intel.com
Cc: linux-mm@kvack.org
Cc: mhocko@suse.com
Link: http://lkml.kernel.org/r/20160708001912.5216F89C@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h       | 13 ++++++++++---
 arch/x86/include/asm/pgtable_types.h |  6 ++++++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1a27396..2815d26 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -480,7 +480,7 @@ pte_t *populate_extra_pte(unsigned long vaddr);
 
 static inline int pte_none(pte_t pte)
 {
-	return !pte.pte;
+	return !(pte.pte & ~(_PAGE_KNL_ERRATUM_MASK));
 }
 
 #define __HAVE_ARCH_PTE_SAME
@@ -552,7 +552,8 @@ static inline int pmd_none(pmd_t pmd)
 {
 	/* Only check low word on 32-bit platforms, since it might be
 	   out of sync with upper half. */
-	return (unsigned long)native_pmd_val(pmd) == 0;
+	unsigned long val = native_pmd_val(pmd);
+	return (val & ~_PAGE_KNL_ERRATUM_MASK) == 0;
 }
 
 static inline unsigned long pmd_page_vaddr(pmd_t pmd)
@@ -616,7 +617,7 @@ static inline unsigned long pages_to_mb(unsigned long npg)
 #if CONFIG_PGTABLE_LEVELS > 2
 static inline int pud_none(pud_t pud)
 {
-	return native_pud_val(pud) == 0;
+	return (native_pud_val(pud) & ~(_PAGE_KNL_ERRATUM_MASK)) == 0;
 }
 
 static inline int pud_present(pud_t pud)
@@ -694,6 +695,12 @@ static inline int pgd_bad(pgd_t pgd)
 
 static inline int pgd_none(pgd_t pgd)
 {
+	/*
+	 * There is no need to do a workaround for the KNL stray
+	 * A/D bit erratum here.  PGDs only point to page tables
+	 * except on 32-bit non-PAE which is not supported on
+	 * KNL.
+	 */
 	return !native_pgd_val(pgd);
 }
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 7b5efe2..d14d0a5 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -70,6 +70,12 @@
 			 _PAGE_PKEY_BIT2 | \
 			 _PAGE_PKEY_BIT3)
 
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
+#else
+#define _PAGE_KNL_ERRATUM_MASK 0
+#endif
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:x86/mm] x86/mm: Disallow running with 32-bit PTEs to work around erratum
  2016-07-08  0:19 ` [PATCH 3/4] x86: disallow running with 32-bit PTEs to work around erratum Dave Hansen
@ 2016-07-13  8:04   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Dave Hansen @ 2016-07-13  8:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jpoimboe, bp, torvalds, brgerst, luto, hpa, dave.hansen,
	linux-kernel, akpm, peterz, mcgrof, tglx, dave, toshi.kani,
	dvlasenk, mingo

Commit-ID:  e4a84be6f05eab4778732d799f63b3cd15427885
Gitweb:     http://git.kernel.org/tip/e4a84be6f05eab4778732d799f63b3cd15427885
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Thu, 7 Jul 2016 17:19:14 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 13 Jul 2016 09:43:25 +0200

x86/mm: Disallow running with 32-bit PTEs to work around erratum

The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
Landing) has an erratum where a processor thread setting the Accessed
or Dirty bits may not do so atomically against its checks for the
Present bit.  This may cause a thread (which is about to page fault)
to set A and/or D, even though the Present bit had already been
atomically cleared.

These bits are truly "stray".  In the case of the Dirty bit, the
thread associated with the stray set was *not* allowed to write to
the page.  This means that we do not have to launder the bit(s); we
can simply ignore them.

If the PTE is used for storing a swap index or a NUMA migration index,
the A bit could be misinterpreted as part of the swap type.  The stray
bits being set cause a software-cleared PTE to be interpreted as a
swap entry.  In some cases (like when the swap index ends up being
for a non-existent swapfile), the kernel detects the stray value
and WARN()s about it, but there is no guarantee that the kernel can
always detect it.

When we have 64-bit PTEs (64-bit mode or 32-bit PAE), we were able
to move the swap PTE format around to avoid these troublesome bits.
But, 32-bit non-PAE is tight on bits.  So, disallow it from running
on this hardware.  I can't imagine anyone wanting to run 32-bit
non-highmem kernels on this hardware, but disallowing them from
running entirely is surely the safe thing to do.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: dave.hansen@intel.com
Cc: linux-mm@kvack.org
Cc: mhocko@suse.com
Link: http://lkml.kernel.org/r/20160708001914.D0B50110@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/boot/boot.h     |  1 +
 arch/x86/boot/cpu.c      |  2 ++
 arch/x86/boot/cpucheck.c | 33 +++++++++++++++++++++++++++++++++
 arch/x86/boot/cpuflags.c |  1 +
 arch/x86/boot/cpuflags.h |  1 +
 5 files changed, 38 insertions(+)

diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 9011a88..a5ce666 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -294,6 +294,7 @@ static inline int cmdline_find_option_bool(const char *option)
 
 /* cpu.c, cpucheck.c */
 int check_cpu(int *cpu_level_ptr, int *req_level_ptr, u32 **err_flags_ptr);
+int check_knl_erratum(void);
 int validate_cpu(void);
 
 /* early_serial_console.c */
diff --git a/arch/x86/boot/cpu.c b/arch/x86/boot/cpu.c
index 29207f6..26240dd 100644
--- a/arch/x86/boot/cpu.c
+++ b/arch/x86/boot/cpu.c
@@ -93,6 +93,8 @@ int validate_cpu(void)
 		show_cap_strs(err_flags);
 		putchar('\n');
 		return -1;
+	} else if (check_knl_erratum()) {
+		return -1;
 	} else {
 		return 0;
 	}
diff --git a/arch/x86/boot/cpucheck.c b/arch/x86/boot/cpucheck.c
index 1fd7d57..4ad7d70 100644
--- a/arch/x86/boot/cpucheck.c
+++ b/arch/x86/boot/cpucheck.c
@@ -24,6 +24,7 @@
 # include "boot.h"
 #endif
 #include <linux/types.h>
+#include <asm/intel-family.h>
 #include <asm/processor-flags.h>
 #include <asm/required-features.h>
 #include <asm/msr-index.h>
@@ -175,6 +176,8 @@ int check_cpu(int *cpu_level_ptr, int *req_level_ptr, u32 **err_flags_ptr)
 			puts("WARNING: PAE disabled. Use parameter 'forcepae' to enable at your own risk!\n");
 		}
 	}
+	if (!err)
+		err = check_knl_erratum();
 
 	if (err_flags_ptr)
 		*err_flags_ptr = err ? err_flags : NULL;
@@ -185,3 +188,33 @@ int check_cpu(int *cpu_level_ptr, int *req_level_ptr, u32 **err_flags_ptr)
 
 	return (cpu.level < req_level || err) ? -1 : 0;
 }
+
+int check_knl_erratum(void)
+{
+	/*
+	 * First check for the affected model/family:
+	 */
+	if (!is_intel() ||
+	    cpu.family != 6 ||
+	    cpu.model != INTEL_FAM6_XEON_PHI_KNL)
+		return 0;
+
+	/*
+	 * This erratum affects the Accessed/Dirty bits, and can
+	 * cause stray bits to be set in !Present PTEs.  We have
+	 * enough bits in our 64-bit PTEs (which we have on real
+	 * 64-bit mode or PAE) to avoid using these troublesome
+	 * bits.  But, we do not have enough space in our 32-bit
+	 * PTEs.  So, refuse to run on 32-bit non-PAE kernels.
+	 */
+	if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
+		return 0;
+
+	puts("This 32-bit kernel can not run on this Xeon Phi x200\n"
+	     "processor due to a processor erratum.  Use a 64-bit\n"
+	     "kernel, or enable PAE in this 32-bit kernel.\n\n");
+
+	return -1;
+}
+
+
diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
index 431fa5f..6687ab9 100644
--- a/arch/x86/boot/cpuflags.c
+++ b/arch/x86/boot/cpuflags.c
@@ -102,6 +102,7 @@ void get_cpuflags(void)
 			cpuid(0x1, &tfms, &ignored, &cpu.flags[4],
 			      &cpu.flags[0]);
 			cpu.level = (tfms >> 8) & 15;
+			cpu.family = cpu.level;
 			cpu.model = (tfms >> 4) & 15;
 			if (cpu.level >= 6)
 				cpu.model += ((tfms >> 16) & 0xf) << 4;
diff --git a/arch/x86/boot/cpuflags.h b/arch/x86/boot/cpuflags.h
index 4cb404f..15ad56a 100644
--- a/arch/x86/boot/cpuflags.h
+++ b/arch/x86/boot/cpuflags.h
@@ -6,6 +6,7 @@
 
 struct cpu_features {
 	int level;		/* Family, or 64 for x86-64 */
+	int family;		/* Family, always */
 	int model;
 	u32 flags[NCAPINTS];
 };

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:x86/mm] x86/mm: Use pte_none() to test for empty PTE
  2016-07-08  0:19 ` [PATCH 4/4] x86: use pte_none() to test for empty PTE Dave Hansen
@ 2016-07-13  8:04   ` tip-bot for Dave Hansen
  2016-07-13 15:18   ` [PATCH 4/4] x86: use " Michal Hocko
  2016-07-14 13:47   ` Vlastimil Babka
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Dave Hansen @ 2016-07-13  8:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dave.hansen, mingo, mcgrof, hpa, dave, bp, luto, akpm, jpoimboe,
	peterz, torvalds, toshi.kani, dvlasenk, brgerst, tglx,
	linux-kernel

Commit-ID:  dcb32d9913b7ed527b135a7e221f8d14b67bb952
Gitweb:     http://git.kernel.org/tip/dcb32d9913b7ed527b135a7e221f8d14b67bb952
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Thu, 7 Jul 2016 17:19:15 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 13 Jul 2016 09:43:25 +0200

x86/mm: Use pte_none() to test for empty PTE

The page table manipulation code seems to have grown a couple of
sites that are looking for empty PTEs.  Just in case one of these
entries got a stray bit set, use pte_none() instead of checking
for a zero pte_val().

The use pte_same() makes me a bit nervous.  If we were doing a
pte_same() check against two cleared entries and one of them had
a stray bit set, it might fail the pte_same() check.  But, I
don't think we ever _do_ pte_same() for cleared entries.  It is
almost entirely used for checking for races in fault-in paths.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: dave.hansen@intel.com
Cc: linux-mm@kvack.org
Cc: mhocko@suse.com
Link: http://lkml.kernel.org/r/20160708001915.813703D9@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/init_64.c    | 12 ++++++------
 arch/x86/mm/pageattr.c   |  2 +-
 arch/x86/mm/pgtable_32.c |  2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bce2e5d..bb88fbc 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -354,7 +354,7 @@ phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
 		 * pagetable pages as RO. So assume someone who pre-setup
 		 * these mappings are more intelligent.
 		 */
-		if (pte_val(*pte)) {
+		if (!pte_none(*pte)) {
 			if (!after_bootmem)
 				pages++;
 			continue;
@@ -396,7 +396,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
 			continue;
 		}
 
-		if (pmd_val(*pmd)) {
+		if (!pmd_none(*pmd)) {
 			if (!pmd_large(*pmd)) {
 				spin_lock(&init_mm.page_table_lock);
 				pte = (pte_t *)pmd_page_vaddr(*pmd);
@@ -470,7 +470,7 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
 			continue;
 		}
 
-		if (pud_val(*pud)) {
+		if (!pud_none(*pud)) {
 			if (!pud_large(*pud)) {
 				pmd = pmd_offset(pud, 0);
 				last_map_addr = phys_pmd_init(pmd, addr, end,
@@ -673,7 +673,7 @@ static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
 
 	for (i = 0; i < PTRS_PER_PTE; i++) {
 		pte = pte_start + i;
-		if (pte_val(*pte))
+		if (!pte_none(*pte))
 			return;
 	}
 
@@ -691,7 +691,7 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 
 	for (i = 0; i < PTRS_PER_PMD; i++) {
 		pmd = pmd_start + i;
-		if (pmd_val(*pmd))
+		if (!pmd_none(*pmd))
 			return;
 	}
 
@@ -710,7 +710,7 @@ static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
 
 	for (i = 0; i < PTRS_PER_PUD; i++) {
 		pud = pud_start + i;
-		if (pud_val(*pud))
+		if (!pud_none(*pud))
 			return false;
 	}
 
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 7a1f7bb..7514215 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1185,7 +1185,7 @@ repeat:
 		return __cpa_process_fault(cpa, address, primary);
 
 	old_pte = *kpte;
-	if (!pte_val(old_pte))
+	if (pte_none(old_pte))
 		return __cpa_process_fault(cpa, address, primary);
 
 	if (level == PG_LEVEL_4K) {
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
index 75cc097..e67ae0e6 100644
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -47,7 +47,7 @@ void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
 		return;
 	}
 	pte = pte_offset_kernel(pmd, vaddr);
-	if (pte_val(pteval))
+	if (!pte_none(pteval))
 		set_pte_at(&init_mm, vaddr, pte, pteval);
 	else
 		pte_clear(&init_mm, vaddr, pte);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum
  2016-07-08  0:19 [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
                   ` (3 preceding siblings ...)
  2016-07-08  0:19 ` [PATCH 4/4] x86: use pte_none() to test for empty PTE Dave Hansen
@ 2016-07-13  9:54 ` Vlastimil Babka
  4 siblings, 0 replies; 22+ messages in thread
From: Vlastimil Babka @ 2016-07-13  9:54 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen,
	Benjamin Herrenschmidt

On 07/08/2016 02:19 AM, Dave Hansen wrote:
> This patch survived a bunch of testing over the past week, including
> on hardware affected by the issue.  A debugging patch showed the
> "stray" bits being set, and no ill effects were noticed.
>
> Barring any heartburn from folks, I think this is ready for the tip
> tree.

I don't see any answer to Benjamin's question on the previous version?
https://lkml.org/lkml/2016/7/1/703

> --
>
> The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
> Landing) has an erratum where a processor thread setting the Accessed
> or Dirty bits may not do so atomically against its checks for the
> Present bit.  This may cause a thread (which is about to page fault)
> to set A and/or D, even though the Present bit had already been
> atomically cleared.
>
> These bits are truly "stray".  In the case of the Dirty bit, the
> thread associated with the stray set was *not* allowed to write to
> the page.  This means that we do not have to launder the bit(s); we
> can simply ignore them.
>
> More details can be found in the "Specification Update" under "KNL4":
>
> 	http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-phi-processor-specification-update.pdf
>
> If the PTE is used for storing a swap index or a NUMA migration index,
> the A bit could be misinterpreted as part of the swap type.  The stray
> bits being set cause a software-cleared PTE to be interpreted as a
> swap entry.  In some cases (like when the swap index ends up being
> for a non-existent swapfile), the kernel detects the stray value
> and WARN()s about it, but there is no guarantee that the kernel can
> always detect it.
>
> This patch changes the kernel to attempt to ignore those stray bits
> when they get set.  We do this by making our swap PTE format
> completely ignore the A/D bits, and also by ignoring them in our
> pte_none() checks.
>
> Andi Kleen wrote the original version of this patch.  Dave Hansen
> wrote the later ones.
>
> v4: complete rework: let the bad bits stay around, but try to
>     ignore them
> v3: huge rework to keep batching working in unmap case
> v2: out of line. avoid single thread flush. cover more clear
>     cases
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] x86: use pte_none() to test for empty PTE
  2016-07-08  0:19 ` [PATCH 4/4] x86: use pte_none() to test for empty PTE Dave Hansen
  2016-07-13  8:04   ` [tip:x86/mm] x86/mm: Use " tip-bot for Dave Hansen
@ 2016-07-13 15:18   ` Michal Hocko
  2016-07-13 15:23     ` Julia Lawall
  2016-07-13 15:49     ` Julia Lawall
  2016-07-14 13:47   ` Vlastimil Babka
  2 siblings, 2 replies; 22+ messages in thread
From: Michal Hocko @ 2016-07-13 15:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, linux-mm, torvalds, akpm, bp, ak, dave.hansen,
	dave.hansen, Julia Lawall

[CCing Julia]

On Thu 07-07-16 17:19:15, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The page table manipulation code seems to have grown a couple of
> sites that are looking for empty PTEs.  Just in case one of these
> entries got a stray bit set, use pte_none() instead of checking
> for a zero pte_val().

This looks like something that coccinelle could help with and automate.
Especially when the patch seems interesting for applying to older kernel
code streams.

Julia would it be hard to generate a metapatch which would check the
{pte,pmd}_val() usage in conditions and replace them with {pte,pmd}_none
equivalents?

> The use pte_same() makes me a bit nervous.  If we were doing a
> pte_same() check against two cleared entries and one of them had
> a stray bit set, it might fail the pte_same() check.  But, I
> don't think we ever _do_ pte_same() for cleared entries.  It is
> almost entirely used for checking for races in fault-in paths.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Other than that looks good to me. Feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> 
>  b/arch/x86/mm/init_64.c    |   12 ++++++------
>  b/arch/x86/mm/pageattr.c   |    2 +-
>  b/arch/x86/mm/pgtable_32.c |    2 +-
>  3 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff -puN arch/x86/mm/init_64.c~knl-strays-50-pte_val-cleanups arch/x86/mm/init_64.c
> --- a/arch/x86/mm/init_64.c~knl-strays-50-pte_val-cleanups	2016-07-07 17:17:44.942808493 -0700
> +++ b/arch/x86/mm/init_64.c	2016-07-07 17:17:44.949808807 -0700
> @@ -354,7 +354,7 @@ phys_pte_init(pte_t *pte_page, unsigned
>  		 * pagetable pages as RO. So assume someone who pre-setup
>  		 * these mappings are more intelligent.
>  		 */
> -		if (pte_val(*pte)) {
> +		if (!pte_none(*pte)) {
>  			if (!after_bootmem)
>  				pages++;
>  			continue;
> @@ -396,7 +396,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned
>  			continue;
>  		}
>  
> -		if (pmd_val(*pmd)) {
> +		if (!pmd_none(*pmd)) {
>  			if (!pmd_large(*pmd)) {
>  				spin_lock(&init_mm.page_table_lock);
>  				pte = (pte_t *)pmd_page_vaddr(*pmd);
> @@ -470,7 +470,7 @@ phys_pud_init(pud_t *pud_page, unsigned
>  			continue;
>  		}
>  
> -		if (pud_val(*pud)) {
> +		if (!pud_none(*pud)) {
>  			if (!pud_large(*pud)) {
>  				pmd = pmd_offset(pud, 0);
>  				last_map_addr = phys_pmd_init(pmd, addr, end,
> @@ -673,7 +673,7 @@ static void __meminit free_pte_table(pte
>  
>  	for (i = 0; i < PTRS_PER_PTE; i++) {
>  		pte = pte_start + i;
> -		if (pte_val(*pte))
> +		if (!pte_none(*pte))
>  			return;
>  	}
>  
> @@ -691,7 +691,7 @@ static void __meminit free_pmd_table(pmd
>  
>  	for (i = 0; i < PTRS_PER_PMD; i++) {
>  		pmd = pmd_start + i;
> -		if (pmd_val(*pmd))
> +		if (!pmd_none(*pmd))
>  			return;
>  	}
>  
> @@ -710,7 +710,7 @@ static bool __meminit free_pud_table(pud
>  
>  	for (i = 0; i < PTRS_PER_PUD; i++) {
>  		pud = pud_start + i;
> -		if (pud_val(*pud))
> +		if (!pud_none(*pud))
>  			return false;
>  	}
>  
> diff -puN arch/x86/mm/pageattr.c~knl-strays-50-pte_val-cleanups arch/x86/mm/pageattr.c
> --- a/arch/x86/mm/pageattr.c~knl-strays-50-pte_val-cleanups	2016-07-07 17:17:44.944808582 -0700
> +++ b/arch/x86/mm/pageattr.c	2016-07-07 17:17:44.950808852 -0700
> @@ -1185,7 +1185,7 @@ repeat:
>  		return __cpa_process_fault(cpa, address, primary);
>  
>  	old_pte = *kpte;
> -	if (!pte_val(old_pte))
> +	if (pte_none(old_pte))
>  		return __cpa_process_fault(cpa, address, primary);
>  
>  	if (level == PG_LEVEL_4K) {
> diff -puN arch/x86/mm/pgtable_32.c~knl-strays-50-pte_val-cleanups arch/x86/mm/pgtable_32.c
> --- a/arch/x86/mm/pgtable_32.c~knl-strays-50-pte_val-cleanups	2016-07-07 17:17:44.946808672 -0700
> +++ b/arch/x86/mm/pgtable_32.c	2016-07-07 17:17:44.950808852 -0700
> @@ -47,7 +47,7 @@ void set_pte_vaddr(unsigned long vaddr,
>  		return;
>  	}
>  	pte = pte_offset_kernel(pmd, vaddr);
> -	if (pte_val(pteval))
> +	if (!pte_none(pteval))
>  		set_pte_at(&init_mm, vaddr, pte, pteval);
>  	else
>  		pte_clear(&init_mm, vaddr, pte);
> _

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum
  2016-07-08  0:19 ` [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum Dave Hansen
  2016-07-13  8:03   ` [tip:x86/mm] x86/mm: Move " tip-bot for Dave Hansen
@ 2016-07-13 15:19   ` Michal Hocko
  1 sibling, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2016-07-13 15:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, linux-mm, torvalds, akpm, bp, ak, dave.hansen,
	dave.hansen

On Thu 07-07-16 17:19:11, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This erratum can result in Accessed/Dirty getting set by the hardware
> when we do not expect them to be (on !Present PTEs).
> 
> Instead of trying to fix them up after this happens, we just
> allow the bits to get set and try to ignore them.  We do this by
> shifting the layout of the bits we use for swap offset/type in
> our 64-bit PTEs.
> 
> It looks like this:
> 
> bitnrs: |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0|
> names:  |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P|
> before: |         OFFSET (9-63)          |0|X|X| TYPE(1-5) |0|
>  after: | OFFSET (14-63)  |  TYPE (9-13) |0|X|X|X| X| X|X|X|0|
> 
> Note that D was already a don't care (X) even before.  We just
> move TYPE up and turn its old spot (which could be hit by the
> A bit) into all don't cares.
> 
> We take 5 bits away from the offset, but that still leaves us
> with 50 bits which lets us index into a 62-bit swapfile (4 EiB).
> I think that's probably fine for the moment.  We could
> theoretically reclaim 5 of the bits (1, 2, 3, 4, 7) but it
> doesn't gain us anything.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Yes, this seems like a safest option. Feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> 
>  b/arch/x86/include/asm/pgtable_64.h |   26 ++++++++++++++++++++------
>  1 file changed, 20 insertions(+), 6 deletions(-)
> 
> diff -puN arch/x86/include/asm/pgtable_64.h~knl-strays-10-move-swp-pte-bits arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h~knl-strays-10-move-swp-pte-bits	2016-07-07 17:17:43.556746185 -0700
> +++ b/arch/x86/include/asm/pgtable_64.h	2016-07-07 17:17:43.559746319 -0700
> @@ -140,18 +140,32 @@ static inline int pgd_large(pgd_t pgd) {
>  #define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))
>  #define pte_unmap(pte) ((void)(pte))/* NOP */
>  
> -/* Encode and de-code a swap entry */
> +/*
> + * Encode and de-code a swap entry
> + *
> + * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0| <- bit number
> + * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
> + * | OFFSET (14->63) | TYPE (10-13) |0|X|X|X| X| X|X|X|0| <- swp entry
> + *
> + * G (8) is aliased and used as a PROT_NONE indicator for
> + * !present ptes.  We need to start storing swap entries above
> + * there.  We also need to avoid using A and D because of an
> + * erratum where they can be incorrectly set by hardware on
> + * non-present PTEs.
> + */
> +#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
>  #define SWP_TYPE_BITS 5
> -#define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 1)
> +/* Place the offset above the type: */
> +#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS + 1)
>  
>  #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
>  
> -#define __swp_type(x)			(((x).val >> (_PAGE_BIT_PRESENT + 1)) \
> +#define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \
>  					 & ((1U << SWP_TYPE_BITS) - 1))
> -#define __swp_offset(x)			((x).val >> SWP_OFFSET_SHIFT)
> +#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)
>  #define __swp_entry(type, offset)	((swp_entry_t) { \
> -					 ((type) << (_PAGE_BIT_PRESENT + 1)) \
> -					 | ((offset) << SWP_OFFSET_SHIFT) })
> +					 ((type) << (SWP_TYPE_FIRST_BIT)) \
> +					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
>  #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
>  
> _

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none()
  2016-07-08  0:19 ` [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none() Dave Hansen
  2016-07-13  8:03   ` [tip:x86/mm] x86/mm: Ignore " tip-bot for Dave Hansen
@ 2016-07-13 15:21   ` Michal Hocko
  2016-07-13 15:47     ` Dave Hansen
  1 sibling, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2016-07-13 15:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, linux-mm, torvalds, akpm, bp, ak, dave.hansen,
	dave.hansen

On Thu 07-07-16 17:19:12, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The erratum we are fixing here can lead to stray setting of the
> A and D bits.  That means that a pte that we cleared might
> suddenly have A/D set.  So, stop considering those bits when
> determining if a pte is pte_none().  The same goes for the
> other pmd_none() and pud_none().  pgd_none() can be skipped
> because it is not affected; we do not use PGD entries for
> anything other than pagetables on affected configurations.
> 
> This adds a tiny amount of overhead to all pte_none() checks.
> I doubt we'll be able to measure it anywhere.

It would be better to introduce the overhead only for the affected
cpu models but I guess this is also acceptable. Would it be too
complicated to use alternatives for that?

> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Anyway
Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> 
>  b/arch/x86/include/asm/pgtable.h       |   13 ++++++++++---
>  b/arch/x86/include/asm/pgtable_types.h |    6 ++++++
>  2 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff -puN arch/x86/include/asm/pgtable.h~knl-strays-20-mod-pte-none arch/x86/include/asm/pgtable.h
> --- a/arch/x86/include/asm/pgtable.h~knl-strays-20-mod-pte-none	2016-07-07 17:17:43.974764976 -0700
> +++ b/arch/x86/include/asm/pgtable.h	2016-07-07 17:17:43.980765246 -0700
> @@ -480,7 +480,7 @@ pte_t *populate_extra_pte(unsigned long
>  
>  static inline int pte_none(pte_t pte)
>  {
> -	return !pte.pte;
> +	return !(pte.pte & ~(_PAGE_KNL_ERRATUM_MASK));
>  }
>  
>  #define __HAVE_ARCH_PTE_SAME
> @@ -552,7 +552,8 @@ static inline int pmd_none(pmd_t pmd)
>  {
>  	/* Only check low word on 32-bit platforms, since it might be
>  	   out of sync with upper half. */
> -	return (unsigned long)native_pmd_val(pmd) == 0;
> +	unsigned long val = native_pmd_val(pmd);
> +	return (val & ~_PAGE_KNL_ERRATUM_MASK) == 0;
>  }
>  
>  static inline unsigned long pmd_page_vaddr(pmd_t pmd)
> @@ -616,7 +617,7 @@ static inline unsigned long pages_to_mb(
>  #if CONFIG_PGTABLE_LEVELS > 2
>  static inline int pud_none(pud_t pud)
>  {
> -	return native_pud_val(pud) == 0;
> +	return (native_pud_val(pud) & ~(_PAGE_KNL_ERRATUM_MASK)) == 0;
>  }
>  
>  static inline int pud_present(pud_t pud)
> @@ -694,6 +695,12 @@ static inline int pgd_bad(pgd_t pgd)
>  
>  static inline int pgd_none(pgd_t pgd)
>  {
> +	/*
> +	 * There is no need to do a workaround for the KNL stray
> +	 * A/D bit erratum here.  PGDs only point to page tables
> +	 * except on 32-bit non-PAE which is not supported on
> +	 * KNL.
> +	 */
>  	return !native_pgd_val(pgd);
>  }
>  #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
> diff -puN arch/x86/include/asm/pgtable_types.h~knl-strays-20-mod-pte-none arch/x86/include/asm/pgtable_types.h
> --- a/arch/x86/include/asm/pgtable_types.h~knl-strays-20-mod-pte-none	2016-07-07 17:17:43.976765066 -0700
> +++ b/arch/x86/include/asm/pgtable_types.h	2016-07-07 17:17:43.980765246 -0700
> @@ -70,6 +70,12 @@
>  			 _PAGE_PKEY_BIT2 | \
>  			 _PAGE_PKEY_BIT3)
>  
> +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
> +#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
> +#else
> +#define _PAGE_KNL_ERRATUM_MASK 0
> +#endif
> +
>  #ifdef CONFIG_KMEMCHECK
>  #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
>  #else
> _

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] x86: use pte_none() to test for empty PTE
  2016-07-13 15:18   ` [PATCH 4/4] x86: use " Michal Hocko
@ 2016-07-13 15:23     ` Julia Lawall
  2016-07-13 15:49     ` Julia Lawall
  1 sibling, 0 replies; 22+ messages in thread
From: Julia Lawall @ 2016-07-13 15:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, linux-kernel, x86, linux-mm, torvalds, akpm, bp, ak,
	dave.hansen, dave.hansen



On Wed, 13 Jul 2016, Michal Hocko wrote:

> [CCing Julia]
>
> On Thu 07-07-16 17:19:15, Dave Hansen wrote:
> >
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> >
> > The page table manipulation code seems to have grown a couple of
> > sites that are looking for empty PTEs.  Just in case one of these
> > entries got a stray bit set, use pte_none() instead of checking
> > for a zero pte_val().
>
> This looks like something that coccinelle could help with and automate.
> Especially when the patch seems interesting for applying to older kernel
> code streams.
>
> Julia would it be hard to generate a metapatch which would check the
> {pte,pmd}_val() usage in conditions and replace them with {pte,pmd}_none
> equivalents?

Thanks for forwarding.  A priori, it looks quite trivial.  I will do the
obvious thing and send the results for verification.

julia

> > The use pte_same() makes me a bit nervous.  If we were doing a
> > pte_same() check against two cleared entries and one of them had
> > a stray bit set, it might fail the pte_same() check.  But, I
> > don't think we ever _do_ pte_same() for cleared entries.  It is
> > almost entirely used for checking for races in fault-in paths.
> >
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>
> Other than that looks good to me. Feel free to add
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> > ---
> >
> >  b/arch/x86/mm/init_64.c    |   12 ++++++------
> >  b/arch/x86/mm/pageattr.c   |    2 +-
> >  b/arch/x86/mm/pgtable_32.c |    2 +-
> >  3 files changed, 8 insertions(+), 8 deletions(-)
> >
> > diff -puN arch/x86/mm/init_64.c~knl-strays-50-pte_val-cleanups arch/x86/mm/init_64.c
> > --- a/arch/x86/mm/init_64.c~knl-strays-50-pte_val-cleanups	2016-07-07 17:17:44.942808493 -0700
> > +++ b/arch/x86/mm/init_64.c	2016-07-07 17:17:44.949808807 -0700
> > @@ -354,7 +354,7 @@ phys_pte_init(pte_t *pte_page, unsigned
> >  		 * pagetable pages as RO. So assume someone who pre-setup
> >  		 * these mappings are more intelligent.
> >  		 */
> > -		if (pte_val(*pte)) {
> > +		if (!pte_none(*pte)) {
> >  			if (!after_bootmem)
> >  				pages++;
> >  			continue;
> > @@ -396,7 +396,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned
> >  			continue;
> >  		}
> >
> > -		if (pmd_val(*pmd)) {
> > +		if (!pmd_none(*pmd)) {
> >  			if (!pmd_large(*pmd)) {
> >  				spin_lock(&init_mm.page_table_lock);
> >  				pte = (pte_t *)pmd_page_vaddr(*pmd);
> > @@ -470,7 +470,7 @@ phys_pud_init(pud_t *pud_page, unsigned
> >  			continue;
> >  		}
> >
> > -		if (pud_val(*pud)) {
> > +		if (!pud_none(*pud)) {
> >  			if (!pud_large(*pud)) {
> >  				pmd = pmd_offset(pud, 0);
> >  				last_map_addr = phys_pmd_init(pmd, addr, end,
> > @@ -673,7 +673,7 @@ static void __meminit free_pte_table(pte
> >
> >  	for (i = 0; i < PTRS_PER_PTE; i++) {
> >  		pte = pte_start + i;
> > -		if (pte_val(*pte))
> > +		if (!pte_none(*pte))
> >  			return;
> >  	}
> >
> > @@ -691,7 +691,7 @@ static void __meminit free_pmd_table(pmd
> >
> >  	for (i = 0; i < PTRS_PER_PMD; i++) {
> >  		pmd = pmd_start + i;
> > -		if (pmd_val(*pmd))
> > +		if (!pmd_none(*pmd))
> >  			return;
> >  	}
> >
> > @@ -710,7 +710,7 @@ static bool __meminit free_pud_table(pud
> >
> >  	for (i = 0; i < PTRS_PER_PUD; i++) {
> >  		pud = pud_start + i;
> > -		if (pud_val(*pud))
> > +		if (!pud_none(*pud))
> >  			return false;
> >  	}
> >
> > diff -puN arch/x86/mm/pageattr.c~knl-strays-50-pte_val-cleanups arch/x86/mm/pageattr.c
> > --- a/arch/x86/mm/pageattr.c~knl-strays-50-pte_val-cleanups	2016-07-07 17:17:44.944808582 -0700
> > +++ b/arch/x86/mm/pageattr.c	2016-07-07 17:17:44.950808852 -0700
> > @@ -1185,7 +1185,7 @@ repeat:
> >  		return __cpa_process_fault(cpa, address, primary);
> >
> >  	old_pte = *kpte;
> > -	if (!pte_val(old_pte))
> > +	if (pte_none(old_pte))
> >  		return __cpa_process_fault(cpa, address, primary);
> >
> >  	if (level == PG_LEVEL_4K) {
> > diff -puN arch/x86/mm/pgtable_32.c~knl-strays-50-pte_val-cleanups arch/x86/mm/pgtable_32.c
> > --- a/arch/x86/mm/pgtable_32.c~knl-strays-50-pte_val-cleanups	2016-07-07 17:17:44.946808672 -0700
> > +++ b/arch/x86/mm/pgtable_32.c	2016-07-07 17:17:44.950808852 -0700
> > @@ -47,7 +47,7 @@ void set_pte_vaddr(unsigned long vaddr,
> >  		return;
> >  	}
> >  	pte = pte_offset_kernel(pmd, vaddr);
> > -	if (pte_val(pteval))
> > +	if (!pte_none(pteval))
> >  		set_pte_at(&init_mm, vaddr, pte, pteval);
> >  	else
> >  		pte_clear(&init_mm, vaddr, pte);
> > _
>
> --
> Michal Hocko
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none()
  2016-07-13 15:21   ` [PATCH 2/4] x86, pagetable: ignore " Michal Hocko
@ 2016-07-13 15:47     ` Dave Hansen
  2016-07-14  6:13       ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2016-07-13 15:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, x86, linux-mm, torvalds, akpm, bp, ak, dave.hansen,
	dave.hansen

On 07/13/2016 08:21 AM, Michal Hocko wrote:
>> > This adds a tiny amount of overhead to all pte_none() checks.
>> > I doubt we'll be able to measure it anywhere.
> It would be better to introduce the overhead only for the affected
> cpu models but I guess this is also acceptable. Would it be too
> complicated to use alternatives for that?

The patch as it stands ends up doing a one-instruction change in
pte_none().  It goes from

    64c8:       48 85 ff                test   %rdi,%rdi

to

    64a8:       48 f7 c7 9f ff ff ff    test   $0xffffffffffffff9f,%rdi

So it essentially eats 4 bytes of icache more than it did before.  But,
it's the same number of instructions, and I can't imagine that the CPU
will have any more trouble with a test against an immediate than a test
against 0.

We could theoretically do alternatives for this, but we would at *best*
end up with 4 bytes of noops.  So, unless the processor likes decoding 4
noops better than 4 bytes of immediate as part of an instruction, we'll
not win anything.  *Plus* the ugliness of the assembly that we'll need
to have the compiler guarantee that the PTE ends up in %rdi.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] x86: use pte_none() to test for empty PTE
  2016-07-13 15:18   ` [PATCH 4/4] x86: use " Michal Hocko
  2016-07-13 15:23     ` Julia Lawall
@ 2016-07-13 15:49     ` Julia Lawall
  2016-07-13 16:28       ` Dave Hansen
  1 sibling, 1 reply; 22+ messages in thread
From: Julia Lawall @ 2016-07-13 15:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, linux-kernel, x86, linux-mm, torvalds, akpm, bp, ak,
	dave.hansen, dave.hansen, Julia Lawall

My results are below.  There are a couple of cases in arch/mn10300/mm that
were not in the original patch.

julia

diff -u -p a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1185,7 +1185,7 @@ repeat:
 		return __cpa_process_fault(cpa, address, primary);

 	old_pte = *kpte;
-	if (!pte_val(old_pte))
+	if (pte_none(old_pte))
 		return __cpa_process_fault(cpa, address, primary);

 	if (level == PG_LEVEL_4K) {
diff -u -p a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -354,7 +354,7 @@ phys_pte_init(pte_t *pte_page, unsigned
 		 * pagetable pages as RO. So assume someone who pre-setup
 		 * these mappings are more intelligent.
 		 */
-		if (pte_val(*pte)) {
+		if (!pte_none(*pte)) {
 			if (!after_bootmem)
 				pages++;
 			continue;
@@ -396,7 +396,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned
 			continue;
 		}

-		if (pmd_val(*pmd)) {
+		if (!pmd_none(*pmd)) {
 			if (!pmd_large(*pmd)) {
 				spin_lock(&init_mm.page_table_lock);
 				pte = (pte_t *)pmd_page_vaddr(*pmd);
@@ -470,7 +470,7 @@ phys_pud_init(pud_t *pud_page, unsigned
 			continue;
 		}

-		if (pud_val(*pud)) {
+		if (!pud_none(*pud)) {
 			if (!pud_large(*pud)) {
 				pmd = pmd_offset(pud, 0);
 				last_map_addr = phys_pmd_init(pmd, addr, end,
@@ -673,7 +673,7 @@ static void __meminit free_pte_table(pte

 	for (i = 0; i < PTRS_PER_PTE; i++) {
 		pte = pte_start + i;
-		if (pte_val(*pte))
+		if (!pte_none(*pte))
 			return;
 	}

@@ -691,7 +691,7 @@ static void __meminit free_pmd_table(pmd

 	for (i = 0; i < PTRS_PER_PMD; i++) {
 		pmd = pmd_start + i;
-		if (pmd_val(*pmd))
+		if (!pmd_none(*pmd))
 			return;
 	}

@@ -710,7 +710,7 @@ static bool __meminit free_pud_table(pud

 	for (i = 0; i < PTRS_PER_PUD; i++) {
 		pud = pud_start + i;
-		if (pud_val(*pud))
+		if (!pud_none(*pud))
 			return false;
 	}

diff -u -p a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -47,7 +47,7 @@ void set_pte_vaddr(unsigned long vaddr,
 		return;
 	}
 	pte = pte_offset_kernel(pmd, vaddr);
-	if (pte_val(pteval))
+	if (!pte_none(pteval))
 		set_pte_at(&init_mm, vaddr, pte, pteval);
 	else
 		pte_clear(&init_mm, vaddr, pte);
diff -u -p a/arch/mn10300/mm/cache-flush-icache.c b/arch/mn10300/mm/cache-flush-icache.c
--- a/arch/mn10300/mm/cache-flush-icache.c
+++ b/arch/mn10300/mm/cache-flush-icache.c
@@ -67,11 +67,11 @@ static void flush_icache_page_range(unsi
 		return;

 	pud = pud_offset(pgd, start);
-	if (!pud || !pud_val(*pud))
+	if (!pud || pud_none(*pud))
 		return;

 	pmd = pmd_offset(pud, start);
-	if (!pmd || !pmd_val(*pmd))
+	if (!pmd || pmd_none(*pmd))
 		return;

 	ppte = pte_offset_map(pmd, start);
diff -u -p a/arch/mn10300/mm/cache-inv-icache.c b/arch/mn10300/mm/cache-inv-icache.c
--- a/arch/mn10300/mm/cache-inv-icache.c
+++ b/arch/mn10300/mm/cache-inv-icache.c
@@ -45,11 +45,11 @@ static void flush_icache_page_range(unsi
 		return;

 	pud = pud_offset(pgd, start);
-	if (!pud || !pud_val(*pud))
+	if (!pud || pud_none(*pud))
 		return;

 	pmd = pmd_offset(pud, start);
-	if (!pmd || !pmd_val(*pmd))
+	if (!pmd || pmd_none(*pmd))
 		return;

 	ppte = pte_offset_map(pmd, start);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] x86: use pte_none() to test for empty PTE
  2016-07-13 15:49     ` Julia Lawall
@ 2016-07-13 16:28       ` Dave Hansen
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2016-07-13 16:28 UTC (permalink / raw)
  To: Julia Lawall, Michal Hocko
  Cc: linux-kernel, x86, linux-mm, torvalds, akpm, bp, ak, dave.hansen,
	David Howells

On 07/13/2016 08:49 AM, Julia Lawall wrote:
> My results are below.  There are a couple of cases in arch/mn10300/mm that
> were not in the original patch.

Yeah, so mn10300 is obviously unaffected by the erratum in question, and
I didn't look for non-x86 architectures for this patch.

But, this code definitely _looks_ like it should be using pte_none(),
especially since mn10300 defines it the same way as x86 (well, as x86
_did_ before this series).

	#define pte_none(x)		(!pte_val(x))

> diff -u -p a/arch/mn10300/mm/cache-inv-icache.c b/arch/mn10300/mm/cache-inv-icache.c
> --- a/arch/mn10300/mm/cache-inv-icache.c
> +++ b/arch/mn10300/mm/cache-inv-icache.c
> @@ -45,11 +45,11 @@ static void flush_icache_page_range(unsi
>  		return;
> 
>  	pud = pud_offset(pgd, start);
> -	if (!pud || !pud_val(*pud))
> +	if (!pud || pud_none(*pud))
>  		return;
> 
>  	pmd = pmd_offset(pud, start);
> -	if (!pmd || !pmd_val(*pmd))
> +	if (!pmd || pmd_none(*pmd))
>  		return;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none()
  2016-07-13 15:47     ` Dave Hansen
@ 2016-07-14  6:13       ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2016-07-14  6:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, linux-mm, torvalds, akpm, bp, ak, dave.hansen,
	dave.hansen

On Wed 13-07-16 08:47:51, Dave Hansen wrote:
> On 07/13/2016 08:21 AM, Michal Hocko wrote:
> >> > This adds a tiny amount of overhead to all pte_none() checks.
> >> > I doubt we'll be able to measure it anywhere.
> > It would be better to introduce the overhead only for the affected
> > cpu models but I guess this is also acceptable. Would it be too
> > complicated to use alternatives for that?
> 
> The patch as it stands ends up doing a one-instruction change in
> pte_none().  It goes from
> 
>     64c8:       48 85 ff                test   %rdi,%rdi
> 
> to
> 
>     64a8:       48 f7 c7 9f ff ff ff    test   $0xffffffffffffff9f,%rdi
> 
> So it essentially eats 4 bytes of icache more than it did before.  But,
> it's the same number of instructions, and I can't imagine that the CPU
> will have any more trouble with a test against an immediate than a test
> against 0.

I see. Thanks for the clarification.

> We could theoretically do alternatives for this, but we would at *best*
> end up with 4 bytes of noops.  So, unless the processor likes decoding 4
> noops better than 4 bytes of immediate as part of an instruction, we'll
> not win anything.  *Plus* the ugliness of the assembly that we'll need
> to have the compiler guarantee that the PTE ends up in %rdi.

Agreed!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] x86: use pte_none() to test for empty PTE
  2016-07-08  0:19 ` [PATCH 4/4] x86: use pte_none() to test for empty PTE Dave Hansen
  2016-07-13  8:04   ` [tip:x86/mm] x86/mm: Use " tip-bot for Dave Hansen
  2016-07-13 15:18   ` [PATCH 4/4] x86: use " Michal Hocko
@ 2016-07-14 13:47   ` Vlastimil Babka
  2016-07-14 14:24     ` Dave Hansen
  2 siblings, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2016-07-14 13:47 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen, dave.hansen

On 07/08/2016 02:19 AM, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The page table manipulation code seems to have grown a couple of
> sites that are looking for empty PTEs.  Just in case one of these
> entries got a stray bit set, use pte_none() instead of checking
> for a zero pte_val().
>
> The use pte_same() makes me a bit nervous.  If we were doing a
> pte_same() check against two cleared entries and one of them had
> a stray bit set, it might fail the pte_same() check.  But, I
> don't think we ever _do_ pte_same() for cleared entries.  It is
> almost entirely used for checking for races in fault-in paths.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

So, this might be just because I know next to nothing about (para)virt, 
but...

in arch/x86/include/asm/paravirt.h, pte_val is implemented via some 
pvops, which suggests that obtaining a pte value is different than just 
reading it from memory. But I don't see pte_none() defined to be using 
this on paravirt, and it shares (before patch 2/4) the "return !pte.pte" 
implementation, AFAICS?

So that itself is suspicious to me. And now that this patches does 
things like this:

-              if (pte_val(*pte)) {
+              if (!pte_none(*pte)) {

So previously on paravirt these tests would read pte via the pvops, and 
now they won't. Is that OK?

Thanks,
Vlastimil

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] x86: use pte_none() to test for empty PTE
  2016-07-14 13:47   ` Vlastimil Babka
@ 2016-07-14 14:24     ` Dave Hansen
  2016-07-14 14:50       ` David Vrabel
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2016-07-14 14:24 UTC (permalink / raw)
  To: Vlastimil Babka, Dave Hansen, linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen,
	Boris Ostrovsky, David Vrabel, Juergen Gross

On 07/14/2016 06:47 AM, Vlastimil Babka wrote:
> So, this might be just because I know next to nothing about (para)virt,
> but...
> 
> in arch/x86/include/asm/paravirt.h, pte_val is implemented via some
> pvops, which suggests that obtaining a pte value is different than just
> reading it from memory. But I don't see pte_none() defined to be using
> this on paravirt, and it shares (before patch 2/4) the "return !pte.pte"
> implementation, AFAICS?
> 
> So that itself is suspicious to me. And now that this patches does
> things like this:
> 
> -              if (pte_val(*pte)) {
> +              if (!pte_none(*pte)) {
> 
> So previously on paravirt these tests would read pte via the pvops, and
> now they won't. Is that OK?

I've cc'd a few Xen guys.  I think they're the only ones that would care.

But, as far as I can tell, the Xen pte_val() will take a _PAGE_PRESENT
PTE and muck with it.  But its answer will never differ for an all 0 PTE
from !pte_none() because that PTE does not have _PAGE_PRESENT set.

It does seem fragile that Xen is doing it this way, but I guess it works.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] x86: use pte_none() to test for empty PTE
  2016-07-14 14:24     ` Dave Hansen
@ 2016-07-14 14:50       ` David Vrabel
  0 siblings, 0 replies; 22+ messages in thread
From: David Vrabel @ 2016-07-14 14:50 UTC (permalink / raw)
  To: Dave Hansen, Vlastimil Babka, Dave Hansen, linux-kernel
  Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, dave.hansen,
	Boris Ostrovsky, Juergen Gross

On 14/07/16 15:24, Dave Hansen wrote:
> On 07/14/2016 06:47 AM, Vlastimil Babka wrote:
>> So, this might be just because I know next to nothing about (para)virt,
>> but...
>>
>> in arch/x86/include/asm/paravirt.h, pte_val is implemented via some
>> pvops, which suggests that obtaining a pte value is different than just
>> reading it from memory. But I don't see pte_none() defined to be using
>> this on paravirt, and it shares (before patch 2/4) the "return !pte.pte"
>> implementation, AFAICS?
>>
>> So that itself is suspicious to me. And now that this patches does
>> things like this:
>>
>> -              if (pte_val(*pte)) {
>> +              if (!pte_none(*pte)) {
>>
>> So previously on paravirt these tests would read pte via the pvops, and
>> now they won't. Is that OK?
> 
> I've cc'd a few Xen guys.  I think they're the only ones that would care.
> 
> But, as far as I can tell, the Xen pte_val() will take a _PAGE_PRESENT
> PTE and muck with it.  But its answer will never differ for an all 0 PTE
> from !pte_none() because that PTE does not have _PAGE_PRESENT set.
> 
> It does seem fragile that Xen is doing it this way, but I guess it works.

Xen PV guests never plays games with non-present PTEs so, for the
series, wrt Xen:

Acked-by: David Vrabel <david.vrabel@citrix.com>

FWIW, present PTEs have a hardware-specified meaning where-as
non-present PTEs do not, so I'm not sure I'd view Xen PV guests making
this distinct as "fragile".


David

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/4] x86: use pte_none() to test for empty PTE
  2016-07-01 17:46 Dave Hansen
@ 2016-07-01 17:47 ` Dave Hansen
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2016-07-01 17:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, linux-mm, torvalds, akpm, bp, ak, mhocko, Dave Hansen


The page table manipulation code seems to have grown a couple of
sites that are looking for empty PTEs.  Just in case one of these
entries got a stray bit set, use pte_none() instead of checking
for a zero pte_val().

The use pte_same() makes me a bit nervous.  If we were doing a
pte_same() check against two cleared entries and one of them had
a stray bit set, it might fail the pte_same() check.  But, I
don't think we ever _do_ pte_same() for cleared entries.  It is
almost entirely used for checking for races in fault-in paths.

---

 b/arch/x86/mm/init_64.c    |   12 ++++++------
 b/arch/x86/mm/pageattr.c   |    2 +-
 b/arch/x86/mm/pgtable_32.c |    2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff -puN arch/x86/mm/init_64.c~knl-strays-50-pte_val-cleanups arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c~knl-strays-50-pte_val-cleanups	2016-07-01 10:42:07.781811987 -0700
+++ b/arch/x86/mm/init_64.c	2016-07-01 10:42:07.788812305 -0700
@@ -354,7 +354,7 @@ phys_pte_init(pte_t *pte_page, unsigned
 		 * pagetable pages as RO. So assume someone who pre-setup
 		 * these mappings are more intelligent.
 		 */
-		if (pte_val(*pte)) {
+		if (!pte_none(*pte)) {
 			if (!after_bootmem)
 				pages++;
 			continue;
@@ -396,7 +396,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned
 			continue;
 		}
 
-		if (pmd_val(*pmd)) {
+		if (!pmd_none(*pmd)) {
 			if (!pmd_large(*pmd)) {
 				spin_lock(&init_mm.page_table_lock);
 				pte = (pte_t *)pmd_page_vaddr(*pmd);
@@ -470,7 +470,7 @@ phys_pud_init(pud_t *pud_page, unsigned
 			continue;
 		}
 
-		if (pud_val(*pud)) {
+		if (!pud_none(*pud)) {
 			if (!pud_large(*pud)) {
 				pmd = pmd_offset(pud, 0);
 				last_map_addr = phys_pmd_init(pmd, addr, end,
@@ -673,7 +673,7 @@ static void __meminit free_pte_table(pte
 
 	for (i = 0; i < PTRS_PER_PTE; i++) {
 		pte = pte_start + i;
-		if (pte_val(*pte))
+		if (!pte_none(*pte))
 			return;
 	}
 
@@ -691,7 +691,7 @@ static void __meminit free_pmd_table(pmd
 
 	for (i = 0; i < PTRS_PER_PMD; i++) {
 		pmd = pmd_start + i;
-		if (pmd_val(*pmd))
+		if (!pmd_none(*pmd))
 			return;
 	}
 
@@ -710,7 +710,7 @@ static bool __meminit free_pud_table(pud
 
 	for (i = 0; i < PTRS_PER_PUD; i++) {
 		pud = pud_start + i;
-		if (pud_val(*pud))
+		if (!pud_none(*pud))
 			return false;
 	}
 
diff -puN arch/x86/mm/pageattr.c~knl-strays-50-pte_val-cleanups arch/x86/mm/pageattr.c
--- a/arch/x86/mm/pageattr.c~knl-strays-50-pte_val-cleanups	2016-07-01 10:42:07.783812078 -0700
+++ b/arch/x86/mm/pageattr.c	2016-07-01 10:42:07.789812350 -0700
@@ -1185,7 +1185,7 @@ repeat:
 		return __cpa_process_fault(cpa, address, primary);
 
 	old_pte = *kpte;
-	if (!pte_val(old_pte))
+	if (pte_none(old_pte))
 		return __cpa_process_fault(cpa, address, primary);
 
 	if (level == PG_LEVEL_4K) {
diff -puN arch/x86/mm/pgtable_32.c~knl-strays-50-pte_val-cleanups arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c~knl-strays-50-pte_val-cleanups	2016-07-01 10:42:07.785812169 -0700
+++ b/arch/x86/mm/pgtable_32.c	2016-07-01 10:42:07.789812350 -0700
@@ -47,7 +47,7 @@ void set_pte_vaddr(unsigned long vaddr,
 		return;
 	}
 	pte = pte_offset_kernel(pmd, vaddr);
-	if (pte_val(pteval))
+	if (!pte_none(pteval))
 		set_pte_at(&init_mm, vaddr, pte, pteval);
 	else
 		pte_clear(&init_mm, vaddr, pte);
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-07-14 14:50 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-08  0:19 [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Dave Hansen
2016-07-08  0:19 ` [PATCH 1/4] x86, swap: move swap offset/type up in PTE to work around erratum Dave Hansen
2016-07-13  8:03   ` [tip:x86/mm] x86/mm: Move " tip-bot for Dave Hansen
2016-07-13 15:19   ` [PATCH 1/4] x86, swap: move " Michal Hocko
2016-07-08  0:19 ` [PATCH 2/4] x86, pagetable: ignore A/D bits in pte/pmd/pud_none() Dave Hansen
2016-07-13  8:03   ` [tip:x86/mm] x86/mm: Ignore " tip-bot for Dave Hansen
2016-07-13 15:21   ` [PATCH 2/4] x86, pagetable: ignore " Michal Hocko
2016-07-13 15:47     ` Dave Hansen
2016-07-14  6:13       ` Michal Hocko
2016-07-08  0:19 ` [PATCH 3/4] x86: disallow running with 32-bit PTEs to work around erratum Dave Hansen
2016-07-13  8:04   ` [tip:x86/mm] x86/mm: Disallow " tip-bot for Dave Hansen
2016-07-08  0:19 ` [PATCH 4/4] x86: use pte_none() to test for empty PTE Dave Hansen
2016-07-13  8:04   ` [tip:x86/mm] x86/mm: Use " tip-bot for Dave Hansen
2016-07-13 15:18   ` [PATCH 4/4] x86: use " Michal Hocko
2016-07-13 15:23     ` Julia Lawall
2016-07-13 15:49     ` Julia Lawall
2016-07-13 16:28       ` Dave Hansen
2016-07-14 13:47   ` Vlastimil Babka
2016-07-14 14:24     ` Dave Hansen
2016-07-14 14:50       ` David Vrabel
2016-07-13  9:54 ` [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum Vlastimil Babka
  -- strict thread matches above, loose matches on Subject: below --
2016-07-01 17:46 Dave Hansen
2016-07-01 17:47 ` [PATCH 4/4] x86: use pte_none() to test for empty PTE Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).