linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Linux VM workaround for Knights Landing A/D leak
@ 2016-06-14 15:58 Lukasz Anaczkowski
  2016-06-14 16:31 ` kbuild test robot
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Lukasz Anaczkowski @ 2016-06-14 15:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm, tglx, mingo, dave.hansen, ak,
	kirill.shutemov, mhocko, akpm, hpa
  Cc: lukasz.anaczkowski, harish.srinivasappa, lukasz.odzioba

From: Andi Kleen <ak@linux.intel.com>

Knights Landing has a issue that a thread setting A or D bits
may not do so atomically against checking the present bit.
A thread which is going to page fault may still set those
bits, even though the present bit was already atomically cleared.

This implies that when the kernel clears present atomically,
some time later the supposed to be zero entry could be corrupted
with stray A or D bits.

Since the PTE could be already used for storing a swap index,
or a NUMA migration index, this cannot be tolerated. Most
of the time the kernel detects the problem, but in some
rare cases it may not.

This patch enforces that the page unmap path in vmscan/direct reclaim
always flushes other CPUs after clearing each page, and also
clears the PTE again after the flush.

For reclaim this brings the performance back to before Mel's
flushing changes, but for unmap it disables batching.

This makes sure any leaked A/D bits are immediately cleared before the entry
is used for something else.

Any parallel faults that check for entry is zero may loop,
but they should eventually recover after the entry is written.

Also other users may spin in the page table lock until we
"fixed" the PTE. This is ensured by always taking the page table lock
even for the swap cache case. Previously this was only done
on architectures with non atomic PTE accesses (such as 32bit PTE),
but now it is also done when this bug workaround is active.

I audited apply_pte_range and other users of arch_enter_lazy...
and they seem to all not clear the present bit.

Right now the extra flush is done in the low level
architecture code, while the higher level code still
does batched TLB flush. This means there is always one extra
unnecessary TLB flush now. As a followon optimization
this could be avoided by telling the callers that
the flush already happenend.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/hugetlb.h     |  9 ++++++++-
 arch/x86/include/asm/pgtable.h     |  5 +++++
 arch/x86/include/asm/pgtable_64.h  |  6 ++++++
 arch/x86/kernel/cpu/intel.c        | 10 ++++++++++
 arch/x86/mm/tlb.c                  | 20 ++++++++++++++++++++
 include/linux/mm.h                 |  4 ++++
 mm/memory.c                        |  3 ++-
 8 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 4a41348..2c48011 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -303,6 +303,7 @@
 #define X86_BUG_SYSRET_SS_ATTRS	X86_BUG(8) /* SYSRET doesn't fix up SS attrs */
 #define X86_BUG_NULL_SEG	X86_BUG(9) /* Nulling a selector preserves the base */
 #define X86_BUG_SWAPGS_FENCE	X86_BUG(10) /* SWAPGS without input dep on GS */
+#define X86_BUG_PTE_LEAK        X86_BUG(11) /* PTE may leak A/D bits after clear */
 
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
index 3a10616..58e1ca9 100644
--- a/arch/x86/include/asm/hugetlb.h
+++ b/arch/x86/include/asm/hugetlb.h
@@ -41,10 +41,17 @@ static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	set_pte_at(mm, addr, ptep, pte);
 }
 
+extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr,
+			 pte_t *ptep);
+
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 					    unsigned long addr, pte_t *ptep)
 {
-	return ptep_get_and_clear(mm, addr, ptep);
+	pte_t pte = ptep_get_and_clear(mm, addr, ptep);
+
+	if (boot_cpu_has_bug(X86_BUG_PTE_LEAK))
+		fix_pte_leak(mm, addr, ptep);
+	return pte;
 }
 
 static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1a27396..9769355 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -794,11 +794,16 @@ extern int ptep_test_and_clear_young(struct vm_area_struct *vma,
 extern int ptep_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pte_t *ptep);
 
+extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr,
+			 pte_t *ptep);
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 				       pte_t *ptep)
 {
 	pte_t pte = native_ptep_get_and_clear(ptep);
+	if (boot_cpu_has_bug(X86_BUG_PTE_LEAK))
+		fix_pte_leak(mm, addr, ptep);
 	pte_update(mm, addr, ptep);
 	return pte;
 }
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 2ee7811..6fa4079 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -178,6 +178,12 @@ extern void cleanup_highmap(void);
 extern void init_extra_mapping_uc(unsigned long phys, unsigned long size);
 extern void init_extra_mapping_wb(unsigned long phys, unsigned long size);
 
+#define ARCH_HAS_NEEDS_SWAP_PTL 1
+static inline bool arch_needs_swap_ptl(void)
+{
+       return boot_cpu_has_bug(X86_BUG_PTE_LEAK);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 6e2ffbe..f499513 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -181,6 +181,16 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 		}
 	}
 
+	if (c->x86_model == 87) {
+		static bool printed;
+
+		if (!printed) {
+			pr_info("Enabling PTE leaking workaround\n");
+			printed = true;
+		}
+		set_cpu_bug(c, X86_BUG_PTE_LEAK);
+	}
+
 	/*
 	 * Intel Quark Core DevMan_001.pdf section 6.4.11
 	 * "The operating system also is required to invalidate (i.e., flush)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0..3d54488 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -468,4 +468,24 @@ static int __init create_tlb_single_page_flush_ceiling(void)
 }
 late_initcall(create_tlb_single_page_flush_ceiling);
 
+/*
+ * Workaround for KNL issue:
+ *
+ * A thread that is going to page fault due to P=0, may still
+ * non atomically set A or D bits, which could corrupt swap entries.
+ * Always flush the other CPUs and clear the PTE again to avoid
+ * this leakage. We are excluded using the pagetable lock.
+ */
+
+void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
+		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
+		flush_tlb_others(mm_cpumask(mm), mm, addr,
+				 addr + PAGE_SIZE);
+		mb();
+		set_pte(ptep, __pte(0));
+	}
+}
+
 #endif /* CONFIG_SMP */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5df5feb..5c80fe09 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2404,6 +2404,10 @@ static inline bool debug_guardpage_enabled(void) { return false; }
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
+#ifndef ARCH_HAS_NEEDS_SWAP_PTL
+static inline bool arch_needs_swap_ptl(void) { return false; }
+#endif
+
 #if MAX_NUMNODES > 1
 void __init setup_nr_node_ids(void);
 #else
diff --git a/mm/memory.c b/mm/memory.c
index 15322b7..0d6ef39 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1960,7 +1960,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
 {
 	int same = 1;
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-	if (sizeof(pte_t) > sizeof(unsigned long)) {
+	if (arch_needs_swap_ptl() ||
+	    sizeof(pte_t) > sizeof(unsigned long)) {
 		spinlock_t *ptl = pte_lockptr(mm, pmd);
 		spin_lock(ptl);
 		same = pte_same(*page_table, orig_pte);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 15:58 [PATCH] Linux VM workaround for Knights Landing A/D leak Lukasz Anaczkowski
@ 2016-06-14 16:31 ` kbuild test robot
  2016-06-14 16:47 ` Nadav Amit
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 38+ messages in thread
From: kbuild test robot @ 2016-06-14 16:31 UTC (permalink / raw)
  To: Lukasz Anaczkowski
  Cc: kbuild-all, linux-kernel, linux-mm, tglx, mingo, dave.hansen, ak,
	kirill.shutemov, mhocko, akpm, hpa, lukasz.anaczkowski,
	harish.srinivasappa, lukasz.odzioba

[-- Attachment #1: Type: text/plain, Size: 1398 bytes --]

Hi,

[auto build test ERROR on v4.7-rc3]
[also build test ERROR on next-20160614]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Lukasz-Anaczkowski/Linux-VM-workaround-for-Knights-Landing-A-D-leak/20160615-000610
config: i386-alldefconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   mm/built-in.o: In function `unmap_page_range':
   (.text+0x1e9e8): undefined reference to `fix_pte_leak'
   mm/built-in.o: In function `change_protection_range':
>> mprotect.c:(.text+0x25578): undefined reference to `fix_pte_leak'
   mm/built-in.o: In function `move_page_tables':
   (.text+0x25d81): undefined reference to `fix_pte_leak'
   mm/built-in.o: In function `vunmap_page_range':
   vmalloc.c:(.text+0x28419): undefined reference to `fix_pte_leak'
   mm/built-in.o: In function `ptep_clear_flush':
   (.text+0x2a7b3): undefined reference to `fix_pte_leak'
   mm/built-in.o:madvise.c:(.text+0x2b4b0): more undefined references to `fix_pte_leak' follow

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 9935 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 15:58 [PATCH] Linux VM workaround for Knights Landing A/D leak Lukasz Anaczkowski
  2016-06-14 16:31 ` kbuild test robot
@ 2016-06-14 16:47 ` Nadav Amit
  2016-06-14 16:54   ` Anaczkowski, Lukasz
                     ` (2 more replies)
  2016-06-14 16:58 ` kbuild test robot
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 38+ messages in thread
From: Nadav Amit @ 2016-06-14 16:47 UTC (permalink / raw)
  To: Lukasz Anaczkowski
  Cc: LKML, linux-mm, Thomas Gleixner, Ingo Molnar, Dave Hansen, ak,
	kirill.shutemov, mhocko, Andrew Morton, H. Peter Anvin,
	harish.srinivasappa, lukasz.odzioba


Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:

> From: Andi Kleen <ak@linux.intel.com>
> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
> +{
Here there should be a call to smp_mb__after_atomic() to synchronize with
switch_mm. I submitted a similar patch, which is still pending (hint).

> +	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
> +		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
> +		flush_tlb_others(mm_cpumask(mm), mm, addr,
> +				 addr + PAGE_SIZE);
> +		mb();
> +		set_pte(ptep, __pte(0));
> +	}
> +}

Regards,
Nadav

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 16:47 ` Nadav Amit
@ 2016-06-14 16:54   ` Anaczkowski, Lukasz
  2016-06-14 17:01   ` [PATCH v2] " Lukasz Anaczkowski
  2016-06-14 17:18   ` [PATCH] " Dave Hansen
  2 siblings, 0 replies; 38+ messages in thread
From: Anaczkowski, Lukasz @ 2016-06-14 16:54 UTC (permalink / raw)
  To: Nadav Amit
  Cc: LKML, linux-mm, Thomas Gleixner, Ingo Molnar, Dave Hansen, ak,
	kirill.shutemov, mhocko, Andrew Morton, H. Peter Anvin,
	Srinivasappa, Harish, Odzioba, Lukasz

From: Nadav Amit [mailto:nadav.amit@gmail.com] 
Sent: Tuesday, June 14, 2016 6:48 PM

> Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:

>> From: Andi Kleen <ak@linux.intel.com>
>> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>> +{
> Here there should be a call to smp_mb__after_atomic() to synchronize with
> switch_mm. I submitted a similar patch, which is still pending (hint).

Thanks, Nadav!
I'll add this and re-submit the patch.

Cheers,
Lukasz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 15:58 [PATCH] Linux VM workaround for Knights Landing A/D leak Lukasz Anaczkowski
  2016-06-14 16:31 ` kbuild test robot
  2016-06-14 16:47 ` Nadav Amit
@ 2016-06-14 16:58 ` kbuild test robot
  2016-06-14 17:19 ` Dave Hansen
  2016-06-14 17:47 ` kbuild test robot
  4 siblings, 0 replies; 38+ messages in thread
From: kbuild test robot @ 2016-06-14 16:58 UTC (permalink / raw)
  To: Lukasz Anaczkowski
  Cc: kbuild-all, linux-kernel, linux-mm, tglx, mingo, dave.hansen, ak,
	kirill.shutemov, mhocko, akpm, hpa, lukasz.anaczkowski,
	harish.srinivasappa, lukasz.odzioba

[-- Attachment #1: Type: text/plain, Size: 1403 bytes --]

Hi,

[auto build test ERROR on v4.7-rc3]
[also build test ERROR on next-20160614]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Lukasz-Anaczkowski/Linux-VM-workaround-for-Knights-Landing-A-D-leak/20160615-000610
config: i386-randconfig-s0-201624 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   arch/x86/built-in.o: In function `__ptep_modify_prot_start':
>> paravirt.c:(.text+0x346f4): undefined reference to `fix_pte_leak'
   mm/built-in.o: In function `unmap_page_range':
>> (.text+0x1dd4c): undefined reference to `fix_pte_leak'
   mm/built-in.o: In function `move_page_tables':
   (.text+0x274f2): undefined reference to `fix_pte_leak'
   mm/built-in.o: In function `vunmap_page_range':
>> vmalloc.c:(.text+0x2aa78): undefined reference to `fix_pte_leak'
   mm/built-in.o: In function `ptep_clear_flush':
   (.text+0x2cfa7): undefined reference to `fix_pte_leak'
   mm/built-in.o:(.text+0x30dff): more undefined references to `fix_pte_leak' follow

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 26996 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 16:47 ` Nadav Amit
  2016-06-14 16:54   ` Anaczkowski, Lukasz
@ 2016-06-14 17:01   ` Lukasz Anaczkowski
  2016-06-14 17:24     ` Dave Hansen
                       ` (3 more replies)
  2016-06-14 17:18   ` [PATCH] " Dave Hansen
  2 siblings, 4 replies; 38+ messages in thread
From: Lukasz Anaczkowski @ 2016-06-14 17:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, tglx, mingo, dave.hansen, ak,
	kirill.shutemov, mhocko, akpm, hpa
  Cc: lukasz.anaczkowski, harish.srinivasappa, lukasz.odzioba,
	grzegorz.andrejczuk, lukasz.daniluk

From: Andi Kleen <ak@linux.intel.com>

Knights Landing has a issue that a thread setting A or D bits
may not do so atomically against checking the present bit.
A thread which is going to page fault may still set those
bits, even though the present bit was already atomically cleared.

This implies that when the kernel clears present atomically,
some time later the supposed to be zero entry could be corrupted
with stray A or D bits.

Since the PTE could be already used for storing a swap index,
or a NUMA migration index, this cannot be tolerated. Most
of the time the kernel detects the problem, but in some
rare cases it may not.

This patch enforces that the page unmap path in vmscan/direct reclaim
always flushes other CPUs after clearing each page, and also
clears the PTE again after the flush.

For reclaim this brings the performance back to before Mel's
flushing changes, but for unmap it disables batching.

This makes sure any leaked A/D bits are immediately cleared before the entry
is used for something else.

Any parallel faults that check for entry is zero may loop,
but they should eventually recover after the entry is written.

Also other users may spin in the page table lock until we
"fixed" the PTE. This is ensured by always taking the page table lock
even for the swap cache case. Previously this was only done
on architectures with non atomic PTE accesses (such as 32bit PTE),
but now it is also done when this bug workaround is active.

I audited apply_pte_range and other users of arch_enter_lazy...
and they seem to all not clear the present bit.

Right now the extra flush is done in the low level
architecture code, while the higher level code still
does batched TLB flush. This means there is always one extra
unnecessary TLB flush now. As a followon optimization
this could be avoided by telling the callers that
the flush already happenend.

v2 (Lukasz Anaczkowski):
    () added call to smp_mb__after_atomic() to synchornize with
       switch_mm, based on Nadav's comment
    () fixed compilation breakage

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/hugetlb.h     |  9 ++++++++-
 arch/x86/include/asm/pgtable.h     |  5 +++++
 arch/x86/include/asm/pgtable_64.h  |  6 ++++++
 arch/x86/kernel/cpu/intel.c        | 10 ++++++++++
 arch/x86/mm/tlb.c                  | 22 ++++++++++++++++++++++
 include/linux/mm.h                 |  4 ++++
 mm/memory.c                        |  3 ++-
 8 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 4a41348..2c48011 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -303,6 +303,7 @@
 #define X86_BUG_SYSRET_SS_ATTRS	X86_BUG(8) /* SYSRET doesn't fix up SS attrs */
 #define X86_BUG_NULL_SEG	X86_BUG(9) /* Nulling a selector preserves the base */
 #define X86_BUG_SWAPGS_FENCE	X86_BUG(10) /* SWAPGS without input dep on GS */
+#define X86_BUG_PTE_LEAK        X86_BUG(11) /* PTE may leak A/D bits after clear */
 
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
index 3a10616..58e1ca9 100644
--- a/arch/x86/include/asm/hugetlb.h
+++ b/arch/x86/include/asm/hugetlb.h
@@ -41,10 +41,17 @@ static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	set_pte_at(mm, addr, ptep, pte);
 }
 
+extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr,
+			 pte_t *ptep);
+
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 					    unsigned long addr, pte_t *ptep)
 {
-	return ptep_get_and_clear(mm, addr, ptep);
+	pte_t pte = ptep_get_and_clear(mm, addr, ptep);
+
+	if (boot_cpu_has_bug(X86_BUG_PTE_LEAK))
+		fix_pte_leak(mm, addr, ptep);
+	return pte;
 }
 
 static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1a27396..9769355 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -794,11 +794,16 @@ extern int ptep_test_and_clear_young(struct vm_area_struct *vma,
 extern int ptep_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pte_t *ptep);
 
+extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr,
+			 pte_t *ptep);
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 				       pte_t *ptep)
 {
 	pte_t pte = native_ptep_get_and_clear(ptep);
+	if (boot_cpu_has_bug(X86_BUG_PTE_LEAK))
+		fix_pte_leak(mm, addr, ptep);
 	pte_update(mm, addr, ptep);
 	return pte;
 }
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 2ee7811..6fa4079 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -178,6 +178,12 @@ extern void cleanup_highmap(void);
 extern void init_extra_mapping_uc(unsigned long phys, unsigned long size);
 extern void init_extra_mapping_wb(unsigned long phys, unsigned long size);
 
+#define ARCH_HAS_NEEDS_SWAP_PTL 1
+static inline bool arch_needs_swap_ptl(void)
+{
+       return boot_cpu_has_bug(X86_BUG_PTE_LEAK);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 6e2ffbe..f499513 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -181,6 +181,16 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 		}
 	}
 
+	if (c->x86_model == 87) {
+		static bool printed;
+
+		if (!printed) {
+			pr_info("Enabling PTE leaking workaround\n");
+			printed = true;
+		}
+		set_cpu_bug(c, X86_BUG_PTE_LEAK);
+	}
+
 	/*
 	 * Intel Quark Core DevMan_001.pdf section 6.4.11
 	 * "The operating system also is required to invalidate (i.e., flush)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0..9b4c575 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -469,3 +469,25 @@ static int __init create_tlb_single_page_flush_ceiling(void)
 late_initcall(create_tlb_single_page_flush_ceiling);
 
 #endif /* CONFIG_SMP */
+
+/*
+ * Workaround for KNL issue:
+ *
+ * A thread that is going to page fault due to P=0, may still
+ * non atomically set A or D bits, which could corrupt swap entries.
+ * Always flush the other CPUs and clear the PTE again to avoid
+ * this leakage. We are excluded using the pagetable lock.
+ */
+
+void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+	smp_mb__after_atomic();
+	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
+		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
+		flush_tlb_others(mm_cpumask(mm), mm, addr,
+				 addr + PAGE_SIZE);
+		mb();
+		set_pte(ptep, __pte(0));
+	}
+}
+
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5df5feb..5c80fe09 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2404,6 +2404,10 @@ static inline bool debug_guardpage_enabled(void) { return false; }
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
+#ifndef ARCH_HAS_NEEDS_SWAP_PTL
+static inline bool arch_needs_swap_ptl(void) { return false; }
+#endif
+
 #if MAX_NUMNODES > 1
 void __init setup_nr_node_ids(void);
 #else
diff --git a/mm/memory.c b/mm/memory.c
index 15322b7..0d6ef39 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1960,7 +1960,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
 {
 	int same = 1;
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-	if (sizeof(pte_t) > sizeof(unsigned long)) {
+	if (arch_needs_swap_ptl() ||
+	    sizeof(pte_t) > sizeof(unsigned long)) {
 		spinlock_t *ptl = pte_lockptr(mm, pmd);
 		spin_lock(ptl);
 		same = pte_same(*page_table, orig_pte);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 16:47 ` Nadav Amit
  2016-06-14 16:54   ` Anaczkowski, Lukasz
  2016-06-14 17:01   ` [PATCH v2] " Lukasz Anaczkowski
@ 2016-06-14 17:18   ` Dave Hansen
  2016-06-14 20:16     ` Nadav Amit
  2 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2016-06-14 17:18 UTC (permalink / raw)
  To: Nadav Amit, Lukasz Anaczkowski
  Cc: LKML, linux-mm, Thomas Gleixner, Ingo Molnar, ak,
	kirill.shutemov, mhocko, Andrew Morton, H. Peter Anvin,
	harish.srinivasappa, lukasz.odzioba

On 06/14/2016 09:47 AM, Nadav Amit wrote:
> Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:
> 
>> > From: Andi Kleen <ak@linux.intel.com>
>> > +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>> > +{
> Here there should be a call to smp_mb__after_atomic() to synchronize with
> switch_mm. I submitted a similar patch, which is still pending (hint).
> 
>> > +	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
>> > +		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>> > +		flush_tlb_others(mm_cpumask(mm), mm, addr,
>> > +				 addr + PAGE_SIZE);
>> > +		mb();
>> > +		set_pte(ptep, __pte(0));
>> > +	}
>> > +}

Shouldn't that barrier be incorporated in the TLB flush code itself and
not every single caller (like this code is)?

It is insane to require individual TLB flushers to be concerned with the
barriers.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 15:58 [PATCH] Linux VM workaround for Knights Landing A/D leak Lukasz Anaczkowski
                   ` (2 preceding siblings ...)
  2016-06-14 16:58 ` kbuild test robot
@ 2016-06-14 17:19 ` Dave Hansen
  2016-06-15 13:06   ` Anaczkowski, Lukasz
  2016-06-14 17:47 ` kbuild test robot
  4 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2016-06-14 17:19 UTC (permalink / raw)
  To: Lukasz Anaczkowski, linux-kernel, linux-mm, tglx, mingo, ak,
	kirill.shutemov, mhocko, akpm, hpa
  Cc: harish.srinivasappa, lukasz.odzioba

...
> +extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr,
> +			 pte_t *ptep);
> +
>  static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
>  					    unsigned long addr, pte_t *ptep)
>  {
> -	return ptep_get_and_clear(mm, addr, ptep);
> +	pte_t pte = ptep_get_and_clear(mm, addr, ptep);
> +
> +	if (boot_cpu_has_bug(X86_BUG_PTE_LEAK))
> +		fix_pte_leak(mm, addr, ptep);
> +	return pte;
>  }
>  
>  static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 1a27396..9769355 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -794,11 +794,16 @@ extern int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  extern int ptep_clear_flush_young(struct vm_area_struct *vma,
>  				  unsigned long address, pte_t *ptep);
>  
> +extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr,
> +			 pte_t *ptep);
> +
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>  				       pte_t *ptep)
>  {
>  	pte_t pte = native_ptep_get_and_clear(ptep);
> +	if (boot_cpu_has_bug(X86_BUG_PTE_LEAK))
> +		fix_pte_leak(mm, addr, ptep);
>  	pte_update(mm, addr, ptep);
>  	return pte;
>  }

Doesn't hugetlb.h somehow #include pgtable.h?  So why double-define
fix_pte_leak()?

> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 2ee7811..6fa4079 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -178,6 +178,12 @@ extern void cleanup_highmap(void);
>  extern void init_extra_mapping_uc(unsigned long phys, unsigned long size);
>  extern void init_extra_mapping_wb(unsigned long phys, unsigned long size);
>  
> +#define ARCH_HAS_NEEDS_SWAP_PTL 1
> +static inline bool arch_needs_swap_ptl(void)
> +{
> +       return boot_cpu_has_bug(X86_BUG_PTE_LEAK);
> +}
> +
>  #endif /* !__ASSEMBLY__ */

I think we need a comment on this sucker.  I'm not sure we should lean
solely on the commit message to record why we need this until the end of
time.

Or, refer over to fix_pte_leak() for a full description of what is going on.

>  #endif /* _ASM_X86_PGTABLE_64_H */
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 6e2ffbe..f499513 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -181,6 +181,16 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>  		}
>  	}
>  
> +	if (c->x86_model == 87) {
> +		static bool printed;
> +
> +		if (!printed) {
> +			pr_info("Enabling PTE leaking workaround\n");
> +			printed = true;
> +		}
> +		set_cpu_bug(c, X86_BUG_PTE_LEAK);
> +	}

Please use the macros in here for the model id:

> http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/tree/arch/x86/include/asm/intel-family.h

We also probably want to prefix the pr_info() with something like
"x86/intel:".

> +/*
> + * Workaround for KNL issue:

Please be specific about what this "KNL issue" *is*.  Refer to the
public documentation of the erratum, please.

> + * A thread that is going to page fault due to P=0, may still
> + * non atomically set A or D bits, which could corrupt swap entries.
> + * Always flush the other CPUs and clear the PTE again to avoid
> + * this leakage. We are excluded using the pagetable lock.
> + */
> +
> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
> +{
> +	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
> +		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
> +		flush_tlb_others(mm_cpumask(mm), mm, addr,
> +				 addr + PAGE_SIZE);
> +		mb();
> +		set_pte(ptep, __pte(0));
> +	}
> +}

I think the comment here is a bit sparse.  Can we add some more details,
like:

	Entering here, the current CPU just cleared the PTE.  But,
	another thread may have raced and set the A or D bits, or be
	_about_ to set the bits.  Shooting their TLB entry down will
	ensure they see the cleared PTE and will not set A or D.

and by the set_pte():

	Clear the PTE one more time, in case the other thread set A/D
	before we sent the TLB flush.

>  #endif /* CONFIG_SMP */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5df5feb..5c80fe09 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2404,6 +2404,10 @@ static inline bool debug_guardpage_enabled(void) { return false; }
>  static inline bool page_is_guard(struct page *page) { return false; }
>  #endif /* CONFIG_DEBUG_PAGEALLOC */
>  
> +#ifndef ARCH_HAS_NEEDS_SWAP_PTL
> +static inline bool arch_needs_swap_ptl(void) { return false; }
> +#endif
> +
>  #if MAX_NUMNODES > 1
>  void __init setup_nr_node_ids(void);
>  #else
> diff --git a/mm/memory.c b/mm/memory.c
> index 15322b7..0d6ef39 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1960,7 +1960,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
>  {
>  	int same = 1;
>  #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
> -	if (sizeof(pte_t) > sizeof(unsigned long)) {
> +	if (arch_needs_swap_ptl() ||
> +	    sizeof(pte_t) > sizeof(unsigned long)) {
>  		spinlock_t *ptl = pte_lockptr(mm, pmd);
>  		spin_lock(ptl);
>  		same = pte_same(*page_table, orig_pte);
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 17:01   ` [PATCH v2] " Lukasz Anaczkowski
@ 2016-06-14 17:24     ` Dave Hansen
  2016-06-14 18:34       ` One Thousand Gnomes
  2016-06-14 18:10     ` Borislav Petkov
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2016-06-14 17:24 UTC (permalink / raw)
  To: Lukasz Anaczkowski, linux-kernel, linux-mm, tglx, mingo, ak,
	kirill.shutemov, mhocko, akpm, hpa
  Cc: harish.srinivasappa, lukasz.odzioba, grzegorz.andrejczuk, lukasz.daniluk

On 06/14/2016 10:01 AM, Lukasz Anaczkowski wrote:
> v2 (Lukasz Anaczkowski):
>     () fixed compilation breakage
...

By unconditionally defining the workaround code, even on kernels where
there is no chance of ever hitting this bug.  I think that's a pretty
poor way to do it.

Can we please stick this in one of the intel.c files, so it's only
present on CPU_SUP_INTEL builds?

Which reminds me...

> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -178,6 +178,12 @@ extern void cleanup_highmap(void);
>  extern void init_extra_mapping_uc(unsigned long phys, unsigned long size);
>  extern void init_extra_mapping_wb(unsigned long phys, unsigned long size);
>  
> +#define ARCH_HAS_NEEDS_SWAP_PTL 1
> +static inline bool arch_needs_swap_ptl(void)
> +{
> +       return boot_cpu_has_bug(X86_BUG_PTE_LEAK);
> +}

Does this *REALLY* only affect 64-bit kernels?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 15:58 [PATCH] Linux VM workaround for Knights Landing A/D leak Lukasz Anaczkowski
                   ` (3 preceding siblings ...)
  2016-06-14 17:19 ` Dave Hansen
@ 2016-06-14 17:47 ` kbuild test robot
  4 siblings, 0 replies; 38+ messages in thread
From: kbuild test robot @ 2016-06-14 17:47 UTC (permalink / raw)
  To: Lukasz Anaczkowski
  Cc: kbuild-all, linux-kernel, linux-mm, tglx, mingo, dave.hansen, ak,
	kirill.shutemov, mhocko, akpm, hpa, lukasz.anaczkowski,
	harish.srinivasappa, lukasz.odzioba

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

Hi,

[auto build test ERROR on v4.7-rc3]
[also build test ERROR on next-20160614]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Lukasz-Anaczkowski/Linux-VM-workaround-for-Knights-Landing-A-D-leak/20160615-000610
config: i386-randconfig-r0-201624 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   arch/x86/built-in.o: In function `__ptep_modify_prot_start':
   paravirt.c:(.text+0xc0e3c): undefined reference to `fix_pte_leak'
   mm/built-in.o: In function `zap_pte_range':
>> memory.c:(.text+0xdb8b4): undefined reference to `fix_pte_leak'
   mm/built-in.o: In function `move_page_tables':
   (.text+0x1102ff): undefined reference to `fix_pte_leak'
   mm/built-in.o: In function `vunmap_page_range':
   vmalloc.c:(.text+0x125741): undefined reference to `fix_pte_leak'
   mm/built-in.o: In function `ptep_clear_flush':
   (.text+0x12e01f): undefined reference to `fix_pte_leak'
   mm/built-in.o:madvise.c:(.text+0x131138): more undefined references to `fix_pte_leak' follow

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 28300 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 17:01   ` [PATCH v2] " Lukasz Anaczkowski
  2016-06-14 17:24     ` Dave Hansen
@ 2016-06-14 18:10     ` Borislav Petkov
  2016-06-15 13:12       ` Anaczkowski, Lukasz
  2016-06-14 18:38     ` Nadav Amit
  2016-06-16 15:14     ` [PATCH v3] " Lukasz Anaczkowski
  3 siblings, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2016-06-14 18:10 UTC (permalink / raw)
  To: Lukasz Anaczkowski
  Cc: linux-kernel, linux-mm, tglx, mingo, dave.hansen, ak,
	kirill.shutemov, mhocko, akpm, hpa, harish.srinivasappa,
	lukasz.odzioba, grzegorz.andrejczuk, lukasz.daniluk

On Tue, Jun 14, 2016 at 07:01:12PM +0200, Lukasz Anaczkowski wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Knights Landing has a issue that a thread setting A or D bits
> may not do so atomically against checking the present bit.
> A thread which is going to page fault may still set those
> bits, even though the present bit was already atomically cleared.
> 
> This implies that when the kernel clears present atomically,
> some time later the supposed to be zero entry could be corrupted
> with stray A or D bits.
> 
> Since the PTE could be already used for storing a swap index,
> or a NUMA migration index, this cannot be tolerated. Most
> of the time the kernel detects the problem, but in some
> rare cases it may not.
> 
> This patch enforces that the page unmap path in vmscan/direct reclaim
> always flushes other CPUs after clearing each page, and also
> clears the PTE again after the flush.
> 
> For reclaim this brings the performance back to before Mel's
> flushing changes, but for unmap it disables batching.
> 
> This makes sure any leaked A/D bits are immediately cleared before the entry
> is used for something else.
> 
> Any parallel faults that check for entry is zero may loop,
> but they should eventually recover after the entry is written.
> 
> Also other users may spin in the page table lock until we
> "fixed" the PTE. This is ensured by always taking the page table lock
> even for the swap cache case. Previously this was only done
> on architectures with non atomic PTE accesses (such as 32bit PTE),
> but now it is also done when this bug workaround is active.
> 
> I audited apply_pte_range and other users of arch_enter_lazy...
> and they seem to all not clear the present bit.
> 
> Right now the extra flush is done in the low level
> architecture code, while the higher level code still
> does batched TLB flush. This means there is always one extra
> unnecessary TLB flush now. As a followon optimization
> this could be avoided by telling the callers that
> the flush already happenend.
> 
> v2 (Lukasz Anaczkowski):
>     () added call to smp_mb__after_atomic() to synchornize with
>        switch_mm, based on Nadav's comment
>     () fixed compilation breakage
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>
> ---
>  arch/x86/include/asm/cpufeatures.h |  1 +
>  arch/x86/include/asm/hugetlb.h     |  9 ++++++++-
>  arch/x86/include/asm/pgtable.h     |  5 +++++
>  arch/x86/include/asm/pgtable_64.h  |  6 ++++++
>  arch/x86/kernel/cpu/intel.c        | 10 ++++++++++
>  arch/x86/mm/tlb.c                  | 22 ++++++++++++++++++++++
>  include/linux/mm.h                 |  4 ++++
>  mm/memory.c                        |  3 ++-
>  8 files changed, 58 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 4a41348..2c48011 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -303,6 +303,7 @@
>  #define X86_BUG_SYSRET_SS_ATTRS	X86_BUG(8) /* SYSRET doesn't fix up SS attrs */
>  #define X86_BUG_NULL_SEG	X86_BUG(9) /* Nulling a selector preserves the base */
>  #define X86_BUG_SWAPGS_FENCE	X86_BUG(10) /* SWAPGS without input dep on GS */
> +#define X86_BUG_PTE_LEAK        X86_BUG(11) /* PTE may leak A/D bits after clear */
>  
>  
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
> index 3a10616..58e1ca9 100644
> --- a/arch/x86/include/asm/hugetlb.h
> +++ b/arch/x86/include/asm/hugetlb.h
> @@ -41,10 +41,17 @@ static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
>  	set_pte_at(mm, addr, ptep, pte);
>  }
>  
> +extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr,
> +			 pte_t *ptep);
> +
>  static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
>  					    unsigned long addr, pte_t *ptep)
>  {
> -	return ptep_get_and_clear(mm, addr, ptep);
> +	pte_t pte = ptep_get_and_clear(mm, addr, ptep);
> +
> +	if (boot_cpu_has_bug(X86_BUG_PTE_LEAK))

static_cpu_has_bug()

> +		fix_pte_leak(mm, addr, ptep);
> +	return pte;
>  }
>  
>  static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 1a27396..9769355 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -794,11 +794,16 @@ extern int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  extern int ptep_clear_flush_young(struct vm_area_struct *vma,
>  				  unsigned long address, pte_t *ptep);
>  
> +extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr,
> +			 pte_t *ptep);
> +
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>  				       pte_t *ptep)
>  {
>  	pte_t pte = native_ptep_get_and_clear(ptep);
> +	if (boot_cpu_has_bug(X86_BUG_PTE_LEAK))

static_cpu_has_bug()

> +		fix_pte_leak(mm, addr, ptep);
>  	pte_update(mm, addr, ptep);
>  	return pte;
>  }
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 2ee7811..6fa4079 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -178,6 +178,12 @@ extern void cleanup_highmap(void);
>  extern void init_extra_mapping_uc(unsigned long phys, unsigned long size);
>  extern void init_extra_mapping_wb(unsigned long phys, unsigned long size);
>  
> +#define ARCH_HAS_NEEDS_SWAP_PTL 1
> +static inline bool arch_needs_swap_ptl(void)
> +{
> +       return boot_cpu_has_bug(X86_BUG_PTE_LEAK);

static_cpu_has_bug()

> +}
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* _ASM_X86_PGTABLE_64_H */
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 6e2ffbe..f499513 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -181,6 +181,16 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>  		}
>  	}
>  
> +	if (c->x86_model == 87) {

That should be INTEL_FAM6_XEON_PHI_KNL, AFAICT.

> +		static bool printed;
> +
> +		if (!printed) {
> +			pr_info("Enabling PTE leaking workaround\n");
> +			printed = true;
> +		}

pr_info_once

> +		set_cpu_bug(c, X86_BUG_PTE_LEAK);
> +	}
> +
>  	/*
>  	 * Intel Quark Core DevMan_001.pdf section 6.4.11
>  	 * "The operating system also is required to invalidate (i.e., flush)
-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 17:24     ` Dave Hansen
@ 2016-06-14 18:34       ` One Thousand Gnomes
  2016-06-14 18:54         ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: One Thousand Gnomes @ 2016-06-14 18:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Lukasz Anaczkowski, linux-kernel, linux-mm, tglx, mingo, ak,
	kirill.shutemov, mhocko, akpm, hpa, harish.srinivasappa,
	lukasz.odzioba, grzegorz.andrejczuk, lukasz.daniluk

On Tue, 14 Jun 2016 10:24:16 -0700
Dave Hansen <dave.hansen@linux.intel.com> wrote:

> On 06/14/2016 10:01 AM, Lukasz Anaczkowski wrote:
> > v2 (Lukasz Anaczkowski):
> >     () fixed compilation breakage  
> ...
> 
> By unconditionally defining the workaround code, even on kernels where
> there is no chance of ever hitting this bug.  I think that's a pretty
> poor way to do it.
> 
> Can we please stick this in one of the intel.c files, so it's only
> present on CPU_SUP_INTEL builds?

Can we please make it use alternatives or somesuch so that it just goes
away at boot if its not a Knights Landing box ?

Alan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 17:01   ` [PATCH v2] " Lukasz Anaczkowski
  2016-06-14 17:24     ` Dave Hansen
  2016-06-14 18:10     ` Borislav Petkov
@ 2016-06-14 18:38     ` Nadav Amit
  2016-06-15 13:12       ` Anaczkowski, Lukasz
  2016-06-16 15:14     ` [PATCH v3] " Lukasz Anaczkowski
  3 siblings, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2016-06-14 18:38 UTC (permalink / raw)
  To: Lukasz Anaczkowski
  Cc: LKML, linux-mm, Thomas Gleixner, Ingo Molnar, Dave Hansen, ak,
	kirill.shutemov, mhocko, Andrew Morton, H. Peter Anvin,
	harish.srinivasappa, lukasz.odzioba, grzegorz.andrejczuk,
	lukasz.daniluk

Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:

> From: Andi Kleen <ak@linux.intel.com>

> static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> 					    unsigned long addr, pte_t *ptep)
> {
> -	return ptep_get_and_clear(mm, addr, ptep);
> +	pte_t pte = ptep_get_and_clear(mm, addr, ptep);
> +
> +	if (boot_cpu_has_bug(X86_BUG_PTE_LEAK))
> +		fix_pte_leak(mm, addr, ptep);
> +	return pte;
> }

I missed it on the previous iteration: ptep_get_and_clear already calls 
fix_pte_leak when needed. So do you need to call it again here?

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 18:34       ` One Thousand Gnomes
@ 2016-06-14 18:54         ` Dave Hansen
  2016-06-14 19:19           ` Borislav Petkov
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2016-06-14 18:54 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Lukasz Anaczkowski, linux-kernel, linux-mm, tglx, mingo, ak,
	kirill.shutemov, mhocko, akpm, hpa, harish.srinivasappa,
	lukasz.odzioba, grzegorz.andrejczuk, lukasz.daniluk,
	Borislav Petkov

On 06/14/2016 11:34 AM, One Thousand Gnomes wrote:
> On Tue, 14 Jun 2016 10:24:16 -0700
> Dave Hansen <dave.hansen@linux.intel.com> wrote:
> 
>> On 06/14/2016 10:01 AM, Lukasz Anaczkowski wrote:
>>> v2 (Lukasz Anaczkowski):
>>>     () fixed compilation breakage  
>> ...
>>
>> By unconditionally defining the workaround code, even on kernels where
>> there is no chance of ever hitting this bug.  I think that's a pretty
>> poor way to do it.
>>
>> Can we please stick this in one of the intel.c files, so it's only
>> present on CPU_SUP_INTEL builds?
> 
> Can we please make it use alternatives or somesuch so that it just goes
> away at boot if its not a Knights Landing box ?

Lukasz, Borislav suggested using static_cpu_has_bug(), which will do the
alternatives patching.  It's definitely the right thing to use here.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 18:54         ` Dave Hansen
@ 2016-06-14 19:19           ` Borislav Petkov
  2016-06-14 20:20             ` H. Peter Anvin
  0 siblings, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2016-06-14 19:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: One Thousand Gnomes, Lukasz Anaczkowski, linux-kernel, linux-mm,
	tglx, mingo, ak, kirill.shutemov, mhocko, akpm, hpa,
	harish.srinivasappa, lukasz.odzioba, grzegorz.andrejczuk,
	lukasz.daniluk

On Tue, Jun 14, 2016 at 11:54:24AM -0700, Dave Hansen wrote:
> Lukasz, Borislav suggested using static_cpu_has_bug(), which will do the
> alternatives patching.  It's definitely the right thing to use here.

Yeah, either that or do an

alternative_call(null_func, fix_pte_peak, X86_BUG_PTE_LEAK, ...)

or so and you'll need a dummy function to call on !X86_BUG_PTE_LEAK
CPUs.

The static_cpu_has_bug() thing should be most likely a penalty
of a single JMP (I have to look at the asm) but then since the
callers are inlined, you'll have to patch all those places where
*ptep_get_and_clear() get inlined.

Shouldn't be a big deal still but...

"debug-alternative" and a kvm guest should help you there to get a quick
idea.

HTH.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 17:18   ` [PATCH] " Dave Hansen
@ 2016-06-14 20:16     ` Nadav Amit
  2016-06-14 21:37       ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2016-06-14 20:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Lukasz Anaczkowski, LKML, linux-mm, Thomas Gleixner, Ingo Molnar,
	ak, kirill.shutemov, mhocko, Andrew Morton, H. Peter Anvin,
	harish.srinivasappa, lukasz.odzioba

Dave Hansen <dave.hansen@linux.intel.com> wrote:

> On 06/14/2016 09:47 AM, Nadav Amit wrote:
>> Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:
>> 
>>>> From: Andi Kleen <ak@linux.intel.com>
>>>> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>>>> +{
>> Here there should be a call to smp_mb__after_atomic() to synchronize with
>> switch_mm. I submitted a similar patch, which is still pending (hint).
>> 
>>>> +	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
>>>> +		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>>> +		flush_tlb_others(mm_cpumask(mm), mm, addr,
>>>> +				 addr + PAGE_SIZE);
>>>> +		mb();
>>>> +		set_pte(ptep, __pte(0));
>>>> +	}
>>>> +}
> 
> Shouldn't that barrier be incorporated in the TLB flush code itself and
> not every single caller (like this code is)?
> 
> It is insane to require individual TLB flushers to be concerned with the
> barriers.

IMHO it is best to use existing flushing interfaces instead of creating
new ones. 

In theory, fix_pte_leak could have used flush_tlb_page. But the problem
is that flush_tlb_page requires the vm_area_struct as an argument, which
ptep_get_and_clear (and others) do not have.

I don’t know which architecture needs the vm_area_struct, since x86 and
some others I looked at (e.g., ARM) only need the mm_struct.

Nadav

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 19:19           ` Borislav Petkov
@ 2016-06-14 20:20             ` H. Peter Anvin
  2016-06-14 20:47               ` Borislav Petkov
  0 siblings, 1 reply; 38+ messages in thread
From: H. Peter Anvin @ 2016-06-14 20:20 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen
  Cc: One Thousand Gnomes, Lukasz Anaczkowski, linux-kernel, linux-mm,
	tglx, mingo, ak, kirill.shutemov, mhocko, akpm,
	harish.srinivasappa, lukasz.odzioba, grzegorz.andrejczuk,
	lukasz.daniluk

On 06/14/16 12:19, Borislav Petkov wrote:
> On Tue, Jun 14, 2016 at 11:54:24AM -0700, Dave Hansen wrote:
>> Lukasz, Borislav suggested using static_cpu_has_bug(), which will do the
>> alternatives patching.  It's definitely the right thing to use here.
> 
> Yeah, either that or do an
> 
> alternative_call(null_func, fix_pte_peak, X86_BUG_PTE_LEAK, ...)
> 
> or so and you'll need a dummy function to call on !X86_BUG_PTE_LEAK
> CPUs.
> 
> The static_cpu_has_bug() thing should be most likely a penalty
> of a single JMP (I have to look at the asm) but then since the
> callers are inlined, you'll have to patch all those places where
> *ptep_get_and_clear() get inlined.
> 
> Shouldn't be a big deal still but...
> 
> "debug-alternative" and a kvm guest should help you there to get a quick
> idea.
> 

static_cpu_has_bug() should turn into 5-byte NOP in the common (bugless)
case.

	-hpa

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 20:20             ` H. Peter Anvin
@ 2016-06-14 20:47               ` Borislav Petkov
  2016-06-14 20:54                 ` H. Peter Anvin
  0 siblings, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2016-06-14 20:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Dave Hansen, One Thousand Gnomes, Lukasz Anaczkowski,
	linux-kernel, linux-mm, tglx, mingo, ak, kirill.shutemov, mhocko,
	akpm, harish.srinivasappa, lukasz.odzioba, grzegorz.andrejczuk,
	lukasz.daniluk

On Tue, Jun 14, 2016 at 01:20:06PM -0700, H. Peter Anvin wrote:
> static_cpu_has_bug() should turn into 5-byte NOP in the common (bugless)
> case.

Yeah, it does. I looked at the asm.

I wasn't 100% sure because I vaguely remember gcc reordering things in
some pathological case but I'm most likely remembering wrong because if
it were doing that, then the whole nopping out won't work. F'get about
it. :)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 20:47               ` Borislav Petkov
@ 2016-06-14 20:54                 ` H. Peter Anvin
  2016-06-14 21:02                   ` Borislav Petkov
  0 siblings, 1 reply; 38+ messages in thread
From: H. Peter Anvin @ 2016-06-14 20:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, One Thousand Gnomes, Lukasz Anaczkowski,
	linux-kernel, linux-mm, tglx, mingo, ak, kirill.shutemov, mhocko,
	akpm, harish.srinivasappa, lukasz.odzioba, grzegorz.andrejczuk,
	lukasz.daniluk

On 06/14/16 13:47, Borislav Petkov wrote:
> On Tue, Jun 14, 2016 at 01:20:06PM -0700, H. Peter Anvin wrote:
>> static_cpu_has_bug() should turn into 5-byte NOP in the common (bugless)
>> case.
> 
> Yeah, it does. I looked at the asm.
> 
> I wasn't 100% sure because I vaguely remember gcc reordering things in
> some pathological case but I'm most likely remembering wrong because if
> it were doing that, then the whole nopping out won't work. F'get about
> it. :)
> 

There was that.  It is still possible that we end up with NOP a JMP
right before another JMP; we could perhaps make the patching code
smarter and see if we have a JMP immediately after.

	-hpa

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 20:54                 ` H. Peter Anvin
@ 2016-06-14 21:02                   ` Borislav Petkov
  2016-06-14 21:08                     ` H. Peter Anvin
  2016-06-14 21:13                     ` H. Peter Anvin
  0 siblings, 2 replies; 38+ messages in thread
From: Borislav Petkov @ 2016-06-14 21:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Dave Hansen, One Thousand Gnomes, Lukasz Anaczkowski,
	linux-kernel, linux-mm, tglx, mingo, ak, kirill.shutemov, mhocko,
	akpm, harish.srinivasappa, lukasz.odzioba, grzegorz.andrejczuk,
	lukasz.daniluk

On Tue, Jun 14, 2016 at 01:54:25PM -0700, H. Peter Anvin wrote:
> There was that.  It is still possible that we end up with NOP a JMP
> right before another JMP; we could perhaps make the patching code
> smarter and see if we have a JMP immediately after.

Yeah, I still can't get reproduce that reliably - I remember seeing it
at some point but then dismissing it for another, higher-prio thing. And
now the whole memory is hazy at best.

But, you're giving me a great idea right now - I have this kernel
disassembler tool which dumps alternative sections already and I could
teach it to look for pathological cases around the patching sites and
scream.

Something for my TODO list when I get a quiet moment.

Thanks!

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 21:02                   ` Borislav Petkov
@ 2016-06-14 21:08                     ` H. Peter Anvin
  2016-06-14 21:13                     ` H. Peter Anvin
  1 sibling, 0 replies; 38+ messages in thread
From: H. Peter Anvin @ 2016-06-14 21:08 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, One Thousand Gnomes, Lukasz Anaczkowski,
	linux-kernel, linux-mm, tglx, mingo, ak, kirill.shutemov, mhocko,
	akpm, harish.srinivasappa, lukasz.odzioba, grzegorz.andrejczuk,
	lukasz.daniluk

On 06/14/16 14:02, Borislav Petkov wrote:
> On Tue, Jun 14, 2016 at 01:54:25PM -0700, H. Peter Anvin wrote:
>> There was that.  It is still possible that we end up with NOP a JMP
>> right before another JMP; we could perhaps make the patching code
>> smarter and see if we have a JMP immediately after.
> 
> Yeah, I still can't get reproduce that reliably - I remember seeing it
> at some point but then dismissing it for another, higher-prio thing. And
> now the whole memory is hazy at best.
> 
> But, you're giving me a great idea right now - I have this kernel
> disassembler tool which dumps alternative sections already and I could
> teach it to look for pathological cases around the patching sites and
> scream.
> 
> Something for my TODO list when I get a quiet moment.
> 

It's not really pathological; the issue is that asm goto() with an
unreachable clause after it doesn't tell gcc that a certain code path
ought to be linear, so we tell it to fall through.  However, if gcc then
wants to have a jump there for whatever reason (perhaps it is part of a
loop) we end up with a redundant jump, so a patch site followed by a JMP
is entirely reasonable.

	-hpa

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 21:02                   ` Borislav Petkov
  2016-06-14 21:08                     ` H. Peter Anvin
@ 2016-06-14 21:13                     ` H. Peter Anvin
  1 sibling, 0 replies; 38+ messages in thread
From: H. Peter Anvin @ 2016-06-14 21:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, One Thousand Gnomes, Lukasz Anaczkowski,
	linux-kernel, linux-mm, tglx, mingo, ak, kirill.shutemov, mhocko,
	akpm, harish.srinivasappa, lukasz.odzioba, grzegorz.andrejczuk,
	lukasz.daniluk

On June 14, 2016 2:02:55 PM PDT, Borislav Petkov <bp@alien8.de> wrote:
>On Tue, Jun 14, 2016 at 01:54:25PM -0700, H. Peter Anvin wrote:
>> There was that.  It is still possible that we end up with NOP a JMP
>> right before another JMP; we could perhaps make the patching code
>> smarter and see if we have a JMP immediately after.
>
>Yeah, I still can't get reproduce that reliably - I remember seeing it
>at some point but then dismissing it for another, higher-prio thing.
>And
>now the whole memory is hazy at best.
>
>But, you're giving me a great idea right now - I have this kernel
>disassembler tool which dumps alternative sections already and I could
>teach it to look for pathological cases around the patching sites and
>scream.
>
>Something for my TODO list when I get a quiet moment.
>
>Thanks!

We talked with the GCC people about always bias asm goto toward the first label even if followed by __builtin_unreachable().  I don't know if that happened; if so we should probably insert the unreachable for those versions of gcc only.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 20:16     ` Nadav Amit
@ 2016-06-14 21:37       ` Dave Hansen
  2016-06-15  2:20         ` Andy Lutomirski
  2016-06-15  3:20         ` Nadav Amit
  0 siblings, 2 replies; 38+ messages in thread
From: Dave Hansen @ 2016-06-14 21:37 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Lukasz Anaczkowski, LKML, linux-mm, Thomas Gleixner, Ingo Molnar,
	ak, kirill.shutemov, mhocko, Andrew Morton, H. Peter Anvin,
	harish.srinivasappa, lukasz.odzioba, Andy Lutomirski

On 06/14/2016 01:16 PM, Nadav Amit wrote:
> Dave Hansen <dave.hansen@linux.intel.com> wrote:
> 
>> On 06/14/2016 09:47 AM, Nadav Amit wrote:
>>> Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:
>>>
>>>>> From: Andi Kleen <ak@linux.intel.com>
>>>>> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>>>>> +{
>>> Here there should be a call to smp_mb__after_atomic() to synchronize with
>>> switch_mm. I submitted a similar patch, which is still pending (hint).
>>>
>>>>> +	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
>>>>> +		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>>>> +		flush_tlb_others(mm_cpumask(mm), mm, addr,
>>>>> +				 addr + PAGE_SIZE);
>>>>> +		mb();
>>>>> +		set_pte(ptep, __pte(0));
>>>>> +	}
>>>>> +}
>>
>> Shouldn't that barrier be incorporated in the TLB flush code itself and
>> not every single caller (like this code is)?
>>
>> It is insane to require individual TLB flushers to be concerned with the
>> barriers.
> 
> IMHO it is best to use existing flushing interfaces instead of creating
> new ones. 

Yeah, or make these things a _little_ harder to get wrong.  That little
snippet above isn't so crazy that we should be depending on open-coded
barriers to get it right.

Should we just add a barrier to mm_cpumask() itself?  That should stop
the race.  Or maybe we need a new primitive like:

/*
 * Call this if a full barrier has been executed since the last
 * pagetable modification operation.
 */
static int __other_cpus_need_tlb_flush(struct mm_struct *mm)
{
	/* cpumask_any_but() returns >= nr_cpu_ids if no cpus set. */
	return cpumask_any_but(mm_cpumask(mm), smp_processor_id()) <
		nr_cpu_ids;
}


static int other_cpus_need_tlb_flush(struct mm_struct *mm)
{
	/*
	 * Synchronizes with switch_mm.  Makes sure that we do not
	 * observe a bit having been cleared in mm_cpumask() before
 	 * the other processor has seen our pagetable update.  See
	 * switch_mm().
	 */
	smp_mb__after_atomic();

	return __other_cpus_need_tlb_flush(mm)
}

We should be able to deploy other_cpus_need_tlb_flush() in most of the
cases where we are doing "cpumask_any_but(mm_cpumask(mm),
smp_processor_id()) < nr_cpu_ids".

Right?

> In theory, fix_pte_leak could have used flush_tlb_page. But the problem
> is that flush_tlb_page requires the vm_area_struct as an argument, which
> ptep_get_and_clear (and others) do not have.

That, and we do not want/need to flush the _current_ processor's TLB.
flush_tlb_page() would have done that unnecessarily.  That's not the end
of the world here, but it is a downside.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 21:37       ` Dave Hansen
@ 2016-06-15  2:20         ` Andy Lutomirski
  2016-06-15  2:35           ` Nadav Amit
  2016-06-15  3:20         ` Nadav Amit
  1 sibling, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2016-06-15  2:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nadav Amit, Lukasz Anaczkowski, LKML, linux-mm, Thomas Gleixner,
	Ingo Molnar, Andi Kleen, Kirill A. Shutemov, Michal Hocko,
	Andrew Morton, H. Peter Anvin, harish.srinivasappa,
	lukasz.odzioba

On Tue, Jun 14, 2016 at 2:37 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 06/14/2016 01:16 PM, Nadav Amit wrote:
>> Dave Hansen <dave.hansen@linux.intel.com> wrote:
>>
>>> On 06/14/2016 09:47 AM, Nadav Amit wrote:
>>>> Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:
>>>>
>>>>>> From: Andi Kleen <ak@linux.intel.com>
>>>>>> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>>>>>> +{
>>>> Here there should be a call to smp_mb__after_atomic() to synchronize with
>>>> switch_mm. I submitted a similar patch, which is still pending (hint).
>>>>
>>>>>> + if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
>>>>>> +         trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>>>>> +         flush_tlb_others(mm_cpumask(mm), mm, addr,
>>>>>> +                          addr + PAGE_SIZE);
>>>>>> +         mb();
>>>>>> +         set_pte(ptep, __pte(0));
>>>>>> + }
>>>>>> +}
>>>
>>> Shouldn't that barrier be incorporated in the TLB flush code itself and
>>> not every single caller (like this code is)?
>>>
>>> It is insane to require individual TLB flushers to be concerned with the
>>> barriers.
>>
>> IMHO it is best to use existing flushing interfaces instead of creating
>> new ones.
>
> Yeah, or make these things a _little_ harder to get wrong.  That little
> snippet above isn't so crazy that we should be depending on open-coded
> barriers to get it right.
>
> Should we just add a barrier to mm_cpumask() itself?  That should stop
> the race.  Or maybe we need a new primitive like:
>
> /*
>  * Call this if a full barrier has been executed since the last
>  * pagetable modification operation.
>  */
> static int __other_cpus_need_tlb_flush(struct mm_struct *mm)
> {
>         /* cpumask_any_but() returns >= nr_cpu_ids if no cpus set. */
>         return cpumask_any_but(mm_cpumask(mm), smp_processor_id()) <
>                 nr_cpu_ids;
> }
>
>
> static int other_cpus_need_tlb_flush(struct mm_struct *mm)
> {
>         /*
>          * Synchronizes with switch_mm.  Makes sure that we do not
>          * observe a bit having been cleared in mm_cpumask() before
>          * the other processor has seen our pagetable update.  See
>          * switch_mm().
>          */
>         smp_mb__after_atomic();
>
>         return __other_cpus_need_tlb_flush(mm)
> }
>
> We should be able to deploy other_cpus_need_tlb_flush() in most of the
> cases where we are doing "cpumask_any_but(mm_cpumask(mm),
> smp_processor_id()) < nr_cpu_ids".

IMO this is a bit nuts.  smp_mb__after_atomic() doesn't do anything on
x86.  And, even if it did, why should the flush code assume that the
previous store was atomic?

What's the issue being fixed / worked around here?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-15  2:20         ` Andy Lutomirski
@ 2016-06-15  2:35           ` Nadav Amit
  2016-06-15  2:36             ` Andy Lutomirski
  0 siblings, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2016-06-15  2:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Lukasz Anaczkowski, LKML, linux-mm, Thomas Gleixner,
	Ingo Molnar, Andi Kleen, Kirill A. Shutemov, Michal Hocko,
	Andrew Morton, H. Peter Anvin, harish.srinivasappa,
	lukasz.odzioba

Andy Lutomirski <luto@amacapital.net> wrote:

> On Tue, Jun 14, 2016 at 2:37 PM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
>> On 06/14/2016 01:16 PM, Nadav Amit wrote:
>>> Dave Hansen <dave.hansen@linux.intel.com> wrote:
>>> 
>>>> On 06/14/2016 09:47 AM, Nadav Amit wrote:
>>>>> Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:
>>>>> 
>>>>>>> From: Andi Kleen <ak@linux.intel.com>
>>>>>>> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>>>>>>> +{
>>>>> Here there should be a call to smp_mb__after_atomic() to synchronize with
>>>>> switch_mm. I submitted a similar patch, which is still pending (hint).
>>>>> 
>>>>>>> + if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
>>>>>>> +         trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>>>>>> +         flush_tlb_others(mm_cpumask(mm), mm, addr,
>>>>>>> +                          addr + PAGE_SIZE);
>>>>>>> +         mb();
>>>>>>> +         set_pte(ptep, __pte(0));
>>>>>>> + }
>>>>>>> +}
>>>> 
>>>> Shouldn't that barrier be incorporated in the TLB flush code itself and
>>>> not every single caller (like this code is)?
>>>> 
>>>> It is insane to require individual TLB flushers to be concerned with the
>>>> barriers.
>>> 
>>> IMHO it is best to use existing flushing interfaces instead of creating
>>> new ones.
>> 
>> Yeah, or make these things a _little_ harder to get wrong.  That little
>> snippet above isn't so crazy that we should be depending on open-coded
>> barriers to get it right.
>> 
>> Should we just add a barrier to mm_cpumask() itself?  That should stop
>> the race.  Or maybe we need a new primitive like:
>> 
>> /*
>> * Call this if a full barrier has been executed since the last
>> * pagetable modification operation.
>> */
>> static int __other_cpus_need_tlb_flush(struct mm_struct *mm)
>> {
>>        /* cpumask_any_but() returns >= nr_cpu_ids if no cpus set. */
>>        return cpumask_any_but(mm_cpumask(mm), smp_processor_id()) <
>>                nr_cpu_ids;
>> }
>> 
>> 
>> static int other_cpus_need_tlb_flush(struct mm_struct *mm)
>> {
>>        /*
>>         * Synchronizes with switch_mm.  Makes sure that we do not
>>         * observe a bit having been cleared in mm_cpumask() before
>>         * the other processor has seen our pagetable update.  See
>>         * switch_mm().
>>         */
>>        smp_mb__after_atomic();
>> 
>>        return __other_cpus_need_tlb_flush(mm)
>> }
>> 
>> We should be able to deploy other_cpus_need_tlb_flush() in most of the
>> cases where we are doing "cpumask_any_but(mm_cpumask(mm),
>> smp_processor_id()) < nr_cpu_ids".
> 
> IMO this is a bit nuts.  smp_mb__after_atomic() doesn't do anything on
> x86.  And, even if it did, why should the flush code assume that the
> previous store was atomic?
> 
> What's the issue being fixed / worked around here?

It does a compiler barrier, which prevents the decision whether a
remote TLB shootdown is required to be made before the PTE is set.

I agree that PTEs may not be written atomically in certain cases
(although I am unaware of such cases, except on full-mm flush).

Having said that, I think that all the TLB flush/shootdown logic
should not be open-coded at all and be left in the arch-specific
implementation. People keep making small mistakes when they 
reimplement the flushing logic.

This patch, for example, also has a bug in the way it traces the
flush - it marks full flush, when it flushes a single page:
>>>>>>> trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);


Regards,
Nadav

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-15  2:35           ` Nadav Amit
@ 2016-06-15  2:36             ` Andy Lutomirski
  2016-06-15  2:44               ` Nadav Amit
  0 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2016-06-15  2:36 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Dave Hansen, Lukasz Anaczkowski, LKML, linux-mm, Thomas Gleixner,
	Ingo Molnar, Andi Kleen, Kirill A. Shutemov, Michal Hocko,
	Andrew Morton, H. Peter Anvin, harish.srinivasappa,
	lukasz.odzioba

On Tue, Jun 14, 2016 at 7:35 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Tue, Jun 14, 2016 at 2:37 PM, Dave Hansen
>> <dave.hansen@linux.intel.com> wrote:
>>> On 06/14/2016 01:16 PM, Nadav Amit wrote:
>>>> Dave Hansen <dave.hansen@linux.intel.com> wrote:
>>>>
>>>>> On 06/14/2016 09:47 AM, Nadav Amit wrote:
>>>>>> Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:
>>>>>>
>>>>>>>> From: Andi Kleen <ak@linux.intel.com>
>>>>>>>> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>>>>>>>> +{
>>>>>> Here there should be a call to smp_mb__after_atomic() to synchronize with
>>>>>> switch_mm. I submitted a similar patch, which is still pending (hint).
>>>>>>
>>>>>>>> + if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
>>>>>>>> +         trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>>>>>>> +         flush_tlb_others(mm_cpumask(mm), mm, addr,
>>>>>>>> +                          addr + PAGE_SIZE);
>>>>>>>> +         mb();
>>>>>>>> +         set_pte(ptep, __pte(0));
>>>>>>>> + }
>>>>>>>> +}
>>>>>
>>>>> Shouldn't that barrier be incorporated in the TLB flush code itself and
>>>>> not every single caller (like this code is)?
>>>>>
>>>>> It is insane to require individual TLB flushers to be concerned with the
>>>>> barriers.
>>>>
>>>> IMHO it is best to use existing flushing interfaces instead of creating
>>>> new ones.
>>>
>>> Yeah, or make these things a _little_ harder to get wrong.  That little
>>> snippet above isn't so crazy that we should be depending on open-coded
>>> barriers to get it right.
>>>
>>> Should we just add a barrier to mm_cpumask() itself?  That should stop
>>> the race.  Or maybe we need a new primitive like:
>>>
>>> /*
>>> * Call this if a full barrier has been executed since the last
>>> * pagetable modification operation.
>>> */
>>> static int __other_cpus_need_tlb_flush(struct mm_struct *mm)
>>> {
>>>        /* cpumask_any_but() returns >= nr_cpu_ids if no cpus set. */
>>>        return cpumask_any_but(mm_cpumask(mm), smp_processor_id()) <
>>>                nr_cpu_ids;
>>> }
>>>
>>>
>>> static int other_cpus_need_tlb_flush(struct mm_struct *mm)
>>> {
>>>        /*
>>>         * Synchronizes with switch_mm.  Makes sure that we do not
>>>         * observe a bit having been cleared in mm_cpumask() before
>>>         * the other processor has seen our pagetable update.  See
>>>         * switch_mm().
>>>         */
>>>        smp_mb__after_atomic();
>>>
>>>        return __other_cpus_need_tlb_flush(mm)
>>> }
>>>
>>> We should be able to deploy other_cpus_need_tlb_flush() in most of the
>>> cases where we are doing "cpumask_any_but(mm_cpumask(mm),
>>> smp_processor_id()) < nr_cpu_ids".
>>
>> IMO this is a bit nuts.  smp_mb__after_atomic() doesn't do anything on
>> x86.  And, even if it did, why should the flush code assume that the
>> previous store was atomic?
>>
>> What's the issue being fixed / worked around here?
>
> It does a compiler barrier, which prevents the decision whether a
> remote TLB shootdown is required to be made before the PTE is set.
>
> I agree that PTEs may not be written atomically in certain cases
> (although I am unaware of such cases, except on full-mm flush).

How about plain set_pte?  It's atomic (aligned word-sized write), but
it's not atomic in the _after_atomic sense.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-15  2:36             ` Andy Lutomirski
@ 2016-06-15  2:44               ` Nadav Amit
  2016-06-15  3:09                 ` Andy Lutomirski
  0 siblings, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2016-06-15  2:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Lukasz Anaczkowski, LKML, linux-mm, Thomas Gleixner,
	Ingo Molnar, Andi Kleen, Kirill A. Shutemov, Michal Hocko,
	Andrew Morton, H. Peter Anvin, harish.srinivasappa,
	lukasz.odzioba

Andy Lutomirski <luto@amacapital.net> wrote:

> On Tue, Jun 14, 2016 at 7:35 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>> Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>>> On Tue, Jun 14, 2016 at 2:37 PM, Dave Hansen
>>> <dave.hansen@linux.intel.com> wrote:
>>>> On 06/14/2016 01:16 PM, Nadav Amit wrote:
>>>>> Dave Hansen <dave.hansen@linux.intel.com> wrote:
>>>>> 
>>>>>> On 06/14/2016 09:47 AM, Nadav Amit wrote:
>>>>>>> Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:
>>>>>>> 
>>>>>>>>> From: Andi Kleen <ak@linux.intel.com>
>>>>>>>>> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>>>>>>>>> +{
>>>>>>> Here there should be a call to smp_mb__after_atomic() to synchronize with
>>>>>>> switch_mm. I submitted a similar patch, which is still pending (hint).
>>>>>>> 
>>>>>>>>> + if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
>>>>>>>>> +         trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>>>>>>>> +         flush_tlb_others(mm_cpumask(mm), mm, addr,
>>>>>>>>> +                          addr + PAGE_SIZE);
>>>>>>>>> +         mb();
>>>>>>>>> +         set_pte(ptep, __pte(0));
>>>>>>>>> + }
>>>>>>>>> +}
>>>>>> 
>>>>>> Shouldn't that barrier be incorporated in the TLB flush code itself and
>>>>>> not every single caller (like this code is)?
>>>>>> 
>>>>>> It is insane to require individual TLB flushers to be concerned with the
>>>>>> barriers.
>>>>> 
>>>>> IMHO it is best to use existing flushing interfaces instead of creating
>>>>> new ones.
>>>> 
>>>> Yeah, or make these things a _little_ harder to get wrong.  That little
>>>> snippet above isn't so crazy that we should be depending on open-coded
>>>> barriers to get it right.
>>>> 
>>>> Should we just add a barrier to mm_cpumask() itself?  That should stop
>>>> the race.  Or maybe we need a new primitive like:
>>>> 
>>>> /*
>>>> * Call this if a full barrier has been executed since the last
>>>> * pagetable modification operation.
>>>> */
>>>> static int __other_cpus_need_tlb_flush(struct mm_struct *mm)
>>>> {
>>>>       /* cpumask_any_but() returns >= nr_cpu_ids if no cpus set. */
>>>>       return cpumask_any_but(mm_cpumask(mm), smp_processor_id()) <
>>>>               nr_cpu_ids;
>>>> }
>>>> 
>>>> 
>>>> static int other_cpus_need_tlb_flush(struct mm_struct *mm)
>>>> {
>>>>       /*
>>>>        * Synchronizes with switch_mm.  Makes sure that we do not
>>>>        * observe a bit having been cleared in mm_cpumask() before
>>>>        * the other processor has seen our pagetable update.  See
>>>>        * switch_mm().
>>>>        */
>>>>       smp_mb__after_atomic();
>>>> 
>>>>       return __other_cpus_need_tlb_flush(mm)
>>>> }
>>>> 
>>>> We should be able to deploy other_cpus_need_tlb_flush() in most of the
>>>> cases where we are doing "cpumask_any_but(mm_cpumask(mm),
>>>> smp_processor_id()) < nr_cpu_ids".
>>> 
>>> IMO this is a bit nuts.  smp_mb__after_atomic() doesn't do anything on
>>> x86.  And, even if it did, why should the flush code assume that the
>>> previous store was atomic?
>>> 
>>> What's the issue being fixed / worked around here?
>> 
>> It does a compiler barrier, which prevents the decision whether a
>> remote TLB shootdown is required to be made before the PTE is set.
>> 
>> I agree that PTEs may not be written atomically in certain cases
>> (although I am unaware of such cases, except on full-mm flush).
> 
> How about plain set_pte?  It's atomic (aligned word-sized write), but
> it's not atomic in the _after_atomic sense.

Can you point me to a place where set_pte is used before a TLB
invalidation/shootdown, excluding this patch and the fullmm case?

I am not claiming there is no such case, but I am unaware of such
one. PTEs are cleared on SMP using xchg, and similarly the dirty bit
is cleared with an atomic operation.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-15  2:44               ` Nadav Amit
@ 2016-06-15  3:09                 ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-06-15  3:09 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Dave Hansen, Lukasz Anaczkowski, LKML, linux-mm, Thomas Gleixner,
	Ingo Molnar, Andi Kleen, Kirill A. Shutemov, Michal Hocko,
	Andrew Morton, H. Peter Anvin, harish.srinivasappa,
	lukasz.odzioba

On Tue, Jun 14, 2016 at 7:44 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Tue, Jun 14, 2016 at 7:35 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>> Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>>> On Tue, Jun 14, 2016 at 2:37 PM, Dave Hansen
>>>> <dave.hansen@linux.intel.com> wrote:
>>>>> On 06/14/2016 01:16 PM, Nadav Amit wrote:
>>>>>> Dave Hansen <dave.hansen@linux.intel.com> wrote:
>>>>>>
>>>>>>> On 06/14/2016 09:47 AM, Nadav Amit wrote:
>>>>>>>> Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:
>>>>>>>>
>>>>>>>>>> From: Andi Kleen <ak@linux.intel.com>
>>>>>>>>>> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>>>>>>>>>> +{
>>>>>>>> Here there should be a call to smp_mb__after_atomic() to synchronize with
>>>>>>>> switch_mm. I submitted a similar patch, which is still pending (hint).
>>>>>>>>
>>>>>>>>>> + if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
>>>>>>>>>> +         trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>>>>>>>>> +         flush_tlb_others(mm_cpumask(mm), mm, addr,
>>>>>>>>>> +                          addr + PAGE_SIZE);
>>>>>>>>>> +         mb();
>>>>>>>>>> +         set_pte(ptep, __pte(0));
>>>>>>>>>> + }
>>>>>>>>>> +}
>>>>>>>
>>>>>>> Shouldn't that barrier be incorporated in the TLB flush code itself and
>>>>>>> not every single caller (like this code is)?
>>>>>>>
>>>>>>> It is insane to require individual TLB flushers to be concerned with the
>>>>>>> barriers.
>>>>>>
>>>>>> IMHO it is best to use existing flushing interfaces instead of creating
>>>>>> new ones.
>>>>>
>>>>> Yeah, or make these things a _little_ harder to get wrong.  That little
>>>>> snippet above isn't so crazy that we should be depending on open-coded
>>>>> barriers to get it right.
>>>>>
>>>>> Should we just add a barrier to mm_cpumask() itself?  That should stop
>>>>> the race.  Or maybe we need a new primitive like:
>>>>>
>>>>> /*
>>>>> * Call this if a full barrier has been executed since the last
>>>>> * pagetable modification operation.
>>>>> */
>>>>> static int __other_cpus_need_tlb_flush(struct mm_struct *mm)
>>>>> {
>>>>>       /* cpumask_any_but() returns >= nr_cpu_ids if no cpus set. */
>>>>>       return cpumask_any_but(mm_cpumask(mm), smp_processor_id()) <
>>>>>               nr_cpu_ids;
>>>>> }
>>>>>
>>>>>
>>>>> static int other_cpus_need_tlb_flush(struct mm_struct *mm)
>>>>> {
>>>>>       /*
>>>>>        * Synchronizes with switch_mm.  Makes sure that we do not
>>>>>        * observe a bit having been cleared in mm_cpumask() before
>>>>>        * the other processor has seen our pagetable update.  See
>>>>>        * switch_mm().
>>>>>        */
>>>>>       smp_mb__after_atomic();
>>>>>
>>>>>       return __other_cpus_need_tlb_flush(mm)
>>>>> }
>>>>>
>>>>> We should be able to deploy other_cpus_need_tlb_flush() in most of the
>>>>> cases where we are doing "cpumask_any_but(mm_cpumask(mm),
>>>>> smp_processor_id()) < nr_cpu_ids".
>>>>
>>>> IMO this is a bit nuts.  smp_mb__after_atomic() doesn't do anything on
>>>> x86.  And, even if it did, why should the flush code assume that the
>>>> previous store was atomic?
>>>>
>>>> What's the issue being fixed / worked around here?
>>>
>>> It does a compiler barrier, which prevents the decision whether a
>>> remote TLB shootdown is required to be made before the PTE is set.
>>>
>>> I agree that PTEs may not be written atomically in certain cases
>>> (although I am unaware of such cases, except on full-mm flush).
>>
>> How about plain set_pte?  It's atomic (aligned word-sized write), but
>> it's not atomic in the _after_atomic sense.
>
> Can you point me to a place where set_pte is used before a TLB
> invalidation/shootdown, excluding this patch and the fullmm case?
>
> I am not claiming there is no such case, but I am unaware of such
> one. PTEs are cleared on SMP using xchg, and similarly the dirty bit
> is cleared with an atomic operation.
>

Hmm, you may be right.  I still think this is all disgusting, but I
don't have any better ideas.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 21:37       ` Dave Hansen
  2016-06-15  2:20         ` Andy Lutomirski
@ 2016-06-15  3:20         ` Nadav Amit
  1 sibling, 0 replies; 38+ messages in thread
From: Nadav Amit @ 2016-06-15  3:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Lukasz Anaczkowski, LKML, linux-mm, Thomas Gleixner, Ingo Molnar,
	Andi Kleen, Kirill A. Shutemov, Michal Hocko, Andrew Morton,
	H. Peter Anvin, harish.srinivasappa, lukasz.odzioba,
	Andy Lutomirski

Dave Hansen <dave.hansen@linux.intel.com> wrote:

> On 06/14/2016 01:16 PM, Nadav Amit wrote:
>> Dave Hansen <dave.hansen@linux.intel.com> wrote:
>> 
>>> On 06/14/2016 09:47 AM, Nadav Amit wrote:
>>>> Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:
>>>> 
>>>>>> From: Andi Kleen <ak@linux.intel.com>
>>>>>> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>>>>>> +{
>>>> Here there should be a call to smp_mb__after_atomic() to synchronize with
>>>> switch_mm. I submitted a similar patch, which is still pending (hint).
>>>> 
>>>>>> +	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
>>>>>> +		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>>>>> +		flush_tlb_others(mm_cpumask(mm), mm, addr,
>>>>>> +				 addr + PAGE_SIZE);
>>>>>> +		mb();
>>>>>> +		set_pte(ptep, __pte(0));
>>>>>> +	}
>>>>>> +}
>>> 
>>> Shouldn't that barrier be incorporated in the TLB flush code itself and
>>> not every single caller (like this code is)?
>>> 
>>> It is insane to require individual TLB flushers to be concerned with the
>>> barriers.
>> 
>> IMHO it is best to use existing flushing interfaces instead of creating
>> new ones.
> 
> Yeah, or make these things a _little_ harder to get wrong.  That little
> snippet above isn't so crazy that we should be depending on open-coded
> barriers to get it right.
> 
> Should we just add a barrier to mm_cpumask() itself?  That should stop
> the race.  Or maybe we need a new primitive like:
> 
> /*
> * Call this if a full barrier has been executed since the last
> * pagetable modification operation.
> */
> static int __other_cpus_need_tlb_flush(struct mm_struct *mm)
> {
> 	/* cpumask_any_but() returns >= nr_cpu_ids if no cpus set. */
> 	return cpumask_any_but(mm_cpumask(mm), smp_processor_id()) <
> 		nr_cpu_ids;
> }
> 
> 
> static int other_cpus_need_tlb_flush(struct mm_struct *mm)
> {
> 	/*
> 	 * Synchronizes with switch_mm.  Makes sure that we do not
> 	 * observe a bit having been cleared in mm_cpumask() before
> 	 * the other processor has seen our pagetable update.  See
> 	 * switch_mm().
> 	 */
> 	smp_mb__after_atomic();
> 
> 	return __other_cpus_need_tlb_flush(mm)
> }
> 
> We should be able to deploy other_cpus_need_tlb_flush() in most of the
> cases where we are doing "cpumask_any_but(mm_cpumask(mm),
> smp_processor_id()) < nr_cpu_ids".
> 
> Right?
This approach may work, but I doubt other_cpus_need_tlb_flush() would
be needed by anyone, excluding this "hacky" workaround. There are already
five interfaces for invalidation of: a single page, a userspace range,
a whole task, a kernel range, and full flush including kernel (global)
entries.

> 
>> In theory, fix_pte_leak could have used flush_tlb_page. But the problem
>> is that flush_tlb_page requires the vm_area_struct as an argument, which
>> ptep_get_and_clear (and others) do not have.
> 
> That, and we do not want/need to flush the _current_ processor's TLB.
> flush_tlb_page() would have done that unnecessarily.  That's not the end
> of the world here, but it is a downside.

Oops, I missed the fact a local flush is not needed in this case.

Nadav

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 17:19 ` Dave Hansen
@ 2016-06-15 13:06   ` Anaczkowski, Lukasz
  0 siblings, 0 replies; 38+ messages in thread
From: Anaczkowski, Lukasz @ 2016-06-15 13:06 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, linux-mm, tglx, mingo, ak,
	kirill.shutemov, mhocko, akpm, hpa
  Cc: Srinivasappa, Harish, Odzioba, Lukasz

From: Dave Hansen [mailto:dave.hansen@linux.intel.com] 
Sent: Tuesday, June 14, 2016 7:20 PM

>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
...
>> +extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr,
>> +			 pte_t *ptep);

> Doesn't hugetlb.h somehow #include pgtable.h?  So why double-define
> fix_pte_leak()?

It's other way round - pgtable.h somehow includes hugetlb.h. I've removed
duplicated fix_pte_leak() declaration.

>> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
>> index 2ee7811..6fa4079 100644
>> --- a/arch/x86/include/asm/pgtable_64.h
>> +++ b/arch/x86/include/asm/pgtable_64.h
>> @@ -178,6 +178,12 @@ extern void cleanup_highmap(void);
>>  extern void init_extra_mapping_uc(unsigned long phys, unsigned long size);
>>  extern void init_extra_mapping_wb(unsigned long phys, unsigned long size);
>>  
>> +#define ARCH_HAS_NEEDS_SWAP_PTL 1
>> +static inline bool arch_needs_swap_ptl(void)
>> +{
>> +       return boot_cpu_has_bug(X86_BUG_PTE_LEAK);
>> +}
>> +
>>  #endif /* !__ASSEMBLY__ */

> I think we need a comment on this sucker.  I'm not sure we should lean
> solely on the commit message to record why we need this until the end of
> time.

OK.

>> +	if (c->x86_model == 87) {

> Please use the macros in here for the model id:

OK.

> http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/tree/arch/x86/include/asm/intel-family.h

> We also probably want to prefix the pr_info() with something like
> "x86/intel:".

OK

>> +/*
>> + * Workaround for KNL issue:

> Please be specific about what this "KNL issue" *is*. 

OK

>> + * A thread that is going to page fault due to P=0, may still
>> + * non atomically set A or D bits, which could corrupt swap entries.
>> + * Always flush the other CPUs and clear the PTE again to avoid
>> + * this leakage. We are excluded using the pagetable lock.
>> + */
>> +
>> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>> +{
>> +	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
>> +		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>> +		flush_tlb_others(mm_cpumask(mm), mm, addr,
>> +				 addr + PAGE_SIZE);
>> +		mb();
>> +		set_pte(ptep, __pte(0));
>> +	}
>> +}
>
> I think the comment here is a bit sparse.  Can we add some more details,
> like:
>
>	Entering here, the current CPU just cleared the PTE.  But,
>	another thread may have raced and set the A or D bits, or be
>	_about_ to set the bits.  Shooting their TLB entry down will
>	ensure they see the cleared PTE and will not set A or D.
>
> and by the set_pte():
>
>	Clear the PTE one more time, in case the other thread set A/D
>	before we sent the TLB flush.

Thanks,
Lukasz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 18:10     ` Borislav Petkov
@ 2016-06-15 13:12       ` Anaczkowski, Lukasz
  0 siblings, 0 replies; 38+ messages in thread
From: Anaczkowski, Lukasz @ 2016-06-15 13:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, linux-mm, tglx, mingo, dave.hansen, ak,
	kirill.shutemov, mhocko, akpm, hpa, Srinivasappa, Harish,
	Odzioba, Lukasz, Andrejczuk, Grzegorz, Daniluk, Lukasz

From: Borislav Petkov [mailto:bp@alien8.de] 
Sent: Tuesday, June 14, 2016 8:10 PM

>> +	if (boot_cpu_has_bug(X86_BUG_PTE_LEAK))
>
> static_cpu_has_bug()

>> +	if (c->x86_model == 87) {
>
> That should be INTEL_FAM6_XEON_PHI_KNL, AFAICT.

>> +		static bool printed;
>> +
>> +		if (!printed) {
>> +			pr_info("Enabling PTE leaking workaround\n");
>> +			printed = true;
>> +		}
>
> pr_info_once

Thanks, Boris! This is very valuable. I'll address  those comments in next version of the patch.

Cheers,
Lukasz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 18:38     ` Nadav Amit
@ 2016-06-15 13:12       ` Anaczkowski, Lukasz
  2016-06-15 20:04         ` Nadav Amit
  0 siblings, 1 reply; 38+ messages in thread
From: Anaczkowski, Lukasz @ 2016-06-15 13:12 UTC (permalink / raw)
  To: Nadav Amit
  Cc: LKML, linux-mm, Thomas Gleixner, Ingo Molnar, Dave Hansen, ak,
	kirill.shutemov, mhocko, Andrew Morton, H. Peter Anvin,
	Srinivasappa, Harish, Odzioba, Lukasz, Andrejczuk, Grzegorz,
	Daniluk, Lukasz

From: Nadav Amit [mailto:nadav.amit@gmail.com] 
Sent: Tuesday, June 14, 2016 8:38 PM

>> +	pte_t pte = ptep_get_and_clear(mm, addr, ptep);
>> +
>> +	if (boot_cpu_has_bug(X86_BUG_PTE_LEAK))
>> +		fix_pte_leak(mm, addr, ptep);
>> +	return pte;
>> }
>
> I missed it on the previous iteration: ptep_get_and_clear already calls 
> fix_pte_leak when needed. So do you need to call it again here?

You're right, Nadav. Not needing this. Will be removed in next version of the patch.

Cheers,
Lukasz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-15 13:12       ` Anaczkowski, Lukasz
@ 2016-06-15 20:04         ` Nadav Amit
  2016-06-15 20:10           ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2016-06-15 20:04 UTC (permalink / raw)
  To: Anaczkowski, Lukasz
  Cc: LKML, linux-mm, Thomas Gleixner, Ingo Molnar, Dave Hansen, ak,
	kirill.shutemov, mhocko, Andrew Morton, H. Peter Anvin,
	Srinivasappa, Harish, Odzioba, Lukasz, Andrejczuk, Grzegorz,
	Daniluk, Lukasz

Lukasz <lukasz.anaczkowski@intel.com> wrote:

> From: Nadav Amit [mailto:nadav.amit@gmail.com] 
> Sent: Tuesday, June 14, 2016 8:38 PM
> 
>>> +	pte_t pte = ptep_get_and_clear(mm, addr, ptep);
>>> +
>>> +	if (boot_cpu_has_bug(X86_BUG_PTE_LEAK))
>>> +		fix_pte_leak(mm, addr, ptep);
>>> +	return pte;
>>> }
>> 
>> I missed it on the previous iteration: ptep_get_and_clear already calls 
>> fix_pte_leak when needed. So do you need to call it again here?
> 
> You're right, Nadav. Not needing this. Will be removed in next version of the patch.

Be careful here. According to the SDM when invalidating a huge-page,
each 4KB page needs to be invalidated separately. In practice, when
Linux invalidates 2MB/1GB pages it performs a full TLB flush. The
full flush may not be required on knights landing, and specifically
for the workaround, but you should check.  

Regards,
Nadav

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-15 20:04         ` Nadav Amit
@ 2016-06-15 20:10           ` Dave Hansen
  2016-06-15 20:26             ` Nadav Amit
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2016-06-15 20:10 UTC (permalink / raw)
  To: Nadav Amit, Anaczkowski, Lukasz
  Cc: LKML, linux-mm, Thomas Gleixner, Ingo Molnar, ak,
	kirill.shutemov, mhocko, Andrew Morton, H. Peter Anvin,
	Srinivasappa, Harish, Odzioba, Lukasz, Andrejczuk, Grzegorz,
	Daniluk, Lukasz

On 06/15/2016 01:04 PM, Nadav Amit wrote:
> Be careful here. According to the SDM when invalidating a huge-page,
> each 4KB page needs to be invalidated separately. In practice, when
> Linux invalidates 2MB/1GB pages it performs a full TLB flush. The
> full flush may not be required on knights landing, and specifically
> for the workaround, but you should check.  

Where do you get that?  The SDM says: "they (TLB invalidation operations
invalidate all TLB entries corresponding to the translation specified by
the paging structures."

Here's the full paragraph from the SDM

... some processors may choose to cache multiple smaller-page TLB
entries for a translation specified by the paging structures to use a
page larger than 4 KBytes. There is no way for software to be aware
that multiple translations for smaller pages have been used for a large
page. The INVLPG instruction and page faults provide the same assurances
that they provide when a single TLB entry is used: they invalidate all
TLB entries corresponding to the translation specified by the paging
structures.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] Linux VM workaround for Knights Landing A/D leak
  2016-06-15 20:10           ` Dave Hansen
@ 2016-06-15 20:26             ` Nadav Amit
  0 siblings, 0 replies; 38+ messages in thread
From: Nadav Amit @ 2016-06-15 20:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Anaczkowski, Lukasz, LKML, linux-mm, Thomas Gleixner,
	Ingo Molnar, ak, kirill.shutemov, mhocko, Andrew Morton,
	H. Peter Anvin, Srinivasappa, Harish, Odzioba, Lukasz,
	Andrejczuk, Grzegorz, Daniluk, Lukasz

Dave Hansen <dave.hansen@linux.intel.com> wrote:

> On 06/15/2016 01:04 PM, Nadav Amit wrote:
>> Be careful here. According to the SDM when invalidating a huge-page,
>> each 4KB page needs to be invalidated separately. In practice, when
>> Linux invalidates 2MB/1GB pages it performs a full TLB flush. The
>> full flush may not be required on knights landing, and specifically
>> for the workaround, but you should check.  
> 
> Where do you get that?  The SDM says: "they (TLB invalidation operations
> invalidate all TLB entries corresponding to the translation specified by
> the paging structures.”

You are absolutely correct. Last time I write something based on my
recollection of the SDM without re-reading again. Sorry.

Nadav

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3] Linux VM workaround for Knights Landing A/D leak
  2016-06-14 17:01   ` [PATCH v2] " Lukasz Anaczkowski
                       ` (2 preceding siblings ...)
  2016-06-14 18:38     ` Nadav Amit
@ 2016-06-16 15:14     ` Lukasz Anaczkowski
  2016-06-16 16:43       ` Nadav Amit
  2016-06-16 20:23       ` Dave Hansen
  3 siblings, 2 replies; 38+ messages in thread
From: Lukasz Anaczkowski @ 2016-06-16 15:14 UTC (permalink / raw)
  To: hpa, mingo, tglx, dave.hansen, ak, kirill.shutemov, mhocko, akpm,
	linux-kernel, linux-mm
  Cc: lukasz.anaczkowski, harish.srinivasappa, lukasz.odzioba,
	grzegorz.andrejczuk, lukasz.daniluk

From: Andi Kleen <ak@linux.intel.com>

Knights Landing has an issue that a thread setting A or D bits
may not do so atomically against checking the present bit.
A thread which is going to page fault may still set those
bits, even though the present bit was already atomically cleared.

This implies that when the kernel clears present atomically,
some time later the supposed to be zero entry could be corrupted
with stray A or D bits.

Since the PTE could be already used for storing a swap index,
or a NUMA migration index, this cannot be tolerated. Most
of the time the kernel detects the problem, but in some
rare cases it may not.

This patch enforces that the page unmap path in vmscan/direct reclaim
always flushes other CPUs after clearing each page, and also
clears the PTE again after the flush. A new memory barrier may be
required, but this code is at least consistent with all
the existing uses in the kernel. If we decide that we need a new
barrier, it can added at the same time as the rest of the tree.

For reclaim this brings the performance back to before Mel's
flushing changes, but for unmap it disables batching.

This makes sure any leaked A/D bits are immediately cleared before the entry
is used for something else.

Any parallel faults that check for entry is zero may loop,
but they should eventually recover after the entry is written.

Also other users may spin in the page table lock until we
"fixed" the PTE. This is ensured by always taking the page table lock
even for the swap cache case. Previously this was only done
on architectures with non atomic PTE accesses (such as 32bit PTE),
but now it is also done when this bug workaround is active.

I audited apply_pte_range and other users of arch_enter_lazy...
and they seem to all not clear the present bit.

Right now the extra flush is done in the low level
architecture code, while the higher level code still
does batched TLB flush. This means there is always one extra
unnecessary TLB flush now. As a followon optimization
this could be avoided by telling the callers that
the flush already happenend.

The official erratum will be posted hopefully by the end of July'16.

v3 (Lukasz Anaczkowski):
    () Improved documentation
    () Removed unnecessary declaration and call to fix_pte_leak from hugetlb.h
    () Moved fix_pte_leak() definition from tlb.c to intel.c
    () Replaced boot_cpu_has_bug() with static_cpu_has_bug()
    () pr_info_once instead of pr_info
    () Fix applies only to 64-bit kernels as Knights Landing does not
       support 32-bit kernels

v2 (Lukasz Anaczkowski):
    () added call to smp_mb__after_atomic() to synchornize with
       switch_mm, based on Nadav's comment
    () fixed compilation breakage

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/pgtable.h     | 10 ++++++++++
 arch/x86/include/asm/pgtable_64.h  | 16 +++++++++++++++
 arch/x86/kernel/cpu/intel.c        | 41 ++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h                 |  4 ++++
 mm/memory.c                        |  3 ++-
 6 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 4a41348..2c48011 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -303,6 +303,7 @@
 #define X86_BUG_SYSRET_SS_ATTRS	X86_BUG(8) /* SYSRET doesn't fix up SS attrs */
 #define X86_BUG_NULL_SEG	X86_BUG(9) /* Nulling a selector preserves the base */
 #define X86_BUG_SWAPGS_FENCE	X86_BUG(10) /* SWAPGS without input dep on GS */
+#define X86_BUG_PTE_LEAK        X86_BUG(11) /* PTE may leak A/D bits after clear */
 
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1a27396..14abcb3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -794,11 +794,21 @@ extern int ptep_test_and_clear_young(struct vm_area_struct *vma,
 extern int ptep_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pte_t *ptep);
 
+#if defined(CONFIG_X86_64) && defined(CONFIG_CPU_SUP_INTEL)
+extern void fix_pte_leak(struct mm_struct *mm, unsigned long addr,
+			 pte_t *ptep);
+#else
+static inline void fix_pte_leak(struct mm_struct *mm, unsigned long addr,
+				pte_t *ptep) {}
+#endif
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 				       pte_t *ptep)
 {
 	pte_t pte = native_ptep_get_and_clear(ptep);
+	if (static_cpu_has_bug(X86_BUG_PTE_LEAK))
+		fix_pte_leak(mm, addr, ptep);
 	pte_update(mm, addr, ptep);
 	return pte;
 }
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 2ee7811..be7d63c 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -178,6 +178,22 @@ extern void cleanup_highmap(void);
 extern void init_extra_mapping_uc(unsigned long phys, unsigned long size);
 extern void init_extra_mapping_wb(unsigned long phys, unsigned long size);
 
+/*
+ * Intel Xeon Phi x200 codenamed Knights Landing has an issue that a thread
+ * setting A or D bits may not do so atomically against checking the present
+ * bit. A thread which is going to page fault may still set those
+ * bits, even though the present bit was already atomically cleared.
+ *
+ * This implies that when the kernel clears present atomically,
+ * some time later the supposed to be zero entry could be corrupted
+ * with stray A or D bits.
+ */
+#define ARCH_HAS_NEEDS_SWAP_PTL 1
+static inline bool arch_needs_swap_ptl(void)
+{
+       return static_cpu_has_bug(X86_BUG_PTE_LEAK);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 6e2ffbe..b5f5bab 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -7,12 +7,17 @@
 #include <linux/thread_info.h>
 #include <linux/module.h>
 #include <linux/uaccess.h>
+#include <linux/mm_types.h>
 
 #include <asm/cpufeature.h>
 #include <asm/pgtable.h>
 #include <asm/msr.h>
 #include <asm/bugs.h>
 #include <asm/cpu.h>
+#include <asm/intel-family.h>
+#include <asm/tlbflush.h>
+
+#include <trace/events/tlb.h>
 
 #ifdef CONFIG_X86_64
 #include <linux/topology.h>
@@ -60,6 +65,35 @@ void check_mpx_erratum(struct cpuinfo_x86 *c)
 	}
 }
 
+#ifdef CONFIG_X86_64
+/*
+ * Knights Landing has an issue that a thread setting A or D bits
+ * may not do so atomically against checking the present bit.
+ * A thread which is going to page fault may still set those
+ * bits, even though the present bit was already atomically cleared.
+ *
+ * Entering here, the current CPU just cleared the PTE.  But,
+ * another thread may have raced and set the A or D bits, or be
+ * _about_ to set the bits.  Shooting their TLB entry down will
+ * ensure they see the cleared PTE and will not set A or D and
+ * won't corrupt swap entries.
+ */
+void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
+		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
+		flush_tlb_others(mm_cpumask(mm), mm, addr,
+				 addr + PAGE_SIZE);
+		mb();
+		/*
+		 * Clear the PTE one more time, in case the other thread set A/D
+		 * before we sent the TLB flush.
+		 */
+		set_pte(ptep, __pte(0));
+	}
+}
+#endif
+
 static void early_init_intel(struct cpuinfo_x86 *c)
 {
 	u64 misc_enable;
@@ -181,6 +215,13 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 		}
 	}
 
+#ifdef CONFIG_X86_64
+	if (c->x86_model == INTEL_FAM6_XEON_PHI_KNL) {
+		pr_info_once("x86/intel: Enabling PTE leaking workaround\n");
+		set_cpu_bug(c, X86_BUG_PTE_LEAK);
+	}
+#endif
+
 	/*
 	 * Intel Quark Core DevMan_001.pdf section 6.4.11
 	 * "The operating system also is required to invalidate (i.e., flush)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5df5feb..5c80fe09 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2404,6 +2404,10 @@ static inline bool debug_guardpage_enabled(void) { return false; }
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
+#ifndef ARCH_HAS_NEEDS_SWAP_PTL
+static inline bool arch_needs_swap_ptl(void) { return false; }
+#endif
+
 #if MAX_NUMNODES > 1
 void __init setup_nr_node_ids(void);
 #else
diff --git a/mm/memory.c b/mm/memory.c
index 15322b7..0d6ef39 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1960,7 +1960,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
 {
 	int same = 1;
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-	if (sizeof(pte_t) > sizeof(unsigned long)) {
+	if (arch_needs_swap_ptl() ||
+	    sizeof(pte_t) > sizeof(unsigned long)) {
 		spinlock_t *ptl = pte_lockptr(mm, pmd);
 		spin_lock(ptl);
 		same = pte_same(*page_table, orig_pte);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3] Linux VM workaround for Knights Landing A/D leak
  2016-06-16 15:14     ` [PATCH v3] " Lukasz Anaczkowski
@ 2016-06-16 16:43       ` Nadav Amit
  2016-06-16 20:23       ` Dave Hansen
  1 sibling, 0 replies; 38+ messages in thread
From: Nadav Amit @ 2016-06-16 16:43 UTC (permalink / raw)
  To: Lukasz Anaczkowski
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Dave Hansen, ak,
	kirill.shutemov, mhocko, akpm, linux-kernel, linux-mm,
	harish.srinivasappa, lukasz.odzioba, grzegorz.andrejczuk,
	lukasz.daniluk


Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> wrote:

> From: Andi Kleen <ak@linux.intel.com>
> 
> +void fix_pte_leak(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
> +{
> +	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) {
> +		trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
This tracing seems incorrect since you don’t perform a local flush.
I don’t think you need any tracing - native_flush_tlb_others will do it for you.

> +		flush_tlb_others(mm_cpumask(mm), mm, addr,
> +				 addr + PAGE_SIZE);
> +		mb();
Why do you need the memory barrier?

Regards,
Nadav

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3] Linux VM workaround for Knights Landing A/D leak
  2016-06-16 15:14     ` [PATCH v3] " Lukasz Anaczkowski
  2016-06-16 16:43       ` Nadav Amit
@ 2016-06-16 20:23       ` Dave Hansen
  1 sibling, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2016-06-16 20:23 UTC (permalink / raw)
  To: Lukasz Anaczkowski, hpa, mingo, tglx, ak, kirill.shutemov,
	mhocko, akpm, linux-kernel, linux-mm
  Cc: harish.srinivasappa, lukasz.odzioba, grzegorz.andrejczuk, lukasz.daniluk

On 06/16/2016 08:14 AM, Lukasz Anaczkowski wrote:
> For reclaim this brings the performance back to before Mel's
> flushing changes, but for unmap it disables batching.

This turns out to be pretty catastrophic for unmap.  In a workload that
uses, say 200 hardware threads and alloc/frees() a few MB/sec, this ends
up costing hundreds of thousands of extra received IPIs.  10MB=~2500
ptes, and at with 200 threads, that's 250,000 IPIs received just to free
10MB of memory.

The initial testing we did on this was on a *bunch* of threads all doing
alloc/free.  But this is bottlenecked on other things, like mmap_sem
being held for write.

The scenario that we really needed to test here was on lots of threads
doing processing and 1 thread doing alloc/free.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-06-16 20:24 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-14 15:58 [PATCH] Linux VM workaround for Knights Landing A/D leak Lukasz Anaczkowski
2016-06-14 16:31 ` kbuild test robot
2016-06-14 16:47 ` Nadav Amit
2016-06-14 16:54   ` Anaczkowski, Lukasz
2016-06-14 17:01   ` [PATCH v2] " Lukasz Anaczkowski
2016-06-14 17:24     ` Dave Hansen
2016-06-14 18:34       ` One Thousand Gnomes
2016-06-14 18:54         ` Dave Hansen
2016-06-14 19:19           ` Borislav Petkov
2016-06-14 20:20             ` H. Peter Anvin
2016-06-14 20:47               ` Borislav Petkov
2016-06-14 20:54                 ` H. Peter Anvin
2016-06-14 21:02                   ` Borislav Petkov
2016-06-14 21:08                     ` H. Peter Anvin
2016-06-14 21:13                     ` H. Peter Anvin
2016-06-14 18:10     ` Borislav Petkov
2016-06-15 13:12       ` Anaczkowski, Lukasz
2016-06-14 18:38     ` Nadav Amit
2016-06-15 13:12       ` Anaczkowski, Lukasz
2016-06-15 20:04         ` Nadav Amit
2016-06-15 20:10           ` Dave Hansen
2016-06-15 20:26             ` Nadav Amit
2016-06-16 15:14     ` [PATCH v3] " Lukasz Anaczkowski
2016-06-16 16:43       ` Nadav Amit
2016-06-16 20:23       ` Dave Hansen
2016-06-14 17:18   ` [PATCH] " Dave Hansen
2016-06-14 20:16     ` Nadav Amit
2016-06-14 21:37       ` Dave Hansen
2016-06-15  2:20         ` Andy Lutomirski
2016-06-15  2:35           ` Nadav Amit
2016-06-15  2:36             ` Andy Lutomirski
2016-06-15  2:44               ` Nadav Amit
2016-06-15  3:09                 ` Andy Lutomirski
2016-06-15  3:20         ` Nadav Amit
2016-06-14 16:58 ` kbuild test robot
2016-06-14 17:19 ` Dave Hansen
2016-06-15 13:06   ` Anaczkowski, Lukasz
2016-06-14 17:47 ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).