All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-16 19:02 ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-16 19:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, Dave Hansen, Naoya Horiguchi, x86, linux-mm,
	linux-kernel, stable

From: Tony Luck <tony.luck@intel.com>

Speculative processor accesses may reference any memory that has a
valid page table entry.  While a speculative access won't generate
a machine check, it will log the error in a machine check bank. That
could cause escalation of a subsequent error since the overflow bit
will be then set in the machine check bank status register.

Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
address of the page we want to map out otherwise we may trigger the
very problem we are trying to avoid.  We use a non-canonical address
that passes through the usual Linux table walking code to get to the
same "pte".

Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: x86@kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Thanks to Dave Hansen for reviewing several iterations of this.

 arch/x86/include/asm/page_64.h   |  4 ++++
 arch/x86/kernel/cpu/mcheck/mce.c | 35 +++++++++++++++++++++++++++++++++++
 include/linux/mm_inline.h        |  6 ++++++
 mm/memory-failure.c              |  2 ++
 4 files changed, 47 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index b4a0d43248cf..b50df06ad251 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -51,6 +51,10 @@ static inline void clear_page(void *page)
 
 void copy_page(void *to, void *from);
 
+#ifdef CONFIG_X86_MCE
+#define arch_unmap_kpfn arch_unmap_kpfn
+#endif
+
 #endif	/* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 5cfbaeb6529a..56563db0b2be 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -51,6 +51,7 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
+#include <asm/set_memory.h>
 
 #include "mce-internal.h"
 
@@ -1056,6 +1057,40 @@ static int do_memory_failure(struct mce *m)
 	return ret;
 }
 
+#ifdef CONFIG_X86_64
+
+void arch_unmap_kpfn(unsigned long pfn)
+{
+	unsigned long decoy_addr;
+
+	/*
+	 * Unmap this page from the kernel 1:1 mappings to make sure
+	 * we don't log more errors because of speculative access to
+	 * the page.
+	 * We would like to just call:
+	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
+	 * but doing that would radically increase the odds of a
+	 * speculative access to the posion page because we'd have
+	 * the virtual address of the kernel 1:1 mapping sitting
+	 * around in registers.
+	 * Instead we get tricky.  We create a non-canonical address
+	 * that looks just like the one we want, but has bit 63 flipped.
+	 * This relies on set_memory_np() not checking whether we passed
+	 * a legal address.
+	 */
+
+#if PGDIR_SHIFT + 9 < 63 /* 9 because cpp doesn't grok ilog2(PTRS_PER_PGD) */
+	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+#else
+#error "no unused virtual bit available"
+#endif
+
+	if (set_memory_np(decoy_addr, 1))
+		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);
+
+}
+#endif
+
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e030a68ead7e..25438b2b6f22 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -126,4 +126,10 @@ static __always_inline enum lru_list page_lru(struct page *page)
 
 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef arch_unmap_kpfn
+extern void arch_unmap_kpfn(unsigned long pfn);
+#else
+static __always_inline void arch_unmap_kpfn(unsigned long pfn) { }
+#endif
+
 #endif
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 342fac9ba89b..9479e190dcbd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1071,6 +1071,8 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 		return 0;
 	}
 
+	arch_unmap_kpfn(pfn);
+
 	/*
 	 * Currently errors on hugetlbfs pages are measured in hugepage units,
 	 * so nr_pages should be 1 << compound_order.  OTOH when errors are on
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-16 19:02 ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-16 19:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, Dave Hansen, Naoya Horiguchi, x86, linux-mm,
	linux-kernel, stable

From: Tony Luck <tony.luck@intel.com>

Speculative processor accesses may reference any memory that has a
valid page table entry.  While a speculative access won't generate
a machine check, it will log the error in a machine check bank. That
could cause escalation of a subsequent error since the overflow bit
will be then set in the machine check bank status register.

Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
address of the page we want to map out otherwise we may trigger the
very problem we are trying to avoid.  We use a non-canonical address
that passes through the usual Linux table walking code to get to the
same "pte".

Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: x86@kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Thanks to Dave Hansen for reviewing several iterations of this.

 arch/x86/include/asm/page_64.h   |  4 ++++
 arch/x86/kernel/cpu/mcheck/mce.c | 35 +++++++++++++++++++++++++++++++++++
 include/linux/mm_inline.h        |  6 ++++++
 mm/memory-failure.c              |  2 ++
 4 files changed, 47 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index b4a0d43248cf..b50df06ad251 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -51,6 +51,10 @@ static inline void clear_page(void *page)
 
 void copy_page(void *to, void *from);
 
+#ifdef CONFIG_X86_MCE
+#define arch_unmap_kpfn arch_unmap_kpfn
+#endif
+
 #endif	/* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 5cfbaeb6529a..56563db0b2be 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -51,6 +51,7 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
+#include <asm/set_memory.h>
 
 #include "mce-internal.h"
 
@@ -1056,6 +1057,40 @@ static int do_memory_failure(struct mce *m)
 	return ret;
 }
 
+#ifdef CONFIG_X86_64
+
+void arch_unmap_kpfn(unsigned long pfn)
+{
+	unsigned long decoy_addr;
+
+	/*
+	 * Unmap this page from the kernel 1:1 mappings to make sure
+	 * we don't log more errors because of speculative access to
+	 * the page.
+	 * We would like to just call:
+	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
+	 * but doing that would radically increase the odds of a
+	 * speculative access to the posion page because we'd have
+	 * the virtual address of the kernel 1:1 mapping sitting
+	 * around in registers.
+	 * Instead we get tricky.  We create a non-canonical address
+	 * that looks just like the one we want, but has bit 63 flipped.
+	 * This relies on set_memory_np() not checking whether we passed
+	 * a legal address.
+	 */
+
+#if PGDIR_SHIFT + 9 < 63 /* 9 because cpp doesn't grok ilog2(PTRS_PER_PGD) */
+	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+#else
+#error "no unused virtual bit available"
+#endif
+
+	if (set_memory_np(decoy_addr, 1))
+		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);
+
+}
+#endif
+
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e030a68ead7e..25438b2b6f22 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -126,4 +126,10 @@ static __always_inline enum lru_list page_lru(struct page *page)
 
 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef arch_unmap_kpfn
+extern void arch_unmap_kpfn(unsigned long pfn);
+#else
+static __always_inline void arch_unmap_kpfn(unsigned long pfn) { }
+#endif
+
 #endif
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 342fac9ba89b..9479e190dcbd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1071,6 +1071,8 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 		return 0;
 	}
 
+	arch_unmap_kpfn(pfn);
+
 	/*
 	 * Currently errors on hugetlbfs pages are measured in hugepage units,
 	 * so nr_pages should be 1 << compound_order.  OTOH when errors are on
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-16 19:02 ` Luck, Tony
@ 2017-06-19 18:01   ` Borislav Petkov
  -1 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2017-06-19 18:01 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel, Yazen Ghannam

(drop stable from CC)

You could use git's --suppress-cc= option when sending.

On Fri, Jun 16, 2017 at 12:02:00PM -0700, Luck, Tony wrote:
> From: Tony Luck <tony.luck@intel.com>
> 
> Speculative processor accesses may reference any memory that has a
> valid page table entry.  While a speculative access won't generate
> a machine check, it will log the error in a machine check bank. That
> could cause escalation of a subsequent error since the overflow bit
> will be then set in the machine check bank status register.

...

> @@ -1056,6 +1057,40 @@ static int do_memory_failure(struct mce *m)
>  	return ret;
>  }
>  
> +#ifdef CONFIG_X86_64
> +
> +void arch_unmap_kpfn(unsigned long pfn)
> +{

I guess you can move the ifdeffery inside the function.

> +	unsigned long decoy_addr;
> +
> +	/*
> +	 * Unmap this page from the kernel 1:1 mappings to make sure
> +	 * we don't log more errors because of speculative access to
> +	 * the page.
> +	 * We would like to just call:
> +	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
> +	 * but doing that would radically increase the odds of a
> +	 * speculative access to the posion page because we'd have
> +	 * the virtual address of the kernel 1:1 mapping sitting
> +	 * around in registers.
> +	 * Instead we get tricky.  We create a non-canonical address
> +	 * that looks just like the one we want, but has bit 63 flipped.
> +	 * This relies on set_memory_np() not checking whether we passed
> +	 * a legal address.
> +	 */
> +
> +#if PGDIR_SHIFT + 9 < 63 /* 9 because cpp doesn't grok ilog2(PTRS_PER_PGD) */

Please no side comments.

Also, explain why the build-time check. (Sign-extension going away for VA
space yadda yadda..., 5 2/3 level paging :-))

Also, I'm assuming this whole "workaround" of sorts should be Intel-only?

> +	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> +#else
> +#error "no unused virtual bit available"
> +#endif
> +
> +	if (set_memory_np(decoy_addr, 1))
> +		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);

WARNING: unnecessary whitespace before a quoted newline
#107: FILE: arch/x86/kernel/cpu/mcheck/mce.c:1089:
+               pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);


-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-19 18:01   ` Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2017-06-19 18:01 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel, Yazen Ghannam

(drop stable from CC)

You could use git's --suppress-cc= option when sending.

On Fri, Jun 16, 2017 at 12:02:00PM -0700, Luck, Tony wrote:
> From: Tony Luck <tony.luck@intel.com>
> 
> Speculative processor accesses may reference any memory that has a
> valid page table entry.  While a speculative access won't generate
> a machine check, it will log the error in a machine check bank. That
> could cause escalation of a subsequent error since the overflow bit
> will be then set in the machine check bank status register.

...

> @@ -1056,6 +1057,40 @@ static int do_memory_failure(struct mce *m)
>  	return ret;
>  }
>  
> +#ifdef CONFIG_X86_64
> +
> +void arch_unmap_kpfn(unsigned long pfn)
> +{

I guess you can move the ifdeffery inside the function.

> +	unsigned long decoy_addr;
> +
> +	/*
> +	 * Unmap this page from the kernel 1:1 mappings to make sure
> +	 * we don't log more errors because of speculative access to
> +	 * the page.
> +	 * We would like to just call:
> +	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
> +	 * but doing that would radically increase the odds of a
> +	 * speculative access to the posion page because we'd have
> +	 * the virtual address of the kernel 1:1 mapping sitting
> +	 * around in registers.
> +	 * Instead we get tricky.  We create a non-canonical address
> +	 * that looks just like the one we want, but has bit 63 flipped.
> +	 * This relies on set_memory_np() not checking whether we passed
> +	 * a legal address.
> +	 */
> +
> +#if PGDIR_SHIFT + 9 < 63 /* 9 because cpp doesn't grok ilog2(PTRS_PER_PGD) */

Please no side comments.

Also, explain why the build-time check. (Sign-extension going away for VA
space yadda yadda..., 5 2/3 level paging :-))

Also, I'm assuming this whole "workaround" of sorts should be Intel-only?

> +	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> +#else
> +#error "no unused virtual bit available"
> +#endif
> +
> +	if (set_memory_np(decoy_addr, 1))
> +		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);

WARNING: unnecessary whitespace before a quoted newline
#107: FILE: arch/x86/kernel/cpu/mcheck/mce.c:1089:
+               pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);


-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix ImendA?rffer, Jane Smithard, Graham Norton, HRB 21284 (AG NA 1/4 rnberg)
-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-16 19:02 ` Luck, Tony
@ 2017-06-21  2:12   ` Naoya Horiguchi
  -1 siblings, 0 replies; 49+ messages in thread
From: Naoya Horiguchi @ 2017-06-21  2:12 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, Dave Hansen, x86, linux-mm, linux-kernel

(drop stable from CC)

On Fri, Jun 16, 2017 at 12:02:00PM -0700, Luck, Tony wrote:
> From: Tony Luck <tony.luck@intel.com>
> 
> Speculative processor accesses may reference any memory that has a
> valid page table entry.  While a speculative access won't generate
> a machine check, it will log the error in a machine check bank. That
> could cause escalation of a subsequent error since the overflow bit
> will be then set in the machine check bank status register.
> 
> Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
> address of the page we want to map out otherwise we may trigger the
> very problem we are trying to avoid.  We use a non-canonical address
> that passes through the usual Linux table walking code to get to the
> same "pte".
> 
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: x86@kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: stable@vger.kernel.org
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> Thanks to Dave Hansen for reviewing several iterations of this.
> 
>  arch/x86/include/asm/page_64.h   |  4 ++++
>  arch/x86/kernel/cpu/mcheck/mce.c | 35 +++++++++++++++++++++++++++++++++++
>  include/linux/mm_inline.h        |  6 ++++++
>  mm/memory-failure.c              |  2 ++
>  4 files changed, 47 insertions(+)
> 
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index b4a0d43248cf..b50df06ad251 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -51,6 +51,10 @@ static inline void clear_page(void *page)
>  
>  void copy_page(void *to, void *from);
>  
> +#ifdef CONFIG_X86_MCE
> +#define arch_unmap_kpfn arch_unmap_kpfn
> +#endif
> +
>  #endif	/* !__ASSEMBLY__ */
>  
>  #ifdef CONFIG_X86_VSYSCALL_EMULATION
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 5cfbaeb6529a..56563db0b2be 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -51,6 +51,7 @@
>  #include <asm/mce.h>
>  #include <asm/msr.h>
>  #include <asm/reboot.h>
> +#include <asm/set_memory.h>
>  
>  #include "mce-internal.h"
>  
> @@ -1056,6 +1057,40 @@ static int do_memory_failure(struct mce *m)
>  	return ret;
>  }
>  
> +#ifdef CONFIG_X86_64
> +
> +void arch_unmap_kpfn(unsigned long pfn)
> +{
> +	unsigned long decoy_addr;
> +
> +	/*
> +	 * Unmap this page from the kernel 1:1 mappings to make sure
> +	 * we don't log more errors because of speculative access to
> +	 * the page.
> +	 * We would like to just call:
> +	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
> +	 * but doing that would radically increase the odds of a
> +	 * speculative access to the posion page because we'd have
> +	 * the virtual address of the kernel 1:1 mapping sitting
> +	 * around in registers.
> +	 * Instead we get tricky.  We create a non-canonical address
> +	 * that looks just like the one we want, but has bit 63 flipped.
> +	 * This relies on set_memory_np() not checking whether we passed
> +	 * a legal address.
> +	 */
> +
> +#if PGDIR_SHIFT + 9 < 63 /* 9 because cpp doesn't grok ilog2(PTRS_PER_PGD) */
> +	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> +#else
> +#error "no unused virtual bit available"
> +#endif
> +
> +	if (set_memory_np(decoy_addr, 1))
> +		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);
> +
> +}
> +#endif
> +
>  /*
>   * The actual machine check handler. This only handles real
>   * exceptions when something got corrupted coming in through int 18.
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index e030a68ead7e..25438b2b6f22 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -126,4 +126,10 @@ static __always_inline enum lru_list page_lru(struct page *page)
>  
>  #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
>  
> +#ifdef arch_unmap_kpfn
> +extern void arch_unmap_kpfn(unsigned long pfn);
> +#else
> +static __always_inline void arch_unmap_kpfn(unsigned long pfn) { }
> +#endif
> +
>  #endif
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 342fac9ba89b..9479e190dcbd 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1071,6 +1071,8 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
>  		return 0;
>  	}
>  
> +	arch_unmap_kpfn(pfn);
> +

We had better have a reverse operation of this to cancel the unmapping
when unpoisoning?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-21  2:12   ` Naoya Horiguchi
  0 siblings, 0 replies; 49+ messages in thread
From: Naoya Horiguchi @ 2017-06-21  2:12 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, Dave Hansen, x86, linux-mm, linux-kernel

(drop stable from CC)

On Fri, Jun 16, 2017 at 12:02:00PM -0700, Luck, Tony wrote:
> From: Tony Luck <tony.luck@intel.com>
> 
> Speculative processor accesses may reference any memory that has a
> valid page table entry.  While a speculative access won't generate
> a machine check, it will log the error in a machine check bank. That
> could cause escalation of a subsequent error since the overflow bit
> will be then set in the machine check bank status register.
> 
> Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
> address of the page we want to map out otherwise we may trigger the
> very problem we are trying to avoid.  We use a non-canonical address
> that passes through the usual Linux table walking code to get to the
> same "pte".
> 
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: x86@kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: stable@vger.kernel.org
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> Thanks to Dave Hansen for reviewing several iterations of this.
> 
>  arch/x86/include/asm/page_64.h   |  4 ++++
>  arch/x86/kernel/cpu/mcheck/mce.c | 35 +++++++++++++++++++++++++++++++++++
>  include/linux/mm_inline.h        |  6 ++++++
>  mm/memory-failure.c              |  2 ++
>  4 files changed, 47 insertions(+)
> 
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index b4a0d43248cf..b50df06ad251 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -51,6 +51,10 @@ static inline void clear_page(void *page)
>  
>  void copy_page(void *to, void *from);
>  
> +#ifdef CONFIG_X86_MCE
> +#define arch_unmap_kpfn arch_unmap_kpfn
> +#endif
> +
>  #endif	/* !__ASSEMBLY__ */
>  
>  #ifdef CONFIG_X86_VSYSCALL_EMULATION
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 5cfbaeb6529a..56563db0b2be 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -51,6 +51,7 @@
>  #include <asm/mce.h>
>  #include <asm/msr.h>
>  #include <asm/reboot.h>
> +#include <asm/set_memory.h>
>  
>  #include "mce-internal.h"
>  
> @@ -1056,6 +1057,40 @@ static int do_memory_failure(struct mce *m)
>  	return ret;
>  }
>  
> +#ifdef CONFIG_X86_64
> +
> +void arch_unmap_kpfn(unsigned long pfn)
> +{
> +	unsigned long decoy_addr;
> +
> +	/*
> +	 * Unmap this page from the kernel 1:1 mappings to make sure
> +	 * we don't log more errors because of speculative access to
> +	 * the page.
> +	 * We would like to just call:
> +	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
> +	 * but doing that would radically increase the odds of a
> +	 * speculative access to the posion page because we'd have
> +	 * the virtual address of the kernel 1:1 mapping sitting
> +	 * around in registers.
> +	 * Instead we get tricky.  We create a non-canonical address
> +	 * that looks just like the one we want, but has bit 63 flipped.
> +	 * This relies on set_memory_np() not checking whether we passed
> +	 * a legal address.
> +	 */
> +
> +#if PGDIR_SHIFT + 9 < 63 /* 9 because cpp doesn't grok ilog2(PTRS_PER_PGD) */
> +	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> +#else
> +#error "no unused virtual bit available"
> +#endif
> +
> +	if (set_memory_np(decoy_addr, 1))
> +		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);
> +
> +}
> +#endif
> +
>  /*
>   * The actual machine check handler. This only handles real
>   * exceptions when something got corrupted coming in through int 18.
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index e030a68ead7e..25438b2b6f22 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -126,4 +126,10 @@ static __always_inline enum lru_list page_lru(struct page *page)
>  
>  #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
>  
> +#ifdef arch_unmap_kpfn
> +extern void arch_unmap_kpfn(unsigned long pfn);
> +#else
> +static __always_inline void arch_unmap_kpfn(unsigned long pfn) { }
> +#endif
> +
>  #endif
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 342fac9ba89b..9479e190dcbd 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1071,6 +1071,8 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
>  		return 0;
>  	}
>  
> +	arch_unmap_kpfn(pfn);
> +

We had better have a reverse operation of this to cancel the unmapping
when unpoisoning?

Thanks,
Naoya Horiguchi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-19 18:01   ` Borislav Petkov
@ 2017-06-21 17:47     ` Luck, Tony
  -1 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-21 17:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel, Yazen Ghannam

On Mon, Jun 19, 2017 at 08:01:47PM +0200, Borislav Petkov wrote:
> (drop stable from CC)
> 
> You could use git's --suppress-cc= option when sending.

I would if I could work out how to use it. From reading the manual
page there seem to be a few options to this, but none of them appear
to just drop a specific address (apart from my own). :-(

> > +#ifdef CONFIG_X86_64
> > +
> > +void arch_unmap_kpfn(unsigned long pfn)
> > +{
> 
> I guess you can move the ifdeffery inside the function.

If I do, then the compiler will emit an empty function. It's only
a couple of bytes for the "ret" ... but why?  I may change it
to:

   #if defined(arch_unmap_kpfn) && defined(CONFIG_MEMORY_FAILURE)

to narrow down further when we need this.

> > +#if PGDIR_SHIFT + 9 < 63 /* 9 because cpp doesn't grok ilog2(PTRS_PER_PGD) */
> 
> Please no side comments.

Ok.

> Also, explain why the build-time check. (Sign-extension going away for VA
> space yadda yadda..., 5 2/3 level paging :-))

Will add.

> Also, I'm assuming this whole "workaround" of sorts should be Intel-only?

I'd assume that other X86 implementations would face similar issues (unless
they have extremely cautious pre-fetchers and/or no speculation).

I'm also assuming that non-X86 architectures that do recovery may want this
too ... hence hooking the arch_unmap_kpfn() function into the generic
memory_failure() code.

> > +	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> > +#else
> > +#error "no unused virtual bit available"
> > +#endif
> > +
> > +	if (set_memory_np(decoy_addr, 1))
> > +		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);
> 
> WARNING: unnecessary whitespace before a quoted newline
> #107: FILE: arch/x86/kernel/cpu/mcheck/mce.c:1089:
> +               pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);

Oops!  Will fix.

-Tony

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-21 17:47     ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-21 17:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel, Yazen Ghannam

On Mon, Jun 19, 2017 at 08:01:47PM +0200, Borislav Petkov wrote:
> (drop stable from CC)
> 
> You could use git's --suppress-cc= option when sending.

I would if I could work out how to use it. From reading the manual
page there seem to be a few options to this, but none of them appear
to just drop a specific address (apart from my own). :-(

> > +#ifdef CONFIG_X86_64
> > +
> > +void arch_unmap_kpfn(unsigned long pfn)
> > +{
> 
> I guess you can move the ifdeffery inside the function.

If I do, then the compiler will emit an empty function. It's only
a couple of bytes for the "ret" ... but why?  I may change it
to:

   #if defined(arch_unmap_kpfn) && defined(CONFIG_MEMORY_FAILURE)

to narrow down further when we need this.

> > +#if PGDIR_SHIFT + 9 < 63 /* 9 because cpp doesn't grok ilog2(PTRS_PER_PGD) */
> 
> Please no side comments.

Ok.

> Also, explain why the build-time check. (Sign-extension going away for VA
> space yadda yadda..., 5 2/3 level paging :-))

Will add.

> Also, I'm assuming this whole "workaround" of sorts should be Intel-only?

I'd assume that other X86 implementations would face similar issues (unless
they have extremely cautious pre-fetchers and/or no speculation).

I'm also assuming that non-X86 architectures that do recovery may want this
too ... hence hooking the arch_unmap_kpfn() function into the generic
memory_failure() code.

> > +	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> > +#else
> > +#error "no unused virtual bit available"
> > +#endif
> > +
> > +	if (set_memory_np(decoy_addr, 1))
> > +		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);
> 
> WARNING: unnecessary whitespace before a quoted newline
> #107: FILE: arch/x86/kernel/cpu/mcheck/mce.c:1089:
> +               pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);

Oops!  Will fix.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-21  2:12   ` Naoya Horiguchi
@ 2017-06-21 17:54     ` Luck, Tony
  -1 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-21 17:54 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: Borislav Petkov, Dave Hansen, x86, linux-mm, linux-kernel

On Wed, Jun 21, 2017 at 02:12:27AM +0000, Naoya Horiguchi wrote:

> We had better have a reverse operation of this to cancel the unmapping
> when unpoisoning?

When we have unpoisoning, we can add something.  We don't seem to have
an inverse function for "set_memory_np" to just flip the _PRESENT bit
back on again. But it would be trivial to write a set_memory_pp().

Since we'd be doing this after the poison has been cleared, we wouldn't
need to play games with the address.  We'd just use:

	set_memory_pp((unsigned long)pfn_to_kaddr(pfn), 1);

-Tony

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-21 17:54     ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-21 17:54 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: Borislav Petkov, Dave Hansen, x86, linux-mm, linux-kernel

On Wed, Jun 21, 2017 at 02:12:27AM +0000, Naoya Horiguchi wrote:

> We had better have a reverse operation of this to cancel the unmapping
> when unpoisoning?

When we have unpoisoning, we can add something.  We don't seem to have
an inverse function for "set_memory_np" to just flip the _PRESENT bit
back on again. But it would be trivial to write a set_memory_pp().

Since we'd be doing this after the poison has been cleared, we wouldn't
need to play games with the address.  We'd just use:

	set_memory_pp((unsigned long)pfn_to_kaddr(pfn), 1);

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-21 17:54     ` Luck, Tony
  (?)
@ 2017-06-21 19:47       ` Elliott, Robert (Persistent Memory)
  -1 siblings, 0 replies; 49+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-06-21 19:47 UTC (permalink / raw)
  To: Luck, Tony, Naoya Horiguchi
  Cc: Vaden, Tom (HPE Server OS Architecture),
	linux-nvdimm, x86, linux-kernel, linux-mm, Dave Hansen,
	Borislav Petkov


> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Luck, Tony
> Sent: Wednesday, June 21, 2017 12:54 PM
> To: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Borislav Petkov <bp@suse.de>; Dave Hansen <dave.hansen@intel.com>;
> x86@kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org

(adding linux-nvdimm list in this reply)

> Subject: Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1
> mappings of poison pages
> 
> On Wed, Jun 21, 2017 at 02:12:27AM +0000, Naoya Horiguchi wrote:
> 
> > We had better have a reverse operation of this to cancel the unmapping
> > when unpoisoning?
> 
> When we have unpoisoning, we can add something.  We don't seem to have
> an inverse function for "set_memory_np" to just flip the _PRESENT bit
> back on again. But it would be trivial to write a set_memory_pp().
> 
> Since we'd be doing this after the poison has been cleared, we wouldn't
> need to play games with the address.  We'd just use:
> 
> 	set_memory_pp((unsigned long)pfn_to_kaddr(pfn), 1);
> 
> -Tony

Persistent memory does have unpoisoning and would require this inverse
operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
nvdimm_clear_poison().

---
Robert Elliott, HPE Persistent Memory




_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-21 19:47       ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 49+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-06-21 19:47 UTC (permalink / raw)
  To: Luck, Tony, Naoya Horiguchi
  Cc: Borislav Petkov, Dave Hansen, x86, linux-mm, linux-kernel,
	linux-nvdimm, 'dan.j.williams@intel.com',
	Kani, Toshimitsu, Vaden, Tom (HPE Server OS Architecture)


> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Luck, Tony
> Sent: Wednesday, June 21, 2017 12:54 PM
> To: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Borislav Petkov <bp@suse.de>; Dave Hansen <dave.hansen@intel.com>;
> x86@kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org

(adding linux-nvdimm list in this reply)

> Subject: Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1
> mappings of poison pages
> 
> On Wed, Jun 21, 2017 at 02:12:27AM +0000, Naoya Horiguchi wrote:
> 
> > We had better have a reverse operation of this to cancel the unmapping
> > when unpoisoning?
> 
> When we have unpoisoning, we can add something.  We don't seem to have
> an inverse function for "set_memory_np" to just flip the _PRESENT bit
> back on again. But it would be trivial to write a set_memory_pp().
> 
> Since we'd be doing this after the poison has been cleared, we wouldn't
> need to play games with the address.  We'd just use:
> 
> 	set_memory_pp((unsigned long)pfn_to_kaddr(pfn), 1);
> 
> -Tony

Persistent memory does have unpoisoning and would require this inverse
operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
nvdimm_clear_poison().

---
Robert Elliott, HPE Persistent Memory

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-21 19:47       ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 49+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-06-21 19:47 UTC (permalink / raw)
  To: Luck, Tony, Naoya Horiguchi
  Cc: Borislav Petkov, Dave Hansen, x86, linux-mm, linux-kernel,
	linux-nvdimm, 'dan.j.williams@intel.com',
	Kani, Toshimitsu, Vaden, Tom (HPE Server OS Architecture)


> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Luck, Tony
> Sent: Wednesday, June 21, 2017 12:54 PM
> To: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Borislav Petkov <bp@suse.de>; Dave Hansen <dave.hansen@intel.com>;
> x86@kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org

(adding linux-nvdimm list in this reply)

> Subject: Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1
> mappings of poison pages
> 
> On Wed, Jun 21, 2017 at 02:12:27AM +0000, Naoya Horiguchi wrote:
> 
> > We had better have a reverse operation of this to cancel the unmapping
> > when unpoisoning?
> 
> When we have unpoisoning, we can add something.  We don't seem to have
> an inverse function for "set_memory_np" to just flip the _PRESENT bit
> back on again. But it would be trivial to write a set_memory_pp().
> 
> Since we'd be doing this after the poison has been cleared, we wouldn't
> need to play games with the address.  We'd just use:
> 
> 	set_memory_pp((unsigned long)pfn_to_kaddr(pfn), 1);
> 
> -Tony

Persistent memory does have unpoisoning and would require this inverse
operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
nvdimm_clear_poison().

---
Robert Elliott, HPE Persistent Memory




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-21 17:47     ` Luck, Tony
@ 2017-06-21 19:59       ` Elliott, Robert (Persistent Memory)
  -1 siblings, 0 replies; 49+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-06-21 19:59 UTC (permalink / raw)
  To: Luck, Tony, Borislav Petkov
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel,
	Yazen Ghannam, 'dan.j.williams@intel.com',
	Kani, Toshimitsu, Vaden, Tom (HPE Server OS Architecture)

> +	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> +#else
> +#error "no unused virtual bit available"
> +#endif
> +
> +	if (set_memory_np(decoy_addr, 1))
> +		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);

Does this patch handle breaking up 512 GiB, 1 GiB or 2 MiB page mappings
if it's just trying to mark a 4 KiB page as bad?

Although the kernel doesn't use MTRRs itself anymore, what if the system
BIOS still uses them for some memory regions, and the bad address falls in
an MTRR region?

---
Robert Elliott, HPE Persistent Memory

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-21 19:59       ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 49+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-06-21 19:59 UTC (permalink / raw)
  To: Luck, Tony, Borislav Petkov
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel,
	Yazen Ghannam, 'dan.j.williams@intel.com',
	Kani, Toshimitsu, Vaden, Tom (HPE Server OS Architecture)

> +	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
> +#else
> +#error "no unused virtual bit available"
> +#endif
> +
> +	if (set_memory_np(decoy_addr, 1))
> +		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);

Does this patch handle breaking up 512 GiB, 1 GiB or 2 MiB page mappings
if it's just trying to mark a 4 KiB page as bad?

Although the kernel doesn't use MTRRs itself anymore, what if the system
BIOS still uses them for some memory regions, and the bad address falls in
an MTRR region?

---
Robert Elliott, HPE Persistent Memory




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-21 19:59       ` Elliott, Robert (Persistent Memory)
@ 2017-06-21 20:19         ` Luck, Tony
  -1 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-21 20:19 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Borislav Petkov
  Cc: Hansen, Dave, Naoya Horiguchi, x86, linux-mm, linux-kernel,
	Yazen Ghannam, Williams, Dan J, Kani, Toshimitsu, Vaden,
	Tom (HPE Server OS Architecture)

>> +if (set_memory_np(decoy_addr, 1))
>> +pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);
>
> Does this patch handle breaking up 512 GiB, 1 GiB or 2 MiB page mappings
> if it's just trying to mark a 4 KiB page as bad?

Yes.  The 1:1 mappings start out using the largest supported page size.  This
call will break up huge/large pages so that only 4KB is mapped out.
[This will affect performance because of the extra levels of TLB walks]

> Although the kernel doesn't use MTRRs itself anymore, what if the system
> BIOS still uses them for some memory regions, and the bad address falls in
> an MTRR region?

This code is called after mm/memory-failure.c:memory_failure() has already
checked that the page is one managed by the kernel.  In general machine checks
from other regions are going to be called out as fatal before we get here.

-Tony

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-21 20:19         ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-21 20:19 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Borislav Petkov
  Cc: Hansen, Dave, Naoya Horiguchi, x86, linux-mm, linux-kernel,
	Yazen Ghannam, Williams, Dan J, Kani, Toshimitsu, Vaden,
	Tom (HPE Server OS Architecture)

>> +if (set_memory_np(decoy_addr, 1))
>> +pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n", pfn);
>
> Does this patch handle breaking up 512 GiB, 1 GiB or 2 MiB page mappings
> if it's just trying to mark a 4 KiB page as bad?

Yes.  The 1:1 mappings start out using the largest supported page size.  This
call will break up huge/large pages so that only 4KB is mapped out.
[This will affect performance because of the extra levels of TLB walks]

> Although the kernel doesn't use MTRRs itself anymore, what if the system
> BIOS still uses them for some memory regions, and the bad address falls in
> an MTRR region?

This code is called after mm/memory-failure.c:memory_failure() has already
checked that the page is one managed by the kernel.  In general machine checks
from other regions are going to be called out as fatal before we get here.

-Tony




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-21 19:47       ` Elliott, Robert (Persistent Memory)
  (?)
@ 2017-06-21 20:30         ` Luck, Tony
  -1 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-21 20:30 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Naoya Horiguchi
  Cc: Vaden, Tom (HPE Server OS Architecture),
	linux-nvdimm, x86, linux-kernel, linux-mm, Hansen

> Persistent memory does have unpoisoning and would require this inverse
> operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
> nvdimm_clear_poison().

Nice.  Well this code will need to cooperate with that ... in particular if the page
is in an area that can be unpoisoned ... then we should do that *instead* of marking
the page not present (which breaks up huge/large pages and so affects performance).

Instead of calling it "arch_unmap_pfn" it could be called something like arch_handle_poison()
and do something like:

void arch_handle_poison(unsigned long pfn)
{
	if this is a pmem page && pmem_clear_poison(pfn)
		return
	if this is a nvdimm page && nvdimm_clear_poison(pfn)
		return
	/* can't clear, map out from 1:1 region */
	... code from my patch ...
}

I'm just not sure how those first two "if" bits work ... particularly in terms of CONFIG dependencies and system
capabilities.  Perhaps each of pmem and nvdimm could register their unpoison functions and this code could
just call each in turn?

-Tony


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-21 20:30         ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-21 20:30 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Naoya Horiguchi
  Cc: Borislav Petkov, Hansen, Dave, x86, linux-mm, linux-kernel,
	linux-nvdimm, Williams, Dan J, Kani, Toshimitsu, Vaden,
	Tom (HPE Server OS Architecture)

> Persistent memory does have unpoisoning and would require this inverse
> operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
> nvdimm_clear_poison().

Nice.  Well this code will need to cooperate with that ... in particular if the page
is in an area that can be unpoisoned ... then we should do that *instead* of marking
the page not present (which breaks up huge/large pages and so affects performance).

Instead of calling it "arch_unmap_pfn" it could be called something like arch_handle_poison()
and do something like:

void arch_handle_poison(unsigned long pfn)
{
	if this is a pmem page && pmem_clear_poison(pfn)
		return
	if this is a nvdimm page && nvdimm_clear_poison(pfn)
		return
	/* can't clear, map out from 1:1 region */
	... code from my patch ...
}

I'm just not sure how those first two "if" bits work ... particularly in terms of CONFIG dependencies and system
capabilities.  Perhaps each of pmem and nvdimm could register their unpoison functions and this code could
just call each in turn?

-Tony

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-21 20:30         ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-21 20:30 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Naoya Horiguchi
  Cc: Borislav Petkov, Hansen, Dave, x86, linux-mm, linux-kernel,
	linux-nvdimm, Williams, Dan J, Kani, Toshimitsu, Vaden,
	Tom (HPE Server OS Architecture)

> Persistent memory does have unpoisoning and would require this inverse
> operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
> nvdimm_clear_poison().

Nice.  Well this code will need to cooperate with that ... in particular if the page
is in an area that can be unpoisoned ... then we should do that *instead* of marking
the page not present (which breaks up huge/large pages and so affects performance).

Instead of calling it "arch_unmap_pfn" it could be called something like arch_handle_poison()
and do something like:

void arch_handle_poison(unsigned long pfn)
{
	if this is a pmem page && pmem_clear_poison(pfn)
		return
	if this is a nvdimm page && nvdimm_clear_poison(pfn)
		return
	/* can't clear, map out from 1:1 region */
	... code from my patch ...
}

I'm just not sure how those first two "if" bits work ... particularly in terms of CONFIG dependencies and system
capabilities.  Perhaps each of pmem and nvdimm could register their unpoison functions and this code could
just call each in turn?

-Tony


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-21 17:47     ` Luck, Tony
@ 2017-06-22  9:39       ` Borislav Petkov
  -1 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2017-06-22  9:39 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel, Yazen Ghannam

On Wed, Jun 21, 2017 at 10:47:40AM -0700, Luck, Tony wrote:
> I would if I could work out how to use it. From reading the manual
> page there seem to be a few options to this, but none of them appear
> to just drop a specific address (apart from my own). :-(

$ git send-email --to ... --cc ... --cc ... --suppress-cc=all ...

That should send only to the ones you have in --to and --cc and suppress
the rest.

Do a

$ git send-email -v --dry-run --to ... --cc ... --cc ... --suppress-cc=all ...

to see what it is going to do.

> I'd assume that other X86 implementations would face similar issues (unless
> they have extremely cautious pre-fetchers and/or no speculation).
> 
> I'm also assuming that non-X86 architectures that do recovery may want this
> too ... hence hooking the arch_unmap_kpfn() function into the generic
> memory_failure() code.

Which means that you could move the function to generic
mm/memory_failure.c code after making the decoy_addr computation
generic.

I'd still like to hear some sort of confirmation from other
vendors/arches whether it makes sense for them too, though.

I mean, if they don't do speculative accesses, then it probably doesn't
matter even - the page is innacessible anyway but still...

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-22  9:39       ` Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2017-06-22  9:39 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel, Yazen Ghannam

On Wed, Jun 21, 2017 at 10:47:40AM -0700, Luck, Tony wrote:
> I would if I could work out how to use it. From reading the manual
> page there seem to be a few options to this, but none of them appear
> to just drop a specific address (apart from my own). :-(

$ git send-email --to ... --cc ... --cc ... --suppress-cc=all ...

That should send only to the ones you have in --to and --cc and suppress
the rest.

Do a

$ git send-email -v --dry-run --to ... --cc ... --cc ... --suppress-cc=all ...

to see what it is going to do.

> I'd assume that other X86 implementations would face similar issues (unless
> they have extremely cautious pre-fetchers and/or no speculation).
> 
> I'm also assuming that non-X86 architectures that do recovery may want this
> too ... hence hooking the arch_unmap_kpfn() function into the generic
> memory_failure() code.

Which means that you could move the function to generic
mm/memory_failure.c code after making the decoy_addr computation
generic.

I'd still like to hear some sort of confirmation from other
vendors/arches whether it makes sense for them too, though.

I mean, if they don't do speculative accesses, then it probably doesn't
matter even - the page is innacessible anyway but still...

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix ImendA?rffer, Jane Smithard, Graham Norton, HRB 21284 (AG NA 1/4 rnberg)
-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-21 20:30         ` Luck, Tony
  (?)
@ 2017-06-23  5:07           ` Dan Williams
  -1 siblings, 0 replies; 49+ messages in thread
From: Dan Williams @ 2017-06-23  5:07 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Vaden, Tom (HPE Server OS Architecture),
	linux-mm, x86, linux-kernel, Hansen,
	Dave  <dave.hansen@intel.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Borislav Petkov, linux-nvdimm

On Wed, Jun 21, 2017 at 1:30 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> Persistent memory does have unpoisoning and would require this inverse
>> operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
>> nvdimm_clear_poison().
>
> Nice.  Well this code will need to cooperate with that ... in particular if the page
> is in an area that can be unpoisoned ... then we should do that *instead* of marking
> the page not present (which breaks up huge/large pages and so affects performance).
>
> Instead of calling it "arch_unmap_pfn" it could be called something like arch_handle_poison()
> and do something like:
>
> void arch_handle_poison(unsigned long pfn)
> {
>         if this is a pmem page && pmem_clear_poison(pfn)
>                 return
>         if this is a nvdimm page && nvdimm_clear_poison(pfn)
>                 return
>         /* can't clear, map out from 1:1 region */
>         ... code from my patch ...
> }
>
> I'm just not sure how those first two "if" bits work ... particularly in terms of CONFIG dependencies and system
> capabilities.  Perhaps each of pmem and nvdimm could register their unpoison functions and this code could
> just call each in turn?

We don't unpoison pmem without new data to write in it's place. What
context is arch_handle_poison() called? Ideally we only "clear" poison
when we know we are trying to write zero over the poisoned range.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-23  5:07           ` Dan Williams
  0 siblings, 0 replies; 49+ messages in thread
From: Dan Williams @ 2017-06-23  5:07 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Elliott, Robert (Persistent Memory),
	Naoya Horiguchi, Borislav Petkov, Hansen, Dave, x86, linux-mm,
	linux-kernel, linux-nvdimm, Kani, Toshimitsu, Vaden,
	Tom (HPE Server OS Architecture)

On Wed, Jun 21, 2017 at 1:30 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> Persistent memory does have unpoisoning and would require this inverse
>> operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
>> nvdimm_clear_poison().
>
> Nice.  Well this code will need to cooperate with that ... in particular if the page
> is in an area that can be unpoisoned ... then we should do that *instead* of marking
> the page not present (which breaks up huge/large pages and so affects performance).
>
> Instead of calling it "arch_unmap_pfn" it could be called something like arch_handle_poison()
> and do something like:
>
> void arch_handle_poison(unsigned long pfn)
> {
>         if this is a pmem page && pmem_clear_poison(pfn)
>                 return
>         if this is a nvdimm page && nvdimm_clear_poison(pfn)
>                 return
>         /* can't clear, map out from 1:1 region */
>         ... code from my patch ...
> }
>
> I'm just not sure how those first two "if" bits work ... particularly in terms of CONFIG dependencies and system
> capabilities.  Perhaps each of pmem and nvdimm could register their unpoison functions and this code could
> just call each in turn?

We don't unpoison pmem without new data to write in it's place. What
context is arch_handle_poison() called? Ideally we only "clear" poison
when we know we are trying to write zero over the poisoned range.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-23  5:07           ` Dan Williams
  0 siblings, 0 replies; 49+ messages in thread
From: Dan Williams @ 2017-06-23  5:07 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Elliott, Robert (Persistent Memory),
	Naoya Horiguchi, Borislav Petkov, Hansen, Dave, x86, linux-mm,
	linux-kernel, linux-nvdimm, Kani, Toshimitsu, Vaden,
	Tom (HPE Server OS Architecture)

On Wed, Jun 21, 2017 at 1:30 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> Persistent memory does have unpoisoning and would require this inverse
>> operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
>> nvdimm_clear_poison().
>
> Nice.  Well this code will need to cooperate with that ... in particular if the page
> is in an area that can be unpoisoned ... then we should do that *instead* of marking
> the page not present (which breaks up huge/large pages and so affects performance).
>
> Instead of calling it "arch_unmap_pfn" it could be called something like arch_handle_poison()
> and do something like:
>
> void arch_handle_poison(unsigned long pfn)
> {
>         if this is a pmem page && pmem_clear_poison(pfn)
>                 return
>         if this is a nvdimm page && nvdimm_clear_poison(pfn)
>                 return
>         /* can't clear, map out from 1:1 region */
>         ... code from my patch ...
> }
>
> I'm just not sure how those first two "if" bits work ... particularly in terms of CONFIG dependencies and system
> capabilities.  Perhaps each of pmem and nvdimm could register their unpoison functions and this code could
> just call each in turn?

We don't unpoison pmem without new data to write in it's place. What
context is arch_handle_poison() called? Ideally we only "clear" poison
when we know we are trying to write zero over the poisoned range.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-23  5:07           ` Dan Williams
  (?)
@ 2017-06-23 20:59             ` Luck, Tony
  -1 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-23 20:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vaden, Tom (HPE Server OS Architecture),
	linux-mm, x86, linux-kernel, Hansen,
	Dave  <dave.hansen@intel.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Borislav Petkov, linux-nvdimm

On Thu, Jun 22, 2017 at 10:07:18PM -0700, Dan Williams wrote:
> On Wed, Jun 21, 2017 at 1:30 PM, Luck, Tony <tony.luck@intel.com> wrote:
> >> Persistent memory does have unpoisoning and would require this inverse
> >> operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
> >> nvdimm_clear_poison().
> >
> > Nice.  Well this code will need to cooperate with that ... in particular if the page
> > is in an area that can be unpoisoned ... then we should do that *instead* of marking
> > the page not present (which breaks up huge/large pages and so affects performance).
> >
> > Instead of calling it "arch_unmap_pfn" it could be called something like arch_handle_poison()
> > and do something like:
> >
> > void arch_handle_poison(unsigned long pfn)
> > {
> >         if this is a pmem page && pmem_clear_poison(pfn)
> >                 return
> >         if this is a nvdimm page && nvdimm_clear_poison(pfn)
> >                 return
> >         /* can't clear, map out from 1:1 region */
> >         ... code from my patch ...
> > }
> >
> > I'm just not sure how those first two "if" bits work ... particularly in terms of CONFIG dependencies and system
> > capabilities.  Perhaps each of pmem and nvdimm could register their unpoison functions and this code could
> > just call each in turn?
> 
> We don't unpoison pmem without new data to write in it's place. What
> context is arch_handle_poison() called? Ideally we only "clear" poison
> when we know we are trying to write zero over the poisoned range.

Context is that of the process that did the access (but we've moved
off the machine check stack and are now in normal kernel context).
We are about to unmap this page from all applications that are
using it.  But they may be running ... so now it a bad time to
clear the poison. They might access the page and not get a signal.

If I move this code to after all the users PTEs have been cleared
and TLBs flushed, then it would be safe to try to unpoison the page
and not invalidate from the 1:1 mapping.

But I'm not sure what happens next. For a normal DDR4 page I could
put it back on the free list and allow it to be re-used. But for
PMEM you have some other cleanup that you need to do to mark the
block as lost from your file system.

Is this too early for you to be able to do that?

-Tony
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-23 20:59             ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-23 20:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Elliott, Robert (Persistent Memory),
	Naoya Horiguchi, Borislav Petkov, Hansen, Dave, x86, linux-mm,
	linux-kernel, linux-nvdimm, Kani, Toshimitsu, Vaden,
	Tom (HPE Server OS Architecture)

On Thu, Jun 22, 2017 at 10:07:18PM -0700, Dan Williams wrote:
> On Wed, Jun 21, 2017 at 1:30 PM, Luck, Tony <tony.luck@intel.com> wrote:
> >> Persistent memory does have unpoisoning and would require this inverse
> >> operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
> >> nvdimm_clear_poison().
> >
> > Nice.  Well this code will need to cooperate with that ... in particular if the page
> > is in an area that can be unpoisoned ... then we should do that *instead* of marking
> > the page not present (which breaks up huge/large pages and so affects performance).
> >
> > Instead of calling it "arch_unmap_pfn" it could be called something like arch_handle_poison()
> > and do something like:
> >
> > void arch_handle_poison(unsigned long pfn)
> > {
> >         if this is a pmem page && pmem_clear_poison(pfn)
> >                 return
> >         if this is a nvdimm page && nvdimm_clear_poison(pfn)
> >                 return
> >         /* can't clear, map out from 1:1 region */
> >         ... code from my patch ...
> > }
> >
> > I'm just not sure how those first two "if" bits work ... particularly in terms of CONFIG dependencies and system
> > capabilities.  Perhaps each of pmem and nvdimm could register their unpoison functions and this code could
> > just call each in turn?
> 
> We don't unpoison pmem without new data to write in it's place. What
> context is arch_handle_poison() called? Ideally we only "clear" poison
> when we know we are trying to write zero over the poisoned range.

Context is that of the process that did the access (but we've moved
off the machine check stack and are now in normal kernel context).
We are about to unmap this page from all applications that are
using it.  But they may be running ... so now it a bad time to
clear the poison. They might access the page and not get a signal.

If I move this code to after all the users PTEs have been cleared
and TLBs flushed, then it would be safe to try to unpoison the page
and not invalidate from the 1:1 mapping.

But I'm not sure what happens next. For a normal DDR4 page I could
put it back on the free list and allow it to be re-used. But for
PMEM you have some other cleanup that you need to do to mark the
block as lost from your file system.

Is this too early for you to be able to do that?

-Tony

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-23 20:59             ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-23 20:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Elliott, Robert (Persistent Memory),
	Naoya Horiguchi, Borislav Petkov, Hansen, Dave, x86, linux-mm,
	linux-kernel, linux-nvdimm, Kani, Toshimitsu, Vaden,
	Tom (HPE Server OS Architecture)

On Thu, Jun 22, 2017 at 10:07:18PM -0700, Dan Williams wrote:
> On Wed, Jun 21, 2017 at 1:30 PM, Luck, Tony <tony.luck@intel.com> wrote:
> >> Persistent memory does have unpoisoning and would require this inverse
> >> operation - see drivers/nvdimm/pmem.c pmem_clear_poison() and core.c
> >> nvdimm_clear_poison().
> >
> > Nice.  Well this code will need to cooperate with that ... in particular if the page
> > is in an area that can be unpoisoned ... then we should do that *instead* of marking
> > the page not present (which breaks up huge/large pages and so affects performance).
> >
> > Instead of calling it "arch_unmap_pfn" it could be called something like arch_handle_poison()
> > and do something like:
> >
> > void arch_handle_poison(unsigned long pfn)
> > {
> >         if this is a pmem page && pmem_clear_poison(pfn)
> >                 return
> >         if this is a nvdimm page && nvdimm_clear_poison(pfn)
> >                 return
> >         /* can't clear, map out from 1:1 region */
> >         ... code from my patch ...
> > }
> >
> > I'm just not sure how those first two "if" bits work ... particularly in terms of CONFIG dependencies and system
> > capabilities.  Perhaps each of pmem and nvdimm could register their unpoison functions and this code could
> > just call each in turn?
> 
> We don't unpoison pmem without new data to write in it's place. What
> context is arch_handle_poison() called? Ideally we only "clear" poison
> when we know we are trying to write zero over the poisoned range.

Context is that of the process that did the access (but we've moved
off the machine check stack and are now in normal kernel context).
We are about to unmap this page from all applications that are
using it.  But they may be running ... so now it a bad time to
clear the poison. They might access the page and not get a signal.

If I move this code to after all the users PTEs have been cleared
and TLBs flushed, then it would be safe to try to unpoison the page
and not invalidate from the 1:1 mapping.

But I'm not sure what happens next. For a normal DDR4 page I could
put it back on the free list and allow it to be re-used. But for
PMEM you have some other cleanup that you need to do to mark the
block as lost from your file system.

Is this too early for you to be able to do that?

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-21 17:47     ` Luck, Tony
  (?)
@ 2017-06-23 22:19       ` Elliott, Robert (Persistent Memory)
  -1 siblings, 0 replies; 49+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-06-23 22:19 UTC (permalink / raw)
  To: Luck, Tony, Borislav Petkov
  Cc: Dave Hansen, x86, linux-kernel, linux-mm, Yazen Ghannam,
	Naoya Horiguchi, linux-nvdimm

> > > +	if (set_memory_np(decoy_addr, 1))
> > > +		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n",

Another concept to consider is mapping the page as UC rather than
completely unmapping it.

The uncorrectable error scope could be smaller than a page size, like:
* memory ECC width (e.g., 8 bytes)
* cache line size (e.g., 64 bytes)
* block device logical block size (e.g., 512 bytes, for persistent memory)

UC preserves the ability to access adjacent data within the page that
hasn't gone bad, and is particularly useful for persistent memory.

---
Robert Elliott, HPE Persistent Memory


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-23 22:19       ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 49+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-06-23 22:19 UTC (permalink / raw)
  To: Luck, Tony, Borislav Petkov
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel,
	Yazen Ghannam, Kani, Toshimitsu,
	'dan.j.williams@intel.com',
	linux-nvdimm

> > > +	if (set_memory_np(decoy_addr, 1))
> > > +		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n",

Another concept to consider is mapping the page as UC rather than
completely unmapping it.

The uncorrectable error scope could be smaller than a page size, like:
* memory ECC width (e.g., 8 bytes)
* cache line size (e.g., 64 bytes)
* block device logical block size (e.g., 512 bytes, for persistent memory)

UC preserves the ability to access adjacent data within the page that
hasn't gone bad, and is particularly useful for persistent memory.

---
Robert Elliott, HPE Persistent Memory

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-23 22:19       ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 49+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-06-23 22:19 UTC (permalink / raw)
  To: Luck, Tony, Borislav Petkov
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel,
	Yazen Ghannam, Kani, Toshimitsu,
	'dan.j.williams@intel.com',
	linux-nvdimm

> > > +	if (set_memory_np(decoy_addr, 1))
> > > +		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n",

Another concept to consider is mapping the page as UC rather than
completely unmapping it.

The uncorrectable error scope could be smaller than a page size, like:
* memory ECC width (e.g., 8 bytes)
* cache line size (e.g., 64 bytes)
* block device logical block size (e.g., 512 bytes, for persistent memory)

UC preserves the ability to access adjacent data within the page that
hasn't gone bad, and is particularly useful for persistent memory.

---
Robert Elliott, HPE Persistent Memory


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-23 22:19       ` Elliott, Robert (Persistent Memory)
  (?)
@ 2017-06-27 22:04         ` Luck, Tony
  -1 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-27 22:04 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Borislav Petkov
  Cc: Hansen, Dave, x86, linux-kernel, linux-mm, Yazen Ghannam,
	Naoya Horiguchi, linux-nvdimm

> > > > +if (set_memory_np(decoy_addr, 1))
> > > > +pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n",
>
> Another concept to consider is mapping the page as UC rather than
> completely unmapping it.

UC would also avoid the speculative prefetch issue.  The Vol 3, Section 11.3 SDM says:

Strong Uncacheable (UC) -System memory locations are not cached. All reads and writes
appear on the system bus and are executed in program order without reordering. No speculative
memory accesses, pagetable walks, or prefetches of speculated branch targets are made.
This type of cache-control is useful for memory-mapped I/O devices. When used with normal
RAM, it greatly reduces processor performance.

But then I went and read the code for set_memory_uc() ... which calls "reserve_memtyep()"
which does all kinds of things to avoid issues with MTRRs and other stuff.  Which all looks
really more complex that we need just here.

> The uncorrectable error scope could be smaller than a page size, like:
> * memory ECC width (e.g., 8 bytes)
> * cache line size (e.g., 64 bytes)
> * block device logical block size (e.g., 512 bytes, for persistent memory)
>
> UC preserves the ability to access adjacent data within the page that
> hasn't gone bad, and is particularly useful for persistent memory.

If you want to dig into the non-poisoned pieces of the page later it might be
better to set up a new scratch UC mapping to do that.

My takeaway from Dan's comments on unpoisoning is that this isn't the context
that he wants to do that.  He'd rather wait until he has somebody overwriting the
page with fresh data.

So I think I'd like to keep the patch as-is.

-Tony
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-27 22:04         ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-27 22:04 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Borislav Petkov
  Cc: Hansen, Dave, Naoya Horiguchi, x86, linux-mm, linux-kernel,
	Yazen Ghannam, Kani, Toshimitsu, Williams, Dan J, linux-nvdimm

> > > > +if (set_memory_np(decoy_addr, 1))
> > > > +pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n",
>
> Another concept to consider is mapping the page as UC rather than
> completely unmapping it.

UC would also avoid the speculative prefetch issue.  The Vol 3, Section 11.3 SDM says:

Strong Uncacheable (UC) -System memory locations are not cached. All reads and writes
appear on the system bus and are executed in program order without reordering. No speculative
memory accesses, pagetable walks, or prefetches of speculated branch targets are made.
This type of cache-control is useful for memory-mapped I/O devices. When used with normal
RAM, it greatly reduces processor performance.

But then I went and read the code for set_memory_uc() ... which calls "reserve_memtyep()"
which does all kinds of things to avoid issues with MTRRs and other stuff.  Which all looks
really more complex that we need just here.

> The uncorrectable error scope could be smaller than a page size, like:
> * memory ECC width (e.g., 8 bytes)
> * cache line size (e.g., 64 bytes)
> * block device logical block size (e.g., 512 bytes, for persistent memory)
>
> UC preserves the ability to access adjacent data within the page that
> hasn't gone bad, and is particularly useful for persistent memory.

If you want to dig into the non-poisoned pieces of the page later it might be
better to set up a new scratch UC mapping to do that.

My takeaway from Dan's comments on unpoisoning is that this isn't the context
that he wants to do that.  He'd rather wait until he has somebody overwriting the
page with fresh data.

So I think I'd like to keep the patch as-is.

-Tony

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-27 22:04         ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-27 22:04 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Borislav Petkov
  Cc: Hansen, Dave, Naoya Horiguchi, x86, linux-mm, linux-kernel,
	Yazen Ghannam, Kani, Toshimitsu, Williams, Dan J, linux-nvdimm

> > > > +if (set_memory_np(decoy_addr, 1))
> > > > +pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n",
>
> Another concept to consider is mapping the page as UC rather than
> completely unmapping it.

UC would also avoid the speculative prefetch issue.  The Vol 3, Section 11.3 SDM says:

Strong Uncacheable (UC) -System memory locations are not cached. All reads and writes
appear on the system bus and are executed in program order without reordering. No speculative
memory accesses, pagetable walks, or prefetches of speculated branch targets are made.
This type of cache-control is useful for memory-mapped I/O devices. When used with normal
RAM, it greatly reduces processor performance.

But then I went and read the code for set_memory_uc() ... which calls "reserve_memtyep()"
which does all kinds of things to avoid issues with MTRRs and other stuff.  Which all looks
really more complex that we need just here.

> The uncorrectable error scope could be smaller than a page size, like:
> * memory ECC width (e.g., 8 bytes)
> * cache line size (e.g., 64 bytes)
> * block device logical block size (e.g., 512 bytes, for persistent memory)
>
> UC preserves the ability to access adjacent data within the page that
> hasn't gone bad, and is particularly useful for persistent memory.

If you want to dig into the non-poisoned pieces of the page later it might be
better to set up a new scratch UC mapping to do that.

My takeaway from Dan's comments on unpoisoning is that this isn't the context
that he wants to do that.  He'd rather wait until he has somebody overwriting the
page with fresh data.

So I think I'd like to keep the patch as-is.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-27 22:04         ` Luck, Tony
@ 2017-06-27 22:09           ` Dan Williams
  -1 siblings, 0 replies; 49+ messages in thread
From: Dan Williams @ 2017-06-27 22:09 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Elliott, Robert (Persistent Memory),
	Borislav Petkov, Hansen, Dave, Naoya Horiguchi, x86, linux-mm,
	linux-kernel, Yazen Ghannam, Kani, Toshimitsu, linux-nvdimm

On Tue, Jun 27, 2017 at 3:04 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> > > > +if (set_memory_np(decoy_addr, 1))
>> > > > +pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n",
>>
>> Another concept to consider is mapping the page as UC rather than
>> completely unmapping it.
>
> UC would also avoid the speculative prefetch issue.  The Vol 3, Section 11.3 SDM says:
>
> Strong Uncacheable (UC) -System memory locations are not cached. All reads and writes
> appear on the system bus and are executed in program order without reordering. No speculative
> memory accesses, pagetable walks, or prefetches of speculated branch targets are made.
> This type of cache-control is useful for memory-mapped I/O devices. When used with normal
> RAM, it greatly reduces processor performance.
>
> But then I went and read the code for set_memory_uc() ... which calls "reserve_memtyep()"
> which does all kinds of things to avoid issues with MTRRs and other stuff.  Which all looks
> really more complex that we need just here.
>
>> The uncorrectable error scope could be smaller than a page size, like:
>> * memory ECC width (e.g., 8 bytes)
>> * cache line size (e.g., 64 bytes)
>> * block device logical block size (e.g., 512 bytes, for persistent memory)
>>
>> UC preserves the ability to access adjacent data within the page that
>> hasn't gone bad, and is particularly useful for persistent memory.
>
> If you want to dig into the non-poisoned pieces of the page later it might be
> better to set up a new scratch UC mapping to do that.
>
> My takeaway from Dan's comments on unpoisoning is that this isn't the context
> that he wants to do that.  He'd rather wait until he has somebody overwriting the
> page with fresh data.
>
> So I think I'd like to keep the patch as-is.

Yes, the persistent-memory poison interactions should be handled
separately and not hold up this patch for the normal system-memory
case. We might dove-tail support for this into stray write protection
where we unmap all of pmem while nothing in the kernel is actively
accessing it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-06-27 22:09           ` Dan Williams
  0 siblings, 0 replies; 49+ messages in thread
From: Dan Williams @ 2017-06-27 22:09 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Elliott, Robert (Persistent Memory),
	Borislav Petkov, Hansen, Dave, Naoya Horiguchi, x86, linux-mm,
	linux-kernel, Yazen Ghannam, Kani, Toshimitsu, linux-nvdimm

On Tue, Jun 27, 2017 at 3:04 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> > > > +if (set_memory_np(decoy_addr, 1))
>> > > > +pr_warn("Could not invalidate pfn=0x%lx from 1:1 map \n",
>>
>> Another concept to consider is mapping the page as UC rather than
>> completely unmapping it.
>
> UC would also avoid the speculative prefetch issue.  The Vol 3, Section 11.3 SDM says:
>
> Strong Uncacheable (UC) -System memory locations are not cached. All reads and writes
> appear on the system bus and are executed in program order without reordering. No speculative
> memory accesses, pagetable walks, or prefetches of speculated branch targets are made.
> This type of cache-control is useful for memory-mapped I/O devices. When used with normal
> RAM, it greatly reduces processor performance.
>
> But then I went and read the code for set_memory_uc() ... which calls "reserve_memtyep()"
> which does all kinds of things to avoid issues with MTRRs and other stuff.  Which all looks
> really more complex that we need just here.
>
>> The uncorrectable error scope could be smaller than a page size, like:
>> * memory ECC width (e.g., 8 bytes)
>> * cache line size (e.g., 64 bytes)
>> * block device logical block size (e.g., 512 bytes, for persistent memory)
>>
>> UC preserves the ability to access adjacent data within the page that
>> hasn't gone bad, and is particularly useful for persistent memory.
>
> If you want to dig into the non-poisoned pieces of the page later it might be
> better to set up a new scratch UC mapping to do that.
>
> My takeaway from Dan's comments on unpoisoning is that this isn't the context
> that he wants to do that.  He'd rather wait until he has somebody overwriting the
> page with fresh data.
>
> So I think I'd like to keep the patch as-is.

Yes, the persistent-memory poison interactions should be handled
separately and not hold up this patch for the normal system-memory
case. We might dove-tail support for this into stray write protection
where we unmap all of pmem while nothing in the kernel is actively
accessing it.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* git send-email (w/o Cc: stable)
  2017-06-22  9:39       ` Borislav Petkov
@ 2017-06-29 22:11         ` Luck, Tony
  -1 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-29 22:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel, Yazen Ghannam

On Thu, Jun 22, 2017 at 11:39:05AM +0200, Borislav Petkov wrote:
> On Wed, Jun 21, 2017 at 10:47:40AM -0700, Luck, Tony wrote:
> > I would if I could work out how to use it. From reading the manual
> > page there seem to be a few options to this, but none of them appear
> > to just drop a specific address (apart from my own). :-(
> 
> $ git send-email --to ... --cc ... --cc ... --suppress-cc=all ...
> 
> That should send only to the ones you have in --to and --cc and suppress
> the rest.
> 
> Do a
> 
> $ git send-email -v --dry-run --to ... --cc ... --cc ... --suppress-cc=all ...
> 
> to see what it is going to do.

So there is a "--cc-cmd" option that can do the same as those "-cc" arguments.
Combine that with --suppress-cc=bodycc and things get a bit more automated.

In my .gitconfig:

[sendemail]
	suppresscc = bodycc
	ccCmd = /home/agluck/bin/sendemail.ccCmd

and the command is some sed(1) to grap the Cc: lines except the
stable@vger.kernel.org one:

sed -n \
	-e '/Cc: stable@vger.kernel.org/d' \
	-e '/^Cc: /s///p' \
	-e '/^---/q' $1

-Tony

^ permalink raw reply	[flat|nested] 49+ messages in thread

* git send-email (w/o Cc: stable)
@ 2017-06-29 22:11         ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-06-29 22:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel, Yazen Ghannam

On Thu, Jun 22, 2017 at 11:39:05AM +0200, Borislav Petkov wrote:
> On Wed, Jun 21, 2017 at 10:47:40AM -0700, Luck, Tony wrote:
> > I would if I could work out how to use it. From reading the manual
> > page there seem to be a few options to this, but none of them appear
> > to just drop a specific address (apart from my own). :-(
> 
> $ git send-email --to ... --cc ... --cc ... --suppress-cc=all ...
> 
> That should send only to the ones you have in --to and --cc and suppress
> the rest.
> 
> Do a
> 
> $ git send-email -v --dry-run --to ... --cc ... --cc ... --suppress-cc=all ...
> 
> to see what it is going to do.

So there is a "--cc-cmd" option that can do the same as those "-cc" arguments.
Combine that with --suppress-cc=bodycc and things get a bit more automated.

In my .gitconfig:

[sendemail]
	suppresscc = bodycc
	ccCmd = /home/agluck/bin/sendemail.ccCmd

and the command is some sed(1) to grap the Cc: lines except the
stable@vger.kernel.org one:

sed -n \
	-e '/Cc: stable@vger.kernel.org/d' \
	-e '/^Cc: /s///p' \
	-e '/^---/q' $1

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: git send-email (w/o Cc: stable)
  2017-06-29 22:11         ` Luck, Tony
@ 2017-06-30  7:08           ` Borislav Petkov
  -1 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2017-06-30  7:08 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel,
	Yazen Ghannam, git

On Thu, Jun 29, 2017 at 03:11:37PM -0700, Luck, Tony wrote:
> So there is a "--cc-cmd" option that can do the same as those "-cc" arguments.
> Combine that with --suppress-cc=bodycc and things get a bit more automated.

Yeah, whatever works for you.

I did play with cc-cmd somewhat but can't be bothered to generate the CC
list per hand each time.

I'd prefer if that switch:

	--suppress-cc=<category>

had the obvious <category> of single email address too:

	--suppress-cc=stable@vger.kernel.org

so that we can send patches and unconditionally suppress only that
single recipient from the CC list.

And maybe there is a way...

Let me CC the git ML.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: git send-email (w/o Cc: stable)
@ 2017-06-30  7:08           ` Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2017-06-30  7:08 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Dave Hansen, Naoya Horiguchi, x86, linux-mm, linux-kernel,
	Yazen Ghannam, git

On Thu, Jun 29, 2017 at 03:11:37PM -0700, Luck, Tony wrote:
> So there is a "--cc-cmd" option that can do the same as those "-cc" arguments.
> Combine that with --suppress-cc=bodycc and things get a bit more automated.

Yeah, whatever works for you.

I did play with cc-cmd somewhat but can't be bothered to generate the CC
list per hand each time.

I'd prefer if that switch:

	--suppress-cc=<category>

had the obvious <category> of single email address too:

	--suppress-cc=stable@vger.kernel.org

so that we can send patches and unconditionally suppress only that
single recipient from the CC list.

And maybe there is a way...

Let me CC the git ML.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix ImendA?rffer, Jane Smithard, Graham Norton, HRB 21284 (AG NA 1/4 rnberg)
-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-06-27 22:09           ` Dan Williams
@ 2017-08-16 17:18             ` Luck, Tony
  -1 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-08-16 17:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tony Luck, Borislav Petkov, Dave Hansen, Naoya Horiguchi,
	Elliott, Robert (Persistent Memory),
	x86, linux-mm, linux-kernel

From: Tony Luck <tony.luck@intel.com>

Speculative processor accesses may reference any memory that has a
valid page table entry.  While a speculative access won't generate
a machine check, it will log the error in a machine check bank. That
could cause escalation of a subsequent error since the overflow bit
will be then set in the machine check bank status register.

Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
address of the page we want to map out otherwise we may trigger the
very problem we are trying to avoid.  We use a non-canonical address
that passes through the usual Linux table walking code to get to the
same "pte".

Thanks to Dave Hansen for reviewing several iterations of this.

Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Elliott, Robert (Persistent Memory)" <elliott@hpe.com>
Cc: x86@kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Full previous thread here:
http://marc.info/?l=linux-mm&m=149860136413338&w=2 but the Cliff notes
are: Discussion on this stalled out at the end of June.  Robert Elliott
had raised questions on whether there needed to be a method to re-enable
the 1:1 mapping if the poison was cleared. I replied that would be a good
follow-on patch when we have a way to clear poison. Robert also asked
whether this needs to integrate with the handling of poison in NVDIMMs,
But discussions with Dan Williams ended up concluding that this code is
executed much earlier (right as the fault is detected) than the NVDIMM
code is prepared to take action. Dan thought this patch could move ahead.

 arch/x86/include/asm/page_64.h   |  4 ++++
 arch/x86/kernel/cpu/mcheck/mce.c | 43 ++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_inline.h        |  6 ++++++
 mm/memory-failure.c              |  2 ++
 4 files changed, 55 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index b4a0d43248cf..b50df06ad251 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -51,6 +51,10 @@ static inline void clear_page(void *page)
 
 void copy_page(void *to, void *from);
 
+#ifdef CONFIG_X86_MCE
+#define arch_unmap_kpfn arch_unmap_kpfn
+#endif
+
 #endif	/* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 6dde0497efc7..3b413065c613 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -51,6 +51,7 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
+#include <asm/set_memory.h>
 
 #include "mce-internal.h"
 
@@ -1051,6 +1052,48 @@ static int do_memory_failure(struct mce *m)
 	return ret;
 }
 
+#if defined(arch_unmap_kpfn) && defined(CONFIG_MEMORY_FAILURE)
+
+void arch_unmap_kpfn(unsigned long pfn)
+{
+	unsigned long decoy_addr;
+
+	/*
+	 * Unmap this page from the kernel 1:1 mappings to make sure
+	 * we don't log more errors because of speculative access to
+	 * the page.
+	 * We would like to just call:
+	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
+	 * but doing that would radically increase the odds of a
+	 * speculative access to the posion page because we'd have
+	 * the virtual address of the kernel 1:1 mapping sitting
+	 * around in registers.
+	 * Instead we get tricky.  We create a non-canonical address
+	 * that looks just like the one we want, but has bit 63 flipped.
+	 * This relies on set_memory_np() not checking whether we passed
+	 * a legal address.
+	 */
+
+/*
+ * Build time check to see if we have a spare virtual bit. Don't want
+ * to leave this until run time because most developers don't have a
+ * system that can exercise this code path. This will only become a
+ * problem if/when we move beyond 5-level page tables.
+ *
+ * Hard code "9" here because cpp doesn't grok ilog2(PTRS_PER_PGD)
+ */
+#if PGDIR_SHIFT + 9 < 63
+	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+#else
+#error "no unused virtual bit available"
+#endif
+
+	if (set_memory_np(decoy_addr, 1))
+		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+
+}
+#endif
+
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e030a68ead7e..25438b2b6f22 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -126,4 +126,10 @@ static __always_inline enum lru_list page_lru(struct page *page)
 
 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef arch_unmap_kpfn
+extern void arch_unmap_kpfn(unsigned long pfn);
+#else
+static __always_inline void arch_unmap_kpfn(unsigned long pfn) { }
+#endif
+
 #endif
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1cd3b3569af8..88366626c0b7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1146,6 +1146,8 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 		return 0;
 	}
 
+	arch_unmap_kpfn(pfn);
+
 	orig_head = hpage = compound_head(p);
 	num_poisoned_pages_inc();
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-08-16 17:18             ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-08-16 17:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tony Luck, Borislav Petkov, Dave Hansen, Naoya Horiguchi,
	Elliott, Robert (Persistent Memory),
	x86, linux-mm, linux-kernel

From: Tony Luck <tony.luck@intel.com>

Speculative processor accesses may reference any memory that has a
valid page table entry.  While a speculative access won't generate
a machine check, it will log the error in a machine check bank. That
could cause escalation of a subsequent error since the overflow bit
will be then set in the machine check bank status register.

Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
address of the page we want to map out otherwise we may trigger the
very problem we are trying to avoid.  We use a non-canonical address
that passes through the usual Linux table walking code to get to the
same "pte".

Thanks to Dave Hansen for reviewing several iterations of this.

Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Elliott, Robert (Persistent Memory)" <elliott@hpe.com>
Cc: x86@kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Full previous thread here:
http://marc.info/?l=linux-mm&m=149860136413338&w=2 but the Cliff notes
are: Discussion on this stalled out at the end of June.  Robert Elliott
had raised questions on whether there needed to be a method to re-enable
the 1:1 mapping if the poison was cleared. I replied that would be a good
follow-on patch when we have a way to clear poison. Robert also asked
whether this needs to integrate with the handling of poison in NVDIMMs,
But discussions with Dan Williams ended up concluding that this code is
executed much earlier (right as the fault is detected) than the NVDIMM
code is prepared to take action. Dan thought this patch could move ahead.

 arch/x86/include/asm/page_64.h   |  4 ++++
 arch/x86/kernel/cpu/mcheck/mce.c | 43 ++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_inline.h        |  6 ++++++
 mm/memory-failure.c              |  2 ++
 4 files changed, 55 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index b4a0d43248cf..b50df06ad251 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -51,6 +51,10 @@ static inline void clear_page(void *page)
 
 void copy_page(void *to, void *from);
 
+#ifdef CONFIG_X86_MCE
+#define arch_unmap_kpfn arch_unmap_kpfn
+#endif
+
 #endif	/* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 6dde0497efc7..3b413065c613 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -51,6 +51,7 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
+#include <asm/set_memory.h>
 
 #include "mce-internal.h"
 
@@ -1051,6 +1052,48 @@ static int do_memory_failure(struct mce *m)
 	return ret;
 }
 
+#if defined(arch_unmap_kpfn) && defined(CONFIG_MEMORY_FAILURE)
+
+void arch_unmap_kpfn(unsigned long pfn)
+{
+	unsigned long decoy_addr;
+
+	/*
+	 * Unmap this page from the kernel 1:1 mappings to make sure
+	 * we don't log more errors because of speculative access to
+	 * the page.
+	 * We would like to just call:
+	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
+	 * but doing that would radically increase the odds of a
+	 * speculative access to the posion page because we'd have
+	 * the virtual address of the kernel 1:1 mapping sitting
+	 * around in registers.
+	 * Instead we get tricky.  We create a non-canonical address
+	 * that looks just like the one we want, but has bit 63 flipped.
+	 * This relies on set_memory_np() not checking whether we passed
+	 * a legal address.
+	 */
+
+/*
+ * Build time check to see if we have a spare virtual bit. Don't want
+ * to leave this until run time because most developers don't have a
+ * system that can exercise this code path. This will only become a
+ * problem if/when we move beyond 5-level page tables.
+ *
+ * Hard code "9" here because cpp doesn't grok ilog2(PTRS_PER_PGD)
+ */
+#if PGDIR_SHIFT + 9 < 63
+	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+#else
+#error "no unused virtual bit available"
+#endif
+
+	if (set_memory_np(decoy_addr, 1))
+		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+
+}
+#endif
+
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e030a68ead7e..25438b2b6f22 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -126,4 +126,10 @@ static __always_inline enum lru_list page_lru(struct page *page)
 
 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef arch_unmap_kpfn
+extern void arch_unmap_kpfn(unsigned long pfn);
+#else
+static __always_inline void arch_unmap_kpfn(unsigned long pfn) { }
+#endif
+
 #endif
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1cd3b3569af8..88366626c0b7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1146,6 +1146,8 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 		return 0;
 	}
 
+	arch_unmap_kpfn(pfn);
+
 	orig_head = hpage = compound_head(p);
 	num_poisoned_pages_inc();
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [tip:x86/mm] x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-08-16 17:18             ` Luck, Tony
  (?)
@ 2017-08-17 10:19             ` tip-bot for Tony Luck
  -1 siblings, 0 replies; 49+ messages in thread
From: tip-bot for Tony Luck @ 2017-08-17 10:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, dvlasenk, luto, mingo, hpa, bp, peterz, brgerst, dave.hansen,
	torvalds, elliott, tglx, akpm, linux-kernel, jpoimboe,
	n-horiguchi, tony.luck

Commit-ID:  ce0fa3e56ad20f04d8252353dcd24e924abdafca
Gitweb:     http://git.kernel.org/tip/ce0fa3e56ad20f04d8252353dcd24e924abdafca
Author:     Tony Luck <tony.luck@intel.com>
AuthorDate: Wed, 16 Aug 2017 10:18:03 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 17 Aug 2017 10:30:49 +0200

x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages

Speculative processor accesses may reference any memory that has a
valid page table entry.  While a speculative access won't generate
a machine check, it will log the error in a machine check bank. That
could cause escalation of a subsequent error since the overflow bit
will be then set in the machine check bank status register.

Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
address of the page we want to map out otherwise we may trigger the
very problem we are trying to avoid.  We use a non-canonical address
that passes through the usual Linux table walking code to get to the
same "pte".

Thanks to Dave Hansen for reviewing several iterations of this.

Also see:

  http://marc.info/?l=linux-mm&m=149860136413338&w=2

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott, Robert (Persistent Memory) <elliott@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/page_64.h   |  4 ++++
 arch/x86/kernel/cpu/mcheck/mce.c | 43 ++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_inline.h        |  6 ++++++
 mm/memory-failure.c              |  2 ++
 4 files changed, 55 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index b4a0d43..b50df06 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -51,6 +51,10 @@ static inline void clear_page(void *page)
 
 void copy_page(void *to, void *from);
 
+#ifdef CONFIG_X86_MCE
+#define arch_unmap_kpfn arch_unmap_kpfn
+#endif
+
 #endif	/* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 6dde049..3b413065 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -51,6 +51,7 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
+#include <asm/set_memory.h>
 
 #include "mce-internal.h"
 
@@ -1051,6 +1052,48 @@ static int do_memory_failure(struct mce *m)
 	return ret;
 }
 
+#if defined(arch_unmap_kpfn) && defined(CONFIG_MEMORY_FAILURE)
+
+void arch_unmap_kpfn(unsigned long pfn)
+{
+	unsigned long decoy_addr;
+
+	/*
+	 * Unmap this page from the kernel 1:1 mappings to make sure
+	 * we don't log more errors because of speculative access to
+	 * the page.
+	 * We would like to just call:
+	 *	set_memory_np((unsigned long)pfn_to_kaddr(pfn), 1);
+	 * but doing that would radically increase the odds of a
+	 * speculative access to the posion page because we'd have
+	 * the virtual address of the kernel 1:1 mapping sitting
+	 * around in registers.
+	 * Instead we get tricky.  We create a non-canonical address
+	 * that looks just like the one we want, but has bit 63 flipped.
+	 * This relies on set_memory_np() not checking whether we passed
+	 * a legal address.
+	 */
+
+/*
+ * Build time check to see if we have a spare virtual bit. Don't want
+ * to leave this until run time because most developers don't have a
+ * system that can exercise this code path. This will only become a
+ * problem if/when we move beyond 5-level page tables.
+ *
+ * Hard code "9" here because cpp doesn't grok ilog2(PTRS_PER_PGD)
+ */
+#if PGDIR_SHIFT + 9 < 63
+	decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));
+#else
+#error "no unused virtual bit available"
+#endif
+
+	if (set_memory_np(decoy_addr, 1))
+		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+
+}
+#endif
+
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e030a68..25438b2 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -126,4 +126,10 @@ static __always_inline enum lru_list page_lru(struct page *page)
 
 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef arch_unmap_kpfn
+extern void arch_unmap_kpfn(unsigned long pfn);
+#else
+static __always_inline void arch_unmap_kpfn(unsigned long pfn) { }
+#endif
+
 #endif
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1cd3b35..8836662 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1146,6 +1146,8 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 		return 0;
 	}
 
+	arch_unmap_kpfn(pfn);
+
 	orig_head = hpage = compound_head(p);
 	num_poisoned_pages_inc();
 

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-08-16 17:18             ` Luck, Tony
@ 2017-08-17 22:09               ` Andrew Morton
  -1 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2017-08-17 22:09 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Dave Hansen, Naoya Horiguchi, Elliott,
	Robert (Persistent Memory),
	x86, linux-mm, linux-kernel

On Wed, 16 Aug 2017 10:18:03 -0700 "Luck, Tony" <tony.luck@intel.com> wrote:

> Speculative processor accesses may reference any memory that has a
> valid page table entry.  While a speculative access won't generate
> a machine check, it will log the error in a machine check bank. That
> could cause escalation of a subsequent error since the overflow bit
> will be then set in the machine check bank status register.
> 
> Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
> address of the page we want to map out otherwise we may trigger the
> very problem we are trying to avoid.  We use a non-canonical address
> that passes through the usual Linux table walking code to get to the
> same "pte".
> 
> Thanks to Dave Hansen for reviewing several iterations of this.

It's unclear (to lil ole me) what the end-user-visible effects of this
are.

Could we please have a description of that?  So a) people can
understand your decision to cc:stable and b) people whose kernels are
misbehaving can use your description to decide whether your patch might
fix the issue their users are reporting.

Thanks.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-08-17 22:09               ` Andrew Morton
  0 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2017-08-17 22:09 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Dave Hansen, Naoya Horiguchi, Elliott,
	Robert (Persistent Memory),
	x86, linux-mm, linux-kernel

On Wed, 16 Aug 2017 10:18:03 -0700 "Luck, Tony" <tony.luck@intel.com> wrote:

> Speculative processor accesses may reference any memory that has a
> valid page table entry.  While a speculative access won't generate
> a machine check, it will log the error in a machine check bank. That
> could cause escalation of a subsequent error since the overflow bit
> will be then set in the machine check bank status register.
> 
> Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
> address of the page we want to map out otherwise we may trigger the
> very problem we are trying to avoid.  We use a non-canonical address
> that passes through the usual Linux table walking code to get to the
> same "pte".
> 
> Thanks to Dave Hansen for reviewing several iterations of this.

It's unclear (to lil ole me) what the end-user-visible effects of this
are.

Could we please have a description of that?  So a) people can
understand your decision to cc:stable and b) people whose kernels are
misbehaving can use your description to decide whether your patch might
fix the issue their users are reporting.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-08-17 22:09               ` Andrew Morton
@ 2017-08-17 22:29                 ` Elliott, Robert (Persistent Memory)
  -1 siblings, 0 replies; 49+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-08-17 22:29 UTC (permalink / raw)
  To: Andrew Morton, Luck, Tony
  Cc: Borislav Petkov, Dave Hansen, Naoya Horiguchi, x86, linux-mm,
	linux-kernel



> -----Original Message-----
> From: Andrew Morton [mailto:akpm@linux-foundation.org]
> Sent: Thursday, August 17, 2017 5:10 PM
> To: Luck, Tony <tony.luck@intel.com>
> Cc: Borislav Petkov <bp@suse.de>; Dave Hansen <dave.hansen@intel.com>;
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>; Elliott, Robert (Persistent
> Memory) <elliott@hpe.com>; x86@kernel.org; linux-mm@kvack.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1
> mappings of poison pages
> 
> On Wed, 16 Aug 2017 10:18:03 -0700 "Luck, Tony" <tony.luck@intel.com>
> wrote:
> 
> > Speculative processor accesses may reference any memory that has a
> > valid page table entry.  While a speculative access won't generate
> > a machine check, it will log the error in a machine check bank. That
> > could cause escalation of a subsequent error since the overflow bit
> > will be then set in the machine check bank status register.
> >
> > Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
> > address of the page we want to map out otherwise we may trigger the
> > very problem we are trying to avoid.  We use a non-canonical address
> > that passes through the usual Linux table walking code to get to the
> > same "pte".
> >
> > Thanks to Dave Hansen for reviewing several iterations of this.
> 
> It's unclear (to lil ole me) what the end-user-visible effects of this
> are.
> 
> Could we please have a description of that?  So a) people can
> understand your decision to cc:stable and b) people whose kernels are
> misbehaving can use your description to decide whether your patch might
> fix the issue their users are reporting.

In general, the system is subject to halting due to uncorrectable
memory errors at addresses that software is not even accessing.  

The first error doesn't cause the crash, but if a second error happens
before the machine check handler services the first one, it'll find
the Overflow bit set and won't know what errors or how many errors
happened (e.g., it might have been problems in an instruction fetch,
and the instructions the CPU is slated to run are bogus).  Halting is 
the only safe thing to do.

For persistent memory, the BIOS reports known-bad addresses in the
ACPI ARS (address range scrub) table.  They are likely to keep
reappearing every boot since it is persistent memory, so you can't
just reboot and hope they go away.  Software is supposed to avoid
reading those addresses until it fixes them (e.g., writes new data
to those locations).  Even if it follows this rule, the system can
still crash due to speculative reads (e.g., prefetches) touching
those addresses.

Tony's patch marks those addresses in the page tables so the CPU
won't speculatively try to read them.

---
Robert Elliott, HPE Persistent Memory

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-08-17 22:29                 ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 49+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-08-17 22:29 UTC (permalink / raw)
  To: Andrew Morton, Luck, Tony
  Cc: Borislav Petkov, Dave Hansen, Naoya Horiguchi, x86, linux-mm,
	linux-kernel



> -----Original Message-----
> From: Andrew Morton [mailto:akpm@linux-foundation.org]
> Sent: Thursday, August 17, 2017 5:10 PM
> To: Luck, Tony <tony.luck@intel.com>
> Cc: Borislav Petkov <bp@suse.de>; Dave Hansen <dave.hansen@intel.com>;
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>; Elliott, Robert (Persistent
> Memory) <elliott@hpe.com>; x86@kernel.org; linux-mm@kvack.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1
> mappings of poison pages
> 
> On Wed, 16 Aug 2017 10:18:03 -0700 "Luck, Tony" <tony.luck@intel.com>
> wrote:
> 
> > Speculative processor accesses may reference any memory that has a
> > valid page table entry.  While a speculative access won't generate
> > a machine check, it will log the error in a machine check bank. That
> > could cause escalation of a subsequent error since the overflow bit
> > will be then set in the machine check bank status register.
> >
> > Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
> > address of the page we want to map out otherwise we may trigger the
> > very problem we are trying to avoid.  We use a non-canonical address
> > that passes through the usual Linux table walking code to get to the
> > same "pte".
> >
> > Thanks to Dave Hansen for reviewing several iterations of this.
> 
> It's unclear (to lil ole me) what the end-user-visible effects of this
> are.
> 
> Could we please have a description of that?  So a) people can
> understand your decision to cc:stable and b) people whose kernels are
> misbehaving can use your description to decide whether your patch might
> fix the issue their users are reporting.

In general, the system is subject to halting due to uncorrectable
memory errors at addresses that software is not even accessing.  

The first error doesn't cause the crash, but if a second error happens
before the machine check handler services the first one, it'll find
the Overflow bit set and won't know what errors or how many errors
happened (e.g., it might have been problems in an instruction fetch,
and the instructions the CPU is slated to run are bogus).  Halting is 
the only safe thing to do.

For persistent memory, the BIOS reports known-bad addresses in the
ACPI ARS (address range scrub) table.  They are likely to keep
reappearing every boot since it is persistent memory, so you can't
just reboot and hope they go away.  Software is supposed to avoid
reading those addresses until it fixes them (e.g., writes new data
to those locations).  Even if it follows this rule, the system can
still crash due to speculative reads (e.g., prefetches) touching
those addresses.

Tony's patch marks those addresses in the page tables so the CPU
won't speculatively try to read them.

---
Robert Elliott, HPE Persistent Memory






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  2017-08-17 22:09               ` Andrew Morton
@ 2017-08-17 23:32                 ` Luck, Tony
  -1 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-08-17 23:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Borislav Petkov, Hansen, Dave, Naoya Horiguchi, Elliott,
	Robert (Persistent Memory),
	x86, linux-mm, linux-kernel

> It's unclear (to lil ole me) what the end-user-visible effects of this
> are.
>
> Could we please have a description of that?  So a) people can
> understand your decision to cc:stable and b) people whose kernels are
> misbehaving can use your description to decide whether your patch might
> fix the issue their users are reporting.

Ingo already applied this to the tip tree, so too late to fix the commit message :-(

A very, very, unlucky end user with a system that supports machine check recovery
(Xeon E7, or Xeon-SP-platinum) that has recovered from one or more uncorrected
memory errors (lucky so far) might find a subsequent uncorrected memory error flagged
as fatal because the machine check bank that should log the error is already occupied
by a log caused by a speculative access to one of the earlier uncorrected errors (the
unlucky part).

We haven't seen this happen at the Linux OS level, but it is a theoretical possibility.
[Some BIOS that map physical memory 1:1 have seen this when doing eMCA processing
for the first error ... as soon as they load the address of the error from the MCi_ADDR
register they are vulnerable to some speculative access dereferencing the register with 
the address and setting the overflow bit in the machine check bank that still holds the
original log].

-Tony

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
@ 2017-08-17 23:32                 ` Luck, Tony
  0 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2017-08-17 23:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Borislav Petkov, Hansen, Dave, Naoya Horiguchi, Elliott,
	Robert (Persistent Memory),
	x86, linux-mm, linux-kernel

> It's unclear (to lil ole me) what the end-user-visible effects of this
> are.
>
> Could we please have a description of that?  So a) people can
> understand your decision to cc:stable and b) people whose kernels are
> misbehaving can use your description to decide whether your patch might
> fix the issue their users are reporting.

Ingo already applied this to the tip tree, so too late to fix the commit message :-(

A very, very, unlucky end user with a system that supports machine check recovery
(Xeon E7, or Xeon-SP-platinum) that has recovered from one or more uncorrected
memory errors (lucky so far) might find a subsequent uncorrected memory error flagged
as fatal because the machine check bank that should log the error is already occupied
by a log caused by a speculative access to one of the earlier uncorrected errors (the
unlucky part).

We haven't seen this happen at the Linux OS level, but it is a theoretical possibility.
[Some BIOS that map physical memory 1:1 have seen this when doing eMCA processing
for the first error ... as soon as they load the address of the error from the MCi_ADDR
register they are vulnerable to some speculative access dereferencing the register with 
the address and setting the overflow bit in the machine check bank that still holds the
original log].

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2017-08-17 23:32 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-16 19:02 [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages Luck, Tony
2017-06-16 19:02 ` Luck, Tony
2017-06-19 18:01 ` Borislav Petkov
2017-06-19 18:01   ` Borislav Petkov
2017-06-21 17:47   ` Luck, Tony
2017-06-21 17:47     ` Luck, Tony
2017-06-21 19:59     ` Elliott, Robert (Persistent Memory)
2017-06-21 19:59       ` Elliott, Robert (Persistent Memory)
2017-06-21 20:19       ` Luck, Tony
2017-06-21 20:19         ` Luck, Tony
2017-06-22  9:39     ` Borislav Petkov
2017-06-22  9:39       ` Borislav Petkov
2017-06-29 22:11       ` git send-email (w/o Cc: stable) Luck, Tony
2017-06-29 22:11         ` Luck, Tony
2017-06-30  7:08         ` Borislav Petkov
2017-06-30  7:08           ` Borislav Petkov
2017-06-23 22:19     ` [PATCH] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages Elliott, Robert (Persistent Memory)
2017-06-23 22:19       ` Elliott, Robert (Persistent Memory)
2017-06-23 22:19       ` Elliott, Robert (Persistent Memory)
2017-06-27 22:04       ` Luck, Tony
2017-06-27 22:04         ` Luck, Tony
2017-06-27 22:04         ` Luck, Tony
2017-06-27 22:09         ` Dan Williams
2017-06-27 22:09           ` Dan Williams
2017-08-16 17:18           ` [PATCH-resend] " Luck, Tony
2017-08-16 17:18             ` Luck, Tony
2017-08-17 10:19             ` [tip:x86/mm] x86/mm, " tip-bot for Tony Luck
2017-08-17 22:09             ` [PATCH-resend] " Andrew Morton
2017-08-17 22:09               ` Andrew Morton
2017-08-17 22:29               ` Elliott, Robert (Persistent Memory)
2017-08-17 22:29                 ` Elliott, Robert (Persistent Memory)
2017-08-17 23:32               ` Luck, Tony
2017-08-17 23:32                 ` Luck, Tony
2017-06-21  2:12 ` [PATCH] " Naoya Horiguchi
2017-06-21  2:12   ` Naoya Horiguchi
2017-06-21 17:54   ` Luck, Tony
2017-06-21 17:54     ` Luck, Tony
2017-06-21 19:47     ` Elliott, Robert (Persistent Memory)
2017-06-21 19:47       ` Elliott, Robert (Persistent Memory)
2017-06-21 19:47       ` Elliott, Robert (Persistent Memory)
2017-06-21 20:30       ` Luck, Tony
2017-06-21 20:30         ` Luck, Tony
2017-06-21 20:30         ` Luck, Tony
2017-06-23  5:07         ` Dan Williams
2017-06-23  5:07           ` Dan Williams
2017-06-23  5:07           ` Dan Williams
2017-06-23 20:59           ` Luck, Tony
2017-06-23 20:59             ` Luck, Tony
2017-06-23 20:59             ` Luck, Tony

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.