linux-nvdimm.lists.01.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
@ 2020-04-10 17:49 Dan Williams
  2020-04-18  0:12 ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2020-04-10 17:49 UTC (permalink / raw)
  To: tglx, mingo
  Cc: x86, stable, Borislav Petkov, H. Peter Anvin, Peter Zijlstra,
	Tony Luck, Erwin Tsaur, linux-kernel, linux-nvdimm

The original memcpy_mcsafe() implementation satisfied two primary
concerns. It provided a copy routine that avoided known unrecoverable
scenarios (poison consumption via fast-string instructions / accesses
across cacheline boundaries), and it provided a fallback to fast plain
memcpy if the platform did not indicate recovery capability.

However, the platform indication for recovery capability is not
architectural so it was always going to require an ever growing list of
opt-in quirks. Also, ongoing hardware improvements to recoverability
mean that the cross-cacheline-boundary and fast-string poison
consumption scenarios are no longer concerns on newer platforms. The
introduction of memcpy_mcsafe_fast(), and resulting reworks, recovers
performance for these newer CPUs, but without the maintenance burden of
an opt-in whitelist.

With memcpy_mcsafe_fast() the existing opt-in becomes a blacklist for
CPUs that benefit from a more careful slower implementation. Every other
CPU gets a fast, but optionally recoverable implementation.

With this in place memcpy_mcsafe() never falls back to plain memcpy().
It can be used in any scenario where the caller needs guarantees that
source or destination access faults / exceptions will be handled.
Systems without recovery support continue to default to a fast
implementation while newer systems both default to fast, and support
recovery, without needing an explicit quirk.

Cc: x86@kernel.org
Cc: <stable@vger.kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Reported-by: Erwin Tsaur <erwin.tsaur@intel.com>
Tested-by: Erwin Tsaur <erwin.tsaur@intel.com>
Fixes: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Note that I marked this for stable and included a Fixes tag because it
is arguably a regression that old kernels stop recovering from poison
consumption on new hardware.

 arch/x86/include/asm/string_64.h         |   24 ++++++-------
 arch/x86/include/asm/uaccess_64.h        |    7 +---
 arch/x86/kernel/cpu/mce/core.c           |    6 ++-
 arch/x86/kernel/quirks.c                 |    4 +-
 arch/x86/lib/memcpy_64.S                 |   57 +++++++++++++++++++++++++++---
 arch/x86/lib/usercopy_64.c               |    6 +--
 tools/objtool/check.c                    |    3 +-
 tools/perf/bench/mem-memcpy-x86-64-lib.c |    8 +---
 tools/testing/nvdimm/test/nfit.c         |    2 +
 9 files changed, 75 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 75314c3dbe47..07840fa3582a 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -83,21 +83,23 @@ int strcmp(const char *cs, const char *ct);
 #endif
 
 #define __HAVE_ARCH_MEMCPY_MCSAFE 1
-__must_check unsigned long __memcpy_mcsafe(void *dst, const void *src,
+__must_check unsigned long memcpy_mcsafe_slow(void *dst, const void *src,
 		size_t cnt);
-DECLARE_STATIC_KEY_FALSE(mcsafe_key);
+__must_check unsigned long memcpy_mcsafe_fast(void *dst, const void *src,
+		size_t cnt);
+DECLARE_STATIC_KEY_FALSE(mcsafe_slow_key);
 
 /**
- * memcpy_mcsafe - copy memory with indication if a machine check happened
+ * memcpy_mcsafe - copy memory with indication if an exception / fault happened
  *
  * @dst:	destination address
  * @src:	source address
  * @cnt:	number of bytes to copy
  *
- * Low level memory copy function that catches machine checks
- * We only call into the "safe" function on systems that can
- * actually do machine check recovery. Everyone else can just
- * use memcpy().
+ * The slow version is opted into by platform quirks. The fast version
+ * is equivalent to memcpy() regardless of CPU machine-check-recovery
+ * capability, but may still fall back to the slow version if the CPU
+ * lacks fast-string instruction support.
  *
  * Return 0 for success, or number of bytes not copied if there was an
  * exception.
@@ -106,12 +108,10 @@ static __always_inline __must_check unsigned long
 memcpy_mcsafe(void *dst, const void *src, size_t cnt)
 {
 #ifdef CONFIG_X86_MCE
-	if (static_branch_unlikely(&mcsafe_key))
-		return __memcpy_mcsafe(dst, src, cnt);
-	else
+	if (static_branch_unlikely(&mcsafe_slow_key))
+		return memcpy_mcsafe_slow(dst, src, cnt);
 #endif
-		memcpy(dst, src, cnt);
-	return 0;
+	return memcpy_mcsafe_fast(dst, src, cnt);
 }
 
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 5cd1caa8bc65..f8c0d38c3f45 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -52,12 +52,7 @@ copy_to_user_mcsafe(void *to, const void *from, unsigned len)
 	unsigned long ret;
 
 	__uaccess_begin();
-	/*
-	 * Note, __memcpy_mcsafe() is explicitly used since it can
-	 * handle exceptions / faults.  memcpy_mcsafe() may fall back to
-	 * memcpy() which lacks this handling.
-	 */
-	ret = __memcpy_mcsafe(to, from, len);
+	ret = memcpy_mcsafe(to, from, len);
 	__uaccess_end();
 	return ret;
 }
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2c4f949611e4..6bf94d39dc7f 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2579,13 +2579,13 @@ static void __init mcheck_debugfs_init(void)
 static void __init mcheck_debugfs_init(void) { }
 #endif
 
-DEFINE_STATIC_KEY_FALSE(mcsafe_key);
-EXPORT_SYMBOL_GPL(mcsafe_key);
+DEFINE_STATIC_KEY_FALSE(mcsafe_slow_key);
+EXPORT_SYMBOL_GPL(mcsafe_slow_key);
 
 static int __init mcheck_late_init(void)
 {
 	if (mca_cfg.recovery)
-		static_branch_inc(&mcsafe_key);
+		static_branch_inc(&mcsafe_slow_key);
 
 	mcheck_debugfs_init();
 	cec_init();
diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
index 896d74cb5081..89c88d9de5c4 100644
--- a/arch/x86/kernel/quirks.c
+++ b/arch/x86/kernel/quirks.c
@@ -636,7 +636,7 @@ static void quirk_intel_brickland_xeon_ras_cap(struct pci_dev *pdev)
 	pci_read_config_dword(pdev, 0x84, &capid0);
 
 	if (capid0 & 0x10)
-		static_branch_inc(&mcsafe_key);
+		static_branch_inc(&mcsafe_slow_key);
 }
 
 /* Skylake */
@@ -653,7 +653,7 @@ static void quirk_intel_purley_xeon_ras_cap(struct pci_dev *pdev)
 	 * enabled, so memory machine check recovery is also enabled.
 	 */
 	if ((capid0 & 0xc0) == 0xc0 || (capid5 & 0x1e0))
-		static_branch_inc(&mcsafe_key);
+		static_branch_inc(&mcsafe_slow_key);
 
 }
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0ec3, quirk_intel_brickland_xeon_ras_cap);
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 56b243b14c3a..b5e4fc1cf99d 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -189,11 +189,18 @@ SYM_FUNC_END(memcpy_orig)
 MCSAFE_TEST_CTL
 
 /*
- * __memcpy_mcsafe - memory copy with machine check exception handling
- * Note that we only catch machine checks when reading the source addresses.
- * Writes to target are posted and don't generate machine checks.
+ * memcpy_mcsafe_slow() - memory copy with exception handling
+ *
+ * In contrast to memcpy_mcsafe_fast() this version is careful to
+ * never perform a read across a cacheline boundary, and not use
+ * fast-string instruction sequences which are known to be unrecoverable
+ * on CPUs identified by mcsafe_slow_key.
+ *
+ * Note that this only catches machine check exceptions when reading the
+ * source addresses.  Writes to target are posted and don't generate
+ * machine checks. However this does handle protection faults on writes.
  */
-SYM_FUNC_START(__memcpy_mcsafe)
+SYM_FUNC_START(memcpy_mcsafe_slow)
 	cmpl $8, %edx
 	/* Less than 8 bytes? Go to byte copy loop */
 	jb .L_no_whole_words
@@ -260,8 +267,8 @@ SYM_FUNC_START(__memcpy_mcsafe)
 	xorl %eax, %eax
 .L_done:
 	ret
-SYM_FUNC_END(__memcpy_mcsafe)
-EXPORT_SYMBOL_GPL(__memcpy_mcsafe)
+SYM_FUNC_END(memcpy_mcsafe_slow)
+EXPORT_SYMBOL_GPL(memcpy_mcsafe_slow)
 
 	.section .fixup, "ax"
 	/*
@@ -296,4 +303,42 @@ EXPORT_SYMBOL_GPL(__memcpy_mcsafe)
 	_ASM_EXTABLE(.L_write_leading_bytes, .E_leading_bytes)
 	_ASM_EXTABLE(.L_write_words, .E_write_words)
 	_ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes)
+
+/*
+ * memcpy_mcsafe_fast - memory copy with exception handling
+ *
+ * Fast string copy + exception handling. If the CPU does support
+ * machine check exception recovery, but does not support recovering
+ * from fast-string exceptions then this CPU needs to be added to the
+ * mcsafe_slow_key set of quirks. Otherwise, absent any machine check
+ * recovery support this version should be no slower than standard
+ * memcpy.
+ */
+SYM_FUNC_START(memcpy_mcsafe_fast)
+	ALTERNATIVE "jmp memcpy_mcsafe_slow", "", X86_FEATURE_ERMS
+	movq %rdi, %rax
+	movq %rdx, %rcx
+.L_copy:
+	rep movsb
+	/* Copy successful. Return zero */
+	xorl %eax, %eax
+	ret
+SYM_FUNC_END(memcpy_mcsafe_fast)
+EXPORT_SYMBOL_GPL(memcpy_mcsafe_fast)
+
+	.section .fixup, "ax"
+.E_copy:
+	/*
+	 * On fault %rcx is updated such that the copy instruction could
+	 * optionally be restarted at the fault position, i.e. it
+	 * contains 'bytes remaining'. A non-zero return indicates error
+	 * to memcpy_mcsafe() users, or indicate short transfers to
+	 * user-copy routines.
+	 */
+	movq %rcx, %rax
+	ret
+
+	.previous
+
+	_ASM_EXTABLE_FAULT(.L_copy, .E_copy)
 #endif
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index fff28c6f73a2..348c9331748e 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -64,11 +64,7 @@ __visible notrace unsigned long
 mcsafe_handle_tail(char *to, char *from, unsigned len)
 {
 	for (; len; --len, to++, from++) {
-		/*
-		 * Call the assembly routine back directly since
-		 * memcpy_mcsafe() may silently fallback to memcpy.
-		 */
-		unsigned long rem = __memcpy_mcsafe(to, from, 1);
+		unsigned long rem = memcpy_mcsafe(to, from, 1);
 
 		if (rem)
 			break;
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 4768d91c6d68..bae1f77aae90 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -485,7 +485,8 @@ static const char *uaccess_safe_builtin[] = {
 	"__ubsan_handle_shift_out_of_bounds",
 	/* misc */
 	"csum_partial_copy_generic",
-	"__memcpy_mcsafe",
+	"memcpy_mcsafe_slow",
+	"memcpy_mcsafe_fast",
 	"mcsafe_handle_tail",
 	"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
 	NULL
diff --git a/tools/perf/bench/mem-memcpy-x86-64-lib.c b/tools/perf/bench/mem-memcpy-x86-64-lib.c
index 4130734dde84..23e1747b3a67 100644
--- a/tools/perf/bench/mem-memcpy-x86-64-lib.c
+++ b/tools/perf/bench/mem-memcpy-x86-64-lib.c
@@ -5,17 +5,13 @@
  */
 #include <linux/types.h>
 
-unsigned long __memcpy_mcsafe(void *dst, const void *src, size_t cnt);
+unsigned long memcpy_mcsafe(void *dst, const void *src, size_t cnt);
 unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len);
 
 unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len)
 {
 	for (; len; --len, to++, from++) {
-		/*
-		 * Call the assembly routine back directly since
-		 * memcpy_mcsafe() may silently fallback to memcpy.
-		 */
-		unsigned long rem = __memcpy_mcsafe(to, from, 1);
+		unsigned long rem = memcpy_mcsafe(to, from, 1);
 
 		if (rem)
 			break;
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index bf6422a6af7f..282722d96f8e 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -3136,7 +3136,7 @@ void mcsafe_test(void)
 			}
 
 			mcsafe_test_init(dst, src, 512);
-			rem = __memcpy_mcsafe(dst, src, 512);
+			rem = memcpy_mcsafe_slow(dst, src, 512);
 			valid = mcsafe_test_validate(dst, src, 512, expect);
 			if (rem == expect && valid)
 				continue;
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-10 17:49 [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast Dan Williams
@ 2020-04-18  0:12 ` Dan Williams
  2020-04-18 19:42   ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2020-04-18  0:12 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar
  Cc: X86 ML, stable, Borislav Petkov, H. Peter Anvin, Peter Zijlstra,
	Tony Luck, Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm,
	Linus Torvalds

[ add Linus because he had comments the last time memcpy_mcsafe() was
reworked  [1] ]

On Fri, Apr 10, 2020 at 11:06 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> The original memcpy_mcsafe() implementation satisfied two primary
> concerns. It provided a copy routine that avoided known unrecoverable
> scenarios (poison consumption via fast-string instructions / accesses
> across cacheline boundaries), and it provided a fallback to fast plain
> memcpy if the platform did not indicate recovery capability.
>
> However, the platform indication for recovery capability is not
> architectural so it was always going to require an ever growing list of
> opt-in quirks. Also, ongoing hardware improvements to recoverability
> mean that the cross-cacheline-boundary and fast-string poison
> consumption scenarios are no longer concerns on newer platforms. The
> introduction of memcpy_mcsafe_fast(), and resulting reworks, recovers
> performance for these newer CPUs, but without the maintenance burden of
> an opt-in whitelist.
>
> With memcpy_mcsafe_fast() the existing opt-in becomes a blacklist for
> CPUs that benefit from a more careful slower implementation. Every other
> CPU gets a fast, but optionally recoverable implementation.
>
> With this in place memcpy_mcsafe() never falls back to plain memcpy().
> It can be used in any scenario where the caller needs guarantees that
> source or destination access faults / exceptions will be handled.
> Systems without recovery support continue to default to a fast
> implementation while newer systems both default to fast, and support
> recovery, without needing an explicit quirk.
>
> Cc: x86@kernel.org
> Cc: <stable@vger.kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Acked-by: Tony Luck <tony.luck@intel.com>
> Reported-by: Erwin Tsaur <erwin.tsaur@intel.com>
> Tested-by: Erwin Tsaur <erwin.tsaur@intel.com>
> Fixes: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Note that I marked this for stable and included a Fixes tag because it
> is arguably a regression that old kernels stop recovering from poison
> consumption on new hardware.

This still applies cleanly to tip/master as of yesterday. Any concerns?

[1]: https://lore.kernel.org/lkml/CA+55aFwJ=DvvcrM725KOYAF_c=uKiQcuHz4taPjbKveOVPycKA@mail.gmail.com/

>
>  arch/x86/include/asm/string_64.h         |   24 ++++++-------
>  arch/x86/include/asm/uaccess_64.h        |    7 +---
>  arch/x86/kernel/cpu/mce/core.c           |    6 ++-
>  arch/x86/kernel/quirks.c                 |    4 +-
>  arch/x86/lib/memcpy_64.S                 |   57 +++++++++++++++++++++++++++---
>  arch/x86/lib/usercopy_64.c               |    6 +--
>  tools/objtool/check.c                    |    3 +-
>  tools/perf/bench/mem-memcpy-x86-64-lib.c |    8 +---
>  tools/testing/nvdimm/test/nfit.c         |    2 +
>  9 files changed, 75 insertions(+), 42 deletions(-)
>
> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> index 75314c3dbe47..07840fa3582a 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -83,21 +83,23 @@ int strcmp(const char *cs, const char *ct);
>  #endif
>
>  #define __HAVE_ARCH_MEMCPY_MCSAFE 1
> -__must_check unsigned long __memcpy_mcsafe(void *dst, const void *src,
> +__must_check unsigned long memcpy_mcsafe_slow(void *dst, const void *src,
>                 size_t cnt);
> -DECLARE_STATIC_KEY_FALSE(mcsafe_key);
> +__must_check unsigned long memcpy_mcsafe_fast(void *dst, const void *src,
> +               size_t cnt);
> +DECLARE_STATIC_KEY_FALSE(mcsafe_slow_key);
>
>  /**
> - * memcpy_mcsafe - copy memory with indication if a machine check happened
> + * memcpy_mcsafe - copy memory with indication if an exception / fault happened
>   *
>   * @dst:       destination address
>   * @src:       source address
>   * @cnt:       number of bytes to copy
>   *
> - * Low level memory copy function that catches machine checks
> - * We only call into the "safe" function on systems that can
> - * actually do machine check recovery. Everyone else can just
> - * use memcpy().
> + * The slow version is opted into by platform quirks. The fast version
> + * is equivalent to memcpy() regardless of CPU machine-check-recovery
> + * capability, but may still fall back to the slow version if the CPU
> + * lacks fast-string instruction support.
>   *
>   * Return 0 for success, or number of bytes not copied if there was an
>   * exception.
> @@ -106,12 +108,10 @@ static __always_inline __must_check unsigned long
>  memcpy_mcsafe(void *dst, const void *src, size_t cnt)
>  {
>  #ifdef CONFIG_X86_MCE
> -       if (static_branch_unlikely(&mcsafe_key))
> -               return __memcpy_mcsafe(dst, src, cnt);
> -       else
> +       if (static_branch_unlikely(&mcsafe_slow_key))
> +               return memcpy_mcsafe_slow(dst, src, cnt);
>  #endif
> -               memcpy(dst, src, cnt);
> -       return 0;
> +       return memcpy_mcsafe_fast(dst, src, cnt);
>  }
>
>  #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 5cd1caa8bc65..f8c0d38c3f45 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -52,12 +52,7 @@ copy_to_user_mcsafe(void *to, const void *from, unsigned len)
>         unsigned long ret;
>
>         __uaccess_begin();
> -       /*
> -        * Note, __memcpy_mcsafe() is explicitly used since it can
> -        * handle exceptions / faults.  memcpy_mcsafe() may fall back to
> -        * memcpy() which lacks this handling.
> -        */
> -       ret = __memcpy_mcsafe(to, from, len);
> +       ret = memcpy_mcsafe(to, from, len);
>         __uaccess_end();
>         return ret;
>  }
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 2c4f949611e4..6bf94d39dc7f 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -2579,13 +2579,13 @@ static void __init mcheck_debugfs_init(void)
>  static void __init mcheck_debugfs_init(void) { }
>  #endif
>
> -DEFINE_STATIC_KEY_FALSE(mcsafe_key);
> -EXPORT_SYMBOL_GPL(mcsafe_key);
> +DEFINE_STATIC_KEY_FALSE(mcsafe_slow_key);
> +EXPORT_SYMBOL_GPL(mcsafe_slow_key);
>
>  static int __init mcheck_late_init(void)
>  {
>         if (mca_cfg.recovery)
> -               static_branch_inc(&mcsafe_key);
> +               static_branch_inc(&mcsafe_slow_key);
>
>         mcheck_debugfs_init();
>         cec_init();
> diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
> index 896d74cb5081..89c88d9de5c4 100644
> --- a/arch/x86/kernel/quirks.c
> +++ b/arch/x86/kernel/quirks.c
> @@ -636,7 +636,7 @@ static void quirk_intel_brickland_xeon_ras_cap(struct pci_dev *pdev)
>         pci_read_config_dword(pdev, 0x84, &capid0);
>
>         if (capid0 & 0x10)
> -               static_branch_inc(&mcsafe_key);
> +               static_branch_inc(&mcsafe_slow_key);
>  }
>
>  /* Skylake */
> @@ -653,7 +653,7 @@ static void quirk_intel_purley_xeon_ras_cap(struct pci_dev *pdev)
>          * enabled, so memory machine check recovery is also enabled.
>          */
>         if ((capid0 & 0xc0) == 0xc0 || (capid5 & 0x1e0))
> -               static_branch_inc(&mcsafe_key);
> +               static_branch_inc(&mcsafe_slow_key);
>
>  }
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0ec3, quirk_intel_brickland_xeon_ras_cap);
> diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
> index 56b243b14c3a..b5e4fc1cf99d 100644
> --- a/arch/x86/lib/memcpy_64.S
> +++ b/arch/x86/lib/memcpy_64.S
> @@ -189,11 +189,18 @@ SYM_FUNC_END(memcpy_orig)
>  MCSAFE_TEST_CTL
>
>  /*
> - * __memcpy_mcsafe - memory copy with machine check exception handling
> - * Note that we only catch machine checks when reading the source addresses.
> - * Writes to target are posted and don't generate machine checks.
> + * memcpy_mcsafe_slow() - memory copy with exception handling
> + *
> + * In contrast to memcpy_mcsafe_fast() this version is careful to
> + * never perform a read across a cacheline boundary, and not use
> + * fast-string instruction sequences which are known to be unrecoverable
> + * on CPUs identified by mcsafe_slow_key.
> + *
> + * Note that this only catches machine check exceptions when reading the
> + * source addresses.  Writes to target are posted and don't generate
> + * machine checks. However this does handle protection faults on writes.
>   */
> -SYM_FUNC_START(__memcpy_mcsafe)
> +SYM_FUNC_START(memcpy_mcsafe_slow)
>         cmpl $8, %edx
>         /* Less than 8 bytes? Go to byte copy loop */
>         jb .L_no_whole_words
> @@ -260,8 +267,8 @@ SYM_FUNC_START(__memcpy_mcsafe)
>         xorl %eax, %eax
>  .L_done:
>         ret
> -SYM_FUNC_END(__memcpy_mcsafe)
> -EXPORT_SYMBOL_GPL(__memcpy_mcsafe)
> +SYM_FUNC_END(memcpy_mcsafe_slow)
> +EXPORT_SYMBOL_GPL(memcpy_mcsafe_slow)
>
>         .section .fixup, "ax"
>         /*
> @@ -296,4 +303,42 @@ EXPORT_SYMBOL_GPL(__memcpy_mcsafe)
>         _ASM_EXTABLE(.L_write_leading_bytes, .E_leading_bytes)
>         _ASM_EXTABLE(.L_write_words, .E_write_words)
>         _ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes)
> +
> +/*
> + * memcpy_mcsafe_fast - memory copy with exception handling
> + *
> + * Fast string copy + exception handling. If the CPU does support
> + * machine check exception recovery, but does not support recovering
> + * from fast-string exceptions then this CPU needs to be added to the
> + * mcsafe_slow_key set of quirks. Otherwise, absent any machine check
> + * recovery support this version should be no slower than standard
> + * memcpy.
> + */
> +SYM_FUNC_START(memcpy_mcsafe_fast)
> +       ALTERNATIVE "jmp memcpy_mcsafe_slow", "", X86_FEATURE_ERMS
> +       movq %rdi, %rax
> +       movq %rdx, %rcx
> +.L_copy:
> +       rep movsb
> +       /* Copy successful. Return zero */
> +       xorl %eax, %eax
> +       ret
> +SYM_FUNC_END(memcpy_mcsafe_fast)
> +EXPORT_SYMBOL_GPL(memcpy_mcsafe_fast)
> +
> +       .section .fixup, "ax"
> +.E_copy:
> +       /*
> +        * On fault %rcx is updated such that the copy instruction could
> +        * optionally be restarted at the fault position, i.e. it
> +        * contains 'bytes remaining'. A non-zero return indicates error
> +        * to memcpy_mcsafe() users, or indicate short transfers to
> +        * user-copy routines.
> +        */
> +       movq %rcx, %rax
> +       ret
> +
> +       .previous
> +
> +       _ASM_EXTABLE_FAULT(.L_copy, .E_copy)
>  #endif
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index fff28c6f73a2..348c9331748e 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -64,11 +64,7 @@ __visible notrace unsigned long
>  mcsafe_handle_tail(char *to, char *from, unsigned len)
>  {
>         for (; len; --len, to++, from++) {
> -               /*
> -                * Call the assembly routine back directly since
> -                * memcpy_mcsafe() may silently fallback to memcpy.
> -                */
> -               unsigned long rem = __memcpy_mcsafe(to, from, 1);
> +               unsigned long rem = memcpy_mcsafe(to, from, 1);
>
>                 if (rem)
>                         break;
> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index 4768d91c6d68..bae1f77aae90 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -485,7 +485,8 @@ static const char *uaccess_safe_builtin[] = {
>         "__ubsan_handle_shift_out_of_bounds",
>         /* misc */
>         "csum_partial_copy_generic",
> -       "__memcpy_mcsafe",
> +       "memcpy_mcsafe_slow",
> +       "memcpy_mcsafe_fast",
>         "mcsafe_handle_tail",
>         "ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
>         NULL
> diff --git a/tools/perf/bench/mem-memcpy-x86-64-lib.c b/tools/perf/bench/mem-memcpy-x86-64-lib.c
> index 4130734dde84..23e1747b3a67 100644
> --- a/tools/perf/bench/mem-memcpy-x86-64-lib.c
> +++ b/tools/perf/bench/mem-memcpy-x86-64-lib.c
> @@ -5,17 +5,13 @@
>   */
>  #include <linux/types.h>
>
> -unsigned long __memcpy_mcsafe(void *dst, const void *src, size_t cnt);
> +unsigned long memcpy_mcsafe(void *dst, const void *src, size_t cnt);
>  unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len);
>
>  unsigned long mcsafe_handle_tail(char *to, char *from, unsigned len)
>  {
>         for (; len; --len, to++, from++) {
> -               /*
> -                * Call the assembly routine back directly since
> -                * memcpy_mcsafe() may silently fallback to memcpy.
> -                */
> -               unsigned long rem = __memcpy_mcsafe(to, from, 1);
> +               unsigned long rem = memcpy_mcsafe(to, from, 1);
>
>                 if (rem)
>                         break;
> diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
> index bf6422a6af7f..282722d96f8e 100644
> --- a/tools/testing/nvdimm/test/nfit.c
> +++ b/tools/testing/nvdimm/test/nfit.c
> @@ -3136,7 +3136,7 @@ void mcsafe_test(void)
>                         }
>
>                         mcsafe_test_init(dst, src, 512);
> -                       rem = __memcpy_mcsafe(dst, src, 512);
> +                       rem = memcpy_mcsafe_slow(dst, src, 512);
>                         valid = mcsafe_test_validate(dst, src, 512, expect);
>                         if (rem == expect && valid)
>                                 continue;
>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-18  0:12 ` Dan Williams
@ 2020-04-18 19:42   ` Linus Torvalds
  0 siblings, 0 replies; 19+ messages in thread
From: Linus Torvalds @ 2020-04-18 19:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Thomas Gleixner, Ingo Molnar, X86 ML, stable, Borislav Petkov,
	H. Peter Anvin, Peter Zijlstra, Tony Luck, Erwin Tsaur,
	Linux Kernel Mailing List, linux-nvdimm

On Fri, Apr 17, 2020 at 5:12 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> > @@ -106,12 +108,10 @@ static __always_inline __must_check unsigned long
> >  memcpy_mcsafe(void *dst, const void *src, size_t cnt)
> >  {
> >  #ifdef CONFIG_X86_MCE
> > -       if (static_branch_unlikely(&mcsafe_key))
> > -               return __memcpy_mcsafe(dst, src, cnt);
> > -       else
> > +       if (static_branch_unlikely(&mcsafe_slow_key))
> > +               return memcpy_mcsafe_slow(dst, src, cnt);
> >  #endif
> > -               memcpy(dst, src, cnt);
> > -       return 0;
> > +       return memcpy_mcsafe_fast(dst, src, cnt);
> >  }

It strikes me that I see no advantages to making this an inline function at all.

Even for the good case - where it turns into just a memcpy because MCE
is entirely disabled - it doesn't seem to matter.

The only case that really helps is when the memcpy can be turned into
a single access. Which - and I checked - does exist, with people doing

        r = memcpy_mcsafe(&sb_seq_count, &sb(wc)->seq_count, sizeof(uint64_t));

to read a single 64-bit field which looks aligned to me.

But that code is incredible garbage anyway, since even on a broken
machine, there's no actual reason to use the slow variant for that
whole access that I can tell. The macs-safe copy routines do not do
anything worthwhile for a single access.

So my reaction remains that a lot of this is just completely wrong and
incredibly mis-designed.

Yes, the hardware was buggy garbage. But why should we make things
worse with making the software be incomprehensibly bad too?

              Linus
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:57                   ` Luck, Tony
@ 2020-04-20 21:16                     ` Linus Torvalds
  0 siblings, 0 replies; 19+ messages in thread
From: Linus Torvalds @ 2020-04-20 21:16 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tsaur, Erwin,
	Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 1:57 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> >  (a) is a trap, not an exception - so the instruction has been done,
> > and you don't need to try to emulate it or anything to continue.
>
> Maybe for errors on the data side of the pipeline. On the instruction
> side we can usually recover from user space instruction fetches by
> just throwing away the page with the corrupted instructions and reading
> from disk into a new page. Then just point the page table to the new
> page, and hey presto, its all transparently fixed (modulo time lost fixing
> things).

That's true for things like ECC on real RAM, with traditional executables.

It's not so true of something like nvram that you execute out of
directly. There is not necessarily a disk to re-read things from.

But it's also not true of things like JIT's. They are kind of a big
thing. Asking the JIT to do "hey, I faulted at a random point, you
need to re-JIT" is no different from all the other "that's a _really_
painful recovery point, please delay it".

Sure, the JIT environment will probably just have to kill that thread
anyway, but I do think this falls under the same "you're better off
giving the _option_ to just continue and hope for the best" than force
a non-recoverable state.

For regular ECC, I literally would like the machine to just always
continue. I'd like to be informed that there's something bad going on
(because it might be RAM going bad, but it might also be a rowhammer
attack), but the decision to kill things or not should ultimately be
the *users*, not the JIT's, not the kernel.

So the basic rule should be that you should always have the _option_
to just continue. The corrupted state might not be critical - or it
might be the ECC bits themselves, not the data.

There are situations where stopping everything is worse than "let's
continue as best we can, and inform the user with a big red blinking
light".

ECC should not make things less reliable, even if it's another 10+% of
bits that can go wrong.

It should also be noted that even a good ECC pattern _can_ miss
corruption if you're unlucky with the corruption. So the whole
black-and-white model of "ECC means you need to stop everything" is
questionable to begin with, because the signal isn't that absolute in
the first place.

So when somebody brings up a "what if I use corrupted data and make
things worse", they are making an intellectually dishonest argument.
What if you saw corrupted data and simply never caught it, because it
was a unlucky multi-bit failure"?

There is no "absolute" thing about ECC. The only thing that is _never_
wrong is to report it and try to continue, and let some higher-level
entity decide what to do.

And that final decision might literally be "I ran this simulation for
2 days, I see that there's an error report, I will buy a new machine.
For now I'll use the data it generated, but I'll re-run to validate it
later".

                 Linus
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:46                 ` Linus Torvalds
@ 2020-04-20 20:57                   ` Luck, Tony
  2020-04-20 21:16                     ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2020-04-20 20:57 UTC (permalink / raw)
  To: Linus Torvalds, Williams, Dan J
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tsaur, Erwin,
	Linux Kernel Mailing List  <linux-kernel@vger.kernel.org>,
	linux-nvdimm

>  (a) is a trap, not an exception - so the instruction has been done,
> and you don't need to try to emulate it or anything to continue.

Maybe for errors on the data side of the pipeline. On the instruction
side we can usually recover from user space instruction fetches by
just throwing away the page with the corrupted instructions and reading
from disk into a new page. Then just point the page table to the new
page, and hey presto, its all transparently fixed (modulo time lost fixing
things).

-Tony
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:45                   ` Luck, Tony
@ 2020-04-20 20:56                     ` Linus Torvalds
  0 siblings, 0 replies; 19+ messages in thread
From: Linus Torvalds @ 2020-04-20 20:56 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tsaur, Erwin,
	Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 1:45 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> Another X86 vendor seems to be adding something like that. See MCOMMIT
> in https://www.amd.com/system/files/TechDocs/24594.pdf

That sounds potentially very expensive.

Particularly, as you say, something like the kernel (or
virtualization) may want to at least test for it cheaply on entry or
switch or whatever.

I do think you want the mcommit kind of thing for writing, but I think
the intel model of (no longer pcommit) using a writeback instruction
with a range, and then just sfence is better than a "commit
everything" thing.

But that's just for writing things, and that's fundamentally very
different from the read side errors.

So for the read-side you'd want some kind of "lfence and report"
instruction for the "did I see load errors". Very cheap like lfence,
so that there wouldn't really be a cost for the people if somebody
_really_ want to get notified immediately.

And again, it might just be part of any serializing instruction. I
don't care that much, although overloading "serializing instruction"
even more sounds like a bad idea.

               Linus
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:24               ` Dan Williams
@ 2020-04-20 20:46                 ` Linus Torvalds
  2020-04-20 20:57                   ` Luck, Tony
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2020-04-20 20:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 1:25 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> ...but also some kind of barrier semantic, right? Because there are
> systems that want some guarantees when they can commit or otherwise
> shoot the machine if they can not.

The optimal model would likely be a new instruction that could be done
in user space and test for it, possibly without any exception at all
(because the thing that checks for errors is also presumably the only
thing that can decide how to recover - so raising an exception doesn't
necessarily help).

Together with a way for the kernel to save/restore the exception state
on task switch (presumably in the xsave area) so that the error state
of one process doesn't affect another one. Bonus points if it's all
per-security level, so that a pure user-level error report doesn't
poison the kernel state and vice versa.

That is _very_ similar to how FPU exceptions work right now. User
space can literally do an operation that creates an error on one CPU,
get re-scheduled to another one, and take the actual signal and read
the exception state on that other CPU.

(Of course, the "not even take an exception" part is different).

An alternate very simple model that doesn't require any new
instructions and no new architecturally visible state (except of
course the actual error data) would be to just be raising a *maskable*
trap (with the Intel definition of trap vs exception: a trap happens
_after_ the instruction).

The trap could be on the next instruction if people really want to be
that precise, but I don't think it even matters. If it's delayed until
the next serializing instruction, that would probably be just fine
too.

But the important thing is that it

 (a) is a trap, not an exception - so the instruction has been done,
and you don't need to try to emulate it or anything to continue.

 (b) is maskable, so that the trap handler can decide to just mask it
and return (and set a separate flag to then handle it later)

With domain transfers either being barriers, or masking it (so NMI and
external interrupts would presumably mask it for latency reasons)?

I dunno. Wild handwaving. But much better than that crazy
unrecoverable machine check model.

                   Linus
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:27                 ` Linus Torvalds
@ 2020-04-20 20:45                   ` Luck, Tony
  2020-04-20 20:56                     ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2020-04-20 20:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter,
	Erwin  <erwin.tsaur@intel.com>,
	Linux Kernel Mailing List, linux-nvdimm

> By "asynchronous" I don't mean "hours later".
>
> Make it be "interrupts are enabled, before serializing instruction".
>
> Yes, we want bounded error handling latency. But that doesn't mean "synchronous"

Another X86 vendor seems to be adding something like that. See MCOMMIT
in https://www.amd.com/system/files/TechDocs/24594.pdf

But I wonder how an OS will know whether it is running some smart
MCOMMIT-aware application that can figure out what to do with bad
data, or a legacy application that should probably be stopped before
it hurts somebody.

I also wonder how expensive MCOMMIT is (since it is essentially
polling for "did any errors happen").

-Tony
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:23               ` Luck, Tony
@ 2020-04-20 20:27                 ` Linus Torvalds
  2020-04-20 20:45                   ` Luck, Tony
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2020-04-20 20:27 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Erwin Tsaur,
	Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 1:23 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> I think they do. If the result of the wrong data has already
> been sent out the network before you process the signal, then you
> will need far smarter application software than has ever been written
> to hunt it down and stop the spread of the bogus result.

Bah. That's a completely red herring argument.

By "asynchronous" I don't mean "hours later".

Make it be "interrupts are enabled, before serializing instruction".

Yes, we want bounded error handling latency. But that doesn't mean "synchronous"

           Linus
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:07             ` Linus Torvalds
  2020-04-20 20:23               ` Luck, Tony
@ 2020-04-20 20:24               ` Dan Williams
  2020-04-20 20:46                 ` Linus Torvalds
  1 sibling, 1 reply; 19+ messages in thread
From: Dan Williams @ 2020-04-20 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 1:07 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Apr 20, 2020 at 12:29 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >  I didn't consider asynchronous to be
> > better because that means there is a gap between when the data
> > corruption is detected and when it might escape the system that some
> > external agent could trust the result and start acting on before the
> > asynchronous signal is delivered.
>
> The thing is, absolutely nobody cares whether you start acting on the
> wrong data or not.
>
> Even if you use the wrong data as a pointer, and get other wrong data
> behind it (or take an exception), that doesn't change a thing: the end
> result is still the exact same thing "unpredictably wrong data". You
> didn't really make things worse by continuing, and people who care
> about some very particular case can easily add a barrier to get any
> error before they go past some important point.
>
> It's not like Intel hasn't done that before. It's how legacy FP
> instructions work - and while the legacy FP unit had horrendous
> problems, the delayed exception wasn't one of them (routing it to
> irq13 was, but that's on IBM, not Intel). Async exception handling
> simplifies a lot of problems, and in fact, if you then have explicit
> "check now" points (which the Intel FPU also had, as long as you
> didn't have that nasty irq13 routing), it simplifies software too.

Ah, now it's finally sinking in, it's the fact that if the model had
been from the beginning to require software to issue a barrier if it
wants to resolve pending memory errors that would have freed up the
implementation to leave error handling to a higher-level / more less
constrained part of the software stack.

>
> So there is no _advantage_ to being synchronous to the source of the
> problem.  There is only pain.
>
> There are lots and lots of disadvantages, most of them being
> variations - on many different levesl of - "it hit at a point where I
> cannot recover".
>
> The microcode example is one such example that Intel engineers should
> really take to heart. But the real take-away is that it is only _one_
> such example. And making the microcode able to recover doesn't fix all
> the _other_ places that aren't able to recover at that point, or fix
> the fundamental pain.
>
> For example, for the kernel, NMI's tend to be pretty special things,
> and we simply may not be able to handle errors inside an NMI context.
> It's not all that unlike the intel microcode example, just at another
> level.
>
> But we have lots of other situations. We have random
> compiler-generated code, and we're simply not able to take exceptions
> at random stages and expect to recover. We have very strict rules
> about "this might be unsafe", and a normal memory access of a trusted
> pointer IS NOT IT.
>
> But at a higher level, something like "strnlen()" - or any other
> low-level library routine - isn't really all that different from
> microcode for most programs. We don't want to have a million copies of
> "standard library routine, except for nvram".
>
> The whole - and ONLY - point of something like nvram is that it's
> supposed to act like memory. It's part of the name, and it's very much
> part of the whole argument for why it's a good idea.

Right, and I had a short cry/laugh when realizing that the current
implementation still correctly sends SIGBUS to userspace consumed
poison, but the kernel regressed because it's special synchronous
handling was no longer be correctly deployed.

>
> And the advantage of it being memory is that you are supposed to be
> able to do all those random operations that you just do normally on
> any memory region - ask for string lengths, do copies, add up numbers,
> look for patterns. Without having to do an "IO" operation for it.
>
> Taking random synchronous faults is simply not acceptable. It breaks
> that fundamental principle.
>
> Maybe you have a language with a try/catch/throw model of exception
> handling - but one where throwing exceptions is constrained to happen
> for documented operations, not any random memory access. That's very
> common, and kind of like what the kernel exception handling does by
> hand.
>
> So an allocation failure can throw an exception, but once you've
> allocated memory, the language runtime simply depends on knowing that
> it has pointer safety etc, and normal accesses won't throw an
> exception.
>
> In that kind of model, you can easily add a model where "SIGBUS sets a
> flag, we'll handle it at catch time", but that depends on things being
> able to just continue _despite_ the error.
>
> So a source-synchronous error can be really hard to handle. Exception
> handling at random points is simply really really tough in general -
> and there are few more random and more painful points than some "I
> accessed memory".
>
> So no. An ECC error is nothing like a page fault. A page fault you
> have *control* over. You can have a language without the ability to
> generate wild pointers, or you can have a language like C where wild
> pointers are possible, but they are a bug. So you can work on the
> assumption that a page fault is always entirely invisible to the
> program at hand.
>
> A synchronous ECC error doesn't have that "entirely invisible to the
> program at hand" behavior.
>
> And yes, I've asked for ECC for regular memory for regular CPU's from
> Intel for over two decades now. I do NOT want their broken machine
> check model with it. I don't even necessarily want the "correction"
> part of it - I want reporting of "your hardware / the results of your
> computation is no longer reliable". I don't want a dead machine, I
> don't want some worry about "what happens if I get an ECC error in
> NMI", I don't want any of the problems that come from taking an
> absolute "ECC errors are fatal" approach. I just want "an error
> happened", with as much information about where it happened as
> possible.

...but also some kind of barrier semantic, right? Because there are
systems that want some guarantees when they can commit or otherwise
shoot the machine if they can not.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 20:07             ` Linus Torvalds
@ 2020-04-20 20:23               ` Luck, Tony
  2020-04-20 20:27                 ` Linus Torvalds
  2020-04-20 20:24               ` Dan Williams
  1 sibling, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2020-04-20 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Erwin Tsaur,
	Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 01:07:09PM -0700, Linus Torvalds wrote:
> On Mon, Apr 20, 2020 at 12:29 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >  I didn't consider asynchronous to be
> > better because that means there is a gap between when the data
> > corruption is detected and when it might escape the system that some
> > external agent could trust the result and start acting on before the
> > asynchronous signal is delivered.
> 
> The thing is, absolutely nobody cares whether you start acting on the
> wrong data or not.

I think they do. If the result of the wrong data has already
been sent out the network before you process the signal, then you
will need far smarter application software than has ever been written
to hunt it down and stop the spread of the bogus result.

Stopping dead on the instruction before it consumes the data
means you can "recover" by killing just one process, or just one
VMM guest.

I'm in total agreement the machine check (especially broadcast)
was a bad choice for how to "stop on a dime". But I can't see
how you could possibly decide what to do if you let thousands
of instructions retire based on a bad data value before you even
know that it happened.

-Tony
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 19:29           ` Dan Williams
@ 2020-04-20 20:07             ` Linus Torvalds
  2020-04-20 20:23               ` Luck, Tony
  2020-04-20 20:24               ` Dan Williams
  0 siblings, 2 replies; 19+ messages in thread
From: Linus Torvalds @ 2020-04-20 20:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 12:29 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
>  I didn't consider asynchronous to be
> better because that means there is a gap between when the data
> corruption is detected and when it might escape the system that some
> external agent could trust the result and start acting on before the
> asynchronous signal is delivered.

The thing is, absolutely nobody cares whether you start acting on the
wrong data or not.

Even if you use the wrong data as a pointer, and get other wrong data
behind it (or take an exception), that doesn't change a thing: the end
result is still the exact same thing "unpredictably wrong data". You
didn't really make things worse by continuing, and people who care
about some very particular case can easily add a barrier to get any
error before they go past some important point.

It's not like Intel hasn't done that before. It's how legacy FP
instructions work - and while the legacy FP unit had horrendous
problems, the delayed exception wasn't one of them (routing it to
irq13 was, but that's on IBM, not Intel). Async exception handling
simplifies a lot of problems, and in fact, if you then have explicit
"check now" points (which the Intel FPU also had, as long as you
didn't have that nasty irq13 routing), it simplifies software too.

So there is no _advantage_ to being synchronous to the source of the
problem.  There is only pain.

There are lots and lots of disadvantages, most of them being
variations - on many different levesl of - "it hit at a point where I
cannot recover".

The microcode example is one such example that Intel engineers should
really take to heart. But the real take-away is that it is only _one_
such example. And making the microcode able to recover doesn't fix all
the _other_ places that aren't able to recover at that point, or fix
the fundamental pain.

For example, for the kernel, NMI's tend to be pretty special things,
and we simply may not be able to handle errors inside an NMI context.
It's not all that unlike the intel microcode example, just at another
level.

But we have lots of other situations. We have random
compiler-generated code, and we're simply not able to take exceptions
at random stages and expect to recover. We have very strict rules
about "this might be unsafe", and a normal memory access of a trusted
pointer IS NOT IT.

But at a higher level, something like "strnlen()" - or any other
low-level library routine - isn't really all that different from
microcode for most programs. We don't want to have a million copies of
"standard library routine, except for nvram".

The whole - and ONLY - point of something like nvram is that it's
supposed to act like memory. It's part of the name, and it's very much
part of the whole argument for why it's a good idea.

And the advantage of it being memory is that you are supposed to be
able to do all those random operations that you just do normally on
any memory region - ask for string lengths, do copies, add up numbers,
look for patterns. Without having to do an "IO" operation for it.

Taking random synchronous faults is simply not acceptable. It breaks
that fundamental principle.

Maybe you have a language with a try/catch/throw model of exception
handling - but one where throwing exceptions is constrained to happen
for documented operations, not any random memory access. That's very
common, and kind of like what the kernel exception handling does by
hand.

So an allocation failure can throw an exception, but once you've
allocated memory, the language runtime simply depends on knowing that
it has pointer safety etc, and normal accesses won't throw an
exception.

In that kind of model, you can easily add a model where "SIGBUS sets a
flag, we'll handle it at catch time", but that depends on things being
able to just continue _despite_ the error.

So a source-synchronous error can be really hard to handle. Exception
handling at random points is simply really really tough in general -
and there are few more random and more painful points than some "I
accessed memory".

So no. An ECC error is nothing like a page fault. A page fault you
have *control* over. You can have a language without the ability to
generate wild pointers, or you can have a language like C where wild
pointers are possible, but they are a bug. So you can work on the
assumption that a page fault is always entirely invisible to the
program at hand.

A synchronous ECC error doesn't have that "entirely invisible to the
program at hand" behavior.

And yes, I've asked for ECC for regular memory for regular CPU's from
Intel for over two decades now. I do NOT want their broken machine
check model with it. I don't even necessarily want the "correction"
part of it - I want reporting of "your hardware / the results of your
computation is no longer reliable". I don't want a dead machine, I
don't want some worry about "what happens if I get an ECC error in
NMI", I don't want any of the problems that come from taking an
absolute "ECC errors are fatal" approach. I just want "an error
happened", with as much information about where it happened as
possible.

                 Linus
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 19:05         ` Linus Torvalds
@ 2020-04-20 19:29           ` Dan Williams
  2020-04-20 20:07             ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2020-04-20 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 12:13 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Apr 20, 2020 at 11:20 AM Dan Williams <dan.j.williams@intel.com> wrote:
[..]
> I really really detest the whole mcsafe garbage. And I absolutely
> *ABHOR* how nobody inside of Intel has apparently ever questioned the
> brokenness at a really fundamental level.
>
> That "I throw my hands in the air and just give up" thing is a
> disease. It's absolutely not "what else could we do".

So I grew up in the early part of my career validating ARM CPUs where
a data-abort was either precise or imprecise and the precise error
could be handled like a page fault as you know which instruction
faulted and how to restart the thread. So I didn't take x86 CPU
designers' word for it, I honestly thought that "hmm the x86 machine
check thingy looks like it's trying to implement precise vs imprecise
data-aborts, and precise / synchronous is maybe a good thing because
it's akin to a page fault". I didn't consider asynchronous to be
better because that means there is a gap between when the data
corruption is detected and when it might escape the system that some
external agent could trust the result and start acting on before the
asynchronous signal is delivered.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 18:20       ` Dan Williams
@ 2020-04-20 19:05         ` Linus Torvalds
  2020-04-20 19:29           ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2020-04-20 19:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 11:20 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> * I'm at a loss of why you seem to be suggesting that hardware should
> / could avoid all exceptions. What else could hardware do besides
> throw an exception on consumption of a naturally occuring multi-bit
> ECC error? Data is gone, and only software might know how to recover.

This is classic bogus thinking.

If Intel ever makes ECC DRAM available to everybody, there would be a
_shred_ of logic to that thinking, but right now it's some hw designer
in their mom's basement that has told you that hardware has to throw a
synchronous exception because hardware doesn't know any better.

That hardware designer really doesn't have a _clue_ about the big issues.

The fact is, a synchronous machine check exception is about the
_worst_ thing you can ever do when you encounter something like a
memory error.

It literally means that the software cannot possibly do anything sane
to recover, because the software is in some random place. The hardware
designer didn't think about the fact that the low-level access is
hidden from the software by a compiler and/or a lot of other
infrastructure - maybe microcode, maybe the OS, maybe a scriping
language, yadda yadda.

Absolutely NOBODY can recover at the level of one instruction. The
microcode people already proved that. At the level of "memcpy()", you
do not have a recovery option.

A hardware designer that tells you that you have to magically recover
at an instruction boundary fundamentally DOES NOT UNDERSTAND THE
PROBLEM.

IOW, you have completely uncritically just taken that incorrect
statement of "what else could hardware do" without questioning that
hardware designer AT ALL.

And remember, this is likely the same hardware designer that told you
that it's a good idea to then make machine checks go to every single
CPU in the system.

And this is the same hardware designer that then didn't even give you
enough information to recover.

And this is the same hardware designer that made recovery impossible
anyway, because if the error happened in microcode or in some other
situation, the microcode COULDN'T HANDLE IT EITHER!

In other words, you are talking to people WHO ARE KNOWN TO BE INCOMPETENT.

Seriously. Question them. When they tell ytou that "it's the only
thing we can possibly do", they do so from being incompetent, and we
have the history to PROVE it.

I don't understand why nobody else seems to be pushing back against
the completely insane and known garbage that is the Intel machine
checks. They are wrong.

The fact is, synchronous errors are the absolute worst possible
interface, exactly because they cause problems in various nasty corner
cases.

We _know_ a lot of those corner cases, for chrissake:

 - random standard library routine like "memcpy". How the hell do you
think a memcpy can recover? It can't.

 - Unaligned handling where "one" access isn't actually a single access.

 - microcode. Intel saw this problem themselves, but instead of making
people realize "oh, synchronous exceptions don't work that well" and
think about the problem, they wasted our time for decades, and then
probably spent a lot of effort in trying to make them work.

 - random generic code that isn't able to handle the fault, because IT
SHOULDN'T NEED TO CARE. Low-level filesystems, user mappings, the list
just goes on.

The only thing that can recover tends to be at a *MUCH* higher level
than one instruction access.

So the next time somebody tells you "there's nothing else we can do",
call them out on being incompetent, and call them out on the fact that
history has _shown_ that they are incompetent and wrong. Over and over
again.

I can _trivially_ point to a number of "what else could we do" that
are much better options.

 (a) let the corrupted value through, notify things ASYNCHRONOUSLY
that there were problems, and let people handle the issue later when
they are ready to do so.

Yeah, the data was corrupt - but not allowing the user to read it may
mean that the whole file is now inaccessible, even if it was just a
single ECC block that was wrong. I don't know the block-size people
use for ECC, and the fact is, software shouldn't really even need to
care. I may be able to recover from the situation at a higher level.
The data may be recoverable other ways - including just a "I want even
corrupted data, because I might have enough context that I can make
sense of it anyway".

 (b) if you have other issues so that you don't have data at all
(maybe it was metadata that was corrupted, don't ask me how that would
happen), just return zeroes, and notify about the problem
ASYNCHRONOUSLY.

And when notifying, give as much useful information as possible: the
virtual and physical address of the data, please, in addition to maybe
lower level bank information. Add a bit for "multiple errors", so that
whoever then later tries to maybe recover, can tell if it has complete
data or not.

The OS can send a SIGBUS with that information to user land that can
then maybe recover. Or it can say "hey, I'm in some recovery mode,
I'll try to limp along with incomplete data". Sometimes "recover"
means "try to limp along, notify the user that their hw is going bad,
but try to use the data anyway".

Again, what Intel engineers actually did with the synchronous
non-recoverable thing was not "the only thing I could possibly have
done".

It was literally the ABSOLUTE WORST POSSIBLE THING they could do, with
potentially  a dead machine as a result.

And now - after years of pain - you have the gall to repeat that
idiocy that you got from people who have shown themselves to be
completely wrong in the past?

Why?

Why do you take their known wrong approach as unthinking gospel? Just
because it's been handed down over generations on stone slabs?

I really really detest the whole mcsafe garbage. And I absolutely
*ABHOR* how nobody inside of Intel has apparently ever questioned the
brokenness at a really fundamental level.

That "I throw my hands in the air and just give up" thing is a
disease. It's absolutely not "what else could we do".

                  Linus
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20 17:28     ` Linus Torvalds
@ 2020-04-20 18:20       ` Dan Williams
  2020-04-20 19:05         ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2020-04-20 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Mon, Apr 20, 2020 at 10:29 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, Apr 19, 2020 at 10:08 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Do we have examples of doing exception handling from C? I thought all
> > the exception handling copy routines were assembly routines?
>
> You need assembler for the actual access, but that's a _single_
> instruction - best done as inline asm.
>
> The best example of something that does *exactly* what you want to do is likely
>
>         unsafe_get_user();
>         unsafe_put_user();
>
> which basically turns into a single instruction with exception
> handling, with the exception hander jumping directly to an error
> label.
>
> Ok, so right now gcc can't do that for inline asm with outputs, so it
> generates fairly nasty code (a secondary register with the error state
> that then causes a conditional branch on it), but that's a compiler
> limitation that will eventually go away (where "eventially" means that
> it already works in LLVM with experimental patches).
>
> You could literally mis-use those helpers as-is (please don't - the
> code generation is correct, but at the very least we'd have to
> re-organize a bit to make it a better interface, ie have an
> alternative name like "unsafe_get_kernel()" for kernel pointer
> accesses).
>
> You'd have to do the alignment guarantees yourself, but there are
> examples of that in this area too (strnlen_user() does exactly that -
> aligned word accesses).

Ok, I'll take a deeper look, but my initial reaction is that this
approach could be a way to replace memcpy_mcsafe_slow(), but would be
unfortunate for the fast case which can just use rep; movs;

>
> So the point here is that the current interfaces are garbage, _if_ the
> whole "access a single value" is actually performance-critical.
>
> And if that is *not* the case, then the best thing to do is likely to
> just use a static call. No inlining of single instructions at all,
> just always use a function call, and then pick the function
> appropriately.
>
> Honestly, I can't imagine that the "single access" case is so
> timing-critical that the static call isn't the right model. Your use
> case is _not_ as important or common as doing user accesses.
>
> Finally, the big question is whether the completely broken hardware
> even matters. Are there actual customers that actually use the garbage
> "we can crash the machine" stuff?
>
> Because when it comes to things like nvdimms etc, the _primary_
> performance target would be getting the kernel entirely out of the
> way, and allowing databases etc to just access the damn thing
> directly.
>
> And if you allow user space to access it directly, then you just have
> to admit that it's not a software issue any more - it's the hardware
> that is terminally broken and unusable garbage. It's not even
> interesting to work around things in the kernel, because user space
> can just crash the machine directly.
>
> This is why I absolutely detest that thing so much. The hardware is
> _so_ fundamentally broken that I have always considered the kernel
> workarounds to basically be "staging" level stuff - good enough for
> some random testing of known-broken stuff, but not something that
> anybody sane should ever use.
>
> So my preference would actually be to just call the broken cases to be
> largely ignored, at least from a performance angle. If you can only
> access it through the kernel, the biggest performance limitation is
> that you cannot do any DAX-like thing at all safely, so then the
> performance of some kernel accessors is completely secondary and
> meaningless. When a kernel entry/exit takes a few thousand cycles on
> the broken hardware (due to _other_ bugs), what's the point about
> worrying about trying to inline some single access to the nvdimm?
>
> Did the broken hardware ever spread out into the general public?
> Because if not, then the proper thing to do might be to just make it a
> compile-time option for the (few) customers that signed up for testing
> the initial broken stuff, and make the way _forward_ be a clean model
> without the need to worry about any exceptions at all.

There are 3 classes of hardware, all of them trigger exceptions* it's
just the recoverability of those exceptions that is different:

1/ Recovery not supported all exceptions report PCC=1 (processor
context corrupted) and the kernel decides to panic.

2/ Recovery supported, user processes killed on consumption, panic on
kernel consumption outside of an exception handler instrumented code
path. Unfortunately there is no architectural way to distinguish
class1 from class2 outside of a PCI quirk whitelist. The kernel prior
to memcpy_mcsafe() just unconditionally enabled recovery for user
consumption, panicked on kernel consumption, and hoped that exceptions
are sent with PCC=0. The introduction of memcpy_mcsafe() to handle
poison consumption from kernel-space without a panic introduced this
not pretty caveat that some platforms needed to do special snowflake
memcpy to avoid known scenarios** that trigger PCC=1 when they could
otherwise trigger PCC=0. So a PCI quirk whitelist was added for those.

3/ Recovery supported *and* special snowflake memcpy rules relaxed.
Still no architectural way to discover this state so let us just
enable memcpy_mcsafe_fast() always and let the former whitelist become
a blacklist of "this CPU requires you to do a dance to avoid some
PCC=1 cases, but that dance impacts performance".

* I'm at a loss of why you seem to be suggesting that hardware should
/ could avoid all exceptions. What else could hardware do besides
throw an exception on consumption of a naturally occuring multi-bit
ECC error? Data is gone, and only software might know how to recover.

** Unaligned access across a cacheline boundary, fast string consumption...

> > The writes can mmu-fault now that memcpy_mcsafe() is also used by
> > _copy_to_iter_mcsafe(). This allows a clean bypass of the block layer
> > in fs/dax.c in addition to the pmem driver access of poisoned memory.
> > Now that the fallback is a sane rep; movs; it can be considered for
> > plain copy_to_iter() for other user copies so you get exception
> > handling on kernel access of poison outside of persistent memory. To
> > Andy's point I think a recoverable copy (for exceptions or faults) is
> > generally useful.
>
> I think that's completely independent.
>
> If we have good reasons for having targets with exception handling,
> then that has absolutely nothing to do with machine checks or buggy
> hardware.
>
> And it sure shouldn't be called "mcsafe", since it has nothing to do
> with that situation any more.

Ok, true.

> > I understand the gripes about the mcsafe_slow() implementation, but
> > how do I implement mcsafe_fast() any better than how it is currently
> > organized given that, setting aside machine check handling,
> > memcpy_mcsafe() is the core of a copy_to_iter*() front-end that can
> > mmu-fault on either source or destination access?
>
> So honestly, once it is NOT about the broken machine check garbage,
> then it should be sold on its own independent reasons.
>
> Do we want to have a "copy_to_iter_safe" that can handle page faults?
> Because we have lots of those kinds of things, we have
>
>  - load_unaligned_zeropad()
>
>    This loads a single word knowing that the _first_ byte is valid,
> but can take an exception and zero-pad if it crosses a page boundary
>
>  - probe_kernel_read()/write()
>
>    This is a kernel memcpy() with the source/destination perhaps being unmapped.
>
>  - various perf and tracing helpers that have special semantics.
>
> but once it's about some generic interface, then it also needs to take
> other architectures into account.

For the fast case it's really just copy_user_enhanced_fast_string()
without forced stac/clac when the source and dest are both kernel
accesses.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-20  5:08   ` Dan Williams
@ 2020-04-20 17:28     ` Linus Torvalds
  2020-04-20 18:20       ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2020-04-20 17:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Sun, Apr 19, 2020 at 10:08 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Do we have examples of doing exception handling from C? I thought all
> the exception handling copy routines were assembly routines?

You need assembler for the actual access, but that's a _single_
instruction - best done as inline asm.

The best example of something that does *exactly* what you want to do is likely

        unsafe_get_user();
        unsafe_put_user();

which basically turns into a single instruction with exception
handling, with the exception hander jumping directly to an error
label.

Ok, so right now gcc can't do that for inline asm with outputs, so it
generates fairly nasty code (a secondary register with the error state
that then causes a conditional branch on it), but that's a compiler
limitation that will eventually go away (where "eventially" means that
it already works in LLVM with experimental patches).

You could literally mis-use those helpers as-is (please don't - the
code generation is correct, but at the very least we'd have to
re-organize a bit to make it a better interface, ie have an
alternative name like "unsafe_get_kernel()" for kernel pointer
accesses).

You'd have to do the alignment guarantees yourself, but there are
examples of that in this area too (strnlen_user() does exactly that -
aligned word accesses).

So the point here is that the current interfaces are garbage, _if_ the
whole "access a single value" is actually performance-critical.

And if that is *not* the case, then the best thing to do is likely to
just use a static call. No inlining of single instructions at all,
just always use a function call, and then pick the function
appropriately.

Honestly, I can't imagine that the "single access" case is so
timing-critical that the static call isn't the right model. Your use
case is _not_ as important or common as doing user accesses.

Finally, the big question is whether the completely broken hardware
even matters. Are there actual customers that actually use the garbage
"we can crash the machine" stuff?

Because when it comes to things like nvdimms etc, the _primary_
performance target would be getting the kernel entirely out of the
way, and allowing databases etc to just access the damn thing
directly.

And if you allow user space to access it directly, then you just have
to admit that it's not a software issue any more - it's the hardware
that is terminally broken and unusable garbage. It's not even
interesting to work around things in the kernel, because user space
can just crash the machine directly.

This is why I absolutely detest that thing so much. The hardware is
_so_ fundamentally broken that I have always considered the kernel
workarounds to basically be "staging" level stuff - good enough for
some random testing of known-broken stuff, but not something that
anybody sane should ever use.

So my preference would actually be to just call the broken cases to be
largely ignored, at least from a performance angle. If you can only
access it through the kernel, the biggest performance limitation is
that you cannot do any DAX-like thing at all safely, so then the
performance of some kernel accessors is completely secondary and
meaningless. When a kernel entry/exit takes a few thousand cycles on
the broken hardware (due to _other_ bugs), what's the point about
worrying about trying to inline some single access to the nvdimm?

Did the broken hardware ever spread out into the general public?
Because if not, then the proper thing to do might be to just make it a
compile-time option for the (few) customers that signed up for testing
the initial broken stuff, and make the way _forward_ be a clean model
without the need to worry about any exceptions at all.

> The writes can mmu-fault now that memcpy_mcsafe() is also used by
> _copy_to_iter_mcsafe(). This allows a clean bypass of the block layer
> in fs/dax.c in addition to the pmem driver access of poisoned memory.
> Now that the fallback is a sane rep; movs; it can be considered for
> plain copy_to_iter() for other user copies so you get exception
> handling on kernel access of poison outside of persistent memory. To
> Andy's point I think a recoverable copy (for exceptions or faults) is
> generally useful.

I think that's completely independent.

If we have good reasons for having targets with exception handling,
then that has absolutely nothing to do with machine checks or buggy
hardware.

And it sure shouldn't be called "mcsafe", since it has nothing to do
with that situation any more.

> I understand the gripes about the mcsafe_slow() implementation, but
> how do I implement mcsafe_fast() any better than how it is currently
> organized given that, setting aside machine check handling,
> memcpy_mcsafe() is the core of a copy_to_iter*() front-end that can
> mmu-fault on either source or destination access?

So honestly, once it is NOT about the broken machine check garbage,
then it should be sold on its own independent reasons.

Do we want to have a "copy_to_iter_safe" that can handle page faults?
Because we have lots of those kinds of things, we have

 - load_unaligned_zeropad()

   This loads a single word knowing that the _first_ byte is valid,
but can take an exception and zero-pad if it crosses a page boundary

 - probe_kernel_read()/write()

   This is a kernel memcpy() with the source/destination perhaps being unmapped.

 - various perf and tracing helpers that have special semantics.

but once it's about some generic interface, then it also needs to take
other architectures into account.

               Linus
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-18 20:52 ` Linus Torvalds
@ 2020-04-20  5:08   ` Dan Williams
  2020-04-20 17:28     ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2020-04-20  5:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, X86 ML, stable,
	Borislav Petkov, H. Peter Anvin, Peter Zijlstra, Tony Luck,
	Erwin Tsaur, Linux Kernel Mailing List, linux-nvdimm

On Sat, Apr 18, 2020 at 1:52 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, Apr 18, 2020 at 1:30 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Maybe I’m missing something obvious, but what’s the alternative?  The _mcsafe variants don’t just avoid the REP mess — they also tell the kernel that this particular access is recoverable via extable.
>
> .. which they could easily do exactly the same way the user space
> accessors do, just with a much simplified model that doesn't even care
> about multiple sizes, since unaligned accesses weren't valid anyway.
>
> The thing is, all of the MCS code has been nasty. There's no reason
> for it what-so-ever that I can tell. The hardware has been so
> incredibly broken that it's basically unusable, and most of the
> software around it seems to have been about testing.
>
> So I absolutely abhor that thing. Everything about that code has
> screamed "yeah, we completely mis-designed the hardware, we're pushing
> the problems into software, and nobody even uses it or can test it so
> there's like 5 people who care".
>
> And I'm pushing back on it, because I think that the least the code
> can do is to at least be simple.
>
> For example, none of those optimizations should exist. That function
> shouldn't have been inline to begin with. And if it really really
> matters from a performance angle that it was inline (which I doubt),
> it shouldn't have looked like a memory copy, it should have looked
> like "get_user()" (except without all the complications of actually
> having to test addresses or worry about different sizes).
>
>
> And it almost certainly shouldn't have been done in low-level asm
> either. It could have been a single "read aligned word" interface
> using an inline asm, and then everything else could have been done as
> C code around it.

Do we have examples of doing exception handling from C? I thought all
the exception handling copy routines were assembly routines?

>
> But no. The software side is almost as messy as the hardware side is.
> I hate it. And since nobody sane can test it, and the broken hardware
> is _so_ broken than nobody should ever use it, I have continually
> pushed back against this kind of ugly nasty special code.
>
> We know the writes can't fault, since they are buffered. So they
> aren't special at all.

The writes can mmu-fault now that memcpy_mcsafe() is also used by
_copy_to_iter_mcsafe(). This allows a clean bypass of the block layer
in fs/dax.c in addition to the pmem driver access of poisoned memory.
Now that the fallback is a sane rep; movs; it can be considered for
plain copy_to_iter() for other user copies so you get exception
handling on kernel access of poison outside of persistent memory. To
Andy's point I think a recoverable copy (for exceptions or faults) is
generally useful.

> We know the acceptable reads for the broken hardware basically boil
> down to a single simple word-size aligned read, so you need _one_
> special inline asm for that. The rest of the cases can be handled by
> masking and shifting if you really really need to - and done better
> that way than with byte accesses anyway.
>
> Then you have _one_ C file that implements everything using that
> single operation (ok, if people absolutely want to do sizes, I guess
> they can if they can just hide it in that one file), and you have one
> header file that exposes the interfaces to it, and you're done.
>
> And you strive hard as hell to not impact anything else, because you
> know that the hardware is unacceptable until all those special rules
> go away. Which they apparently finally have.

I understand the gripes about the mcsafe_slow() implementation, but
how do I implement mcsafe_fast() any better than how it is currently
organized given that, setting aside machine check handling,
memcpy_mcsafe() is the core of a copy_to_iter*() front-end that can
mmu-fault on either source or destination access?
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
  2020-04-18 20:30 Andy Lutomirski
@ 2020-04-18 20:52 ` Linus Torvalds
  2020-04-20  5:08   ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2020-04-18 20:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, X86 ML, stable, Borislav Petkov,
	H. Peter Anvin, Peter Zijlstra, Tony Luck, Erwin Tsaur,
	Linux Kernel Mailing List, linux-nvdimm

On Sat, Apr 18, 2020 at 1:30 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> Maybe I’m missing something obvious, but what’s the alternative?  The _mcsafe variants don’t just avoid the REP mess — they also tell the kernel that this particular access is recoverable via extable.

.. which they could easily do exactly the same way the user space
accessors do, just with a much simplified model that doesn't even care
about multiple sizes, since unaligned accesses weren't valid anyway.

The thing is, all of the MCS code has been nasty. There's no reason
for it what-so-ever that I can tell. The hardware has been so
incredibly broken that it's basically unusable, and most of the
software around it seems to have been about testing.

So I absolutely abhor that thing. Everything about that code has
screamed "yeah, we completely mis-designed the hardware, we're pushing
the problems into software, and nobody even uses it or can test it so
there's like 5 people who care".

And I'm pushing back on it, because I think that the least the code
can do is to at least be simple.

For example, none of those optimizations should exist. That function
shouldn't have been inline to begin with. And if it really really
matters from a performance angle that it was inline (which I doubt),
it shouldn't have looked like a memory copy, it should have looked
like "get_user()" (except without all the complications of actually
having to test addresses or worry about different sizes).

And it almost certainly shouldn't have been done in low-level asm
either. It could have been a single "read aligned word" interface
using an inline asm, and then everything else could have been done as
C code around it.

But no. The software side is almost as messy as the hardware side is.
I hate it. And since nobody sane can test it, and the broken hardware
is _so_ broken than nobody should ever use it, I have continually
pushed back against this kind of ugly nasty special code.

We know the writes can't fault, since they are buffered. So they
aren't special at all.

We know the acceptable reads for the broken hardware basically boil
down to a single simple word-size aligned read, so you need _one_
special inline asm for that. The rest of the cases can be handled by
masking and shifting if you really really need to - and done better
that way than with byte accesses anyway.

Then you have _one_ C file that implements everything using that
single operation (ok, if people absolutely want to do sizes, I guess
they can if they can just hide it in that one file), and you have one
header file that exposes the interfaces to it, and you're done.

And you strive hard as hell to not impact anything else, because you
know that the hardware is unacceptable until all those special rules
go away. Which they apparently finally have.

                Linus
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast
@ 2020-04-18 20:30 Andy Lutomirski
  2020-04-18 20:52 ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2020-04-18 20:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, X86 ML, stable, Borislav Petkov,
	H. Peter Anvin, Peter Zijlstra, Tony Luck, Erwin Tsaur,
	Linux Kernel Mailing List, linux-nvdimm



--Andy

> On Apr 18, 2020, at 12:42 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>>> On Fri, Apr 17, 2020 at 5:12 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>> 
>>> @@ -106,12 +108,10 @@ static __always_inline __must_check unsigned long
>>> memcpy_mcsafe(void *dst, const void *src, size_t cnt)
>>> {
>>> #ifdef CONFIG_X86_MCE
>>> -       i(static_branch_unlikely(&mcsafe_key))
>>> -               return __memcpy_mcsafe(dst, src, cnt);
>>> -       else
>>> +       if (static_branch_unlikely(&mcsafe_slow_key))
>>> +               return memcpy_mcsafe_slow(dst, src, cnt);
>>> #endif
>>> -               memcpy(dst, src, cnt);
>>> -       return 0;
>>> +       return memcpy_mcsafe_fast(dst, src, cnt);
>>> }
> 
> It strikes me that I see no advantages to making this an inline function at all.
> 
> Even for the good case - where it turns into just a memcpy because MCE
> is entirely disabled - it doesn't seem to matter.
> 
> The only case that really helps is when the memcpy can be turned into
> a single access. Which - and I checked - does exist, with people doing
> 
>        r = memcpy_mcsafe(&sb_seq_count, &sb(wc)->seq_count, sizeof(uint64_t));
> 
> to read a single 64-bit field which looks aligned to me.
> 
> But that code is incredible garbage anyway, since even on a broken
> machine, there's no actual reason to use the slow variant for that
> whole access that I can tell. The macs-safe copy routines do not do
> anything worthwhile for a single access.

Maybe I’m missing something obvious, but what’s the alternative?  The _mcsafe variants don’t just avoid the REP mess — they also tell the kernel that this particular access is recoverable via extable. With a regular memory access, the CPU may not explode, but do_machine_check() will, at very best, OOPS, and even that requires a certain degree of optimism.  A panic is more likely.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-04-20 21:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-10 17:49 [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast Dan Williams
2020-04-18  0:12 ` Dan Williams
2020-04-18 19:42   ` Linus Torvalds
2020-04-18 20:30 Andy Lutomirski
2020-04-18 20:52 ` Linus Torvalds
2020-04-20  5:08   ` Dan Williams
2020-04-20 17:28     ` Linus Torvalds
2020-04-20 18:20       ` Dan Williams
2020-04-20 19:05         ` Linus Torvalds
2020-04-20 19:29           ` Dan Williams
2020-04-20 20:07             ` Linus Torvalds
2020-04-20 20:23               ` Luck, Tony
2020-04-20 20:27                 ` Linus Torvalds
2020-04-20 20:45                   ` Luck, Tony
2020-04-20 20:56                     ` Linus Torvalds
2020-04-20 20:24               ` Dan Williams
2020-04-20 20:46                 ` Linus Torvalds
2020-04-20 20:57                   ` Luck, Tony
2020-04-20 21:16                     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).