linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 0/4] Machine check recovery when kernel accesses poison
@ 2016-02-17 18:20 Tony Luck
  2016-02-17 18:20 ` [PATCH v11 1/4] x86: Expand exception table to allow new handling options Tony Luck
                   ` (3 more replies)
  0 siblings, 4 replies; 52+ messages in thread
From: Tony Luck @ 2016-02-17 18:20 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

[Resend of v11 with Boris' "Reviewed-by" tags added. For Ingo's workflow]

-Tony

Tony Luck (4):
  x86: Expand exception table to allow new handling options
  x86, mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception
    table entries
  x86, mce: Add __mcsafe_copy()
  x86: Create a new synthetic cpu capability for machine check recovery

 Documentation/x86/exception-tables.txt    |  35 +++++++
 Documentation/x86/x86_64/boot-options.txt |   2 +
 arch/x86/include/asm/asm.h                |  40 ++++----
 arch/x86/include/asm/cpufeature.h         |   1 +
 arch/x86/include/asm/mce.h                |   1 +
 arch/x86/include/asm/string_64.h          |   8 ++
 arch/x86/include/asm/uaccess.h            |  16 ++--
 arch/x86/kernel/cpu/mcheck/mce-severity.c |  22 ++++-
 arch/x86/kernel/cpu/mcheck/mce.c          |  83 +++++++++-------
 arch/x86/kernel/kprobes/core.c            |   2 +-
 arch/x86/kernel/traps.c                   |   6 +-
 arch/x86/kernel/x8664_ksyms_64.c          |   2 +
 arch/x86/lib/memcpy_64.S                  | 151 ++++++++++++++++++++++++++++++
 arch/x86/mm/extable.c                     | 100 ++++++++++++++------
 arch/x86/mm/fault.c                       |   2 +-
 scripts/sortextable.c                     |  32 +++++++
 16 files changed, 410 insertions(+), 93 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v11 1/4] x86: Expand exception table to allow new handling options
  2016-02-17 18:20 [PATCH v11 0/4] Machine check recovery when kernel accesses poison Tony Luck
@ 2016-02-17 18:20 ` Tony Luck
  2016-02-18 10:19   ` [tip:ras/core] x86/mm: Expand the exception table logic " tip-bot for Tony Luck
  2016-02-17 18:20 ` [PATCH v11 4/4] x86: Create a new synthetic cpu capability for machine check recovery Tony Luck
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 52+ messages in thread
From: Tony Luck @ 2016-02-17 18:20 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Huge amounts of help from  Andy Lutomirski and Borislav Petkov to
produce this. Andy provided the inspiration to add classes to the
exception table with a clever bit-squeezing trick, Boris pointed
out how much cleaner it would all be if we just had a new field.

Linus Torvalds blessed the expansion with:
  I'd rather not be clever in order to save just a tiny amount of space
  in the exception table, which isn't really criticial for anybody.

The third field is another relative function pointer, this one to a
handler that executes the actions.

We start out with three handlers:

1: Legacy - just jumps the to fixup IP
2: Fault - provide the trap number in %ax to the fixup code
3: Cleaned up legacy for the uaccess error hack

Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/x86/exception-tables.txt |  35 ++++++++++++
 arch/x86/include/asm/asm.h             |  40 +++++++------
 arch/x86/include/asm/uaccess.h         |  16 +++---
 arch/x86/kernel/kprobes/core.c         |   2 +-
 arch/x86/kernel/traps.c                |   6 +-
 arch/x86/mm/extable.c                  | 100 ++++++++++++++++++++++++---------
 arch/x86/mm/fault.c                    |   2 +-
 scripts/sortextable.c                  |  32 +++++++++++
 8 files changed, 176 insertions(+), 57 deletions(-)

diff --git a/Documentation/x86/exception-tables.txt b/Documentation/x86/exception-tables.txt
index 32901aa36f0a..fed18187a8b8 100644
--- a/Documentation/x86/exception-tables.txt
+++ b/Documentation/x86/exception-tables.txt
@@ -290,3 +290,38 @@ Due to the way that the exception table is built and needs to be ordered,
 only use exceptions for code in the .text section.  Any other section
 will cause the exception table to not be sorted correctly, and the
 exceptions will fail.
+
+Things changed when 64-bit support was added to x86 Linux. Rather than
+double the size of the exception table by expanding the two entries
+from 32-bits to 64 bits, a clever trick was used to store addresses
+as relative offsets from the table itself. The assembly code changed
+from:
+	.long 1b,3b
+to:
+        .long (from) - .
+        .long (to) - .
+
+and the C-code that uses these values converts back to absolute addresses
+like this:
+
+	ex_insn_addr(const struct exception_table_entry *x)
+	{
+		return (unsigned long)&x->insn + x->insn;
+	}
+
+In v4.5 the exception table entry was given a new field "handler".
+This is also 32-bits wide and contains a third relative function
+pointer which points to one of:
+
+1) int ex_handler_default(const struct exception_table_entry *fixup)
+   This is legacy case that just jumps to the fixup code
+2) int ex_handler_fault(const struct exception_table_entry *fixup)
+   This case provides the fault number of the trap that occurred at
+   entry->insn. It is used to distinguish page faults from machine
+   check.
+3) int ex_handler_ext(const struct exception_table_entry *fixup)
+   This case is used for uaccess_err ... we need to set a flag
+   in the task structure. Before the handler functions existed this
+   case was handled by adding a large offset to the fixup to tag
+   it as special.
+More functions can easily be added.
diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 189679aba703..f5063b6659eb 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -44,19 +44,22 @@
 
 /* Exception table entry */
 #ifdef __ASSEMBLY__
-# define _ASM_EXTABLE(from,to)					\
+# define _ASM_EXTABLE_HANDLE(from, to, handler)			\
 	.pushsection "__ex_table","a" ;				\
-	.balign 8 ;						\
+	.balign 4 ;						\
 	.long (from) - . ;					\
 	.long (to) - . ;					\
+	.long (handler) - . ;					\
 	.popsection
 
-# define _ASM_EXTABLE_EX(from,to)				\
-	.pushsection "__ex_table","a" ;				\
-	.balign 8 ;						\
-	.long (from) - . ;					\
-	.long (to) - . + 0x7ffffff0 ;				\
-	.popsection
+# define _ASM_EXTABLE(from, to)					\
+	_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
+
+# define _ASM_EXTABLE_FAULT(from, to)				\
+	_ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
+
+# define _ASM_EXTABLE_EX(from, to)				\
+	_ASM_EXTABLE_HANDLE(from, to, ex_handler_ext)
 
 # define _ASM_NOKPROBE(entry)					\
 	.pushsection "_kprobe_blacklist","aw" ;			\
@@ -89,19 +92,24 @@
 	.endm
 
 #else
-# define _ASM_EXTABLE(from,to)					\
+# define _EXPAND_EXTABLE_HANDLE(x) #x
+# define _ASM_EXTABLE_HANDLE(from, to, handler)			\
 	" .pushsection \"__ex_table\",\"a\"\n"			\
-	" .balign 8\n"						\
+	" .balign 4\n"						\
 	" .long (" #from ") - .\n"				\
 	" .long (" #to ") - .\n"				\
+	" .long (" _EXPAND_EXTABLE_HANDLE(handler) ") - .\n"	\
 	" .popsection\n"
 
-# define _ASM_EXTABLE_EX(from,to)				\
-	" .pushsection \"__ex_table\",\"a\"\n"			\
-	" .balign 8\n"						\
-	" .long (" #from ") - .\n"				\
-	" .long (" #to ") - . + 0x7ffffff0\n"			\
-	" .popsection\n"
+# define _ASM_EXTABLE(from, to)					\
+	_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
+
+# define _ASM_EXTABLE_FAULT(from, to)				\
+	_ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
+
+# define _ASM_EXTABLE_EX(from, to)				\
+	_ASM_EXTABLE_HANDLE(from, to, ex_handler_ext)
+
 /* For C file, we already have NOKPROBE_SYMBOL macro */
 #endif
 
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index a4a30e4b2d34..c0f27d7ea7ff 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -90,12 +90,11 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
 	likely(!__range_not_ok(addr, size, user_addr_max()))
 
 /*
- * The exception table consists of pairs of addresses relative to the
- * exception table enty itself: the first is the address of an
- * instruction that is allowed to fault, and the second is the address
- * at which the program should continue.  No registers are modified,
- * so it is entirely up to the continuation code to figure out what to
- * do.
+ * The exception table consists of triples of addresses relative to the
+ * exception table entry itself. The first address is of an instruction
+ * that is allowed to fault, the second is the target at which the program
+ * should continue. The third is a handler function to deal with the fault
+ * caused by the instruction in the first field.
  *
  * All the routines below use bits of fixup code that are out of line
  * with the main instruction path.  This means when everything is well,
@@ -104,13 +103,14 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
  */
 
 struct exception_table_entry {
-	int insn, fixup;
+	int insn, fixup, handler;
 };
 /* This is not the generic standard exception_table_entry format */
 #define ARCH_HAS_SORT_EXTABLE
 #define ARCH_HAS_SEARCH_EXTABLE
 
-extern int fixup_exception(struct pt_regs *regs);
+extern int fixup_exception(struct pt_regs *regs, int trapnr);
+extern bool ex_has_fault_handler(unsigned long ip);
 extern int early_fixup_exception(unsigned long *ip);
 
 /*
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 1deffe6cc873..0f05deeff5ce 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -988,7 +988,7 @@ int kprobe_fault_handler(struct pt_regs *regs, int trapnr)
 		 * In case the user-specified fault handler returned
 		 * zero, try to fix up.
 		 */
-		if (fixup_exception(regs))
+		if (fixup_exception(regs, trapnr))
 			return 1;
 
 		/*
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ade185a46b1d..211c11c7bba4 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -199,7 +199,7 @@ do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
 	}
 
 	if (!user_mode(regs)) {
-		if (!fixup_exception(regs)) {
+		if (!fixup_exception(regs, trapnr)) {
 			tsk->thread.error_code = error_code;
 			tsk->thread.trap_nr = trapnr;
 			die(str, regs, error_code);
@@ -453,7 +453,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
 
 	tsk = current;
 	if (!user_mode(regs)) {
-		if (fixup_exception(regs))
+		if (fixup_exception(regs, X86_TRAP_GP))
 			return;
 
 		tsk->thread.error_code = error_code;
@@ -699,7 +699,7 @@ static void math_error(struct pt_regs *regs, int error_code, int trapnr)
 	conditional_sti(regs);
 
 	if (!user_mode(regs)) {
-		if (!fixup_exception(regs)) {
+		if (!fixup_exception(regs, trapnr)) {
 			task->thread.error_code = error_code;
 			task->thread.trap_nr = trapnr;
 			die(str, regs, error_code);
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 903ec1e9c326..9dd7e4b7fcde 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -3,6 +3,9 @@
 #include <linux/sort.h>
 #include <asm/uaccess.h>
 
+typedef bool (*ex_handler_t)(const struct exception_table_entry *,
+			    struct pt_regs *, int);
+
 static inline unsigned long
 ex_insn_addr(const struct exception_table_entry *x)
 {
@@ -13,11 +16,56 @@ ex_fixup_addr(const struct exception_table_entry *x)
 {
 	return (unsigned long)&x->fixup + x->fixup;
 }
+static inline ex_handler_t
+ex_fixup_handler(const struct exception_table_entry *x)
+{
+	return (ex_handler_t)((unsigned long)&x->handler + x->handler);
+}
 
-int fixup_exception(struct pt_regs *regs)
+bool ex_handler_default(const struct exception_table_entry *fixup,
+		       struct pt_regs *regs, int trapnr)
 {
-	const struct exception_table_entry *fixup;
-	unsigned long new_ip;
+	regs->ip = ex_fixup_addr(fixup);
+	return true;
+}
+EXPORT_SYMBOL(ex_handler_default);
+
+bool ex_handler_fault(const struct exception_table_entry *fixup,
+		     struct pt_regs *regs, int trapnr)
+{
+	regs->ip = ex_fixup_addr(fixup);
+	regs->ax = trapnr;
+	return true;
+}
+EXPORT_SYMBOL_GPL(ex_handler_fault);
+
+bool ex_handler_ext(const struct exception_table_entry *fixup,
+		   struct pt_regs *regs, int trapnr)
+{
+	/* Special hack for uaccess_err */
+	current_thread_info()->uaccess_err = 1;
+	regs->ip = ex_fixup_addr(fixup);
+	return true;
+}
+EXPORT_SYMBOL(ex_handler_ext);
+
+bool ex_has_fault_handler(unsigned long ip)
+{
+	const struct exception_table_entry *e;
+	ex_handler_t handler;
+
+	e = search_exception_tables(ip);
+	if (!e)
+		return false;
+	handler = ex_fixup_handler(e);
+
+	return handler == ex_handler_fault;
+}
+
+int fixup_exception(struct pt_regs *regs, int trapnr)
+{
+	const struct exception_table_entry *e;
+	ex_handler_t handler;
 
 #ifdef CONFIG_PNPBIOS
 	if (unlikely(SEGMENT_IS_PNP_CODE(regs->cs))) {
@@ -33,42 +81,34 @@ int fixup_exception(struct pt_regs *regs)
 	}
 #endif
 
-	fixup = search_exception_tables(regs->ip);
-	if (fixup) {
-		new_ip = ex_fixup_addr(fixup);
-
-		if (fixup->fixup - fixup->insn >= 0x7ffffff0 - 4) {
-			/* Special hack for uaccess_err */
-			current_thread_info()->uaccess_err = 1;
-			new_ip -= 0x7ffffff0;
-		}
-		regs->ip = new_ip;
-		return 1;
-	}
+	e = search_exception_tables(regs->ip);
+	if (!e)
+		return 0;
 
-	return 0;
+	handler = ex_fixup_handler(e);
+	return handler(e, regs, trapnr);
 }
 
 /* Restricted version used during very early boot */
 int __init early_fixup_exception(unsigned long *ip)
 {
-	const struct exception_table_entry *fixup;
+	const struct exception_table_entry *e;
 	unsigned long new_ip;
+	ex_handler_t handler;
 
-	fixup = search_exception_tables(*ip);
-	if (fixup) {
-		new_ip = ex_fixup_addr(fixup);
+	e = search_exception_tables(*ip);
+	if (!e)
+		return 0;
 
-		if (fixup->fixup - fixup->insn >= 0x7ffffff0 - 4) {
-			/* uaccess handling not supported during early boot */
-			return 0;
-		}
+	new_ip  = ex_fixup_addr(e);
+	handler = ex_fixup_handler(e);
 
-		*ip = new_ip;
-		return 1;
-	}
+	/* special handling not supported during early boot */
+	if (handler != ex_handler_default)
+		return 0;
 
-	return 0;
+	*ip = new_ip;
+	return 1;
 }
 
 /*
@@ -133,6 +173,8 @@ void sort_extable(struct exception_table_entry *start,
 		i += 4;
 		p->fixup += i;
 		i += 4;
+		p->handler += i;
+		i += 4;
 	}
 
 	sort(start, finish - start, sizeof(struct exception_table_entry),
@@ -145,6 +187,8 @@ void sort_extable(struct exception_table_entry *start,
 		i += 4;
 		p->fixup -= i;
 		i += 4;
+		p->handler -= i;
+		i += 4;
 	}
 }
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9a3f77..495946c3f9dd 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -656,7 +656,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	int sig;
 
 	/* Are we prepared to handle this kernel fault? */
-	if (fixup_exception(regs)) {
+	if (fixup_exception(regs, X86_TRAP_PF)) {
 		/*
 		 * Any interrupt that takes a fault gets the fixup. This makes
 		 * the below recursive fault logic only apply to a faults from
diff --git a/scripts/sortextable.c b/scripts/sortextable.c
index c2423d913b46..7b29fb14f870 100644
--- a/scripts/sortextable.c
+++ b/scripts/sortextable.c
@@ -209,6 +209,35 @@ static int compare_relative_table(const void *a, const void *b)
 	return 0;
 }
 
+static void x86_sort_relative_table(char *extab_image, int image_size)
+{
+	int i;
+
+	i = 0;
+	while (i < image_size) {
+		uint32_t *loc = (uint32_t *)(extab_image + i);
+
+		w(r(loc) + i, loc);
+		w(r(loc + 1) + i + 4, loc + 1);
+		w(r(loc + 2) + i + 8, loc + 2);
+
+		i += sizeof(uint32_t) * 3;
+	}
+
+	qsort(extab_image, image_size / 12, 12, compare_relative_table);
+
+	i = 0;
+	while (i < image_size) {
+		uint32_t *loc = (uint32_t *)(extab_image + i);
+
+		w(r(loc) - i, loc);
+		w(r(loc + 1) - (i + 4), loc + 1);
+		w(r(loc + 2) - (i + 8), loc + 2);
+
+		i += sizeof(uint32_t) * 3;
+	}
+}
+
 static void sort_relative_table(char *extab_image, int image_size)
 {
 	int i;
@@ -281,6 +310,9 @@ do_file(char const *const fname)
 		break;
 	case EM_386:
 	case EM_X86_64:
+		custom_sort = x86_sort_relative_table;
+		break;
+
 	case EM_S390:
 		custom_sort = sort_relative_table;
 		break;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v11 2/4] x86, mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries
  2016-02-17 18:20 [PATCH v11 0/4] Machine check recovery when kernel accesses poison Tony Luck
                   ` (2 preceding siblings ...)
  2016-02-17 18:20 ` [PATCH v11 3/4] x86, mce: Add __mcsafe_copy() Tony Luck
@ 2016-02-17 18:20 ` Tony Luck
  2016-02-18 10:19   ` [tip:ras/core] x86/mce: " tip-bot for Tony Luck
  3 siblings, 1 reply; 52+ messages in thread
From: Tony Luck @ 2016-02-17 18:20 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Extend the severity checking code to add a new context IN_KERN_RECOV
which is used to indicate that the machine check was triggered by code
in the kernel tagged with _ASM_EXTABLE_FAULT() so that the ex_handler_fault()
handler will provide the fixup code with the trap number.

Major re-work to the tail code in do_machine_check() to make all this
readable/maintainable. One functional change is that tolerant=3 no longer
stops recovery actions. Revert to only skipping sending SIGBUS to the
current process.

Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce-severity.c | 22 +++++++++-
 arch/x86/kernel/cpu/mcheck/mce.c          | 70 ++++++++++++++++---------------
 2 files changed, 56 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 9c682c222071..5119766d9889 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/debugfs.h>
 #include <asm/mce.h>
+#include <asm/uaccess.h>
 
 #include "mce-internal.h"
 
@@ -29,7 +30,7 @@
  * panic situations)
  */
 
-enum context { IN_KERNEL = 1, IN_USER = 2 };
+enum context { IN_KERNEL = 1, IN_USER = 2, IN_KERNEL_RECOV = 3 };
 enum ser { SER_REQUIRED = 1, NO_SER = 2 };
 enum exception { EXCP_CONTEXT = 1, NO_EXCP = 2 };
 
@@ -48,6 +49,7 @@ static struct severity {
 #define MCESEV(s, m, c...) { .sev = MCE_ ## s ## _SEVERITY, .msg = m, ## c }
 #define  KERNEL		.context = IN_KERNEL
 #define  USER		.context = IN_USER
+#define  KERNEL_RECOV	.context = IN_KERNEL_RECOV
 #define  SER		.ser = SER_REQUIRED
 #define  NOSER		.ser = NO_SER
 #define  EXCP		.excp = EXCP_CONTEXT
@@ -87,6 +89,10 @@ static struct severity {
 		EXCP, KERNEL, MCGMASK(MCG_STATUS_RIPV, 0)
 		),
 	MCESEV(
+		PANIC, "In kernel and no restart IP",
+		EXCP, KERNEL_RECOV, MCGMASK(MCG_STATUS_RIPV, 0)
+		),
+	MCESEV(
 		DEFERRED, "Deferred error",
 		NOSER, MASK(MCI_STATUS_UC|MCI_STATUS_DEFERRED|MCI_STATUS_POISON, MCI_STATUS_DEFERRED)
 		),
@@ -123,6 +129,11 @@ static struct severity {
 		MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, MCG_STATUS_RIPV)
 		),
 	MCESEV(
+		AR, "Action required: data load in error recoverable area of kernel",
+		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
+		KERNEL_RECOV
+		),
+	MCESEV(
 		AR, "Action required: data load error in a user process",
 		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
 		USER
@@ -170,6 +181,9 @@ static struct severity {
 		)	/* always matches. keep at end */
 };
 
+#define mc_recoverable(mcg) (((mcg) & (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) == \
+				(MCG_STATUS_RIPV|MCG_STATUS_EIPV))
+
 /*
  * If mcgstatus indicated that ip/cs on the stack were
  * no good, then "m->cs" will be zero and we will have
@@ -183,7 +197,11 @@ static struct severity {
  */
 static int error_context(struct mce *m)
 {
-	return ((m->cs & 3) == 3) ? IN_USER : IN_KERNEL;
+	if ((m->cs & 3) == 3)
+		return IN_USER;
+	if (mc_recoverable(m->mcgstatus) && ex_has_fault_handler(m->ip))
+		return IN_KERNEL_RECOV;
+	return IN_KERNEL;
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index a006f4cd792b..905f3070f412 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -961,6 +961,20 @@ static void mce_clear_state(unsigned long *toclear)
 	}
 }
 
+static int do_memory_failure(struct mce *m)
+{
+	int flags = MF_ACTION_REQUIRED;
+	int ret;
+
+	pr_err("Uncorrected hardware memory error in user-access at %llx", m->addr);
+	if (!(m->mcgstatus & MCG_STATUS_RIPV))
+		flags |= MF_MUST_KILL;
+	ret = memory_failure(m->addr >> PAGE_SHIFT, MCE_VECTOR, flags);
+	if (ret)
+		pr_err("Memory error not recovered");
+	return ret;
+}
+
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
@@ -998,8 +1012,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	DECLARE_BITMAP(toclear, MAX_NR_BANKS);
 	DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
 	char *msg = "Unknown";
-	u64 recover_paddr = ~0ull;
-	int flags = MF_ACTION_REQUIRED;
 	int lmce = 0;
 
 	/* If this CPU is offline, just bail out. */
@@ -1136,22 +1148,13 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	}
 
 	/*
-	 * At insane "tolerant" levels we take no action. Otherwise
-	 * we only die if we have no other choice. For less serious
-	 * issues we try to recover, or limit damage to the current
-	 * process.
+	 * If tolerant is at an insane level we drop requests to kill
+	 * processes and continue even when there is no way out.
 	 */
-	if (cfg->tolerant < 3) {
-		if (no_way_out)
-			mce_panic("Fatal machine check on current CPU", &m, msg);
-		if (worst == MCE_AR_SEVERITY) {
-			recover_paddr = m.addr;
-			if (!(m.mcgstatus & MCG_STATUS_RIPV))
-				flags |= MF_MUST_KILL;
-		} else if (kill_it) {
-			force_sig(SIGBUS, current);
-		}
-	}
+	if (cfg->tolerant == 3)
+		kill_it = 0;
+	else if (no_way_out)
+		mce_panic("Fatal machine check on current CPU", &m, msg);
 
 	if (worst > 0)
 		mce_report_event(regs);
@@ -1159,25 +1162,24 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 out:
 	sync_core();
 
-	if (recover_paddr == ~0ull)
-		goto done;
+	if (worst != MCE_AR_SEVERITY && !kill_it)
+		goto out_ist;
 
-	pr_err("Uncorrected hardware memory error in user-access at %llx",
-		 recover_paddr);
-	/*
-	 * We must call memory_failure() here even if the current process is
-	 * doomed. We still need to mark the page as poisoned and alert any
-	 * other users of the page.
-	 */
-	ist_begin_non_atomic(regs);
-	local_irq_enable();
-	if (memory_failure(recover_paddr >> PAGE_SHIFT, MCE_VECTOR, flags) < 0) {
-		pr_err("Memory error not recovered");
-		force_sig(SIGBUS, current);
+	/* Fault was in user mode and we need to take some action */
+	if ((m.cs & 3) == 3) {
+		ist_begin_non_atomic(regs);
+		local_irq_enable();
+
+		if (kill_it || do_memory_failure(&m))
+			force_sig(SIGBUS, current);
+		local_irq_disable();
+		ist_end_non_atomic();
+	} else {
+		if (!fixup_exception(regs, X86_TRAP_MC))
+			mce_panic("Failed kernel mode recovery", &m, NULL);
 	}
-	local_irq_disable();
-	ist_end_non_atomic();
-done:
+
+out_ist:
 	ist_exit(regs);
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-17 18:20 [PATCH v11 0/4] Machine check recovery when kernel accesses poison Tony Luck
  2016-02-17 18:20 ` [PATCH v11 1/4] x86: Expand exception table to allow new handling options Tony Luck
  2016-02-17 18:20 ` [PATCH v11 4/4] x86: Create a new synthetic cpu capability for machine check recovery Tony Luck
@ 2016-02-17 18:20 ` Tony Luck
  2016-02-18  8:21   ` Ingo Molnar
  2016-02-18 18:12   ` [PATCH v11 3/4] x86, mce: Add __mcsafe_copy() Linus Torvalds
  2016-02-17 18:20 ` [PATCH v11 2/4] x86, mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries Tony Luck
  3 siblings, 2 replies; 52+ messages in thread
From: Tony Luck @ 2016-02-17 18:20 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Make use of the EXTABLE_FAULT exception table entries. This routine
returns a structure to indicate the result of the copy:

struct mcsafe_ret {
        u64 trapnr;
        u64 remain;
};

If the copy is successful, then both 'trapnr' and 'remain' are zero.

If we faulted during the copy, then 'trapnr' will say which type
of trap (X86_TRAP_PF or X86_TRAP_MC) and 'remain' says how many
bytes were not copied.

Note that this is probably the first of several copy functions.
We can make new ones for non-temporal cache handling etc.

Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/string_64.h |   8 +++
 arch/x86/kernel/x8664_ksyms_64.c |   2 +
 arch/x86/lib/memcpy_64.S         | 151 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 161 insertions(+)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index ff8b9a17dc4b..5b24039463a4 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -78,6 +78,14 @@ int strcmp(const char *cs, const char *ct);
 #define memset(s, c, n) __memset(s, c, n)
 #endif
 
+struct mcsafe_ret {
+	u64 trapnr;
+	u64 remain;
+};
+
+struct mcsafe_ret __mcsafe_copy(void *dst, const void __user *src, size_t cnt);
+extern void __mcsafe_copy_end(void);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
index a0695be19864..fff245462a8c 100644
--- a/arch/x86/kernel/x8664_ksyms_64.c
+++ b/arch/x86/kernel/x8664_ksyms_64.c
@@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache);
 EXPORT_SYMBOL(_copy_from_user);
 EXPORT_SYMBOL(_copy_to_user);
 
+EXPORT_SYMBOL_GPL(__mcsafe_copy);
+
 EXPORT_SYMBOL(copy_page);
 EXPORT_SYMBOL(clear_page);
 
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 16698bba87de..7f967a9ed0e4 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -177,3 +177,154 @@ ENTRY(memcpy_orig)
 .Lend:
 	retq
 ENDPROC(memcpy_orig)
+
+#ifndef CONFIG_UML
+/*
+ * __mcsafe_copy - memory copy with machine check exception handling
+ * Note that we only catch machine checks when reading the source addresses.
+ * Writes to target are posted and don't generate machine checks.
+ */
+ENTRY(__mcsafe_copy)
+	cmpl $8,%edx
+	jb 20f		/* less then 8 bytes, go to byte copy loop */
+
+	/* check for bad alignment of source */
+	testl $7,%esi
+	/* already aligned */
+	jz 102f
+
+	/* copy one byte at a time until source is 8-byte aligned */
+	movl %esi,%ecx
+	andl $7,%ecx
+	subl $8,%ecx
+	negl %ecx
+	subl %ecx,%edx
+0:	movb (%rsi),%al
+	movb %al,(%rdi)
+	incq %rsi
+	incq %rdi
+	decl %ecx
+	jnz 0b
+
+102:
+	/* Figure out how many whole cache lines (64-bytes) to copy */
+	movl %edx,%ecx
+	andl $63,%edx
+	shrl $6,%ecx
+	jz 17f
+
+	/* Loop copying whole cache lines */
+1:	movq (%rsi),%r8
+2:	movq 1*8(%rsi),%r9
+3:	movq 2*8(%rsi),%r10
+4:	movq 3*8(%rsi),%r11
+	movq %r8,(%rdi)
+	movq %r9,1*8(%rdi)
+	movq %r10,2*8(%rdi)
+	movq %r11,3*8(%rdi)
+9:	movq 4*8(%rsi),%r8
+10:	movq 5*8(%rsi),%r9
+11:	movq 6*8(%rsi),%r10
+12:	movq 7*8(%rsi),%r11
+	movq %r8,4*8(%rdi)
+	movq %r9,5*8(%rdi)
+	movq %r10,6*8(%rdi)
+	movq %r11,7*8(%rdi)
+	leaq 64(%rsi),%rsi
+	leaq 64(%rdi),%rdi
+	decl %ecx
+	jnz 1b
+
+	/* Are there any trailing 8-byte words? */
+17:	movl %edx,%ecx
+	andl $7,%edx
+	shrl $3,%ecx
+	jz 20f
+
+	/* Copy trailing words */
+18:	movq (%rsi),%r8
+	mov %r8,(%rdi)
+	leaq 8(%rsi),%rsi
+	leaq 8(%rdi),%rdi
+	decl %ecx
+	jnz 18b
+
+	/* Any trailing bytes? */
+20:	andl %edx,%edx
+	jz 23f
+
+	/* copy trailing bytes */
+	movl %edx,%ecx
+21:	movb (%rsi),%al
+	movb %al,(%rdi)
+	incq %rsi
+	incq %rdi
+	decl %ecx
+	jnz 21b
+
+	/* Copy successful. Return .remain = 0, .trapnr = 0 */
+23:	xorq %rax, %rax
+	xorq %rdx, %rdx
+	ret
+
+	.section .fixup,"ax"
+	/*
+	 * machine check handler loaded %rax with trap number
+	 * We just need to make sure %edx has the number of
+	 * bytes remaining
+	 */
+30:
+	add %ecx,%edx
+	ret
+31:
+	shl $6,%ecx
+	add %ecx,%edx
+	ret
+32:
+	shl $6,%ecx
+	lea -8(%ecx,%edx),%edx
+	ret
+33:
+	shl $6,%ecx
+	lea -16(%ecx,%edx),%edx
+	ret
+34:
+	shl $6,%ecx
+	lea -24(%ecx,%edx),%edx
+	ret
+35:
+	shl $6,%ecx
+	lea -32(%ecx,%edx),%edx
+	ret
+36:
+	shl $6,%ecx
+	lea -40(%ecx,%edx),%edx
+	ret
+37:
+	shl $6,%ecx
+	lea -48(%ecx,%edx),%edx
+	ret
+38:
+	shl $6,%ecx
+	lea -56(%ecx,%edx),%edx
+	ret
+39:
+	lea (%rdx,%rcx,8),%rdx
+	ret
+40:
+	mov %ecx,%edx
+	ret
+	.previous
+
+	_ASM_EXTABLE_FAULT(0b,30b)
+	_ASM_EXTABLE_FAULT(1b,31b)
+	_ASM_EXTABLE_FAULT(2b,32b)
+	_ASM_EXTABLE_FAULT(3b,33b)
+	_ASM_EXTABLE_FAULT(4b,34b)
+	_ASM_EXTABLE_FAULT(9b,35b)
+	_ASM_EXTABLE_FAULT(10b,36b)
+	_ASM_EXTABLE_FAULT(11b,37b)
+	_ASM_EXTABLE_FAULT(12b,38b)
+	_ASM_EXTABLE_FAULT(18b,39b)
+	_ASM_EXTABLE_FAULT(21b,40b)
+#endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v11 4/4] x86: Create a new synthetic cpu capability for machine check recovery
  2016-02-17 18:20 [PATCH v11 0/4] Machine check recovery when kernel accesses poison Tony Luck
  2016-02-17 18:20 ` [PATCH v11 1/4] x86: Expand exception table to allow new handling options Tony Luck
@ 2016-02-17 18:20 ` Tony Luck
  2016-02-18 10:19   ` [tip:x86/asm] x86/cpufeature: " tip-bot for Tony Luck
  2016-02-17 18:20 ` [PATCH v11 3/4] x86, mce: Add __mcsafe_copy() Tony Luck
  2016-02-17 18:20 ` [PATCH v11 2/4] x86, mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries Tony Luck
  3 siblings, 1 reply; 52+ messages in thread
From: Tony Luck @ 2016-02-17 18:20 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

The Intel Software Developer Manual describes bit 24 in the MCG_CAP
MSR:
   MCG_SER_P (software error recovery support present) flag,
   bit 24 — Indicates (when set) that the processor supports
   software error recovery
But only some models with this capability bit set will actually
generate recoverable machine checks.

Check the model name and set a synthetic capability bit. Provide
a command line option to set this bit anyway in case the kernel
doesn't recognise the model name.

Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/x86/x86_64/boot-options.txt |  2 ++
 arch/x86/include/asm/cpufeature.h         |  1 +
 arch/x86/include/asm/mce.h                |  1 +
 arch/x86/kernel/cpu/mcheck/mce.c          | 13 +++++++++++++
 4 files changed, 17 insertions(+)

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 68ed3114c363..0965a71f9942 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -60,6 +60,8 @@ Machine check
 		threshold to 1. Enabling this may make memory predictive failure
 		analysis less effective if the bios sets thresholds for memory
 		errors since we will not see details for all errors.
+   mce=recovery
+		Force-enable recoverable machine check code paths
 
    nomce (for compatibility with i386): same as mce=off
 
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 7ad8c9464297..06c6c2d2fea0 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -106,6 +106,7 @@
 #define X86_FEATURE_APERFMPERF	( 3*32+28) /* APERFMPERF */
 #define X86_FEATURE_EAGER_FPU	( 3*32+29) /* "eagerfpu" Non lazy FPU restore */
 #define X86_FEATURE_NONSTOP_TSC_S3 ( 3*32+30) /* TSC doesn't stop in S3 state */
+#define X86_FEATURE_MCE_RECOVERY ( 3*32+31) /* cpu has recoverable machine checks */
 
 /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
 #define X86_FEATURE_XMM3	( 4*32+ 0) /* "pni" SSE-3 */
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 2ea4527e462f..18d2ba9c8e44 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -113,6 +113,7 @@ struct mca_config {
 	bool ignore_ce;
 	bool disabled;
 	bool ser;
+	bool recovery;
 	bool bios_cmci_threshold;
 	u8 banks;
 	s8 bootlog;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 905f3070f412..15ff6f07bd92 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1578,6 +1578,17 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
 
 		if (c->x86 == 6 && c->x86_model == 45)
 			quirk_no_way_out = quirk_sandybridge_ifu;
+		/*
+		 * MCG_CAP.MCG_SER_P is necessary but not sufficient to know
+		 * whether this processor will actually generate recoverable
+		 * machine checks. Check to see if this is an E7 model Xeon.
+		 * We can't do a model number check because E5 and E7 use the
+		 * same model number. E5 doesn't support recovery, E7 does.
+		 */
+		if (mca_cfg.recovery || (mca_cfg.ser &&
+			!strncmp(c->x86_model_id,
+				 "Intel(R) Xeon(R) CPU E7-", 24)))
+			set_cpu_cap(c, X86_FEATURE_MCE_RECOVERY);
 	}
 	if (cfg->monarch_timeout < 0)
 		cfg->monarch_timeout = 0;
@@ -2030,6 +2041,8 @@ static int __init mcheck_enable(char *str)
 		cfg->bootlog = (str[0] == 'b');
 	else if (!strcmp(str, "bios_cmci_threshold"))
 		cfg->bios_cmci_threshold = true;
+	else if (!strcmp(str, "recovery"))
+		cfg->recovery = true;
 	else if (isdigit(str[0])) {
 		if (get_option(&str, &cfg->tolerant) == 2)
 			get_option(&str, &(cfg->monarch_timeout));
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-17 18:20 ` [PATCH v11 3/4] x86, mce: Add __mcsafe_copy() Tony Luck
@ 2016-02-18  8:21   ` Ingo Molnar
  2016-02-18  9:59     ` Peter Zijlstra
                       ` (2 more replies)
  2016-02-18 18:12   ` [PATCH v11 3/4] x86, mce: Add __mcsafe_copy() Linus Torvalds
  1 sibling, 3 replies; 52+ messages in thread
From: Ingo Molnar @ 2016-02-18  8:21 UTC (permalink / raw)
  To: Tony Luck
  Cc: linux-kernel, Thomas Gleixner, H. Peter Anvin, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Peter Zijlstra


* Tony Luck <tony.luck@intel.com> wrote:

> Make use of the EXTABLE_FAULT exception table entries. This routine
> returns a structure to indicate the result of the copy:

So the series looks good to me, but I have some (mostly readability) comments that 
went beyond what I usually fix up manually:

> struct mcsafe_ret {
>         u64 trapnr;
>         u64 remain;
> };

> +struct mcsafe_ret {
> +	u64 trapnr;
> +	u64 remain;
> +};

Yeah, so please change this to something like:

  struct mcsafe_ret {
          u64 trap_nr;
          u64 bytes_left;
  };

this makes it crystal clear what the fields are about and what their unit is. 
Readability is king and modern consoles are wide enough, no need to abbreviate 
excessively.

> +struct mcsafe_ret __mcsafe_copy(void *dst, const void __user *src, size_t cnt);
> +extern void __mcsafe_copy_end(void);

So this is a bad name I think. What kind of 'copy' is this? It's defined in 
asm/string_64.h - so people might thing it's a string copy. If it's a memcpy 
variant then name it so.

Also, I'd suggest we postfix the new mcsafe functions with '_mcsafe', not prefix 
them. Special properties of memcpy routines are usually postfixes - such as 
_nocache(), _toio(), etc.

> --- a/arch/x86/kernel/x8664_ksyms_64.c
> +++ b/arch/x86/kernel/x8664_ksyms_64.c
> @@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache);
>  EXPORT_SYMBOL(_copy_from_user);
>  EXPORT_SYMBOL(_copy_to_user);
>  
> +EXPORT_SYMBOL_GPL(__mcsafe_copy);
> +
>  EXPORT_SYMBOL(copy_page);
>  EXPORT_SYMBOL(clear_page);
>  
> diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
> index 16698bba87de..7f967a9ed0e4 100644
> --- a/arch/x86/lib/memcpy_64.S
> +++ b/arch/x86/lib/memcpy_64.S
> @@ -177,3 +177,154 @@ ENTRY(memcpy_orig)
>  .Lend:
>  	retq
>  ENDPROC(memcpy_orig)
> +
> +#ifndef CONFIG_UML

Why is this UML quirk needed? No other memcpy functions have it. Theoretically UML 
could introduce the notion of #MC interruption.

> +/*
> + * __mcsafe_copy - memory copy with machine check exception handling
> + * Note that we only catch machine checks when reading the source addresses.
> + * Writes to target are posted and don't generate machine checks.
> + */
> +ENTRY(__mcsafe_copy)
> +	cmpl $8,%edx
> +	jb 20f		/* less then 8 bytes, go to byte copy loop */
> +
> +	/* check for bad alignment of source */
> +	testl $7,%esi
> +	/* already aligned */
> +	jz 102f
> +
> +	/* copy one byte at a time until source is 8-byte aligned */
> +	movl %esi,%ecx
> +	andl $7,%ecx
> +	subl $8,%ecx
> +	negl %ecx
> +	subl %ecx,%edx
> +0:	movb (%rsi),%al
> +	movb %al,(%rdi)
> +	incq %rsi
> +	incq %rdi
> +	decl %ecx
> +	jnz 0b
> +
> +102:
> +	/* Figure out how many whole cache lines (64-bytes) to copy */
> +	movl %edx,%ecx
> +	andl $63,%edx
> +	shrl $6,%ecx
> +	jz 17f

Please don't use numeric labels in new assembly code, use descriptively named 
local labels:

  .L_do_stuff:

numeric labels are generally unfriendly against future changes. They are the GOTO 
numeric labels of BASIC.

> +
> +	/* Loop copying whole cache lines */
> +1:	movq (%rsi),%r8
> +2:	movq 1*8(%rsi),%r9
> +3:	movq 2*8(%rsi),%r10
> +4:	movq 3*8(%rsi),%r11
> +	movq %r8,(%rdi)
> +	movq %r9,1*8(%rdi)
> +	movq %r10,2*8(%rdi)
> +	movq %r11,3*8(%rdi)
> +9:	movq 4*8(%rsi),%r8
> +10:	movq 5*8(%rsi),%r9
> +11:	movq 6*8(%rsi),%r10
> +12:	movq 7*8(%rsi),%r11
> +	movq %r8,4*8(%rdi)
> +	movq %r9,5*8(%rdi)
> +	movq %r10,6*8(%rdi)
> +	movq %r11,7*8(%rdi)
> +	leaq 64(%rsi),%rsi
> +	leaq 64(%rdi),%rdi
> +	decl %ecx
> +	jnz 1b
> +
> +	/* Are there any trailing 8-byte words? */
> +17:	movl %edx,%ecx
> +	andl $7,%edx
> +	shrl $3,%ecx
> +	jz 20f
> +
> +	/* Copy trailing words */
> +18:	movq (%rsi),%r8
> +	mov %r8,(%rdi)
> +	leaq 8(%rsi),%rsi
> +	leaq 8(%rdi),%rdi
> +	decl %ecx
> +	jnz 18b
> +
> +	/* Any trailing bytes? */
> +20:	andl %edx,%edx
> +	jz 23f
> +
> +	/* copy trailing bytes */
> +	movl %edx,%ecx
> +21:	movb (%rsi),%al
> +	movb %al,(%rdi)
> +	incq %rsi
> +	incq %rdi
> +	decl %ecx
> +	jnz 21b
> +
> +	/* Copy successful. Return .remain = 0, .trapnr = 0 */
> +23:	xorq %rax, %rax
> +	xorq %rdx, %rdx
> +	ret
> +
> +	.section .fixup,"ax"
> +	/*
> +	 * machine check handler loaded %rax with trap number
> +	 * We just need to make sure %edx has the number of
> +	 * bytes remaining
> +	 */

Please use consistent capitalization in comments and punctuate sentences where 
there's more than one of them.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18  8:21   ` Ingo Molnar
@ 2016-02-18  9:59     ` Peter Zijlstra
  2016-02-18 10:19       ` Ingo Molnar
  2016-02-19  7:58       ` Ingo Molnar
  2016-02-18 10:29     ` Borislav Petkov
  2016-02-18 21:14     ` [PATCH v12] x86, mce: Add memcpy_trap() Luck, Tony
  2 siblings, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2016-02-18  9:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tony Luck, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Borislav Petkov, Linus Torvalds, Andrew Morton

On Thu, Feb 18, 2016 at 09:21:07AM +0100, Ingo Molnar wrote:
> 
> * Tony Luck <tony.luck@intel.com> wrote:
> 
> > Make use of the EXTABLE_FAULT exception table entries. This routine
> > returns a structure to indicate the result of the copy:
> 
> So the series looks good to me, but I have some (mostly readability) comments that 
> went beyond what I usually fix up manually:
> 
> > struct mcsafe_ret {
> >         u64 trapnr;
> >         u64 remain;
> > };
> 
> > +struct mcsafe_ret {
> > +	u64 trapnr;
> > +	u64 remain;
> > +};
> 
> Yeah, so please change this to something like:
> 
>   struct mcsafe_ret {
>           u64 trap_nr;
>           u64 bytes_left;
>   };
> 
> this makes it crystal clear what the fields are about and what their unit is. 
> Readability is king and modern consoles are wide enough, no need to abbreviate 
> excessively.

I prefer to use my modern console width to display multiple columns of
text, instead of wasting it to display mostly whitespace. Therefore I
still very much prefer ~80 char wide code.

> > +struct mcsafe_ret __mcsafe_copy(void *dst, const void __user *src, size_t cnt);
> > +extern void __mcsafe_copy_end(void);
> 
> So this is a bad name I think. What kind of 'copy' is this? It's defined in 
> asm/string_64.h - so people might thing it's a string copy. If it's a memcpy 
> variant then name it so.
> 
> Also, I'd suggest we postfix the new mcsafe functions with '_mcsafe', not prefix 
> them. Special properties of memcpy routines are usually postfixes - such as 
> _nocache(), _toio(), etc.

I think the whole notion of mcsafe here is 'wrong'. This copy variant
simply reports the kind of trap that happened (#PF or #MC) and could
arguably be extended to include more types if the hardware were to
generate more.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [tip:ras/core] x86/mm: Expand the exception table logic to allow new handling options
  2016-02-17 18:20 ` [PATCH v11 1/4] x86: Expand exception table to allow new handling options Tony Luck
@ 2016-02-18 10:19   ` tip-bot for Tony Luck
  0 siblings, 0 replies; 52+ messages in thread
From: tip-bot for Tony Luck @ 2016-02-18 10:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, tglx, tony.luck, mingo, bp, linux-kernel, torvalds, hpa

Commit-ID:  548acf19234dbda5a52d5a8e7e205af46e9da840
Gitweb:     http://git.kernel.org/tip/548acf19234dbda5a52d5a8e7e205af46e9da840
Author:     Tony Luck <tony.luck@intel.com>
AuthorDate: Wed, 17 Feb 2016 10:20:12 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:21:46 +0100

x86/mm: Expand the exception table logic to allow new handling options

Huge amounts of help from  Andy Lutomirski and Borislav Petkov to
produce this. Andy provided the inspiration to add classes to the
exception table with a clever bit-squeezing trick, Boris pointed
out how much cleaner it would all be if we just had a new field.

Linus Torvalds blessed the expansion with:

  ' I'd rather not be clever in order to save just a tiny amount of space
    in the exception table, which isn't really criticial for anybody. '

The third field is another relative function pointer, this one to a
handler that executes the actions.

We start out with three handlers:

 1: Legacy - just jumps the to fixup IP
 2: Fault - provide the trap number in %ax to the fixup code
 3: Cleaned up legacy for the uaccess error hack

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f6af78fcbd348cf4939875cfda9c19689b5e50b8.1455732970.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/x86/exception-tables.txt |  35 ++++++++++++
 arch/x86/include/asm/asm.h             |  40 +++++++------
 arch/x86/include/asm/uaccess.h         |  16 +++---
 arch/x86/kernel/kprobes/core.c         |   2 +-
 arch/x86/kernel/traps.c                |   6 +-
 arch/x86/mm/extable.c                  | 100 ++++++++++++++++++++++++---------
 arch/x86/mm/fault.c                    |   2 +-
 scripts/sortextable.c                  |  32 +++++++++++
 8 files changed, 176 insertions(+), 57 deletions(-)

diff --git a/Documentation/x86/exception-tables.txt b/Documentation/x86/exception-tables.txt
index 32901aa..e396bcd 100644
--- a/Documentation/x86/exception-tables.txt
+++ b/Documentation/x86/exception-tables.txt
@@ -290,3 +290,38 @@ Due to the way that the exception table is built and needs to be ordered,
 only use exceptions for code in the .text section.  Any other section
 will cause the exception table to not be sorted correctly, and the
 exceptions will fail.
+
+Things changed when 64-bit support was added to x86 Linux. Rather than
+double the size of the exception table by expanding the two entries
+from 32-bits to 64 bits, a clever trick was used to store addresses
+as relative offsets from the table itself. The assembly code changed
+from:
+	.long 1b,3b
+to:
+        .long (from) - .
+        .long (to) - .
+
+and the C-code that uses these values converts back to absolute addresses
+like this:
+
+	ex_insn_addr(const struct exception_table_entry *x)
+	{
+		return (unsigned long)&x->insn + x->insn;
+	}
+
+In v4.6 the exception table entry was expanded with a new field "handler".
+This is also 32-bits wide and contains a third relative function
+pointer which points to one of:
+
+1) int ex_handler_default(const struct exception_table_entry *fixup)
+   This is legacy case that just jumps to the fixup code
+2) int ex_handler_fault(const struct exception_table_entry *fixup)
+   This case provides the fault number of the trap that occurred at
+   entry->insn. It is used to distinguish page faults from machine
+   check.
+3) int ex_handler_ext(const struct exception_table_entry *fixup)
+   This case is used for uaccess_err ... we need to set a flag
+   in the task structure. Before the handler functions existed this
+   case was handled by adding a large offset to the fixup to tag
+   it as special.
+More functions can easily be added.
diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 189679a..f5063b6 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -44,19 +44,22 @@
 
 /* Exception table entry */
 #ifdef __ASSEMBLY__
-# define _ASM_EXTABLE(from,to)					\
+# define _ASM_EXTABLE_HANDLE(from, to, handler)			\
 	.pushsection "__ex_table","a" ;				\
-	.balign 8 ;						\
+	.balign 4 ;						\
 	.long (from) - . ;					\
 	.long (to) - . ;					\
+	.long (handler) - . ;					\
 	.popsection
 
-# define _ASM_EXTABLE_EX(from,to)				\
-	.pushsection "__ex_table","a" ;				\
-	.balign 8 ;						\
-	.long (from) - . ;					\
-	.long (to) - . + 0x7ffffff0 ;				\
-	.popsection
+# define _ASM_EXTABLE(from, to)					\
+	_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
+
+# define _ASM_EXTABLE_FAULT(from, to)				\
+	_ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
+
+# define _ASM_EXTABLE_EX(from, to)				\
+	_ASM_EXTABLE_HANDLE(from, to, ex_handler_ext)
 
 # define _ASM_NOKPROBE(entry)					\
 	.pushsection "_kprobe_blacklist","aw" ;			\
@@ -89,19 +92,24 @@
 	.endm
 
 #else
-# define _ASM_EXTABLE(from,to)					\
+# define _EXPAND_EXTABLE_HANDLE(x) #x
+# define _ASM_EXTABLE_HANDLE(from, to, handler)			\
 	" .pushsection \"__ex_table\",\"a\"\n"			\
-	" .balign 8\n"						\
+	" .balign 4\n"						\
 	" .long (" #from ") - .\n"				\
 	" .long (" #to ") - .\n"				\
+	" .long (" _EXPAND_EXTABLE_HANDLE(handler) ") - .\n"	\
 	" .popsection\n"
 
-# define _ASM_EXTABLE_EX(from,to)				\
-	" .pushsection \"__ex_table\",\"a\"\n"			\
-	" .balign 8\n"						\
-	" .long (" #from ") - .\n"				\
-	" .long (" #to ") - . + 0x7ffffff0\n"			\
-	" .popsection\n"
+# define _ASM_EXTABLE(from, to)					\
+	_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)
+
+# define _ASM_EXTABLE_FAULT(from, to)				\
+	_ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)
+
+# define _ASM_EXTABLE_EX(from, to)				\
+	_ASM_EXTABLE_HANDLE(from, to, ex_handler_ext)
+
 /* For C file, we already have NOKPROBE_SYMBOL macro */
 #endif
 
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index a4a30e4..c0f27d7 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -90,12 +90,11 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
 	likely(!__range_not_ok(addr, size, user_addr_max()))
 
 /*
- * The exception table consists of pairs of addresses relative to the
- * exception table enty itself: the first is the address of an
- * instruction that is allowed to fault, and the second is the address
- * at which the program should continue.  No registers are modified,
- * so it is entirely up to the continuation code to figure out what to
- * do.
+ * The exception table consists of triples of addresses relative to the
+ * exception table entry itself. The first address is of an instruction
+ * that is allowed to fault, the second is the target at which the program
+ * should continue. The third is a handler function to deal with the fault
+ * caused by the instruction in the first field.
  *
  * All the routines below use bits of fixup code that are out of line
  * with the main instruction path.  This means when everything is well,
@@ -104,13 +103,14 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
  */
 
 struct exception_table_entry {
-	int insn, fixup;
+	int insn, fixup, handler;
 };
 /* This is not the generic standard exception_table_entry format */
 #define ARCH_HAS_SORT_EXTABLE
 #define ARCH_HAS_SEARCH_EXTABLE
 
-extern int fixup_exception(struct pt_regs *regs);
+extern int fixup_exception(struct pt_regs *regs, int trapnr);
+extern bool ex_has_fault_handler(unsigned long ip);
 extern int early_fixup_exception(unsigned long *ip);
 
 /*
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 1deffe6..0f05dee 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -988,7 +988,7 @@ int kprobe_fault_handler(struct pt_regs *regs, int trapnr)
 		 * In case the user-specified fault handler returned
 		 * zero, try to fix up.
 		 */
-		if (fixup_exception(regs))
+		if (fixup_exception(regs, trapnr))
 			return 1;
 
 		/*
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ade185a..211c11c 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -199,7 +199,7 @@ do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
 	}
 
 	if (!user_mode(regs)) {
-		if (!fixup_exception(regs)) {
+		if (!fixup_exception(regs, trapnr)) {
 			tsk->thread.error_code = error_code;
 			tsk->thread.trap_nr = trapnr;
 			die(str, regs, error_code);
@@ -453,7 +453,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
 
 	tsk = current;
 	if (!user_mode(regs)) {
-		if (fixup_exception(regs))
+		if (fixup_exception(regs, X86_TRAP_GP))
 			return;
 
 		tsk->thread.error_code = error_code;
@@ -699,7 +699,7 @@ static void math_error(struct pt_regs *regs, int error_code, int trapnr)
 	conditional_sti(regs);
 
 	if (!user_mode(regs)) {
-		if (!fixup_exception(regs)) {
+		if (!fixup_exception(regs, trapnr)) {
 			task->thread.error_code = error_code;
 			task->thread.trap_nr = trapnr;
 			die(str, regs, error_code);
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 903ec1e..9dd7e4b 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -3,6 +3,9 @@
 #include <linux/sort.h>
 #include <asm/uaccess.h>
 
+typedef bool (*ex_handler_t)(const struct exception_table_entry *,
+			    struct pt_regs *, int);
+
 static inline unsigned long
 ex_insn_addr(const struct exception_table_entry *x)
 {
@@ -13,11 +16,56 @@ ex_fixup_addr(const struct exception_table_entry *x)
 {
 	return (unsigned long)&x->fixup + x->fixup;
 }
+static inline ex_handler_t
+ex_fixup_handler(const struct exception_table_entry *x)
+{
+	return (ex_handler_t)((unsigned long)&x->handler + x->handler);
+}
 
-int fixup_exception(struct pt_regs *regs)
+bool ex_handler_default(const struct exception_table_entry *fixup,
+		       struct pt_regs *regs, int trapnr)
 {
-	const struct exception_table_entry *fixup;
-	unsigned long new_ip;
+	regs->ip = ex_fixup_addr(fixup);
+	return true;
+}
+EXPORT_SYMBOL(ex_handler_default);
+
+bool ex_handler_fault(const struct exception_table_entry *fixup,
+		     struct pt_regs *regs, int trapnr)
+{
+	regs->ip = ex_fixup_addr(fixup);
+	regs->ax = trapnr;
+	return true;
+}
+EXPORT_SYMBOL_GPL(ex_handler_fault);
+
+bool ex_handler_ext(const struct exception_table_entry *fixup,
+		   struct pt_regs *regs, int trapnr)
+{
+	/* Special hack for uaccess_err */
+	current_thread_info()->uaccess_err = 1;
+	regs->ip = ex_fixup_addr(fixup);
+	return true;
+}
+EXPORT_SYMBOL(ex_handler_ext);
+
+bool ex_has_fault_handler(unsigned long ip)
+{
+	const struct exception_table_entry *e;
+	ex_handler_t handler;
+
+	e = search_exception_tables(ip);
+	if (!e)
+		return false;
+	handler = ex_fixup_handler(e);
+
+	return handler == ex_handler_fault;
+}
+
+int fixup_exception(struct pt_regs *regs, int trapnr)
+{
+	const struct exception_table_entry *e;
+	ex_handler_t handler;
 
 #ifdef CONFIG_PNPBIOS
 	if (unlikely(SEGMENT_IS_PNP_CODE(regs->cs))) {
@@ -33,42 +81,34 @@ int fixup_exception(struct pt_regs *regs)
 	}
 #endif
 
-	fixup = search_exception_tables(regs->ip);
-	if (fixup) {
-		new_ip = ex_fixup_addr(fixup);
-
-		if (fixup->fixup - fixup->insn >= 0x7ffffff0 - 4) {
-			/* Special hack for uaccess_err */
-			current_thread_info()->uaccess_err = 1;
-			new_ip -= 0x7ffffff0;
-		}
-		regs->ip = new_ip;
-		return 1;
-	}
+	e = search_exception_tables(regs->ip);
+	if (!e)
+		return 0;
 
-	return 0;
+	handler = ex_fixup_handler(e);
+	return handler(e, regs, trapnr);
 }
 
 /* Restricted version used during very early boot */
 int __init early_fixup_exception(unsigned long *ip)
 {
-	const struct exception_table_entry *fixup;
+	const struct exception_table_entry *e;
 	unsigned long new_ip;
+	ex_handler_t handler;
 
-	fixup = search_exception_tables(*ip);
-	if (fixup) {
-		new_ip = ex_fixup_addr(fixup);
+	e = search_exception_tables(*ip);
+	if (!e)
+		return 0;
 
-		if (fixup->fixup - fixup->insn >= 0x7ffffff0 - 4) {
-			/* uaccess handling not supported during early boot */
-			return 0;
-		}
+	new_ip  = ex_fixup_addr(e);
+	handler = ex_fixup_handler(e);
 
-		*ip = new_ip;
-		return 1;
-	}
+	/* special handling not supported during early boot */
+	if (handler != ex_handler_default)
+		return 0;
 
-	return 0;
+	*ip = new_ip;
+	return 1;
 }
 
 /*
@@ -133,6 +173,8 @@ void sort_extable(struct exception_table_entry *start,
 		i += 4;
 		p->fixup += i;
 		i += 4;
+		p->handler += i;
+		i += 4;
 	}
 
 	sort(start, finish - start, sizeof(struct exception_table_entry),
@@ -145,6 +187,8 @@ void sort_extable(struct exception_table_entry *start,
 		i += 4;
 		p->fixup -= i;
 		i += 4;
+		p->handler -= i;
+		i += 4;
 	}
 }
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9..495946c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -656,7 +656,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	int sig;
 
 	/* Are we prepared to handle this kernel fault? */
-	if (fixup_exception(regs)) {
+	if (fixup_exception(regs, X86_TRAP_PF)) {
 		/*
 		 * Any interrupt that takes a fault gets the fixup. This makes
 		 * the below recursive fault logic only apply to a faults from
diff --git a/scripts/sortextable.c b/scripts/sortextable.c
index c2423d9..7b29fb1 100644
--- a/scripts/sortextable.c
+++ b/scripts/sortextable.c
@@ -209,6 +209,35 @@ static int compare_relative_table(const void *a, const void *b)
 	return 0;
 }
 
+static void x86_sort_relative_table(char *extab_image, int image_size)
+{
+	int i;
+
+	i = 0;
+	while (i < image_size) {
+		uint32_t *loc = (uint32_t *)(extab_image + i);
+
+		w(r(loc) + i, loc);
+		w(r(loc + 1) + i + 4, loc + 1);
+		w(r(loc + 2) + i + 8, loc + 2);
+
+		i += sizeof(uint32_t) * 3;
+	}
+
+	qsort(extab_image, image_size / 12, 12, compare_relative_table);
+
+	i = 0;
+	while (i < image_size) {
+		uint32_t *loc = (uint32_t *)(extab_image + i);
+
+		w(r(loc) - i, loc);
+		w(r(loc + 1) - (i + 4), loc + 1);
+		w(r(loc + 2) - (i + 8), loc + 2);
+
+		i += sizeof(uint32_t) * 3;
+	}
+}
+
 static void sort_relative_table(char *extab_image, int image_size)
 {
 	int i;
@@ -281,6 +310,9 @@ do_file(char const *const fname)
 		break;
 	case EM_386:
 	case EM_X86_64:
+		custom_sort = x86_sort_relative_table;
+		break;
+
 	case EM_S390:
 		custom_sort = sort_relative_table;
 		break;

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [tip:ras/core] x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries
  2016-02-17 18:20 ` [PATCH v11 2/4] x86, mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries Tony Luck
@ 2016-02-18 10:19   ` tip-bot for Tony Luck
  0 siblings, 0 replies; 52+ messages in thread
From: tip-bot for Tony Luck @ 2016-02-18 10:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, torvalds, bp, peterz, mingo, tglx, tony.luck, linux-kernel

Commit-ID:  b2f9d678e28ca71ce650eac82f26dd287b47e89a
Gitweb:     http://git.kernel.org/tip/b2f9d678e28ca71ce650eac82f26dd287b47e89a
Author:     Tony Luck <tony.luck@intel.com>
AuthorDate: Wed, 17 Feb 2016 10:20:13 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:22:42 +0100

x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries

Extend the severity checking code to add a new context IN_KERN_RECOV
which is used to indicate that the machine check was triggered by code
in the kernel tagged with _ASM_EXTABLE_FAULT() so that the ex_handler_fault()
handler will provide the fixup code with the trap number.

Major re-work to the tail code in do_machine_check() to make all this
readable/maintainable. One functional change is that tolerant=3 no longer
stops recovery actions. Revert to only skipping sending SIGBUS to the
current process.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/89d243d05a7943bb187d1074bb30d9c4f482d5f5.1455732970.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/mcheck/mce-severity.c | 22 +++++++++-
 arch/x86/kernel/cpu/mcheck/mce.c          | 70 ++++++++++++++++---------------
 2 files changed, 56 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 9c682c2..5119766 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/debugfs.h>
 #include <asm/mce.h>
+#include <asm/uaccess.h>
 
 #include "mce-internal.h"
 
@@ -29,7 +30,7 @@
  * panic situations)
  */
 
-enum context { IN_KERNEL = 1, IN_USER = 2 };
+enum context { IN_KERNEL = 1, IN_USER = 2, IN_KERNEL_RECOV = 3 };
 enum ser { SER_REQUIRED = 1, NO_SER = 2 };
 enum exception { EXCP_CONTEXT = 1, NO_EXCP = 2 };
 
@@ -48,6 +49,7 @@ static struct severity {
 #define MCESEV(s, m, c...) { .sev = MCE_ ## s ## _SEVERITY, .msg = m, ## c }
 #define  KERNEL		.context = IN_KERNEL
 #define  USER		.context = IN_USER
+#define  KERNEL_RECOV	.context = IN_KERNEL_RECOV
 #define  SER		.ser = SER_REQUIRED
 #define  NOSER		.ser = NO_SER
 #define  EXCP		.excp = EXCP_CONTEXT
@@ -87,6 +89,10 @@ static struct severity {
 		EXCP, KERNEL, MCGMASK(MCG_STATUS_RIPV, 0)
 		),
 	MCESEV(
+		PANIC, "In kernel and no restart IP",
+		EXCP, KERNEL_RECOV, MCGMASK(MCG_STATUS_RIPV, 0)
+		),
+	MCESEV(
 		DEFERRED, "Deferred error",
 		NOSER, MASK(MCI_STATUS_UC|MCI_STATUS_DEFERRED|MCI_STATUS_POISON, MCI_STATUS_DEFERRED)
 		),
@@ -123,6 +129,11 @@ static struct severity {
 		MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, MCG_STATUS_RIPV)
 		),
 	MCESEV(
+		AR, "Action required: data load in error recoverable area of kernel",
+		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
+		KERNEL_RECOV
+		),
+	MCESEV(
 		AR, "Action required: data load error in a user process",
 		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
 		USER
@@ -170,6 +181,9 @@ static struct severity {
 		)	/* always matches. keep at end */
 };
 
+#define mc_recoverable(mcg) (((mcg) & (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) == \
+				(MCG_STATUS_RIPV|MCG_STATUS_EIPV))
+
 /*
  * If mcgstatus indicated that ip/cs on the stack were
  * no good, then "m->cs" will be zero and we will have
@@ -183,7 +197,11 @@ static struct severity {
  */
 static int error_context(struct mce *m)
 {
-	return ((m->cs & 3) == 3) ? IN_USER : IN_KERNEL;
+	if ((m->cs & 3) == 3)
+		return IN_USER;
+	if (mc_recoverable(m->mcgstatus) && ex_has_fault_handler(m->ip))
+		return IN_KERNEL_RECOV;
+	return IN_KERNEL;
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index b718080..524f2a8 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -961,6 +961,20 @@ static void mce_clear_state(unsigned long *toclear)
 	}
 }
 
+static int do_memory_failure(struct mce *m)
+{
+	int flags = MF_ACTION_REQUIRED;
+	int ret;
+
+	pr_err("Uncorrected hardware memory error in user-access at %llx", m->addr);
+	if (!(m->mcgstatus & MCG_STATUS_RIPV))
+		flags |= MF_MUST_KILL;
+	ret = memory_failure(m->addr >> PAGE_SHIFT, MCE_VECTOR, flags);
+	if (ret)
+		pr_err("Memory error not recovered");
+	return ret;
+}
+
 /*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
@@ -998,8 +1012,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	DECLARE_BITMAP(toclear, MAX_NR_BANKS);
 	DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
 	char *msg = "Unknown";
-	u64 recover_paddr = ~0ull;
-	int flags = MF_ACTION_REQUIRED;
 	int lmce = 0;
 
 	/* If this CPU is offline, just bail out. */
@@ -1136,22 +1148,13 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	}
 
 	/*
-	 * At insane "tolerant" levels we take no action. Otherwise
-	 * we only die if we have no other choice. For less serious
-	 * issues we try to recover, or limit damage to the current
-	 * process.
+	 * If tolerant is at an insane level we drop requests to kill
+	 * processes and continue even when there is no way out.
 	 */
-	if (cfg->tolerant < 3) {
-		if (no_way_out)
-			mce_panic("Fatal machine check on current CPU", &m, msg);
-		if (worst == MCE_AR_SEVERITY) {
-			recover_paddr = m.addr;
-			if (!(m.mcgstatus & MCG_STATUS_RIPV))
-				flags |= MF_MUST_KILL;
-		} else if (kill_it) {
-			force_sig(SIGBUS, current);
-		}
-	}
+	if (cfg->tolerant == 3)
+		kill_it = 0;
+	else if (no_way_out)
+		mce_panic("Fatal machine check on current CPU", &m, msg);
 
 	if (worst > 0)
 		mce_report_event(regs);
@@ -1159,25 +1162,24 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 out:
 	sync_core();
 
-	if (recover_paddr == ~0ull)
-		goto done;
+	if (worst != MCE_AR_SEVERITY && !kill_it)
+		goto out_ist;
 
-	pr_err("Uncorrected hardware memory error in user-access at %llx",
-		 recover_paddr);
-	/*
-	 * We must call memory_failure() here even if the current process is
-	 * doomed. We still need to mark the page as poisoned and alert any
-	 * other users of the page.
-	 */
-	ist_begin_non_atomic(regs);
-	local_irq_enable();
-	if (memory_failure(recover_paddr >> PAGE_SHIFT, MCE_VECTOR, flags) < 0) {
-		pr_err("Memory error not recovered");
-		force_sig(SIGBUS, current);
+	/* Fault was in user mode and we need to take some action */
+	if ((m.cs & 3) == 3) {
+		ist_begin_non_atomic(regs);
+		local_irq_enable();
+
+		if (kill_it || do_memory_failure(&m))
+			force_sig(SIGBUS, current);
+		local_irq_disable();
+		ist_end_non_atomic();
+	} else {
+		if (!fixup_exception(regs, X86_TRAP_MC))
+			mce_panic("Failed kernel mode recovery", &m, NULL);
 	}
-	local_irq_disable();
-	ist_end_non_atomic();
-done:
+
+out_ist:
 	ist_exit(regs);
 }
 EXPORT_SYMBOL_GPL(do_machine_check);

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18  9:59     ` Peter Zijlstra
@ 2016-02-18 10:19       ` Ingo Molnar
  2016-02-18 10:29         ` Borislav Petkov
  2016-02-18 10:35         ` Peter Zijlstra
  2016-02-19  7:58       ` Ingo Molnar
  1 sibling, 2 replies; 52+ messages in thread
From: Ingo Molnar @ 2016-02-18 10:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tony Luck, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Borislav Petkov, Linus Torvalds, Andrew Morton


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Feb 18, 2016 at 09:21:07AM +0100, Ingo Molnar wrote:
> > 
> > * Tony Luck <tony.luck@intel.com> wrote:
> > 
> > > Make use of the EXTABLE_FAULT exception table entries. This routine
> > > returns a structure to indicate the result of the copy:
> > 
> > So the series looks good to me, but I have some (mostly readability) comments that 
> > went beyond what I usually fix up manually:
> > 
> > > struct mcsafe_ret {
> > >         u64 trapnr;
> > >         u64 remain;
> > > };
> > 
> > > +struct mcsafe_ret {
> > > +	u64 trapnr;
> > > +	u64 remain;
> > > +};
> > 
> > Yeah, so please change this to something like:
> > 
> >   struct mcsafe_ret {
> >           u64 trap_nr;
> >           u64 bytes_left;
> >   };
> > 
> > this makes it crystal clear what the fields are about and what their unit is. 
> > Readability is king and modern consoles are wide enough, no need to abbreviate 
> > excessively.
> 
> I prefer to use my modern console width to display multiple columns of
> text, instead of wasting it to display mostly whitespace. Therefore I
> still very much prefer ~80 char wide code.

This naming won't hurt the col80 limit.

> > Also, I'd suggest we postfix the new mcsafe functions with '_mcsafe', not 
> > prefix them. Special properties of memcpy routines are usually postfixes - 
> > such as _nocache(), _toio(), etc.
> 
> I think the whole notion of mcsafe here is 'wrong'. This copy variant simply 
> reports the kind of trap that happened (#PF or #MC) and could arguably be 
> extended to include more types if the hardware were to generate more.

What would a better name be? memcpy_ret() or so?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [tip:x86/asm] x86/cpufeature: Create a new synthetic cpu capability for machine check recovery
  2016-02-17 18:20 ` [PATCH v11 4/4] x86: Create a new synthetic cpu capability for machine check recovery Tony Luck
@ 2016-02-18 10:19   ` tip-bot for Tony Luck
  0 siblings, 0 replies; 52+ messages in thread
From: tip-bot for Tony Luck @ 2016-02-18 10:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, tony.luck, mingo, torvalds, bp, linux-kernel, peterz, tglx

Commit-ID:  0f68c088c0adb3c3bbeb487c4ebcde91fd5d34be
Gitweb:     http://git.kernel.org/tip/0f68c088c0adb3c3bbeb487c4ebcde91fd5d34be
Author:     Tony Luck <tony.luck@intel.com>
AuthorDate: Wed, 17 Feb 2016 10:20:13 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:28:47 +0100

x86/cpufeature: Create a new synthetic cpu capability for machine check recovery

The Intel Software Developer Manual describes bit 24 in the MCG_CAP
MSR:

   MCG_SER_P (software error recovery support present) flag,
   bit 24 — Indicates (when set) that the processor supports
   software error recovery

But only some models with this capability bit set will actually
generate recoverable machine checks.

Check the model name and set a synthetic capability bit. Provide
a command line option to set this bit anyway in case the kernel
doesn't recognise the model name.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/2e5bfb23c89800a036fb8a45fa97a74bb16bc362.1455732970.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/x86/x86_64/boot-options.txt |  2 ++
 arch/x86/include/asm/cpufeatures.h        |  1 +
 arch/x86/include/asm/mce.h                |  1 +
 arch/x86/kernel/cpu/mcheck/mce.c          | 13 +++++++++++++
 4 files changed, 17 insertions(+)

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 68ed311..0965a71 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -60,6 +60,8 @@ Machine check
 		threshold to 1. Enabling this may make memory predictive failure
 		analysis less effective if the bios sets thresholds for memory
 		errors since we will not see details for all errors.
+   mce=recovery
+		Force-enable recoverable machine check code paths
 
    nomce (for compatibility with i386): same as mce=off
 
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 0ceb6ad..6663fae 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -106,6 +106,7 @@
 #define X86_FEATURE_APERFMPERF	( 3*32+28) /* APERFMPERF */
 #define X86_FEATURE_EAGER_FPU	( 3*32+29) /* "eagerfpu" Non lazy FPU restore */
 #define X86_FEATURE_NONSTOP_TSC_S3 ( 3*32+30) /* TSC doesn't stop in S3 state */
+#define X86_FEATURE_MCE_RECOVERY ( 3*32+31) /* cpu has recoverable machine checks */
 
 /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
 #define X86_FEATURE_XMM3	( 4*32+ 0) /* "pni" SSE-3 */
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 2ea4527..18d2ba9 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -113,6 +113,7 @@ struct mca_config {
 	bool ignore_ce;
 	bool disabled;
 	bool ser;
+	bool recovery;
 	bool bios_cmci_threshold;
 	u8 banks;
 	s8 bootlog;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index a006f4c..b5b187c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1576,6 +1576,17 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
 
 		if (c->x86 == 6 && c->x86_model == 45)
 			quirk_no_way_out = quirk_sandybridge_ifu;
+		/*
+		 * MCG_CAP.MCG_SER_P is necessary but not sufficient to know
+		 * whether this processor will actually generate recoverable
+		 * machine checks. Check to see if this is an E7 model Xeon.
+		 * We can't do a model number check because E5 and E7 use the
+		 * same model number. E5 doesn't support recovery, E7 does.
+		 */
+		if (mca_cfg.recovery || (mca_cfg.ser &&
+			!strncmp(c->x86_model_id,
+				 "Intel(R) Xeon(R) CPU E7-", 24)))
+			set_cpu_cap(c, X86_FEATURE_MCE_RECOVERY);
 	}
 	if (cfg->monarch_timeout < 0)
 		cfg->monarch_timeout = 0;
@@ -2028,6 +2039,8 @@ static int __init mcheck_enable(char *str)
 		cfg->bootlog = (str[0] == 'b');
 	else if (!strcmp(str, "bios_cmci_threshold"))
 		cfg->bios_cmci_threshold = true;
+	else if (!strcmp(str, "recovery"))
+		cfg->recovery = true;
 	else if (isdigit(str[0])) {
 		if (get_option(&str, &cfg->tolerant) == 2)
 			get_option(&str, &(cfg->monarch_timeout));

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18 10:19       ` Ingo Molnar
@ 2016-02-18 10:29         ` Borislav Petkov
  2016-02-18 10:35         ` Peter Zijlstra
  1 sibling, 0 replies; 52+ messages in thread
From: Borislav Petkov @ 2016-02-18 10:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Tony Luck, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Linus Torvalds, Andrew Morton

On Thu, Feb 18, 2016 at 11:19:42AM +0100, Ingo Molnar wrote:
> > I think the whole notion of mcsafe here is 'wrong'. This copy variant simply 
> > reports the kind of trap that happened (#PF or #MC) and could arguably be 
> > extended to include more types if the hardware were to generate more.

It is safe in the sense that when you get an MCE while shuffling data
here, in the kernel, you don't die but you recover. Thus the exception
handling games.

So _safe() really sounds fitting here.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18  8:21   ` Ingo Molnar
  2016-02-18  9:59     ` Peter Zijlstra
@ 2016-02-18 10:29     ` Borislav Petkov
  2016-02-18 10:34       ` Ingo Molnar
  2016-02-18 21:14     ` [PATCH v12] x86, mce: Add memcpy_trap() Luck, Tony
  2 siblings, 1 reply; 52+ messages in thread
From: Borislav Petkov @ 2016-02-18 10:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tony Luck, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Linus Torvalds, Andrew Morton, Peter Zijlstra

On Thu, Feb 18, 2016 at 09:21:07AM +0100, Ingo Molnar wrote:
> > +#ifndef CONFIG_UML
> 
> Why is this UML quirk needed? No other memcpy functions have it. Theoretically UML 
> could introduce the notion of #MC interruption.

https://lkml.kernel.org/r/56B7AEEE.5070504@nod.at

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18 10:29     ` Borislav Petkov
@ 2016-02-18 10:34       ` Ingo Molnar
  2016-02-18 10:36         ` Borislav Petkov
  0 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2016-02-18 10:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Linus Torvalds, Andrew Morton, Peter Zijlstra


* Borislav Petkov <bp@alien8.de> wrote:

> On Thu, Feb 18, 2016 at 09:21:07AM +0100, Ingo Molnar wrote:
> > > +#ifndef CONFIG_UML
> > 
> > Why is this UML quirk needed? No other memcpy functions have it. Theoretically UML 
> > could introduce the notion of #MC interruption.
> 
> https://lkml.kernel.org/r/56B7AEEE.5070504@nod.at

Does the build fail - or is it just an unused function? If the latter then I'd 
rather leave the #ifdef out.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18 10:19       ` Ingo Molnar
  2016-02-18 10:29         ` Borislav Petkov
@ 2016-02-18 10:35         ` Peter Zijlstra
  2016-02-18 14:59           ` Luck, Tony
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2016-02-18 10:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tony Luck, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Borislav Petkov, Linus Torvalds, Andrew Morton

On Thu, Feb 18, 2016 at 11:19:42AM +0100, Ingo Molnar wrote:
> > I think the whole notion of mcsafe here is 'wrong'. This copy variant simply 
> > reports the kind of trap that happened (#PF or #MC) and could arguably be 
> > extended to include more types if the hardware were to generate more.
> 
> What would a better name be? memcpy_ret() or so?

Yeah, uhmm.. naming. More options from the lack of inspiration department:

  memcpy_trap()
  memcpy_ex()

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18 10:34       ` Ingo Molnar
@ 2016-02-18 10:36         ` Borislav Petkov
  2016-02-18 18:48           ` Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Borislav Petkov @ 2016-02-18 10:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tony Luck, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Linus Torvalds, Andrew Morton, Peter Zijlstra

On Thu, Feb 18, 2016 at 11:34:20AM +0100, Ingo Molnar wrote:
> Does the build fail - or is it just an unused function? If the latter
> then I'd rather leave the #ifdef out.

Yep, it does fail:

https://lkml.kernel.org/r/20160207165524.GF5862@pd.tnic

due to ex_handler_fault not being visible to UML.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18 10:35         ` Peter Zijlstra
@ 2016-02-18 14:59           ` Luck, Tony
  0 siblings, 0 replies; 52+ messages in thread
From: Luck, Tony @ 2016-02-18 14:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, H. Peter Anvin, Borislav Petkov,
	Linus Torvalds, Andrew Morton

> > > I think the whole notion of mcsafe here is 'wrong'. This copy variant simply 
> > > reports the kind of trap that happened (#PF or #MC) and could arguably be 
> > > extended to include more types if the hardware were to generate more.
> > 
> > What would a better name be? memcpy_ret() or so?
>
> Yeah, uhmm.. naming. More options from the lack of inspiration department:
>
>   memcpy_trap()
>   memcpy_ex()

I like "memcpy_trap" - thanks Peter.  I'll redo this part with all the other fixes
suggested by Ingo.

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-17 18:20 ` [PATCH v11 3/4] x86, mce: Add __mcsafe_copy() Tony Luck
  2016-02-18  8:21   ` Ingo Molnar
@ 2016-02-18 18:12   ` Linus Torvalds
  2016-02-18 18:51     ` Ingo Molnar
  2016-02-18 18:52     ` Luck, Tony
  1 sibling, 2 replies; 52+ messages in thread
From: Linus Torvalds @ 2016-02-18 18:12 UTC (permalink / raw)
  To: Tony Luck; +Cc: Ingo Molnar, Linux Kernel Mailing List

On Wed, Feb 17, 2016 at 10:20 AM, Tony Luck <tony.luck@intel.com> wrote:
>
> If we faulted during the copy, then 'trapnr' will say which type
> of trap (X86_TRAP_PF or X86_TRAP_MC) and 'remain' says how many
> bytes were not copied.

So apart from the naming, a couple of questions:

 - I'd like to see the actual *use* case explained, not just what it does.

 - why does this use the complex - and slower, on modern machines -
unrolled manual memory copy, when you might as well just use a single

     rep ; movsb

    which not only makes it smaller, but makes the exception fixup trivial.

 - why not make the "bytes remaining" the same as for a user-space
copy (ie return it as the return value)?

 - at that point, it ends up looking a *lot* like uaccess_try/catch,
which gets the error code from current_thread_info()->uaccess_err

Hmm?

          Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18 10:36         ` Borislav Petkov
@ 2016-02-18 18:48           ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2016-02-18 18:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Linus Torvalds, Andrew Morton, Peter Zijlstra


* Borislav Petkov <bp@alien8.de> wrote:

> On Thu, Feb 18, 2016 at 11:34:20AM +0100, Ingo Molnar wrote:
> > Does the build fail - or is it just an unused function? If the latter
> > then I'd rather leave the #ifdef out.
> 
> Yep, it does fail:
> 
> https://lkml.kernel.org/r/20160207165524.GF5862@pd.tnic
> 
> due to ex_handler_fault not being visible to UML.

Ok, fair enough.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18 18:12   ` [PATCH v11 3/4] x86, mce: Add __mcsafe_copy() Linus Torvalds
@ 2016-02-18 18:51     ` Ingo Molnar
  2016-02-18 18:52     ` Luck, Tony
  1 sibling, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2016-02-18 18:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tony Luck, Linux Kernel Mailing List, Peter Zijlstra,
	Thomas Gleixner, H. Peter Anvin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Feb 17, 2016 at 10:20 AM, Tony Luck <tony.luck@intel.com> wrote:
> >
> > If we faulted during the copy, then 'trapnr' will say which type
> > of trap (X86_TRAP_PF or X86_TRAP_MC) and 'remain' says how many
> > bytes were not copied.
> 
> So apart from the naming, a couple of questions:
> 
>  - I'd like to see the actual *use* case explained, not just what it does.
> 
>  - why does this use the complex - and slower, on modern machines -
> unrolled manual memory copy, when you might as well just use a single
> 
>      rep ; movsb
> 
>     which not only makes it smaller, but makes the exception fixup trivial.
>
>  - why not make the "bytes remaining" the same as for a user-space
> copy (ie return it as the return value)?
> 
>  - at that point, it ends up looking a *lot* like uaccess_try/catch,
> which gets the error code from current_thread_info()->uaccess_err
> 
> Hmm?

memcpy_try()/memcpy_catch() definitely has a nice ring to it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18 18:12   ` [PATCH v11 3/4] x86, mce: Add __mcsafe_copy() Linus Torvalds
  2016-02-18 18:51     ` Ingo Molnar
@ 2016-02-18 18:52     ` Luck, Tony
  2016-02-18 20:14       ` Ingo Molnar
  1 sibling, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2016-02-18 18:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, Linux Kernel Mailing List

On Thu, Feb 18, 2016 at 10:12:42AM -0800, Linus Torvalds wrote:
> On Wed, Feb 17, 2016 at 10:20 AM, Tony Luck <tony.luck@intel.com> wrote:
> >
> > If we faulted during the copy, then 'trapnr' will say which type
> > of trap (X86_TRAP_PF or X86_TRAP_MC) and 'remain' says how many
> > bytes were not copied.
> 
> So apart from the naming, a couple of questions:
> 
>  - I'd like to see the actual *use* case explained, not just what it does.

First user is libnvdimm. Dan Williams already has code to use this
so that kernel code accessing persistent memory can return -EIO to
a user instead of crashing the system if the cpu runs into an
uncorrected error during the copy.

I would also lkie use this for a machine check aware
copy_from_user() which would avoid crashing the kernel 
when the uncorrected error is in a user page (we can SIGBUS
the user just like we do if the user touched the poison themself).

copy_to_user() is also interesting if the source address is the
page cache. I think we can also avoid crashing the kernel in this
case too - but I haven't thought that all the way through.

>  - why does this use the complex - and slower, on modern machines -
> unrolled manual memory copy, when you might as well just use a single
> 
>      rep ; movsb
> 
>     which not only makes it smaller, but makes the exception fixup trivial.

Because current generation cpus don't give a recoverable machine
check if we consume with a "rep ; movsb" :-(
When we have that we can pick the best copy function based
on the capabilities of the cpu we are running on.

>  - why not make the "bytes remaining" the same as for a user-space
> copy (ie return it as the return value)?
> 
>  - at that point, it ends up looking a *lot* like uaccess_try/catch,
> which gets the error code from current_thread_info()->uaccess_err

For my copy_from_user/copy_to_user cases we need to know both the
number of remaining bytes and also *why* we stopped copying. We
might have #PF, in which case we return -EFAULT to the user, if
we have #MC then the recovery path is different (need to offline
the page, SIGBUS the user, ...)

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v14] x86, mce: Add memcpy_mcsafe()
  2016-02-25 22:11                         ` Andy Lutomirski
@ 2016-02-18 19:47                           ` Tony Luck
  2016-03-02 20:47                             ` Luck, Tony
                                               ` (2 more replies)
  2016-02-26  0:58                           ` [PATCH v13] x86, mce: Add memcpy_trap() Linus Torvalds
  1 sibling, 3 replies; 52+ messages in thread
From: Tony Luck @ 2016-02-18 19:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Thomas Gleixner, Tony Luck,
	Borislav Petkov, Andrew Morton, Tony Luck, Peter Zijlstra,
	Ingo Molnar, H. Peter Anvin, Andy Lutomirski

Make use of the EXTABLE_FAULT exception table entries to write
a kernel copy routine that doesn't crash the system if it
encounters a machine check. Prime use case for this is to copy
from large arrays of non-volatile memory used as storage.

We have to use an unrolled copy loop for now because current
hardware implementations treat a machine check in "rep mov"
as fatal. When that is fixed we can simplify.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

Is this what we want now?  Return type is a "bool". True means
that we copied OK, false means that it didn't (this is all that
Dan says that he needs).  Dropped all the complex code to figure
out how many bytes we didn't copy as Linus says this isn't the
right place to do this (and besides we should just make "rep mov"
work). Changed the name from "_trap" since we no longer return
the trap number.  I'm not wedded to "memcpy_mcsafe" though, so
feel free to suggest alternates.

Nothing to do with copy_from_user() in here ... just completing
the patch series that we need for non-volatile storage.

 arch/x86/include/asm/string_64.h |  13 +++++
 arch/x86/kernel/x8664_ksyms_64.c |   2 +
 arch/x86/lib/memcpy_64.S         | 117 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 132 insertions(+)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index ff8b9a17dc4b..ca6ba3607705 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -78,6 +78,19 @@ int strcmp(const char *cs, const char *ct);
 #define memset(s, c, n) __memset(s, c, n)
 #endif
 
+/**
+ * memcpy_mcsafe - copy memory with indication if a machine check happened
+ *
+ * @dst:	destination address
+ * @src:	source address
+ * @cnt:	number of bytes to copy
+ *
+ * Low level memory copy function that catches machine checks
+ *
+ * Return true for success, false for fail
+ */
+bool memcpy_mcsafe(void *dst, const void *src, size_t cnt);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
index a0695be19864..cd05942bc918 100644
--- a/arch/x86/kernel/x8664_ksyms_64.c
+++ b/arch/x86/kernel/x8664_ksyms_64.c
@@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache);
 EXPORT_SYMBOL(_copy_from_user);
 EXPORT_SYMBOL(_copy_to_user);
 
+EXPORT_SYMBOL_GPL(memcpy_mcsafe);
+
 EXPORT_SYMBOL(copy_page);
 EXPORT_SYMBOL(clear_page);
 
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 16698bba87de..7d37641ada5b 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -177,3 +177,120 @@ ENTRY(memcpy_orig)
 .Lend:
 	retq
 ENDPROC(memcpy_orig)
+
+#ifndef CONFIG_UML
+/*
+ * memcpy_mcsafe - memory copy with machine check exception handling
+ * Note that we only catch machine checks when reading the source addresses.
+ * Writes to target are posted and don't generate machine checks.
+ */
+ENTRY(memcpy_mcsafe)
+	cmpl $8, %edx
+	/* Less than 8 bytes? Go to byte copy loop */
+	jb .L_no_whole_words
+
+	/* Check for bad alignment of source */
+	testl $7, %esi
+	/* Already aligned */
+	jz .L_8byte_aligned
+
+	/* Copy one byte at a time until source is 8-byte aligned */
+	movl %esi, %ecx
+	andl $7, %ecx
+	subl $8, %ecx
+	negl %ecx
+	subl %ecx, %edx
+.L_copy_leading_bytes:
+	movb (%rsi), %al
+	movb %al, (%rdi)
+	incq %rsi
+	incq %rdi
+	decl %ecx
+	jnz .L_copy_leading_bytes
+
+.L_8byte_aligned:
+	/* Figure out how many whole cache lines (64-bytes) to copy */
+	movl %edx, %ecx
+	andl $63, %edx
+	shrl $6, %ecx
+	jz .L_no_whole_cache_lines
+
+	/* Loop copying whole cache lines */
+.L_cache_w0: movq (%rsi), %r8
+.L_cache_w1: movq 1*8(%rsi), %r9
+.L_cache_w2: movq 2*8(%rsi), %r10
+.L_cache_w3: movq 3*8(%rsi), %r11
+	movq %r8, (%rdi)
+	movq %r9, 1*8(%rdi)
+	movq %r10, 2*8(%rdi)
+	movq %r11, 3*8(%rdi)
+.L_cache_w4: movq 4*8(%rsi), %r8
+.L_cache_w5: movq 5*8(%rsi), %r9
+.L_cache_w6: movq 6*8(%rsi), %r10
+.L_cache_w7: movq 7*8(%rsi), %r11
+	movq %r8, 4*8(%rdi)
+	movq %r9, 5*8(%rdi)
+	movq %r10, 6*8(%rdi)
+	movq %r11, 7*8(%rdi)
+	leaq 64(%rsi), %rsi
+	leaq 64(%rdi), %rdi
+	decl %ecx
+	jnz .L_cache_w0
+
+	/* Are there any trailing 8-byte words? */
+.L_no_whole_cache_lines:
+	movl %edx, %ecx
+	andl $7, %edx
+	shrl $3, %ecx
+	jz .L_no_whole_words
+
+	/* Copy trailing words */
+.L_copy_trailing_words:
+	movq (%rsi), %r8
+	mov %r8, (%rdi)
+	leaq 8(%rsi), %rsi
+	leaq 8(%rdi), %rdi
+	decl %ecx
+	jnz .L_copy_trailing_words
+
+	/* Any trailing bytes? */
+.L_no_whole_words:
+	andl %edx, %edx
+	jz .L_done_memcpy_trap
+
+	/* Copy trailing bytes */
+	movl %edx, %ecx
+.L_copy_trailing_bytes:
+	movb (%rsi), %al
+	movb %al, (%rdi)
+	incq %rsi
+	incq %rdi
+	decl %ecx
+	jnz .L_copy_trailing_bytes
+
+	/* Copy successful. Return true */
+.L_done_memcpy_trap:
+	xorq %rax, %rax
+	ret
+ENDPROC(memcpy_mcsafe)
+
+	.section .fixup, "ax"
+	/* Return false for any failure */
+.L_memcpy_mcsafe_fail:
+	mov	$1, %rax
+	ret
+
+	.previous
+
+	_ASM_EXTABLE_FAULT(.L_copy_leading_bytes, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w0, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w1, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w3, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w3, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w4, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w5, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w6, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w7, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_copy_trailing_words, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_copy_trailing_bytes, .L_memcpy_mcsafe_fail)
+#endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18 18:52     ` Luck, Tony
@ 2016-02-18 20:14       ` Ingo Molnar
  2016-02-18 21:33         ` Dan Williams
  0 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2016-02-18 20:14 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Linus Torvalds, Linux Kernel Mailing List, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra


* Luck, Tony <tony.luck@intel.com> wrote:

> On Thu, Feb 18, 2016 at 10:12:42AM -0800, Linus Torvalds wrote:
> > On Wed, Feb 17, 2016 at 10:20 AM, Tony Luck <tony.luck@intel.com> wrote:
> > >
> > > If we faulted during the copy, then 'trapnr' will say which type
> > > of trap (X86_TRAP_PF or X86_TRAP_MC) and 'remain' says how many
> > > bytes were not copied.
> > 
> > So apart from the naming, a couple of questions:
> > 
> >  - I'd like to see the actual *use* case explained, not just what it does.
> 
> First user is libnvdimm. Dan Williams already has code to use this so that 
> kernel code accessing persistent memory can return -EIO to a user instead of 
> crashing the system if the cpu runs into an uncorrected error during the copy.

Are these the memcpy_*_pmem() calls in drivers/nvdimm/pmem.c? Is there any actual 
patch to look at?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v12] x86, mce: Add memcpy_trap()
  2016-02-18  8:21   ` Ingo Molnar
  2016-02-18  9:59     ` Peter Zijlstra
  2016-02-18 10:29     ` Borislav Petkov
@ 2016-02-18 21:14     ` Luck, Tony
  2016-02-19  9:10       ` Ingo Molnar
  2 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2016-02-18 21:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, H. Peter Anvin, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Peter Zijlstra

Make use of the EXTABLE_FAULT exception table entries. This routine
returns a structure to indicate the result of the copy:

struct mcsafe_ret {
        u64 trap_nr;
        u64 bytes_left;
};

If the copy is successful, then both 'trap_nr' and 'bytes_left' are zero.

If we faulted during the copy, then 'trap_nr' will say which type
of trap (X86_TRAP_PF or X86_TRAP_MC) and 'bytes_left' says how many
bytes were not copied.

Note that this is probably the first of several copy functions.
We can make new ones for non-temporal cache handling etc.

Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---

V12 (part3 only - parts 1,2,4 are in tip tree)

Ingo:	More meaningful names for fields of return structure
PeterZ:	Better name for copy function: memcpy_trap()
Ingo:	Don't use numeric labels in new asm code
Ingo:	Consistent capitalization in comments.
Ingo:	Periods between sentences in comments.

Not addressed: Linus' comment that perhaps we could try/catch
syntactic sugar instead of returning multiple values in a structure.

 arch/x86/include/asm/string_64.h |  26 +++++++
 arch/x86/kernel/x8664_ksyms_64.c |   2 +
 arch/x86/lib/memcpy_64.S         | 158 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 186 insertions(+)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index ff8b9a17dc4b..65e5793b7590 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -78,6 +78,32 @@ int strcmp(const char *cs, const char *ct);
 #define memset(s, c, n) __memset(s, c, n)
 #endif
 
+/**
+ * struct memcpy_trap_ret - return value from memcpy_trap()
+ *
+ * @trap_nr	x86 trap number if the copy failed
+ * @bytes_left	zero for successful copy else number of bytes not copied
+ */
+struct memcpy_trap_ret {
+	u64 trap_nr;
+	u64 bytes_left;
+};
+
+/**
+ * memcpy_trap - copy memory with indication if a trap interrupted the copy
+ *
+ * @dst:	destination address
+ * @src:	source address
+ * @cnt:	number of bytes to copy
+ *
+ * Low level memory copy function that catches traps and indicates whether
+ * the copy succeeded and if not, why it failed.
+ *
+ * Return is struct memcpy_trap_ret which provides both the number of bytes
+ * not copied and the reason for the failure.
+ */
+struct memcpy_trap_ret memcpy_trap(void *dst, const void __user *src, size_t cnt);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
index a0695be19864..40866e2cbcc4 100644
--- a/arch/x86/kernel/x8664_ksyms_64.c
+++ b/arch/x86/kernel/x8664_ksyms_64.c
@@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache);
 EXPORT_SYMBOL(_copy_from_user);
 EXPORT_SYMBOL(_copy_to_user);
 
+EXPORT_SYMBOL_GPL(memcpy_trap);
+
 EXPORT_SYMBOL(copy_page);
 EXPORT_SYMBOL(clear_page);
 
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 16698bba87de..aecdfc41c114 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -177,3 +177,161 @@ ENTRY(memcpy_orig)
 .Lend:
 	retq
 ENDPROC(memcpy_orig)
+
+#ifndef CONFIG_UML
+/*
+ * memcpy_trap - memory copy with machine check exception handling
+ * Note that we only catch machine checks when reading the source addresses.
+ * Writes to target are posted and don't generate machine checks.
+ */
+ENTRY(memcpy_trap)
+	cmpl $8,%edx
+	/* Less than 8 bytes? Go to byte copy loop */
+	jb .L_no_whole_words
+
+	/* Check for bad alignment of source */
+	testl $7,%esi
+	/* Already aligned */
+	jz .L_8byte_aligned
+
+	/* Copy one byte at a time until source is 8-byte aligned */
+	movl %esi,%ecx
+	andl $7,%ecx
+	subl $8,%ecx
+	negl %ecx
+	subl %ecx,%edx
+.L_copy_leading_bytes:
+	movb (%rsi),%al
+	movb %al,(%rdi)
+	incq %rsi
+	incq %rdi
+	decl %ecx
+	jnz .L_copy_leading_bytes
+
+.L_8byte_aligned:
+	/* Figure out how many whole cache lines (64-bytes) to copy */
+	movl %edx,%ecx
+	andl $63,%edx
+	shrl $6,%ecx
+	jz .L_no_whole_cache_lines
+
+	/* Loop copying whole cache lines */
+.L_cache_w0: movq (%rsi),%r8
+.L_cache_w1: movq 1*8(%rsi),%r9
+.L_cache_w2: movq 2*8(%rsi),%r10
+.L_cache_w3: movq 3*8(%rsi),%r11
+	movq %r8,(%rdi)
+	movq %r9,1*8(%rdi)
+	movq %r10,2*8(%rdi)
+	movq %r11,3*8(%rdi)
+.L_cache_w4: movq 4*8(%rsi),%r8
+.L_cache_w5: movq 5*8(%rsi),%r9
+.L_cache_w6: movq 6*8(%rsi),%r10
+.L_cache_w7: movq 7*8(%rsi),%r11
+	movq %r8,4*8(%rdi)
+	movq %r9,5*8(%rdi)
+	movq %r10,6*8(%rdi)
+	movq %r11,7*8(%rdi)
+	leaq 64(%rsi),%rsi
+	leaq 64(%rdi),%rdi
+	decl %ecx
+	jnz .L_cache_w0
+
+	/* Are there any trailing 8-byte words? */
+.L_no_whole_cache_lines:
+	movl %edx,%ecx
+	andl $7,%edx
+	shrl $3,%ecx
+	jz .L_no_whole_words
+
+	/* Copy trailing words */
+.L_copy_trailing_words:
+	movq (%rsi),%r8
+	mov %r8,(%rdi)
+	leaq 8(%rsi),%rsi
+	leaq 8(%rdi),%rdi
+	decl %ecx
+	jnz .L_copy_trailing_words
+
+	/* Any trailing bytes? */
+.L_no_whole_words:
+	andl %edx,%edx
+	jz .L_done_memcpy_trap
+
+	/* Copy trailing bytes */
+	movl %edx,%ecx
+.L_copy_trailing_bytes:
+	movb (%rsi),%al
+	movb %al,(%rdi)
+	incq %rsi
+	incq %rdi
+	decl %ecx
+	jnz .L_copy_trailing_bytes
+
+	/* Copy successful. Return .remain = 0, .trapnr = 0 */
+.L_done_memcpy_trap:
+	xorq %rax, %rax
+	xorq %rdx, %rdx
+	ret
+
+	.section .fixup,"ax"
+	/*
+	 * The machine check handler loaded %rax with trap number.
+	 * We just need to make sure %edx has the number of
+	 * bytes remaining.
+	 */
+.L_fix_leading_bytes:
+	add %ecx,%edx
+	ret
+.L_fix_cache_w0:
+	shl $6,%ecx
+	add %ecx,%edx
+	ret
+.L_fix_cache_w1:
+	shl $6,%ecx
+	lea -8(%ecx,%edx),%edx
+	ret
+.L_fix_cache_w2:
+	shl $6,%ecx
+	lea -16(%ecx,%edx),%edx
+	ret
+.L_fix_cache_w3:
+	shl $6,%ecx
+	lea -24(%ecx,%edx),%edx
+	ret
+.L_fix_cache_w4:
+	shl $6,%ecx
+	lea -32(%ecx,%edx),%edx
+	ret
+.L_fix_cache_w5:
+	shl $6,%ecx
+	lea -40(%ecx,%edx),%edx
+	ret
+.L_fix_cache_w6:
+	shl $6,%ecx
+	lea -48(%ecx,%edx),%edx
+	ret
+.L_fix_cache_w7:
+	shl $6,%ecx
+	lea -56(%ecx,%edx),%edx
+	ret
+.L_fix_trailing_words:
+	lea (%rdx,%rcx,8),%rdx
+	ret
+.L_fix_trailing_bytes:
+	mov %ecx,%edx
+	ret
+	.previous
+
+	_ASM_EXTABLE_FAULT(.L_copy_leading_bytes,.L_fix_leading_bytes)
+	_ASM_EXTABLE_FAULT(.L_cache_w0,.L_fix_cache_w0)
+	_ASM_EXTABLE_FAULT(.L_cache_w1,.L_fix_cache_w1)
+	_ASM_EXTABLE_FAULT(.L_cache_w3,.L_fix_cache_w2)
+	_ASM_EXTABLE_FAULT(.L_cache_w3,.L_fix_cache_w3)
+	_ASM_EXTABLE_FAULT(.L_cache_w4,.L_fix_cache_w4)
+	_ASM_EXTABLE_FAULT(.L_cache_w5,.L_fix_cache_w5)
+	_ASM_EXTABLE_FAULT(.L_cache_w6,.L_fix_cache_w6)
+	_ASM_EXTABLE_FAULT(.L_cache_w7,.L_fix_cache_w7)
+	_ASM_EXTABLE_FAULT(.L_copy_trailing_words,.L_fix_trailing_words)
+	_ASM_EXTABLE_FAULT(.L_copy_trailing_bytes,.L_fix_trailing_bytes)
+#endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18 20:14       ` Ingo Molnar
@ 2016-02-18 21:33         ` Dan Williams
  0 siblings, 0 replies; 52+ messages in thread
From: Dan Williams @ 2016-02-18 21:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Luck, Tony, Linus Torvalds, Linux Kernel Mailing List,
	Thomas Gleixner, H. Peter Anvin, Peter Zijlstra

On Thu, Feb 18, 2016 at 12:14 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Luck, Tony <tony.luck@intel.com> wrote:
>
>> On Thu, Feb 18, 2016 at 10:12:42AM -0800, Linus Torvalds wrote:
>> > On Wed, Feb 17, 2016 at 10:20 AM, Tony Luck <tony.luck@intel.com> wrote:
>> > >
>> > > If we faulted during the copy, then 'trapnr' will say which type
>> > > of trap (X86_TRAP_PF or X86_TRAP_MC) and 'remain' says how many
>> > > bytes were not copied.
>> >
>> > So apart from the naming, a couple of questions:
>> >
>> >  - I'd like to see the actual *use* case explained, not just what it does.
>>
>> First user is libnvdimm. Dan Williams already has code to use this so that
>> kernel code accessing persistent memory can return -EIO to a user instead of
>> crashing the system if the cpu runs into an uncorrected error during the copy.
>
> Are these the memcpy_*_pmem() calls in drivers/nvdimm/pmem.c? Is there any actual
> patch to look at?
>

Here's the integration patch I had from the version of mcsafe_copy()
at the beginning of January.  Pardon the whitespace damage... the
original thread is here: [1].  Note that the "badblocks" intergration
portion of that set went upstream in v4.5-rc1.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-January/003864.html

---

Subject: x86, pmem: use __mcsafe_copy() for memcpy_from_pmem()

In support of large capacity persistent memory use __mcsafe_copy() for
pmem I/O.  This allows the pmem driver to support an error model similar
to disks when machine check recovery is available.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Borislav Petkov <bp@suse.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/pmem.h |   16 ++++++++++++++++
 drivers/nvdimm/Kconfig      |    1 +
 drivers/nvdimm/pmem.c       |   10 ++++++----
 include/linux/pmem.h        |   17 +++++++++++++----
 4 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec816ab..4ef301e78a2b 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -17,6 +17,7 @@
 #include <asm/cacheflush.h>
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
+#include <asm/string.h>

 #ifdef CONFIG_ARCH_HAS_PMEM_API
 /**
@@ -47,6 +48,21 @@ static inline void arch_memcpy_to_pmem(void __pmem
*dst, const void *src,
                BUG();
 }

+static inline int arch_memcpy_from_pmem(void *dst, const void __pmem *src,
+               size_t n)
+{
+       if (IS_ENABLED(CONFIG_MCE_KERNEL_RECOVERY)) {
+               struct mcsafe_ret ret;
+
+               ret = __mcsafe_copy(dst, (void __force *) src, n);
+               if (ret.remain)
+                       return -EIO;
+               return 0;
+       }
+       memcpy(dst, (void __force *) src, n);
+       return 0;
+}
+
 /**
  * arch_wmb_pmem - synchronize writes to persistent memory
  *
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 53c11621d5b1..fe5885d01fd8 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -22,6 +22,7 @@ config BLK_DEV_PMEM
        depends on HAS_IOMEM
        select ND_BTT if BTT
        select ND_PFN if NVDIMM_PFN
+       select MCE_KERNEL_RECOVERY if X86_MCE && X86_64
        help
          Memory ranges for PMEM are described by either an NFIT
          (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 8744235b5be2..d8e14e962327 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -62,6 +62,7 @@ static bool is_bad_pmem(struct badblocks *bb,
sector_t sector, unsigned int len)
 static int pmem_do_bvec(struct block_device *bdev, struct page *page,
                unsigned int len, unsigned int off, int rw, sector_t sector)
 {
+       int rc = 0;
        void *mem = kmap_atomic(page);
        struct gendisk *disk = bdev->bd_disk;
        struct pmem_device *pmem = disk->private_data;
@@ -71,7 +72,7 @@ static int pmem_do_bvec(struct block_device *bdev,
struct page *page,
        if (rw == READ) {
                if (unlikely(is_bad_pmem(disk->bb, sector, len)))
                        return -EIO;
-               memcpy_from_pmem(mem + off, pmem_addr, len);
+               rc = memcpy_from_pmem(mem + off, pmem_addr, len);
                flush_dcache_page(page);
        } else {
                flush_dcache_page(page);
@@ -79,7 +80,7 @@ static int pmem_do_bvec(struct block_device *bdev,
struct page *page,
        }

        kunmap_atomic(mem);
-       return 0;
+       return rc;
 }

 static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
@@ -237,6 +238,7 @@ static int pmem_rw_bytes(struct nd_namespace_common *ndns,
                resource_size_t offset, void *buf, size_t size, int rw)
 {
        struct pmem_device *pmem = dev_get_drvdata(ndns->claim);
+       int rc = 0;

        if (unlikely(offset + size > pmem->size)) {
                dev_WARN_ONCE(&ndns->dev, 1, "request out of range\n");
@@ -244,13 +246,13 @@ static int pmem_rw_bytes(struct nd_namespace_common *ndns,
        }

        if (rw == READ)
-               memcpy_from_pmem(buf, pmem->virt_addr + offset, size);
+               rc = memcpy_from_pmem(buf, pmem->virt_addr + offset, size);
        else {
                memcpy_to_pmem(pmem->virt_addr + offset, buf, size);
                wmb_pmem();
        }

-       return 0;
+       return rc;
 }

 static int nd_pfn_init(struct nd_pfn *nd_pfn)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index acfea8ce4a07..0e57a5beab21 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -42,6 +42,13 @@ static inline void arch_memcpy_to_pmem(void __pmem
*dst, const void *src,
        BUG();
 }

+static inline int arch_memcpy_from_pmem(void *dst,
+               const void __pmem *src, size_t n)
+{
+       BUG();
+       return 0;
+}
+
 static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
                struct iov_iter *i)
 {
@@ -57,12 +64,14 @@ static inline void arch_clear_pmem(void __pmem
*addr, size_t size)

 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
- * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * implementations for arch_memcpy_to_pmem(), arch_memcpy_from_pmem(),
+ * arch_wmb_pmem(), arch_copy_from_iter_pmem(), arch_clear_pmem() and
+ * arch_has_wmb_pmem().
  */
-static inline void memcpy_from_pmem(void *dst, void __pmem const
*src, size_t size)
+static inline int memcpy_from_pmem(void *dst, void __pmem const *src,
+               size_t size)
 {
-       memcpy(dst, (void __force const *) src, size);
+       return arch_memcpy_from_pmem(dst, src, size);
 }

 static inline bool arch_has_pmem_api(void)

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-18  9:59     ` Peter Zijlstra
  2016-02-18 10:19       ` Ingo Molnar
@ 2016-02-19  7:58       ` Ingo Molnar
  2016-02-19  8:43         ` Peter Zijlstra
  1 sibling, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2016-02-19  7:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tony Luck, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Borislav Petkov, Linus Torvalds, Andrew Morton


* Peter Zijlstra <peterz@infradead.org> wrote:

> > Yeah, so please change this to something like:
> > 
> >   struct mcsafe_ret {
> >           u64 trap_nr;
> >           u64 bytes_left;
> >   };
> > 
> > this makes it crystal clear what the fields are about and what their unit is. 
> > Readability is king and modern consoles are wide enough, no need to abbreviate 
> > excessively.
> 
> I prefer to use my modern console width to display multiple columns of text, 
> instead of wasting it to display mostly whitespace. Therefore I still very much 
> prefer ~80 char wide code.

Btw., the main reason I hate the col80 limit is that I see such patches 
frequently:

 void pcibios_add_bus(struct pci_bus *bus)
 {
+#ifdef CONFIG_DMI
+       const struct dmi_device *dmi;
+       struct dmi_dev_onboard *dslot;
+       char sname[128];
+
+       dmi = NULL;
+       while ((dmi = dmi_find_device(DMI_DEV_TYPE_DEV_SLOT,
+                                     NULL, dmi)) != NULL) {
+               dslot = dmi->device_data;
+               if (dslot->segment == pci_domain_nr(bus) &&
+                   dslot->bus == bus->number) {
+                       dev_info(&bus->dev, "Found SMBIOS Slot %s\n",
+                                dslot->dev.name);
+                       snprintf(sname, sizeof(sname), "%s-%d",
+                                dslot->dev.name,
+                                dslot->instance);
+                       pci_create_slot(bus, dslot->devfn,
+                                       sname, NULL);
+               }
+       }
+#endif
        acpi_pci_add_bus(bus);

Which gobbledygook has 6 (!) col80 artifacts - and it's a pretty straightforward 
piece of code with just 2 levels of indentation.

It is IMHO much more readable in the following form:

 void pcibios_add_bus(struct pci_bus *bus)
 {
 #ifdef CONFIG_DMI
        const struct dmi_device *dmi;
        struct dmi_dev_onboard *dslot;
        char sname[128];

        dmi = NULL;
        while ((dmi = dmi_find_device(DMI_DEV_TYPE_DEV_SLOT, NULL, dmi)) != NULL) {
                dslot = dmi->device_data;
                if (dslot->segment == pci_domain_nr(bus) && dslot->bus == bus->number) {
                        dev_info(&bus->dev, "Found SMBIOS Slot %s\n", dslot->dev.name);
                        snprintf(sname, sizeof(sname), "%s-%d", dslot->dev.name, dslot->instance);
                        pci_create_slot(bus, dslot->devfn, sname, NULL);
                }
        }
 #endif
        acpi_pci_add_bus(bus);

BYMMV.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-19  7:58       ` Ingo Molnar
@ 2016-02-19  8:43         ` Peter Zijlstra
  2016-02-19  9:51           ` Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2016-02-19  8:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tony Luck, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Borislav Petkov, Linus Torvalds, Andrew Morton

On Fri, Feb 19, 2016 at 08:58:43AM +0100, Ingo Molnar wrote:
> > I prefer to use my modern console width to display multiple columns of text, 
> > instead of wasting it to display mostly whitespace. Therefore I still very much 
> > prefer ~80 char wide code.
> 
> Btw., the main reason I hate the col80 limit is that I see such patches 
> frequently:
> 
>  void pcibios_add_bus(struct pci_bus *bus)
>  {
> +#ifdef CONFIG_DMI
> +       const struct dmi_device *dmi;
> +       struct dmi_dev_onboard *dslot;
> +       char sname[128];
> +
> +       dmi = NULL;
> +       while ((dmi = dmi_find_device(DMI_DEV_TYPE_DEV_SLOT,
> +                                     NULL, dmi)) != NULL) {
> +               dslot = dmi->device_data;
> +               if (dslot->segment == pci_domain_nr(bus) &&
> +                   dslot->bus == bus->number) {
> +                       dev_info(&bus->dev, "Found SMBIOS Slot %s\n",
> +                                dslot->dev.name);
> +                       snprintf(sname, sizeof(sname), "%s-%d",
> +                                dslot->dev.name,
> +                                dslot->instance);
> +                       pci_create_slot(bus, dslot->devfn,
> +                                       sname, NULL);
> +               }
> +       }
> +#endif
>         acpi_pci_add_bus(bus);
> 
> Which gobbledygook has 6 (!) col80 artifacts - and it's a pretty straightforward 
> piece of code with just 2 levels of indentation.
> 
> It is IMHO much more readable in the following form:
> 
>  void pcibios_add_bus(struct pci_bus *bus)
>  {
>  #ifdef CONFIG_DMI
>         const struct dmi_device *dmi;
>         struct dmi_dev_onboard *dslot;
>         char sname[128];
> 
>         dmi = NULL;
>         while ((dmi = dmi_find_device(DMI_DEV_TYPE_DEV_SLOT, NULL, dmi)) != NULL) {
>                 dslot = dmi->device_data;
>                 if (dslot->segment == pci_domain_nr(bus) && dslot->bus == bus->number) {
>                         dev_info(&bus->dev, "Found SMBIOS Slot %s\n", dslot->dev.name);
>                         snprintf(sname, sizeof(sname), "%s-%d", dslot->dev.name, dslot->instance);
>                         pci_create_slot(bus, dslot->devfn, sname, NULL);
>                 }
>         }
>  #endif
>         acpi_pci_add_bus(bus);
> 
> BYMMV.

So I mostly agree, although your example does wrap on my normal display
width. The thing is though, we have to have a limit, otherwise people
will completely let loose and we'll end up with lines >200 chars wide
(I've worked on code like that in the past, and its a right pain).

And 80 has so far mostly worked just fine. Its just that people seem
unable to take guidelines as just than, and instead produce the most
horrible code just to adhere to checkpatch or whatnot.

And I'd much rather have an extra column of code than waste a lot of
screen-estate to display mostly whitespace.

Also, this 'artificial' limit on indentation level does in general
encourage people to write saner code.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v12] x86, mce: Add memcpy_trap()
  2016-02-18 21:14     ` [PATCH v12] x86, mce: Add memcpy_trap() Luck, Tony
@ 2016-02-19  9:10       ` Ingo Molnar
  2016-02-19 17:53         ` [PATCH v13] " Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2016-02-19  9:10 UTC (permalink / raw)
  To: Luck, Tony
  Cc: linux-kernel, Thomas Gleixner, H. Peter Anvin, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Peter Zijlstra


* Luck, Tony <tony.luck@intel.com> wrote:

> Make use of the EXTABLE_FAULT exception table entries. This routine
> returns a structure to indicate the result of the copy:
> 
> struct mcsafe_ret {
>         u64 trap_nr;
>         u64 bytes_left;
> };
> 
> If the copy is successful, then both 'trap_nr' and 'bytes_left' are zero.
> 
> If we faulted during the copy, then 'trap_nr' will say which type
> of trap (X86_TRAP_PF or X86_TRAP_MC) and 'bytes_left' says how many
> bytes were not copied.
> 
> Note that this is probably the first of several copy functions.
> We can make new ones for non-temporal cache handling etc.
> 
> Reviewed-by: Borislav Petkov <bp@suse.de>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> 
> V12 (part3 only - parts 1,2,4 are in tip tree)
> 
> Ingo:	More meaningful names for fields of return structure
> PeterZ:	Better name for copy function: memcpy_trap()
> Ingo:	Don't use numeric labels in new asm code
> Ingo:	Consistent capitalization in comments.
> Ingo:	Periods between sentences in comments.
> 
> Not addressed: Linus' comment that perhaps we could try/catch
> syntactic sugar instead of returning multiple values in a structure.
> 
>  arch/x86/include/asm/string_64.h |  26 +++++++
>  arch/x86/kernel/x8664_ksyms_64.c |   2 +
>  arch/x86/lib/memcpy_64.S         | 158 +++++++++++++++++++++++++++++++++++++++
>  3 files changed, 186 insertions(+)
> 
> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> index ff8b9a17dc4b..65e5793b7590 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -78,6 +78,32 @@ int strcmp(const char *cs, const char *ct);
>  #define memset(s, c, n) __memset(s, c, n)
>  #endif
>  
> +/**
> + * struct memcpy_trap_ret - return value from memcpy_trap()
> + *
> + * @trap_nr	x86 trap number if the copy failed
> + * @bytes_left	zero for successful copy else number of bytes not copied
> + */
> +struct memcpy_trap_ret {
> +	u64 trap_nr;
> +	u64 bytes_left;
> +};
> +
> +/**
> + * memcpy_trap - copy memory with indication if a trap interrupted the copy
> + *
> + * @dst:	destination address
> + * @src:	source address
> + * @cnt:	number of bytes to copy
> + *
> + * Low level memory copy function that catches traps and indicates whether
> + * the copy succeeded and if not, why it failed.
> + *
> + * Return is struct memcpy_trap_ret which provides both the number of bytes
> + * not copied and the reason for the failure.
> + */
> +struct memcpy_trap_ret memcpy_trap(void *dst, const void __user *src, size_t cnt);
> +
>  #endif /* __KERNEL__ */
>  
>  #endif /* _ASM_X86_STRING_64_H */
> diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
> index a0695be19864..40866e2cbcc4 100644
> --- a/arch/x86/kernel/x8664_ksyms_64.c
> +++ b/arch/x86/kernel/x8664_ksyms_64.c
> @@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache);
>  EXPORT_SYMBOL(_copy_from_user);
>  EXPORT_SYMBOL(_copy_to_user);
>  
> +EXPORT_SYMBOL_GPL(memcpy_trap);
> +
>  EXPORT_SYMBOL(copy_page);
>  EXPORT_SYMBOL(clear_page);
>  
> diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
> index 16698bba87de..aecdfc41c114 100644
> --- a/arch/x86/lib/memcpy_64.S
> +++ b/arch/x86/lib/memcpy_64.S
> @@ -177,3 +177,161 @@ ENTRY(memcpy_orig)
>  .Lend:
>  	retq
>  ENDPROC(memcpy_orig)
> +
> +#ifndef CONFIG_UML
> +/*
> + * memcpy_trap - memory copy with machine check exception handling
> + * Note that we only catch machine checks when reading the source addresses.
> + * Writes to target are posted and don't generate machine checks.
> + */
> +ENTRY(memcpy_trap)
> +	cmpl $8,%edx
> +	/* Less than 8 bytes? Go to byte copy loop */
> +	jb .L_no_whole_words
> +
> +	/* Check for bad alignment of source */
> +	testl $7,%esi
> +	/* Already aligned */
> +	jz .L_8byte_aligned
> +
> +	/* Copy one byte at a time until source is 8-byte aligned */
> +	movl %esi,%ecx
> +	andl $7,%ecx
> +	subl $8,%ecx
> +	negl %ecx
> +	subl %ecx,%edx
> +.L_copy_leading_bytes:
> +	movb (%rsi),%al
> +	movb %al,(%rdi)
> +	incq %rsi
> +	incq %rdi
> +	decl %ecx
> +	jnz .L_copy_leading_bytes
> +
> +.L_8byte_aligned:
> +	/* Figure out how many whole cache lines (64-bytes) to copy */
> +	movl %edx,%ecx
> +	andl $63,%edx
> +	shrl $6,%ecx
> +	jz .L_no_whole_cache_lines
> +
> +	/* Loop copying whole cache lines */
> +.L_cache_w0: movq (%rsi),%r8
> +.L_cache_w1: movq 1*8(%rsi),%r9
> +.L_cache_w2: movq 2*8(%rsi),%r10
> +.L_cache_w3: movq 3*8(%rsi),%r11
> +	movq %r8,(%rdi)
> +	movq %r9,1*8(%rdi)
> +	movq %r10,2*8(%rdi)
> +	movq %r11,3*8(%rdi)
> +.L_cache_w4: movq 4*8(%rsi),%r8
> +.L_cache_w5: movq 5*8(%rsi),%r9
> +.L_cache_w6: movq 6*8(%rsi),%r10
> +.L_cache_w7: movq 7*8(%rsi),%r11
> +	movq %r8,4*8(%rdi)
> +	movq %r9,5*8(%rdi)
> +	movq %r10,6*8(%rdi)
> +	movq %r11,7*8(%rdi)
> +	leaq 64(%rsi),%rsi
> +	leaq 64(%rdi),%rdi
> +	decl %ecx
> +	jnz .L_cache_w0
> +
> +	/* Are there any trailing 8-byte words? */
> +.L_no_whole_cache_lines:
> +	movl %edx,%ecx
> +	andl $7,%edx
> +	shrl $3,%ecx
> +	jz .L_no_whole_words
> +
> +	/* Copy trailing words */
> +.L_copy_trailing_words:
> +	movq (%rsi),%r8
> +	mov %r8,(%rdi)
> +	leaq 8(%rsi),%rsi
> +	leaq 8(%rdi),%rdi
> +	decl %ecx
> +	jnz .L_copy_trailing_words
> +
> +	/* Any trailing bytes? */
> +.L_no_whole_words:
> +	andl %edx,%edx
> +	jz .L_done_memcpy_trap
> +
> +	/* Copy trailing bytes */
> +	movl %edx,%ecx
> +.L_copy_trailing_bytes:
> +	movb (%rsi),%al
> +	movb %al,(%rdi)
> +	incq %rsi
> +	incq %rdi
> +	decl %ecx
> +	jnz .L_copy_trailing_bytes
> +
> +	/* Copy successful. Return .remain = 0, .trapnr = 0 */
> +.L_done_memcpy_trap:
> +	xorq %rax, %rax
> +	xorq %rdx, %rdx
> +	ret
> +
> +	.section .fixup,"ax"
> +	/*
> +	 * The machine check handler loaded %rax with trap number.
> +	 * We just need to make sure %edx has the number of
> +	 * bytes remaining.
> +	 */
> +.L_fix_leading_bytes:
> +	add %ecx,%edx
> +	ret
> +.L_fix_cache_w0:
> +	shl $6,%ecx
> +	add %ecx,%edx
> +	ret
> +.L_fix_cache_w1:
> +	shl $6,%ecx
> +	lea -8(%ecx,%edx),%edx
> +	ret
> +.L_fix_cache_w2:
> +	shl $6,%ecx
> +	lea -16(%ecx,%edx),%edx
> +	ret
> +.L_fix_cache_w3:
> +	shl $6,%ecx
> +	lea -24(%ecx,%edx),%edx
> +	ret
> +.L_fix_cache_w4:
> +	shl $6,%ecx
> +	lea -32(%ecx,%edx),%edx
> +	ret
> +.L_fix_cache_w5:
> +	shl $6,%ecx
> +	lea -40(%ecx,%edx),%edx
> +	ret
> +.L_fix_cache_w6:
> +	shl $6,%ecx
> +	lea -48(%ecx,%edx),%edx
> +	ret
> +.L_fix_cache_w7:
> +	shl $6,%ecx
> +	lea -56(%ecx,%edx),%edx
> +	ret
> +.L_fix_trailing_words:
> +	lea (%rdx,%rcx,8),%rdx
> +	ret
> +.L_fix_trailing_bytes:
> +	mov %ecx,%edx
> +	ret
> +	.previous
> +
> +	_ASM_EXTABLE_FAULT(.L_copy_leading_bytes,.L_fix_leading_bytes)
> +	_ASM_EXTABLE_FAULT(.L_cache_w0,.L_fix_cache_w0)
> +	_ASM_EXTABLE_FAULT(.L_cache_w1,.L_fix_cache_w1)
> +	_ASM_EXTABLE_FAULT(.L_cache_w3,.L_fix_cache_w2)
> +	_ASM_EXTABLE_FAULT(.L_cache_w3,.L_fix_cache_w3)
> +	_ASM_EXTABLE_FAULT(.L_cache_w4,.L_fix_cache_w4)
> +	_ASM_EXTABLE_FAULT(.L_cache_w5,.L_fix_cache_w5)
> +	_ASM_EXTABLE_FAULT(.L_cache_w6,.L_fix_cache_w6)
> +	_ASM_EXTABLE_FAULT(.L_cache_w7,.L_fix_cache_w7)
> +	_ASM_EXTABLE_FAULT(.L_copy_trailing_words,.L_fix_trailing_words)
> +	_ASM_EXTABLE_FAULT(.L_copy_trailing_bytes,.L_fix_trailing_bytes)
> +#endif

Ok, I absolutely love this assembly code, it's already a lot easier to read than 
95% of the x86 assembly code we have today!

There's two minor things I've noticed:

1) please put a space between instruction operands, i.e.:

-	shl $6,%ecx
+	shl $6, %ecx

2)

There's a way to make the exception fixup stubs more readable, by aligning them 
vertically, via something like:

.L_fix_cache_w0:	shl $6, %ecx; add %ecx, %edx;		ret
.L_fix_cache_w1:	shl $6, %ecx; lea  -8(%ecx,%edx), %edx;	ret
.L_fix_cache_w2:	shl $6, %ecx; lea -16(%ecx,%edx), %edx;	ret
.L_fix_cache_w3:	shl $6, %ecx; lea -24(%ecx,%edx), %edx; ret
.L_fix_cache_w4:	shl $6, %ecx; lea -32(%ecx,%edx), %edx; ret
.L_fix_cache_w5:	shl $6, %ecx; lea -40(%ecx,%edx), %edx; ret
.L_fix_cache_w6:	shl $6, %ecx; lea -48(%ecx,%edx), %edx; ret
.L_fix_cache_w7:	shl $6, %ecx; lea -56(%ecx,%edx), %edx; ret

this also makes it a lot easier to check the correctness of the fixup stubs. Also 
this layout makes it clear that the first fixup stub uses 'ADD', while the others 
use LEA, etc.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v11 3/4] x86, mce: Add __mcsafe_copy()
  2016-02-19  8:43         ` Peter Zijlstra
@ 2016-02-19  9:51           ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2016-02-19  9:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tony Luck, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Borislav Petkov, Linus Torvalds, Andrew Morton


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Feb 19, 2016 at 08:58:43AM +0100, Ingo Molnar wrote:
> > > I prefer to use my modern console width to display multiple columns of text, 
> > > instead of wasting it to display mostly whitespace. Therefore I still very much 
> > > prefer ~80 char wide code.
> > 
> > Btw., the main reason I hate the col80 limit is that I see such patches 
> > frequently:
> > 
> >  void pcibios_add_bus(struct pci_bus *bus)
> >  {
> > +#ifdef CONFIG_DMI
> > +       const struct dmi_device *dmi;
> > +       struct dmi_dev_onboard *dslot;
> > +       char sname[128];
> > +
> > +       dmi = NULL;
> > +       while ((dmi = dmi_find_device(DMI_DEV_TYPE_DEV_SLOT,
> > +                                     NULL, dmi)) != NULL) {
> > +               dslot = dmi->device_data;
> > +               if (dslot->segment == pci_domain_nr(bus) &&
> > +                   dslot->bus == bus->number) {
> > +                       dev_info(&bus->dev, "Found SMBIOS Slot %s\n",
> > +                                dslot->dev.name);
> > +                       snprintf(sname, sizeof(sname), "%s-%d",
> > +                                dslot->dev.name,
> > +                                dslot->instance);
> > +                       pci_create_slot(bus, dslot->devfn,
> > +                                       sname, NULL);
> > +               }
> > +       }
> > +#endif
> >         acpi_pci_add_bus(bus);
> > 
> > Which gobbledygook has 6 (!) col80 artifacts - and it's a pretty straightforward 
> > piece of code with just 2 levels of indentation.
> > 
> > It is IMHO much more readable in the following form:
> > 
> >  void pcibios_add_bus(struct pci_bus *bus)
> >  {
> >  #ifdef CONFIG_DMI
> >         const struct dmi_device *dmi;
> >         struct dmi_dev_onboard *dslot;
> >         char sname[128];
> > 
> >         dmi = NULL;
> >         while ((dmi = dmi_find_device(DMI_DEV_TYPE_DEV_SLOT, NULL, dmi)) != NULL) {
> >                 dslot = dmi->device_data;
> >                 if (dslot->segment == pci_domain_nr(bus) && dslot->bus == bus->number) {
> >                         dev_info(&bus->dev, "Found SMBIOS Slot %s\n", dslot->dev.name);
> >                         snprintf(sname, sizeof(sname), "%s-%d", dslot->dev.name, dslot->instance);
> >                         pci_create_slot(bus, dslot->devfn, sname, NULL);
> >                 }
> >         }
> >  #endif
> >         acpi_pci_add_bus(bus);
> > 
> > BYMMV.
> 
> So I mostly agree, although your example does wrap on my normal display
> width. The thing is though, we have to have a limit, otherwise people
> will completely let loose and we'll end up with lines >200 chars wide
> (I've worked on code like that in the past, and its a right pain).

What I'm arguing for is to be, on average, _stricter_ than col80, but not use the 
absolute width as a metric.

Obviusly we have to have limits (to have a consistent coding style) - but I think 
it should be the level of indentation/nesting that should be the limit and the 
number of arguments to a function, while the absolute character count should be 
relaxed in certain cases (and should be made more strict in others!), such as 
printks and other 'leaf' functionality that has no primary side effects.

I'd be fine with only allowing up to 2-3 levels of nesting in typical code 
situations, and not having silly long names. I'd also maximize function arguments 
at about ~4 parameters for the typical case - anything longer should probably 
organize the parameters into helper structures.

But yeah, I can see the pragmatic power of a 'simple' guideline, such as col80.

> And 80 has so far mostly worked just fine. Its just that people seem
> unable to take guidelines as just than, and instead produce the most
> horrible code just to adhere to checkpatch or whatnot.

I think it should be made _stricter_ in many cases.

I.e. col80 is too easy to work around and is routinely worked around in 80% of the 
patches I get.

> And I'd much rather have an extra column of code than waste a lot of
> screen-estate to display mostly whitespace.

So I think that with my proposed rule we'd mostly have much shorter code than 80 
colums, with a few cases (such as printk() lines) where _you_ could easily accept 
automatic, editor generated line wraps instead of forcing the ugliness of manual 
line breaks on everyone else...

The fact is that in almost every patch I receieve these days I see col80 artifacts 
that could be avoided with better code structure - or that would look better if 
they were not manually line-broken.

I.e. I don't so much argue in favor of making lines longer than 80 cols, I argue 
against 'col80 line breaks' that are an easy workaround around the col80 rule.

> Also, this 'artificial' limit on indentation level does in general encourage 
> people to write saner code.

But that's not what we have - col80 is in fact too permissive when it comes to 
actual code complexity!

Let me give a random example - took me 20 seconds to find in kernel/*.c: 
kernel/cgroup.c's cgroup_subtree_control_write():

 - the function is way too big with 230+ lines - it should be split into 2-4
   helper functions.

 - the deepest C indentation it has is too much: 4 levels

 - the function has 11+ 'col80 artifacts' that I counted

 - the function has similar looking code patterns repeated over it

 - the control flow is messy at places - goto patterns mixed with open coded 
   unlock sequences.

 - some logic is completely undocumented - for example can you tell at a glance 
   what the first for_each_subsys() loop does? If it was in a helper function it 
   would be self-documenting to a large degree.

and it's a piece of code that is completely col80 compliant while it has obvious 
problems.

Most of the problems in this function would go away with two relatively simple 
rules, which are in fact stricter than col80 limits:

 - not going over 3 levels deep of nesting.

 - not allowing 'manual line breaks' for non-trivial functionality. This rule
   would flag ugly pieces of code like:

                cgroup_for_each_live_child(child, cgrp) {
                        if (css_enable & (1 << ssid))
                                ret = create_css(child, ss,
                                        cgrp->subtree_control & (1 << ssid));
                        else
                                ret = css_populate_dir(cgroup_css(child, ss),
                                                       NULL);
                        if (ret)
                                goto err_undo_css;
                }

these two rules would IMHO automatically limit the complexity of many functions 
that are too complex today.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v13] x86, mce: Add memcpy_trap()
  2016-02-19  9:10       ` Ingo Molnar
@ 2016-02-19 17:53         ` Luck, Tony
  2016-02-24 17:38           ` Tony Luck
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2016-02-19 17:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, H. Peter Anvin, Borislav Petkov,
	Linus Torvalds, Andrew Morton, Peter Zijlstra

Make use of the EXTABLE_FAULT exception table entries. This routine
returns a structure to indicate the result of the copy:

struct mcsafe_ret {
        u64 trap_nr;
        u64 bytes_left;
};

If the copy is successful, then both 'trap_nr' and 'bytes_left' are zero.

If we faulted during the copy, then 'trap_nr' will say which type
of trap (X86_TRAP_PF or X86_TRAP_MC) and 'bytes_left' says how many
bytes were not copied.

Note that this is probably the first of several copy functions.
We can make new ones for non-temporal cache handling etc.

Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
V12-V13
Ingo: Separate instruction arguments with a ", "
	Note that I didn't add spaces after "," within an argument.
	E.g. "lea (%rdx,%rcx,8), %rdx"
	Did you want them there too? I don't think they help as much there.
Ingo: More readable layout for fixup stubs

 arch/x86/include/asm/string_64.h |  26 ++++++++
 arch/x86/kernel/x8664_ksyms_64.c |   2 +
 arch/x86/lib/memcpy_64.S         | 128 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 156 insertions(+)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index ff8b9a17dc4b..65e5793b7590 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -78,6 +78,32 @@ int strcmp(const char *cs, const char *ct);
 #define memset(s, c, n) __memset(s, c, n)
 #endif
 
+/**
+ * struct memcpy_trap_ret - return value from memcpy_trap()
+ *
+ * @trap_nr	x86 trap number if the copy failed
+ * @bytes_left	zero for successful copy else number of bytes not copied
+ */
+struct memcpy_trap_ret {
+	u64 trap_nr;
+	u64 bytes_left;
+};
+
+/**
+ * memcpy_trap - copy memory with indication if a trap interrupted the copy
+ *
+ * @dst:	destination address
+ * @src:	source address
+ * @cnt:	number of bytes to copy
+ *
+ * Low level memory copy function that catches traps and indicates whether
+ * the copy succeeded and if not, why it failed.
+ *
+ * Return is struct memcpy_trap_ret which provides both the number of bytes
+ * not copied and the reason for the failure.
+ */
+struct memcpy_trap_ret memcpy_trap(void *dst, const void __user *src, size_t cnt);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
index a0695be19864..40866e2cbcc4 100644
--- a/arch/x86/kernel/x8664_ksyms_64.c
+++ b/arch/x86/kernel/x8664_ksyms_64.c
@@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache);
 EXPORT_SYMBOL(_copy_from_user);
 EXPORT_SYMBOL(_copy_to_user);
 
+EXPORT_SYMBOL_GPL(memcpy_trap);
+
 EXPORT_SYMBOL(copy_page);
 EXPORT_SYMBOL(clear_page);
 
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 16698bba87de..b8dccda0575d 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -177,3 +177,131 @@ ENTRY(memcpy_orig)
 .Lend:
 	retq
 ENDPROC(memcpy_orig)
+
+#ifndef CONFIG_UML
+/*
+ * memcpy_trap - memory copy with machine check exception handling
+ * Note that we only catch machine checks when reading the source addresses.
+ * Writes to target are posted and don't generate machine checks.
+ */
+ENTRY(memcpy_trap)
+	cmpl $8, %edx
+	/* Less than 8 bytes? Go to byte copy loop */
+	jb .L_no_whole_words
+
+	/* Check for bad alignment of source */
+	testl $7, %esi
+	/* Already aligned */
+	jz .L_8byte_aligned
+
+	/* Copy one byte at a time until source is 8-byte aligned */
+	movl %esi, %ecx
+	andl $7, %ecx
+	subl $8, %ecx
+	negl %ecx
+	subl %ecx, %edx
+.L_copy_leading_bytes:
+	movb (%rsi), %al
+	movb %al, (%rdi)
+	incq %rsi
+	incq %rdi
+	decl %ecx
+	jnz .L_copy_leading_bytes
+
+.L_8byte_aligned:
+	/* Figure out how many whole cache lines (64-bytes) to copy */
+	movl %edx, %ecx
+	andl $63, %edx
+	shrl $6, %ecx
+	jz .L_no_whole_cache_lines
+
+	/* Loop copying whole cache lines */
+.L_cache_w0: movq (%rsi), %r8
+.L_cache_w1: movq 1*8(%rsi), %r9
+.L_cache_w2: movq 2*8(%rsi), %r10
+.L_cache_w3: movq 3*8(%rsi), %r11
+	movq %r8, (%rdi)
+	movq %r9, 1*8(%rdi)
+	movq %r10, 2*8(%rdi)
+	movq %r11, 3*8(%rdi)
+.L_cache_w4: movq 4*8(%rsi), %r8
+.L_cache_w5: movq 5*8(%rsi), %r9
+.L_cache_w6: movq 6*8(%rsi), %r10
+.L_cache_w7: movq 7*8(%rsi), %r11
+	movq %r8, 4*8(%rdi)
+	movq %r9, 5*8(%rdi)
+	movq %r10, 6*8(%rdi)
+	movq %r11, 7*8(%rdi)
+	leaq 64(%rsi), %rsi
+	leaq 64(%rdi), %rdi
+	decl %ecx
+	jnz .L_cache_w0
+
+	/* Are there any trailing 8-byte words? */
+.L_no_whole_cache_lines:
+	movl %edx, %ecx
+	andl $7, %edx
+	shrl $3, %ecx
+	jz .L_no_whole_words
+
+	/* Copy trailing words */
+.L_copy_trailing_words:
+	movq (%rsi), %r8
+	mov %r8, (%rdi)
+	leaq 8(%rsi), %rsi
+	leaq 8(%rdi), %rdi
+	decl %ecx
+	jnz .L_copy_trailing_words
+
+	/* Any trailing bytes? */
+.L_no_whole_words:
+	andl %edx, %edx
+	jz .L_done_memcpy_trap
+
+	/* Copy trailing bytes */
+	movl %edx, %ecx
+.L_copy_trailing_bytes:
+	movb (%rsi), %al
+	movb %al, (%rdi)
+	incq %rsi
+	incq %rdi
+	decl %ecx
+	jnz .L_copy_trailing_bytes
+
+	/* Copy successful. Return .remain = 0, .trapnr = 0 */
+.L_done_memcpy_trap:
+	xorq %rax, %rax
+	xorq %rdx, %rdx
+	ret
+
+	.section .fixup, "ax"
+	/*
+	 * The machine check handler loaded %rax with trap number.
+	 * We just need to make sure %edx has the number of
+	 * bytes remaining.
+	 */
+.L_fix_leading_bytes:	add %ecx, %edx; ret
+.L_fix_cache_w0:	shl $6, %ecx; add %ecx, %edx; ret
+.L_fix_cache_w1:	shl $6, %ecx; lea -8(%ecx,%edx), %edx; ret
+.L_fix_cache_w2:	shl $6, %ecx; lea -16(%ecx,%edx), %edx; ret
+.L_fix_cache_w3:	shl $6, %ecx; lea -24(%ecx,%edx), %edx; ret
+.L_fix_cache_w4:	shl $6, %ecx; lea -32(%ecx,%edx), %edx; ret
+.L_fix_cache_w5:	shl $6, %ecx; lea -40(%ecx,%edx), %edx; ret
+.L_fix_cache_w6:	shl $6, %ecx; lea -48(%ecx,%edx), %edx; ret
+.L_fix_cache_w7:	shl $6, %ecx; lea -56(%ecx,%edx), %edx; ret
+.L_fix_trailing_words:	lea (%rdx,%rcx,8), %rdx; ret
+.L_fix_trailing_bytes:	mov %ecx, %edx; ret
+	.previous
+
+	_ASM_EXTABLE_FAULT(.L_copy_leading_bytes, .L_fix_leading_bytes)
+	_ASM_EXTABLE_FAULT(.L_cache_w0, .L_fix_cache_w0)
+	_ASM_EXTABLE_FAULT(.L_cache_w1, .L_fix_cache_w1)
+	_ASM_EXTABLE_FAULT(.L_cache_w3, .L_fix_cache_w2)
+	_ASM_EXTABLE_FAULT(.L_cache_w3, .L_fix_cache_w3)
+	_ASM_EXTABLE_FAULT(.L_cache_w4, .L_fix_cache_w4)
+	_ASM_EXTABLE_FAULT(.L_cache_w5, .L_fix_cache_w5)
+	_ASM_EXTABLE_FAULT(.L_cache_w6, .L_fix_cache_w6)
+	_ASM_EXTABLE_FAULT(.L_cache_w7, .L_fix_cache_w7)
+	_ASM_EXTABLE_FAULT(.L_copy_trailing_words, .L_fix_trailing_words)
+	_ASM_EXTABLE_FAULT(.L_copy_trailing_bytes, .L_fix_trailing_bytes)
+#endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v13] x86, mce: Add memcpy_trap()
  2016-02-19 17:53         ` [PATCH v13] " Luck, Tony
@ 2016-02-24 17:38           ` Tony Luck
  2016-02-24 18:35             ` Linus Torvalds
  0 siblings, 1 reply; 52+ messages in thread
From: Tony Luck @ 2016-02-24 17:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Thomas Gleixner, H. Peter Anvin,
	Borislav Petkov, Linus Torvalds, Andrew Morton, Peter Zijlstra

On Fri, Feb 19, 2016 at 9:53 AM, Luck, Tony <tony.luck@intel.com> wrote:
> Make use of the EXTABLE_FAULT exception table entries. This routine
> returns a structure to indicate the result of the copy:
>
> struct mcsafe_ret {
>         u64 trap_nr;
>         u64 bytes_left;
> };
>
> If the copy is successful, then both 'trap_nr' and 'bytes_left' are zero.
>
> If we faulted during the copy, then 'trap_nr' will say which type
> of trap (X86_TRAP_PF or X86_TRAP_MC) and 'bytes_left' says how many
> bytes were not copied.
>
> Note that this is probably the first of several copy functions.
> We can make new ones for non-temporal cache handling etc.
>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> V12-V13
> Ingo: Separate instruction arguments with a ", "
>         Note that I didn't add spaces after "," within an argument.
>         E.g. "lea (%rdx,%rcx,8), %rdx"
>         Did you want them there too? I don't think they help as much there.
> Ingo: More readable layout for fixup stubs
>
>  arch/x86/include/asm/string_64.h |  26 ++++++++
>  arch/x86/kernel/x8664_ksyms_64.c |   2 +
>  arch/x86/lib/memcpy_64.S         | 128 +++++++++++++++++++++++++++++++++++++++
>  3 files changed, 156 insertions(+)

Where do we stand with this?  The followup discussion dropped LKML at some point
in the thread ... so here is the summary to bring the archive up to date:

1) Dan Williams doesn't really care about getting the bytes_left
value. A simple succeed/fail code would work for him.

2) But if we want to use this for copy_from_user() as part of the
write(2) call stack (and I *do* want to do that), then there are some
POSIX corner cases that say that if the middle of a buffer supplied by
the user is invalid we should write bytes up to that point to the file
and return a short, but accurate, byte count rather than -EFAULT

3) Linus was concerned that we would not be able to get a precise
bytes_left value when using the "rep mov" x86ism because it might be
copying in a weird order (even backwards) for speed reasons. But the
Intel architects pointed to the SDM volume 2 "REP" description which
makes it clear that whatever shenanigans might be happening behind the
scenes, if the "rep" is interrupted by a trap or fault the
architectural view will be that all bytes up to the point of the fault
will have been copied, no bytes beyond that point will have been
copied (in flight writes will be dropped). rdi/rsi/ecx registers will
all have been updated to the point of the fault (so we somehow fixed
the reason for the machine check we'l be able to *continue* the copy
from the point where it faulted).

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v13] x86, mce: Add memcpy_trap()
  2016-02-24 17:38           ` Tony Luck
@ 2016-02-24 18:35             ` Linus Torvalds
  2016-02-24 19:27               ` Tony Luck
  0 siblings, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2016-02-24 18:35 UTC (permalink / raw)
  To: Tony Luck
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Linux Kernel Mailing List, Andrew Morton, H. Peter Anvin

On Feb 24, 2016 09:38, "Tony Luck" <tony.luck@gmail.com> wrote:
>
> 2) But if we want to use this for copy_from_user() as part of the
> write(2) call stack (and I *do* want to do that)

I don't think that is even remotely an option.

If doing a rep movs can cause the chip to crash, we cannot possibly
allow user space to map these pages at all. User space could just do
"rep movs" on its own, intentionally or by mistake.

Put another way: if we cannot use the regular copy_from_user(), then
the whole concept is already dead, dead, dead.

I would suggest you instead just make regular "copy_from_user()" work
with MCE's (which may well mean "working with hardware people to make
sure it isn't a machine-killing experience").

Think of it this way: if a regular copy_from_user() doesn't work on
the memory, there's no way in hell we can allow user space to map it
anyway.

And once a regular copy_from_user() does work on it, there is no
longer any possible advantage to "memcpy_trap()".

In other words, in no situation does it make sense to make
memcpy_trap() work on user addresses. It really is that simple.

                 Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v13] x86, mce: Add memcpy_trap()
  2016-02-24 18:35             ` Linus Torvalds
@ 2016-02-24 19:27               ` Tony Luck
  2016-02-24 19:37                 ` Linus Torvalds
  0 siblings, 1 reply; 52+ messages in thread
From: Tony Luck @ 2016-02-24 19:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Linux Kernel Mailing List, Andrew Morton, H. Peter Anvin

On Wed, Feb 24, 2016 at 10:35 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Feb 24, 2016 09:38, "Tony Luck" <tony.luck@gmail.com> wrote:
>>
>> 2) But if we want to use this for copy_from_user() as part of the
>> write(2) call stack (and I *do* want to do that)
>
> Think of it this way: if a regular copy_from_user() doesn't work on
> the memory, there's no way in hell we can allow user space to map it
> anyway.

I see I have caused confusion by talking about Dan's NVDIMM case and
copy_from_user() in the same breath,

This isn't just about NVDIMMs. It is about uncorrected errors in any
type of memory.

The copy_from_user() case I'd like to fix is when there is an
uncorrected in memory. This can happen to regular DDR3/DDR4 memory
just as it can happen to NVDIMM. When a user process directly reads
the location with the uncorrected error the h/w triggers a machine
check and if the error is marked as recoverable the kernel will SIGBUS
the process. See mm/memory-failure.c

We do have an issue in current processor implementations that access
using rep mov generates a fatal machine check. This is a gap and will
at some point be fixed.

Currently if the user passes the address of the location with the
uncorrected error to the kernel via a system call like write(2) the
kernel will do the access, and we will crash.  I'd like to fix that
and SIGBUS the user, just like would happen if they touched the memory
themselves.To do that the copy_from_user() needs to be able to tell
that there was a machine check (and probably take action inline, as
fixing every place that we call copy_from_user() sounds like a silly
idea).

Maybe this won't be ready for inclusion in the kernel until rep mov
generates recoverable machine checks ... otherwise we have a bigger
heap of possible copy routines to choose from as we'd have to look not
only an the speed of various copy algorithms, but also at the behavior
during exceptions ... in particular users may have to choose between
speed (rep mov) and recoverability on current generation cpus (and
given how rare uncorrected errors are, they might all choose speed).

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v13] x86, mce: Add memcpy_trap()
  2016-02-24 19:27               ` Tony Luck
@ 2016-02-24 19:37                 ` Linus Torvalds
  2016-02-25  8:56                   ` Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2016-02-24 19:37 UTC (permalink / raw)
  To: Tony Luck
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Linux Kernel Mailing List, Andrew Morton, H. Peter Anvin

On Wed, Feb 24, 2016 at 11:27 AM, Tony Luck <tony.luck@gmail.com> wrote:
>
> This isn't just about NVDIMMs. It is about uncorrected errors in any
> type of memory.
>
> The copy_from_user() case I'd like to fix is when there is an
> uncorrected in memory. This can happen to regular DDR3/DDR4 memory
> just as it can happen to NVDIMM. When a user process directly reads
> the location with the uncorrected error the h/w triggers a machine
> check and if the error is marked as recoverable the kernel will SIGBUS
> the process. See mm/memory-failure.c

But the way to fix that is absolutely *not* to introduce some new concept.

You'd just extend the _existing_ get_user() and put_user() functions
to return a different error code for a memory error. Instead of
-EFAULT, maybe they can return -EMEMERR or something.

I do not see how it could possibly ever make sense to introduce a new
name and then make old users use that new name instead.

                        Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v13] x86, mce: Add memcpy_trap()
  2016-02-24 19:37                 ` Linus Torvalds
@ 2016-02-25  8:56                   ` Ingo Molnar
  2016-02-25 19:33                     ` Luck, Tony
  0 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2016-02-25  8:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tony Luck, Thomas Gleixner, Peter Zijlstra, Borislav Petkov,
	Linux Kernel Mailing List, Andrew Morton, H. Peter Anvin,
	Andy Lutomirski


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Feb 24, 2016 at 11:27 AM, Tony Luck <tony.luck@gmail.com> wrote:
> >
> > This isn't just about NVDIMMs. It is about uncorrected errors in any
> > type of memory.
> >
> > The copy_from_user() case I'd like to fix is when there is an uncorrected in 
> > memory. This can happen to regular DDR3/DDR4 memory just as it can happen to 
> > NVDIMM. When a user process directly reads the location with the uncorrected 
> > error the h/w triggers a machine check and if the error is marked as 
> > recoverable the kernel will SIGBUS the process. See mm/memory-failure.c
> 
> But the way to fix that is absolutely *not* to introduce some new concept.
> 
> You'd just extend the _existing_ get_user() and put_user() functions to return a 
> different error code for a memory error. Instead of -EFAULT, maybe they can 
> return -EMEMERR or something.
> 
> I do not see how it could possibly ever make sense to introduce a new name and 
> then make old users use that new name instead.

So if we do that we should first fix these hard coded assumptions about uaccess 
return codes:

  triton:~/tip> git grep -E '== -EFAULT' | wc -l
  44

But yes, your suggestion sounds a lot cleaner and a lot more powerful overall.

AFAICS we'll still need this exception table extension commit:

  548acf19234d x86/mm: Expand the exception table logic to allow new handling options

to propagate the error code to actual uaccess methods, right?

Alternatively we could change the exception handling protocol and change _all_ 
uaccess methods all at once, but I think that would be crazy complex and risky. 
I'd rather change key uaccess methods step by step.

So if everyone agrees then I'll keep this commit queued up. The followup patches 
will have to be reworked according to Linus's suggestions.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v13] x86, mce: Add memcpy_trap()
  2016-02-25  8:56                   ` Ingo Molnar
@ 2016-02-25 19:33                     ` Luck, Tony
  2016-02-25 20:39                       ` Linus Torvalds
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2016-02-25 19:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tony Luck, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	Borislav Petkov, Linux Kernel Mailing List, Andrew Morton,
	H. Peter Anvin, Andy Lutomirski

For reference below is what I'd hoped to be able to do with
copy_from_user() [obviously needs to not just replace that
ALTERNATIVE_2 setup in _copy_from_user ... would have to invent
an ALTERNATIVE_3 to pick the new function for people willing
to sacrifice speed for recoverability]

BUT ... there are a maze of twisty little other places that
also need to be fixed:

1) There is iov_iter_fault_in_readable() that uses fault_in_pages_readable()
   to pre-fault the first (and last) page using a raw __get_user() call.
2) I tried to avoid that by injecting my error to the middle of a page,
   but still ended up in a "rep mov" copy function instead of mine. Not
   sure which one because no output from the core that died. :-(

-Tony

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 982ce34f4a9b..e31d8964ac09 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -38,11 +38,15 @@ ENTRY(_copy_from_user)
 	jc bad_from_user
 	cmpq TI_addr_limit(%rax),%rcx
 	ja bad_from_user
+#if 1
+	jmp copy_from_user_with_mce_check
+#else
 	ALTERNATIVE_2 "jmp copy_user_generic_unrolled",		\
 		      "jmp copy_user_generic_string",		\
 		      X86_FEATURE_REP_GOOD,			\
 		      "jmp copy_user_enhanced_fast_string",	\
 		      X86_FEATURE_ERMS
+#endif
 ENDPROC(_copy_from_user)
 
 	.section .fixup,"ax"
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 0a42327a59d7..c377a70474e0 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -6,6 +6,8 @@
  * Copyright 2002 Andi Kleen <ak@suse.de>
  */
 #include <linux/module.h>
+#include <linux/mm.h>
+#include <asm/traps.h>
 #include <asm/uaccess.h>
 
 /*
@@ -86,3 +88,25 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
 		memset(to, 0, len);
 	return len;
 }
+
+__visible unsigned long
+copy_from_user_with_mce_check(void *to, const void __user *from, unsigned n)
+{
+	struct memcpy_trap_ret r;
+
+	stac();
+	r = memcpy_trap(to, (__force void *)from, n);
+	clac();
+
+	if (likely(r.bytes_left == 0))
+		return 0;
+
+	if (r.trap_nr == X86_TRAP_MC) {
+		volatile void *fault_addr = (volatile void *)from + n - r.bytes_left;
+		phys_addr_t p = virt_to_phys(fault_addr);
+
+		memory_failure(p >> PAGE_SHIFT, MCE_VECTOR, 0);
+	}
+
+	return r.bytes_left;
+}

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v13] x86, mce: Add memcpy_trap()
  2016-02-25 19:33                     ` Luck, Tony
@ 2016-02-25 20:39                       ` Linus Torvalds
  2016-02-25 22:11                         ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2016-02-25 20:39 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Ingo Molnar, Tony Luck, Thomas Gleixner, Peter Zijlstra,
	Borislav Petkov, Linux Kernel Mailing List, Andrew Morton,
	H. Peter Anvin, Andy Lutomirski

On Thu, Feb 25, 2016 at 11:33 AM, Luck, Tony <tony.luck@intel.com> wrote:
> For reference below is what I'd hoped to be able to do with
> copy_from_user() [obviously needs to not just replace that
> ALTERNATIVE_2 setup in _copy_from_user ... would have to invent
> an ALTERNATIVE_3 to pick the new function for people willing
> to sacrifice speed for recoverability]

Why?

I really don't see why you are so obsessed with replacing our current thing.

Just replace the error number in the exception path slow handling. That's *all*.

This should all be entirely about just the exception itself doing that
memory failure accounting, and the actual copying code shouldn't be
different or care about things AT ALL outside of the fact that we need
to replace the -EFAULT with something else.

I will keep on NAK'ing these insane "let's replace the copy code"
patches. I'll do it forever if that is what it takes.

And no, if the issue is that you want to replace "rep movs", then I
will _still_ keep NAK'ing them.

Because if the hardware gets it wrong, then there is absolutely no
point in us special-casing one of the _last_ common memory operations
in the whole system.

The user copying is literally a drop in the ocean. User space does a
lot more memcpy() on its own than we will *ever* copy data from user
space.

Now, if the patch starts looking more like

    -       return -EFAULT;
    +       return memory_access_fault():

where "memory_access_fault()" might do something like
"current_thread_info()->trap_nr == X86_TRAP_MC ? -EBADMEM : -EFAULT",
now *then* we'll be talking.

But doing things like

+       if (r.trap_nr == X86_TRAP_MC) {
+               volatile void *fault_addr = (volatile void *)from + n
- r.bytes_left;
+               phys_addr_t p = virt_to_phys(fault_addr);
+
+               memory_failure(p >> PAGE_SHIFT, MCE_VECTOR, 0);
+       }

in the copying code is insane, because dammit, that should be done by
the codethat sets X86_TRAP_MC in the first place.

And if there is hardware that raises a machine check without actually
telling you why - including the address - then it's laugable to talk
about "recoverability" and "hardening" and things like that. Then the
hardware is just broken.

                      Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v13] x86, mce: Add memcpy_trap()
  2016-02-25 20:39                       ` Linus Torvalds
@ 2016-02-25 22:11                         ` Andy Lutomirski
  2016-02-18 19:47                           ` [PATCH v14] x86, mce: Add memcpy_mcsafe() Tony Luck
  2016-02-26  0:58                           ` [PATCH v13] x86, mce: Add memcpy_trap() Linus Torvalds
  0 siblings, 2 replies; 52+ messages in thread
From: Andy Lutomirski @ 2016-02-25 22:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Tony Luck, Borislav Petkov, Andrew Morton,
	Tony Luck, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, H. Peter Anvin

On Feb 25, 2016 12:39 PM, "Linus Torvalds"
<torvalds@linux-foundation.org> wrote:
>
> But doing things like
>
> +       if (r.trap_nr == X86_TRAP_MC) {
> +               volatile void *fault_addr = (volatile void *)from + n
> - r.bytes_left;
> +               phys_addr_t p = virt_to_phys(fault_addr);
> +
> +               memory_failure(p >> PAGE_SHIFT, MCE_VECTOR, 0);
> +       }
>
> in the copying code is insane, because dammit, that should be done by
> the codethat sets X86_TRAP_MC in the first place.

Impossible as such, I think :(

do_machine_check uses IST, the memory failure code can sleep, and you
can't sleep in IST context.  There's a special escape that lets
memory_failure sleep *if* it came from user mode.

Here's the solution I'd prefer.  Change all the copy string to/from
user code to use the new enhanced fixup code.  Have the new fixup
handler (which can be a short C function!) fix up regs->ip to point to
copy_user_handle_tail and add a new parameter to copy_user_handle_tail
indicating the fault type.  Then put whatever fixup logic is needed in
copy_user_handle_tail -- it knows the failing address (obviously), and
it's running in process context with interrupts on (unless we're in a
pagefault_disable section), and it can do whatever it needs to do.

Linus, it's kind of like yours, except with the trap info explicitly
passed to the fixup handler instead of having the fixup handler fish
it out of some per-thread structure.

Here are different some ideas I don't like.:

1. The machine check does an IPI-to-self and the failure code runs in
IRQ context.

2. The machine check code rewrites the return stack to inject a
function call.  I don't love this.

3. Drop the idea of sending an immediate sigbus and do it with
task_work.  Maybe this is bad for some reason other than code
messiness.

4. Change the entry code so machine check runs on the normal stack if
it hits with IRQs on.

>

> And if there is hardware that raises a machine check without actually
> telling you why - including the address - then it's laugable to talk
> about "recoverability" and "hardening" and things like that. Then the
> hardware is just broken.
>
>                       Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v13] x86, mce: Add memcpy_trap()
  2016-02-25 22:11                         ` Andy Lutomirski
  2016-02-18 19:47                           ` [PATCH v14] x86, mce: Add memcpy_mcsafe() Tony Luck
@ 2016-02-26  0:58                           ` Linus Torvalds
  2016-02-26  1:19                             ` Andy Lutomirski
  1 sibling, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2016-02-26  0:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Tony Luck, Borislav Petkov, Andrew Morton,
	Tony Luck, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, H. Peter Anvin

On Thu, Feb 25, 2016 at 2:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> do_machine_check uses IST, the memory failure code can sleep, and you
> can't sleep in IST context.  There's a special escape that lets
> memory_failure sleep *if* it came from user mode.

So?

Just save it away in a list (we've got the NMI-safe lists and
everything), and make sure to cause a work to be scheduled eventually
or something. People do things in NMI's, we can do this in a machine
check.

That's what we always do for any other interrupt-time thing that want
to sleep, it's not even like this is unusual or special.

But at NO point does it make sense to try to handle it from
"copy_from_user()". There are so many reason *not* to do it there, but
the most basic one is that the error doesn't necessarily happen at
copy_to_user() time in the first place. If user space has a piece of
RAM that can cause machine checks, then normal user memory access may
have caused the MC. And that's much _much_ more likely than
"copy_from_user()", which is such a rare little special case.

copy_from_user() is simply not that special. Having it do some odd
magic error handling that nobody else does is just insane.

> Here are different some ideas I don't like.:
>
> 1. The machine check does an IPI-to-self and the failure code runs in
> IRQ context.

I really think that's way too over-engineered.

> 2. The machine check code rewrites the return stack to inject a
> function call.  I don't love this.

We already *have* this, for special cases like user accesses. That's
what our exception tables are all about.

So for our bog-standard normal copy_from_user(), we already can abort
the copy in the middle. Absolutely no new infrastructure needed. No
change to copy_from_user(), no nothing.

For other cases, when we don't have an exception table entry? Do the
same thing the page fault code does. If it's in the kernel, we're
going to be in trouble. If it's in user space, send a SIGBUS.

But again, none of that has anything to do with changing the existing
copy_from_user(). The only change I see that makes sense is to have
some way to change the error code.

> 3. Drop the idea of sending an immediate sigbus and do it with
> task_work.  Maybe this is bad for some reason other than code
> messiness.

For copy_from_user(), we get the "immediate" reaction exactly thanks
to the whole exception table mechanism that that function already
uses.

For everything else, it's going to have to be delayed or killed some
way. And yes, the SIGBUS will be delayed until it gets back to user
space.

The kernel getting a machine check while it's accessing kernel data
structures? It's going to be messy. It's going to fail. Tough. That's
kind of inevitable. There is no sane way to recover. The only way to
recover is to have reliable hardware, or just killing the machine
entirely and starting again (in a data center model). In many other
situations, you're likely going to just force an oops and kill the
machine. Or maybe force an oops, and decide to just hope for the best,
and ignore the error entirely.

Because what choice do you have?

None of this is new.

> 4. Change the entry code so machine check runs on the normal stack if
> it hits with IRQs on.

I really think you concentrate too much on the small details. I don't
see why an IST couldn't use the exception handling mechanism, for
example. I don't see why an IST couldn't just add the list entry and
make some later thing happen.  We use ist's for debug traps already,
it's not like sending a signal or looking up the exception table is
fundamentally hard there, or queueing a list or whatever.

              Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v13] x86, mce: Add memcpy_trap()
  2016-02-26  0:58                           ` [PATCH v13] x86, mce: Add memcpy_trap() Linus Torvalds
@ 2016-02-26  1:19                             ` Andy Lutomirski
  2016-02-26  2:38                               ` Linus Torvalds
  0 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2016-02-26  1:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Tony Luck, Borislav Petkov, Andrew Morton,
	Tony Luck, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, H. Peter Anvin

On Thu, Feb 25, 2016 at 4:58 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Feb 25, 2016 at 2:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> do_machine_check uses IST, the memory failure code can sleep, and you
>> can't sleep in IST context.  There's a special escape that lets
>> memory_failure sleep *if* it came from user mode.
>
> So?
>

[...]

Then let's answer the API question instead of the implementation question.

If a user program accesses a bad virtual address directly, it gets
SIGSEGV or SIGBUS depending on the nature of the error.  This is
long-established practice.  The SIGSEGV case is programmer error and
the SIGBUS case might be an IO error.

If a user program accesses a bad virtual address by passing the
address to a syscall, it gets EFAULT.  This may be programmer error or
and underlying IO error, and the program can't tell.

If a user program accesses a bad address on an NVDIMM via mmap, what
should happen?  If mmaped NVDIMM (or other DAX space) is the same as
existing poisoned memory, the program gets SIGBUS.  This still makes
sense.

The question here: what happens if a program accesses a bad NVDIMM
address by passing a pointer to a syscall?  With Tony's patches as
written, I think the program gets SIGBUS via memory_failure.  Do we
want that behavior?  If we take your suggestion and change only the
error code, then the program will *not* get SIGBUS.  Instead it will
get -EFAULT or -ESOMETHINGELSE.  Is that okay?  If it is, then
everything is straightforward and nothing in my previous email is
relevant.

--Andy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v13] x86, mce: Add memcpy_trap()
  2016-02-26  1:19                             ` Andy Lutomirski
@ 2016-02-26  2:38                               ` Linus Torvalds
  0 siblings, 0 replies; 52+ messages in thread
From: Linus Torvalds @ 2016-02-26  2:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Tony Luck, Borislav Petkov, Andrew Morton,
	Tony Luck, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, H. Peter Anvin

On Thu, Feb 25, 2016 at 5:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Then let's answer the API question instead of the implementation question.
>
> If a user program accesses a bad virtual address directly, it gets
> SIGSEGV or SIGBUS depending on the nature of the error.  This is
> long-established practice.  The SIGSEGV case is programmer error and
> the SIGBUS case might be an IO error.

Yes. I don't think there's much question there.

It would be a SIGBUS, and then we should fill in something sane in the
siginfo. So we'd have a si_code that makes sense.

We already have BUS_MCEERR_AR and BUS_MCEERR_AO ("action required" vs
"action optional"), and that together with si_addr is presumably
sufficient.

So I don't think the signal delivery has any questionable issues, it's
fairly obvious what the ABI already is.

> If a user program accesses a bad virtual address by passing the
> address to a syscall, it gets EFAULT.  This may be programmer error or
> and underlying IO error, and the program can't tell.
>
> If a user program accesses a bad address on an NVDIMM via mmap, what
> should happen?  If mmaped NVDIMM (or other DAX space) is the same as
> existing poisoned memory, the program gets SIGBUS.  This still makes
> sense.

Agreed.

> The question here: what happens if a program accesses a bad NVDIMM
> address by passing a pointer to a syscall?  With Tony's patches as
> written, I think the program gets SIGBUS via memory_failure.  Do we
> want that behavior?

I do think we want that behavior.

The traditional UNIX EFAULT behavior is an odd case, and the fact that
we do *not* send a SIGSEGV for faults that the kernel notices is
strange.

In fact, if you read POSIX, you'll notice that the standard often says
that it's implementation-defined, and the reason is that a lot of
system calls aren't "native" - they are wrappers in various libraries,
and depending on just how things work, the actual access to a bad
address may be done by the kernel (-EFAULT) or by the library wrapper
(SIGSEGV).

So the EFAULT behavior is actually pretty nasty. I don't think we
should necessarily use that as "this is how things should be done".

So I'd much rather say that "if you get a machine check on a user
address, we'll always queue a SIGBUS". Obviously, if it happens in a
system call, the system call itself will also return an error. And
also obviously, there's a limit to the queueing, so if you take
*multiple* machine checks in one system call, maybe you'd get just one
signal.

>  If we take your suggestion and change only the
> error code, then the program will *not* get SIGBUS.  Instead it will
> get -EFAULT or -ESOMETHINGELSE.  Is that okay?  If it is, then
> everything is straightforward and nothing in my previous email is
> relevant.

So I'd suggest that the *signal* be sent by the machine check handler,
and that it be done independently of the system call error we choose.

And in fact, if user mode is happy with that, it would certainly be
easier for us to just always return -EFAULT for the error, and a user
app that cares about machine checks will do all the error handling in
their signal handler.

There would be advantages to that not just for kernel simplicity: if
we start using a *new* error code, existing programs might be
confused.  But that's a pretty small advantage.

And there are certainly also advantages to coming up with a new system
call error return value.

My gut feel is that people would probably prefer something other than
EFAULT. Especially if we already have an error code that libraries
already have a reasonable error string for. The extra complexity in
the kernel to add another error case isn't _that_ big.

Of course, if users actually come and say "we don't care at all about
the error code if we always get a signal, and we want the signal
anyway in order to get the address that is bad", then we just
shouldn't bother.

So I don't really have any very strong opinions there.

My strong opinions here really seem to be limited to "I don't think it
makes sense to have a special magical copy_to/from_user()
implementation".

[ Side note: *within* the confines of purely the kernel, and
libnvdimm, I do think that something like "memcpy_fault()" absolutely
makes sense.

  So to me, "memcpy_fault()" is not a bad idea: that's a "this is
explicitly a recoverable memory copy operation within the kernel", and
that's a completely separate thing from "copy_to/from_user()" that is
also recoverable.

  What a "memcpy_fault()" (or whatever it would be called) means is
that the kernel is doing its own copies, but knows that there is some
fragility involved, and wants to have a recovery mechanism that isn't
"oops, we got a machine check in the kernel, now we need to kill the
machine".

  That "memcpy_fault()" would just use the same exception handling
mechanism to not make it a fatal error. We already do that for
possible speculative page-crossers in kernel memory for
"get_unaligned_zeropad()", for example, which is a pure in-kernel
fetch that we just know might fault.

  So it's literally just "copy_to/from_user()" that I don't think
should have special cases outside of perhaps error returns ]

                  Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v14] x86, mce: Add memcpy_mcsafe()
  2016-02-18 19:47                           ` [PATCH v14] x86, mce: Add memcpy_mcsafe() Tony Luck
@ 2016-03-02 20:47                             ` Luck, Tony
  2016-03-08 17:37                             ` [tip:ras/core] x86/mm, x86/mce: " tip-bot for Tony Luck
  2016-03-10 19:26                             ` [PATCH v14] x86, mce: " Mika Penttilä
  2 siblings, 0 replies; 52+ messages in thread
From: Luck, Tony @ 2016-03-02 20:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Thomas Gleixner, Borislav Petkov,
	Andrew Morton, Tony Luck, Peter Zijlstra, H. Peter Anvin,
	Andy Lutomirski

On Thu, Feb 18, 2016 at 11:47:26AM -0800, Tony Luck wrote:
> Make use of the EXTABLE_FAULT exception table entries to write
> a kernel copy routine that doesn't crash the system if it
> encounters a machine check. Prime use case for this is to copy
> from large arrays of non-volatile memory used as storage.
> 
> We have to use an unrolled copy loop for now because current
> hardware implementations treat a machine check in "rep mov"
> as fatal. When that is fixed we can simplify.

Ping.

Anything more needed for this?  In his last message Linus
seemed OK with a *kernel* copy function that avoided death
by machine check.  He said:

   What a "memcpy_fault()" (or whatever it would be called) means is
   that the kernel is doing its own copies, but knows that there is some
   fragility involved, and wants to have a recovery mechanism that isn't
   "oops, we got a machine check in the kernel, now we need to kill the
   machine".

The only things left to argue are the name, and the return value.

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [tip:ras/core] x86/mm, x86/mce: Add memcpy_mcsafe()
  2016-02-18 19:47                           ` [PATCH v14] x86, mce: Add memcpy_mcsafe() Tony Luck
  2016-03-02 20:47                             ` Luck, Tony
@ 2016-03-08 17:37                             ` tip-bot for Tony Luck
  2016-03-10 19:26                             ` [PATCH v14] x86, mce: " Mika Penttilä
  2 siblings, 0 replies; 52+ messages in thread
From: tip-bot for Tony Luck @ 2016-03-08 17:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: akpm, tony.luck, hpa, luto, bp, linux-kernel, tglx, tony.luck,
	torvalds, mingo, peterz, a.p.zijlstra

Commit-ID:  92b0729c34cab1f46d89aace3e66015f0bb4a682
Gitweb:     http://git.kernel.org/tip/92b0729c34cab1f46d89aace3e66015f0bb4a682
Author:     Tony Luck <tony.luck@intel.com>
AuthorDate: Thu, 18 Feb 2016 11:47:26 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 8 Mar 2016 17:54:38 +0100

x86/mm, x86/mce: Add memcpy_mcsafe()

Make use of the EXTABLE_FAULT exception table entries to write
a kernel copy routine that doesn't crash the system if it
encounters a machine check. Prime use case for this is to copy
from large arrays of non-volatile memory used as storage.

We have to use an unrolled copy loop for now because current
hardware implementations treat a machine check in "rep mov"
as fatal. When that is fixed we can simplify.

Return type is a "bool". True means that we copied OK, false means
that it didn't.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@gmail.com>
Link: http://lkml.kernel.org/r/a44e1055efc2d2a9473307b22c91caa437aa3f8b.1456439214.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/string_64.h |  13 +++++
 arch/x86/kernel/x8664_ksyms_64.c |   2 +
 arch/x86/lib/memcpy_64.S         | 117 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 132 insertions(+)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index ff8b9a1..ca6ba36 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -78,6 +78,19 @@ int strcmp(const char *cs, const char *ct);
 #define memset(s, c, n) __memset(s, c, n)
 #endif
 
+/**
+ * memcpy_mcsafe - copy memory with indication if a machine check happened
+ *
+ * @dst:	destination address
+ * @src:	source address
+ * @cnt:	number of bytes to copy
+ *
+ * Low level memory copy function that catches machine checks
+ *
+ * Return true for success, false for fail
+ */
+bool memcpy_mcsafe(void *dst, const void *src, size_t cnt);
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
index a0695be..cd05942 100644
--- a/arch/x86/kernel/x8664_ksyms_64.c
+++ b/arch/x86/kernel/x8664_ksyms_64.c
@@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache);
 EXPORT_SYMBOL(_copy_from_user);
 EXPORT_SYMBOL(_copy_to_user);
 
+EXPORT_SYMBOL_GPL(memcpy_mcsafe);
+
 EXPORT_SYMBOL(copy_page);
 EXPORT_SYMBOL(clear_page);
 
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 16698bb..7d37641 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -177,3 +177,120 @@ ENTRY(memcpy_orig)
 .Lend:
 	retq
 ENDPROC(memcpy_orig)
+
+#ifndef CONFIG_UML
+/*
+ * memcpy_mcsafe - memory copy with machine check exception handling
+ * Note that we only catch machine checks when reading the source addresses.
+ * Writes to target are posted and don't generate machine checks.
+ */
+ENTRY(memcpy_mcsafe)
+	cmpl $8, %edx
+	/* Less than 8 bytes? Go to byte copy loop */
+	jb .L_no_whole_words
+
+	/* Check for bad alignment of source */
+	testl $7, %esi
+	/* Already aligned */
+	jz .L_8byte_aligned
+
+	/* Copy one byte at a time until source is 8-byte aligned */
+	movl %esi, %ecx
+	andl $7, %ecx
+	subl $8, %ecx
+	negl %ecx
+	subl %ecx, %edx
+.L_copy_leading_bytes:
+	movb (%rsi), %al
+	movb %al, (%rdi)
+	incq %rsi
+	incq %rdi
+	decl %ecx
+	jnz .L_copy_leading_bytes
+
+.L_8byte_aligned:
+	/* Figure out how many whole cache lines (64-bytes) to copy */
+	movl %edx, %ecx
+	andl $63, %edx
+	shrl $6, %ecx
+	jz .L_no_whole_cache_lines
+
+	/* Loop copying whole cache lines */
+.L_cache_w0: movq (%rsi), %r8
+.L_cache_w1: movq 1*8(%rsi), %r9
+.L_cache_w2: movq 2*8(%rsi), %r10
+.L_cache_w3: movq 3*8(%rsi), %r11
+	movq %r8, (%rdi)
+	movq %r9, 1*8(%rdi)
+	movq %r10, 2*8(%rdi)
+	movq %r11, 3*8(%rdi)
+.L_cache_w4: movq 4*8(%rsi), %r8
+.L_cache_w5: movq 5*8(%rsi), %r9
+.L_cache_w6: movq 6*8(%rsi), %r10
+.L_cache_w7: movq 7*8(%rsi), %r11
+	movq %r8, 4*8(%rdi)
+	movq %r9, 5*8(%rdi)
+	movq %r10, 6*8(%rdi)
+	movq %r11, 7*8(%rdi)
+	leaq 64(%rsi), %rsi
+	leaq 64(%rdi), %rdi
+	decl %ecx
+	jnz .L_cache_w0
+
+	/* Are there any trailing 8-byte words? */
+.L_no_whole_cache_lines:
+	movl %edx, %ecx
+	andl $7, %edx
+	shrl $3, %ecx
+	jz .L_no_whole_words
+
+	/* Copy trailing words */
+.L_copy_trailing_words:
+	movq (%rsi), %r8
+	mov %r8, (%rdi)
+	leaq 8(%rsi), %rsi
+	leaq 8(%rdi), %rdi
+	decl %ecx
+	jnz .L_copy_trailing_words
+
+	/* Any trailing bytes? */
+.L_no_whole_words:
+	andl %edx, %edx
+	jz .L_done_memcpy_trap
+
+	/* Copy trailing bytes */
+	movl %edx, %ecx
+.L_copy_trailing_bytes:
+	movb (%rsi), %al
+	movb %al, (%rdi)
+	incq %rsi
+	incq %rdi
+	decl %ecx
+	jnz .L_copy_trailing_bytes
+
+	/* Copy successful. Return true */
+.L_done_memcpy_trap:
+	xorq %rax, %rax
+	ret
+ENDPROC(memcpy_mcsafe)
+
+	.section .fixup, "ax"
+	/* Return false for any failure */
+.L_memcpy_mcsafe_fail:
+	mov	$1, %rax
+	ret
+
+	.previous
+
+	_ASM_EXTABLE_FAULT(.L_copy_leading_bytes, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w0, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w1, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w3, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w3, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w4, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w5, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w6, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_cache_w7, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_copy_trailing_words, .L_memcpy_mcsafe_fail)
+	_ASM_EXTABLE_FAULT(.L_copy_trailing_bytes, .L_memcpy_mcsafe_fail)
+#endif

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v14] x86, mce: Add memcpy_mcsafe()
  2016-02-18 19:47                           ` [PATCH v14] x86, mce: Add memcpy_mcsafe() Tony Luck
  2016-03-02 20:47                             ` Luck, Tony
  2016-03-08 17:37                             ` [tip:ras/core] x86/mm, x86/mce: " tip-bot for Tony Luck
@ 2016-03-10 19:26                             ` Mika Penttilä
  2016-03-10 19:37                               ` Luck, Tony
  2 siblings, 1 reply; 52+ messages in thread
From: Mika Penttilä @ 2016-03-10 19:26 UTC (permalink / raw)
  To: Tony Luck, Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Thomas Gleixner, Borislav Petkov,
	Andrew Morton, Tony Luck, Peter Zijlstra, H. Peter Anvin,
	Andy Lutomirski



On 18.02.2016 21:47, Tony Luck wrote:
> Make use of the EXTABLE_FAULT exception table entries to write
> a kernel copy routine that doesn't crash the system if it
> encounters a machine check. Prime use case for this is to copy
> from large arrays of non-volatile memory used as storage.
>
> We have to use an unrolled copy loop for now because current
> hardware implementations treat a machine check in "rep mov"
> as fatal. When that is fixed we can simplify.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>
> Is this what we want now?  Return type is a "bool". True means
> that we copied OK, false means that it didn't (this is all that
> Dan says that he needs).  Dropped all the complex code to figure
> out how many bytes we didn't copy as Linus says this isn't the
> right place to do this (and besides we should just make "rep mov"
>
> +
> +	/* Copy successful. Return true */
> +.L_done_memcpy_trap:
> +	xorq %rax, %rax
> +	ret
> +ENDPROC(memcpy_mcsafe)
> +
> +	.section .fixup, "ax"
> +	/* Return false for any failure */
> +.L_memcpy_mcsafe_fail:
> +	mov	$1, %rax
> +	ret
> +
>
But you return 0 == false for success and 1 == true for failure.

--Mika

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v14] x86, mce: Add memcpy_mcsafe()
  2016-03-10 19:26                             ` [PATCH v14] x86, mce: " Mika Penttilä
@ 2016-03-10 19:37                               ` Luck, Tony
  2016-03-11 22:10                                 ` Tony Luck
  0 siblings, 1 reply; 52+ messages in thread
From: Luck, Tony @ 2016-03-10 19:37 UTC (permalink / raw)
  To: Mika Penttilä, Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Thomas Gleixner, Borislav Petkov,
	Andrew Morton, Tony Luck, Peter Zijlstra, H. Peter Anvin,
	Andy Lutomirski, Williams, Dan J

> But you return 0 == false for success and 1 == true for failure.

Aaargh!  -ETOOMUCHSHELLSCRIPTPROGRAMMING

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v14] x86, mce: Add memcpy_mcsafe()
  2016-03-10 19:37                               ` Luck, Tony
@ 2016-03-11 22:10                                 ` Tony Luck
  2016-03-11 22:14                                   ` Dan Williams
  2016-03-12 17:16                                   ` Ingo Molnar
  0 siblings, 2 replies; 52+ messages in thread
From: Tony Luck @ 2016-03-11 22:10 UTC (permalink / raw)
  To: Mika Penttilä, Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Thomas Gleixner, Borislav Petkov,
	Andrew Morton, Tony Luck, Peter Zijlstra, H. Peter Anvin,
	Andy Lutomirski, Williams, Dan J

On Thu, Mar 10, 2016 at 11:37 AM, Luck, Tony <tony.luck@intel.com> wrote:
>> But you return 0 == false for success and 1 == true for failure.
>
> Aaargh!  -ETOOMUCHSHELLSCRIPTPROGRAMMING
>
> -Tony

Options to fix this:
1) Just change the comments in the code.
     This seems like it would confuse people as I thing most people
     would expect the "true" return to mean the copy succeeded.
2) Reverse the return values.
     Better that option 1 - but doesn't leave scope to return a count
     if some future user does want to know where the copy failed.
3) Change the return type back from "bool" to "int"
     0 == success, non-zero == fail (with option to put the non-copied
byte count in later).
4) Something else

-Tony

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v14] x86, mce: Add memcpy_mcsafe()
  2016-03-11 22:10                                 ` Tony Luck
@ 2016-03-11 22:14                                   ` Dan Williams
  2016-03-12 17:16                                   ` Ingo Molnar
  1 sibling, 0 replies; 52+ messages in thread
From: Dan Williams @ 2016-03-11 22:14 UTC (permalink / raw)
  To: Tony Luck
  Cc: Mika Penttilä,
	Ingo Molnar, linux-kernel, Linus Torvalds, Thomas Gleixner,
	Borislav Petkov, Andrew Morton, Peter Zijlstra, H. Peter Anvin,
	Andy Lutomirski

On Fri, Mar 11, 2016 at 2:10 PM, Tony Luck <tony.luck@gmail.com> wrote:
> On Thu, Mar 10, 2016 at 11:37 AM, Luck, Tony <tony.luck@intel.com> wrote:
>>> But you return 0 == false for success and 1 == true for failure.
>>
>> Aaargh!  -ETOOMUCHSHELLSCRIPTPROGRAMMING
>>
>> -Tony
>
> Options to fix this:
> 1) Just change the comments in the code.
>      This seems like it would confuse people as I thing most people
>      would expect the "true" return to mean the copy succeeded.
> 2) Reverse the return values.
>      Better that option 1 - but doesn't leave scope to return a count
>      if some future user does want to know where the copy failed.
> 3) Change the return type back from "bool" to "int"
>      0 == success, non-zero == fail (with option to put the non-copied
> byte count in later).
> 4) Something else
>

You could return 0 for success and -EIO for failure (positive value
for 'location' if that support ever resurfaces).

That would at least let me drop the ternary conversion I have in
memcpy_from_pmem() [1] to convert bool to an error code.

[1]: https://patchwork.kernel.org/patch/8559521/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v14] x86, mce: Add memcpy_mcsafe()
  2016-03-11 22:10                                 ` Tony Luck
  2016-03-11 22:14                                   ` Dan Williams
@ 2016-03-12 17:16                                   ` Ingo Molnar
  2016-03-13  1:12                                     ` Linus Torvalds
  1 sibling, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2016-03-12 17:16 UTC (permalink / raw)
  To: Tony Luck
  Cc: Mika Penttilä,
	linux-kernel, Linus Torvalds, Thomas Gleixner, Borislav Petkov,
	Andrew Morton, Peter Zijlstra, H. Peter Anvin, Andy Lutomirski,
	Williams, Dan J


* Tony Luck <tony.luck@gmail.com> wrote:

> On Thu, Mar 10, 2016 at 11:37 AM, Luck, Tony <tony.luck@intel.com> wrote:
> >> But you return 0 == false for success and 1 == true for failure.
> >
> > Aaargh!  -ETOOMUCHSHELLSCRIPTPROGRAMMING
> >
> > -Tony
> 
> Options to fix this:
> 1) Just change the comments in the code.
>      This seems like it would confuse people as I thing most people
>      would expect the "true" return to mean the copy succeeded.
> 2) Reverse the return values.
>      Better that option 1 - but doesn't leave scope to return a count
>      if some future user does want to know where the copy failed.
> 3) Change the return type back from "bool" to "int"
>      0 == success, non-zero == fail (with option to put the non-copied
> byte count in later).
> 4) Something else

Please use the copy_*_user() memory copying API semantics, which are: return 
negative code (-EFAULT) on error, 0 on success.

Don't return 1 and please don't use bools for any memory copy library 
functionality.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v14] x86, mce: Add memcpy_mcsafe()
  2016-03-12 17:16                                   ` Ingo Molnar
@ 2016-03-13  1:12                                     ` Linus Torvalds
  2016-03-13  9:25                                       ` Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2016-03-13  1:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tony Luck, Mika Penttilä,
	linux-kernel, Thomas Gleixner, Borislav Petkov, Andrew Morton,
	Peter Zijlstra, H. Peter Anvin, Andy Lutomirski, Williams, Dan J

On Sat, Mar 12, 2016 at 9:16 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> Please use the copy_*_user() memory copying API semantics, which are: return
> negative code (-EFAULT) on error, 0 on success.

Those are the get_user/put_user() semantics.

copy_*_user() has those annoying "bytes left uncopied" return values
that I really wouldn't encourage anybody else use unless they really
have to.  There are (good) reasons why it's done that way, but it's
still not something that should be aped without the same kind of major
reasons.

            Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v14] x86, mce: Add memcpy_mcsafe()
  2016-03-13  1:12                                     ` Linus Torvalds
@ 2016-03-13  9:25                                       ` Ingo Molnar
  2016-03-14 22:33                                         ` [PATCH] x86/mm, x86/mce: Fix return type/value for memcpy_mcsafe() Tony Luck
  0 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2016-03-13  9:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tony Luck, Mika Penttilä,
	linux-kernel, Thomas Gleixner, Borislav Petkov, Andrew Morton,
	Peter Zijlstra, H. Peter Anvin, Andy Lutomirski, Williams, Dan J


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Mar 12, 2016 at 9:16 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > Please use the copy_*_user() memory copying API semantics, which are: return
> > negative code (-EFAULT) on error, 0 on success.
> 
> Those are the get_user/put_user() semantics.
> 
> copy_*_user() has those annoying "bytes left uncopied" return values
> that I really wouldn't encourage anybody else use unless they really
> have to. [...]

Indeed, and I even grepped for it because I tend to get it wrong ... I suck!

> [...]  There are (good) reasons why it's done that way, but it's still not 
> something that should be aped without the same kind of major reasons.

Yeah, so IIRC the main reason being security and the ability to know which bits of 
a possibly uninitialized area are not yet copied on.

But with primary memcpy_mcsafe() usage being in-kernel (MM)IO space management I 
don't think we have the 'uninitialized' problem in nearly as marked a fashion, and 
we very much have the (largely uncompensated) cost of more complex assembly.

Nevertheless, wrt. error handling, the bit I feel very strongly about is the 
following:

We have the existing 'mempcy API check for failure' pattern of:

	if (put_user()) {
		... it didn't succeed ...
	}

	if (copy_from_user()) {
		... it didn't succeed ...
	}

... which should be kept IMHO:

	if (memcpy_mcsafe()) {
		... it didn't succeed ...
	}

... and what I most disagree with most is one of the suggestions in the original 
mail I replied to:

    >> 2) Reverse the return values.

    (== return 1 on success, 0 on failure)

... that's crazy IMHO and reverses the pattern of almost every memcpy-with-check 
(or syscall-success) API we have!

So I still think we should have the regular '0 on success, negative error value on 
failure' semantics - but I'd not be hugely against a '0 on success, bytes left 
uncopied on failure' variant either in principle, except that I think it unduly 
complicates the assembly side to recover the bytes.

The '0/1' current implementation is really neither of these but still fits the 
common error checking pattern, but it's still much better than the '1/0' 
suggestion which is crazy!

So I think we should move from 0/1 to 0/-EFAULT for now, and maybe (maybe) move to 
a 0/bytes_left return code in the future, if there comes a strong usecase. Most 
error checks will have to be converted in that case anyway to make use of the 
return code, so it's not like we introduce an extra cost here by going to 
0/-EFAULT.

Also, by going to 0/-EFAULT we have the option to also allow other negative error 
codes, which would allow the distinction of #MC traps from regular page fault 
traps for example. (a driver might want to log them differently.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] x86/mm, x86/mce: Fix return type/value for memcpy_mcsafe()
  2016-03-13  9:25                                       ` Ingo Molnar
@ 2016-03-14 22:33                                         ` Tony Luck
  2016-03-16  8:06                                           ` [tip:x86/urgent] " tip-bot for Tony Luck
  0 siblings, 1 reply; 52+ messages in thread
From: Tony Luck @ 2016-03-14 22:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Thomas Gleixner, linux-kernel, Borislav Petkov,
	Andrew Morton, Peter Zijlstra, H. Peter Anvin, Andy Lutomirski,
	Dan Williams, mika.penttila

Returning a 'bool' was very unpopular. Doubly so because the
code was just wrong (returning zero for true, one for false;
great for shell programming, not so good for C).

Change return type to "int". Keep zero as the success indicator
because it matches other similar code and people may be more
comfortable writing:

	if (memcpy_mcsafe(to, from, count)) {
		printk("Sad panda, copy failed\n");
		...
	}

Make the failure return value -EFAULT for now.

Reported by: Mika Penttilä <mika.penttila@nextfour.com>
Fixes: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()")
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/string_64.h | 4 ++--
 arch/x86/lib/memcpy_64.S         | 7 ++++---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index ca6ba3607705..90dbbd9666d4 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -87,9 +87,9 @@ int strcmp(const char *cs, const char *ct);
  *
  * Low level memory copy function that catches machine checks
  *
- * Return true for success, false for fail
+ * Return 0 for success, -EFAULT for fail
  */
-bool memcpy_mcsafe(void *dst, const void *src, size_t cnt);
+int memcpy_mcsafe(void *dst, const void *src, size_t cnt);
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index cbb8ee5830ff..2ec0b0abbfaa 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -1,6 +1,7 @@
 /* Copyright 2002 Andi Kleen */
 
 #include <linux/linkage.h>
+#include <asm/errno.h>
 #include <asm/cpufeatures.h>
 #include <asm/alternative-asm.h>
 
@@ -268,16 +269,16 @@ ENTRY(memcpy_mcsafe)
 	decl %ecx
 	jnz .L_copy_trailing_bytes
 
-	/* Copy successful. Return true */
+	/* Copy successful. Return zero */
 .L_done_memcpy_trap:
 	xorq %rax, %rax
 	ret
 ENDPROC(memcpy_mcsafe)
 
 	.section .fixup, "ax"
-	/* Return false for any failure */
+	/* Return -EFAULT for any failure */
 .L_memcpy_mcsafe_fail:
-	mov	$1, %rax
+	mov	$-EFAULT, %rax
 	ret
 
 	.previous
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [tip:x86/urgent] x86/mm, x86/mce: Fix return type/value for memcpy_mcsafe()
  2016-03-14 22:33                                         ` [PATCH] x86/mm, x86/mce: Fix return type/value for memcpy_mcsafe() Tony Luck
@ 2016-03-16  8:06                                           ` tip-bot for Tony Luck
  0 siblings, 0 replies; 52+ messages in thread
From: tip-bot for Tony Luck @ 2016-03-16  8:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: a.p.zijlstra, bp, tglx, peterz, mingo, dan.j.williams, torvalds,
	tony.luck, linux-kernel, akpm, hpa, luto

Commit-ID:  cbf8b5a2b649a501758291cb4d4ba1e5711771ba
Gitweb:     http://git.kernel.org/tip/cbf8b5a2b649a501758291cb4d4ba1e5711771ba
Author:     Tony Luck <tony.luck@intel.com>
AuthorDate: Mon, 14 Mar 2016 15:33:39 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 16 Mar 2016 09:02:18 +0100

x86/mm, x86/mce: Fix return type/value for memcpy_mcsafe()

Returning a 'bool' was very unpopular. Doubly so because the
code was just wrong (returning zero for true, one for false;
great for shell programming, not so good for C).

Change return type to "int". Keep zero as the success indicator
because it matches other similar code and people may be more
comfortable writing:

	if (memcpy_mcsafe(to, from, count)) {
		printk("Sad panda, copy failed\n");
		...
	}

Make the failure return value -EFAULT for now.

Reported by: Mika Penttilä <mika.penttila@nextfour.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mika.penttila@nextfour.com
Fixes: 92b0729c34ca ("x86/mm, x86/mce: Add memcpy_mcsafe()")
Link: http://lkml.kernel.org/r/695f14233fa7a54fcac4406c706d7fec228e3f4c.1457993040.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/string_64.h | 4 ++--
 arch/x86/lib/memcpy_64.S         | 7 ++++---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index ca6ba36..90dbbd9 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -87,9 +87,9 @@ int strcmp(const char *cs, const char *ct);
  *
  * Low level memory copy function that catches machine checks
  *
- * Return true for success, false for fail
+ * Return 0 for success, -EFAULT for fail
  */
-bool memcpy_mcsafe(void *dst, const void *src, size_t cnt);
+int memcpy_mcsafe(void *dst, const void *src, size_t cnt);
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index cbb8ee5..2ec0b0abb 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -1,6 +1,7 @@
 /* Copyright 2002 Andi Kleen */
 
 #include <linux/linkage.h>
+#include <asm/errno.h>
 #include <asm/cpufeatures.h>
 #include <asm/alternative-asm.h>
 
@@ -268,16 +269,16 @@ ENTRY(memcpy_mcsafe)
 	decl %ecx
 	jnz .L_copy_trailing_bytes
 
-	/* Copy successful. Return true */
+	/* Copy successful. Return zero */
 .L_done_memcpy_trap:
 	xorq %rax, %rax
 	ret
 ENDPROC(memcpy_mcsafe)
 
 	.section .fixup, "ax"
-	/* Return false for any failure */
+	/* Return -EFAULT for any failure */
 .L_memcpy_mcsafe_fail:
-	mov	$1, %rax
+	mov	$-EFAULT, %rax
 	ret
 
 	.previous

^ permalink raw reply related	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2016-03-16  8:31 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-17 18:20 [PATCH v11 0/4] Machine check recovery when kernel accesses poison Tony Luck
2016-02-17 18:20 ` [PATCH v11 1/4] x86: Expand exception table to allow new handling options Tony Luck
2016-02-18 10:19   ` [tip:ras/core] x86/mm: Expand the exception table logic " tip-bot for Tony Luck
2016-02-17 18:20 ` [PATCH v11 4/4] x86: Create a new synthetic cpu capability for machine check recovery Tony Luck
2016-02-18 10:19   ` [tip:x86/asm] x86/cpufeature: " tip-bot for Tony Luck
2016-02-17 18:20 ` [PATCH v11 3/4] x86, mce: Add __mcsafe_copy() Tony Luck
2016-02-18  8:21   ` Ingo Molnar
2016-02-18  9:59     ` Peter Zijlstra
2016-02-18 10:19       ` Ingo Molnar
2016-02-18 10:29         ` Borislav Petkov
2016-02-18 10:35         ` Peter Zijlstra
2016-02-18 14:59           ` Luck, Tony
2016-02-19  7:58       ` Ingo Molnar
2016-02-19  8:43         ` Peter Zijlstra
2016-02-19  9:51           ` Ingo Molnar
2016-02-18 10:29     ` Borislav Petkov
2016-02-18 10:34       ` Ingo Molnar
2016-02-18 10:36         ` Borislav Petkov
2016-02-18 18:48           ` Ingo Molnar
2016-02-18 21:14     ` [PATCH v12] x86, mce: Add memcpy_trap() Luck, Tony
2016-02-19  9:10       ` Ingo Molnar
2016-02-19 17:53         ` [PATCH v13] " Luck, Tony
2016-02-24 17:38           ` Tony Luck
2016-02-24 18:35             ` Linus Torvalds
2016-02-24 19:27               ` Tony Luck
2016-02-24 19:37                 ` Linus Torvalds
2016-02-25  8:56                   ` Ingo Molnar
2016-02-25 19:33                     ` Luck, Tony
2016-02-25 20:39                       ` Linus Torvalds
2016-02-25 22:11                         ` Andy Lutomirski
2016-02-18 19:47                           ` [PATCH v14] x86, mce: Add memcpy_mcsafe() Tony Luck
2016-03-02 20:47                             ` Luck, Tony
2016-03-08 17:37                             ` [tip:ras/core] x86/mm, x86/mce: " tip-bot for Tony Luck
2016-03-10 19:26                             ` [PATCH v14] x86, mce: " Mika Penttilä
2016-03-10 19:37                               ` Luck, Tony
2016-03-11 22:10                                 ` Tony Luck
2016-03-11 22:14                                   ` Dan Williams
2016-03-12 17:16                                   ` Ingo Molnar
2016-03-13  1:12                                     ` Linus Torvalds
2016-03-13  9:25                                       ` Ingo Molnar
2016-03-14 22:33                                         ` [PATCH] x86/mm, x86/mce: Fix return type/value for memcpy_mcsafe() Tony Luck
2016-03-16  8:06                                           ` [tip:x86/urgent] " tip-bot for Tony Luck
2016-02-26  0:58                           ` [PATCH v13] x86, mce: Add memcpy_trap() Linus Torvalds
2016-02-26  1:19                             ` Andy Lutomirski
2016-02-26  2:38                               ` Linus Torvalds
2016-02-18 18:12   ` [PATCH v11 3/4] x86, mce: Add __mcsafe_copy() Linus Torvalds
2016-02-18 18:51     ` Ingo Molnar
2016-02-18 18:52     ` Luck, Tony
2016-02-18 20:14       ` Ingo Molnar
2016-02-18 21:33         ` Dan Williams
2016-02-17 18:20 ` [PATCH v11 2/4] x86, mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries Tony Luck
2016-02-18 10:19   ` [tip:ras/core] x86/mce: " tip-bot for Tony Luck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).