All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
@ 2020-04-07  1:12 Keno Fischer
  2020-04-07  3:57 ` Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Keno Fischer @ 2020-04-07  1:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, x86, H. Peter Anvin,
	Borislav Petkov, Dave Hansen, Andi Kleen, Kyle Huey,
	Robert O'Callahan

This is a follow-up to my from two-years ago [1]. I've been using
rebased versions of that patch locally for my needs, but I've had
more people ask me about this patch recently, so I figured, I'd
have another go at this. Even if this doesn't make it into mainline,
I figure I can at least improve the quality of the patch for others
who want to use it. That's said, here's how this version is
different from v1:

The major flaw in v1 was that the modified XCR0 would change the
layout of the in-kernel xsave state, which was not considered,
acceptable. What this patch does instead is to restore XCR0 before
any kernel xsave(s)/xrstor(s) operation. XCR0 is then restored together
with the FPU state on return to userspace if necessary.
Of course, this code path, is extremely hot, so I additionally added
a static_key flag that disables the extra check for XCR0-restoring
unless there is a process in the system that has requested a modified
XCR0. To summarize, the performance impact in the hot-path is:

a) In the general case (no thread running with modified XCR0), an extra
   5-byte nop for ever xstate operation. This should be essentially free.
b) If some process in the system is running with a modified XCR0, an extra
   xgetbv and conditional branch. My understanding is that xgetbv is quite
   cheap at around 10 cycles or so (though obviously more expensive than a
   nop), so it's used to guard the xsetbv call for the common case that the
   XCR0 was not modified. If there is some situation that I did not think of
   in which xgetbv is expensive, this could instead look at the value in the
   thread_struct, but I thought the xgetbv would probably be cheaper.
   Additionally, we incur one additional conditional branch on return to
   user space if the FPU state needs to be reloaded.
c) For the process that's running with the modified XCR0, we incur an
   extra xsetbv for every kernel xstate operation.

Obviously point c) means that processes with active XCR0 modification will
incur noticeable overhead (an xsetbv may be expensive, or even incur a vmexit
if running under virtual hardware), even when switching between two threads
with the same (modified) XCR0 value. Note though that this only happens if
the xstate gets moved to memory. For simple round trips through the kernel
(e.g. simple syscalls or signals) that do not use the FPU from the kernel
and that do not cause a context switch, no additional instruction is executed
compared to the usual case.

I think these performance characteristics have the right character for this kind
of a feature, shifting the performance impact to users of the feature as much
as possible. I had hoped it might perhaps be possible to indicate
non-supervisor state components in the IA32_XSS MSR, but those bits are documented
as faulting if set on current-generation CPUs. If that were changed to allow such
user components, it could avoid the necessity of the extra xsetbv operations.
That said, given the hardware we have today, the scheme I implemented here
is the best I could come up with.

Thank you in advance for any review comments - I thought the comments
on v1 were quite helpful. I understand this is a tall ask for mainline
inclusion, but the feature is just too useful for me to give up ;).

[1] https://lkml.org/lkml/2018/6/16/134

Signed-off-by: Keno Fischer <keno@juliacomputing.com>
---
 arch/x86/include/asm/fpu/internal.h |  65 ++++++++++-------
 arch/x86/include/asm/fpu/xstate.h   |   2 +
 arch/x86/include/asm/processor.h    |   2 +
 arch/x86/include/uapi/asm/prctl.h   |   2 +
 arch/x86/kernel/fpu/core.c          |   5 ++
 arch/x86/kernel/fpu/xstate.c        |  11 ++-
 arch/x86/kernel/process.c           | 106 ++++++++++++++++++++++++++--
 7 files changed, 161 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 44c48e34d799..2db28f084206 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -15,6 +15,7 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
+#include <linux/jump_label_ratelimit.h>
 
 #include <asm/user.h>
 #include <asm/fpu/api.h>
@@ -314,6 +315,41 @@ static inline void copy_kernel_to_xregs_booting(struct xregs_state *xstate)
 	WARN_ON_FPU(err);
 }
 
+/*
+ * MXCSR and XCR definitions:
+ */
+
+extern unsigned int mxcsr_feature_mask;
+
+#define XCR_XFEATURE_ENABLED_MASK	0x00000000
+
+static inline u64 xgetbv(u32 index)
+{
+	u32 eax, edx;
+
+	asm volatile(".byte 0x0f,0x01,0xd0" /* xgetbv */
+		     : "=a" (eax), "=d" (edx)
+		     : "c" (index));
+	return eax + ((u64)edx << 32);
+}
+
+static inline void xsetbv(u32 index, u64 value)
+{
+	u32 eax = value;
+	u32 edx = value >> 32;
+
+	asm volatile(".byte 0x0f,0x01,0xd1" /* xsetbv */
+		     : : "a" (eax), "d" (edx), "c" (index));
+}
+
+static inline void reset_xcr0(u64 value)
+{
+	if (static_branch_unlikely(&xcr0_switching_active.key)) {
+		if (unlikely(xgetbv(XCR_XFEATURE_ENABLED_MASK) != value))
+			xsetbv(XCR_XFEATURE_ENABLED_MASK, value);
+	}
+}
+
 /*
  * Save processor xstate to xsave area.
  */
@@ -326,6 +362,7 @@ static inline void copy_xregs_to_kernel(struct xregs_state *xstate)
 
 	WARN_ON_FPU(!alternatives_patched);
 
+	reset_xcr0(xfeatures_mask);
 	XSTATE_XSAVE(xstate, lmask, hmask, err);
 
 	/* We should never fault when copying to a kernel buffer: */
@@ -340,6 +377,7 @@ static inline void copy_kernel_to_xregs(struct xregs_state *xstate, u64 mask)
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 
+	reset_xcr0(xfeatures_mask);
 	XSTATE_XRESTORE(xstate, lmask, hmask);
 }
 
@@ -615,31 +653,4 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
 	__write_pkru(pkru_val);
 }
 
-/*
- * MXCSR and XCR definitions:
- */
-
-extern unsigned int mxcsr_feature_mask;
-
-#define XCR_XFEATURE_ENABLED_MASK	0x00000000
-
-static inline u64 xgetbv(u32 index)
-{
-	u32 eax, edx;
-
-	asm volatile(".byte 0x0f,0x01,0xd0" /* xgetbv */
-		     : "=a" (eax), "=d" (edx)
-		     : "c" (index));
-	return eax + ((u64)edx << 32);
-}
-
-static inline void xsetbv(u32 index, u64 value)
-{
-	u32 eax = value;
-	u32 edx = value >> 32;
-
-	asm volatile(".byte 0x0f,0x01,0xd1" /* xsetbv */
-		     : : "a" (eax), "d" (edx), "c" (index));
-}
-
 #endif /* _ASM_X86_FPU_INTERNAL_H */
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index c6136d79f8c0..521ca7a5d2d2 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -43,11 +43,13 @@
 
 extern u64 xfeatures_mask;
 extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
+extern struct static_key_false_deferred xcr0_switching_active;
 
 extern void __init update_regset_xstate_info(unsigned int size,
 					     u64 xstate_mask);
 
 void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+int xfeature_size(int xfeature_nr);
 const void *get_xsave_field_ptr(int xfeature_nr);
 int using_compacted_format(void);
 int copy_xstate_to_kernel(void *kbuf, struct xregs_state *xsave, unsigned int offset, unsigned int size);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3bcf27caf6c9..28caba95d553 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -521,6 +521,8 @@ struct thread_struct {
 	unsigned long           debugreg6;
 	/* Keep track of the exact dr7 value set by the user */
 	unsigned long           ptrace_dr7;
+	/* Keep track of the exact XCR0 set by the user */
+	unsigned long           xcr0;
 	/* Fault info: */
 	unsigned long		cr2;
 	unsigned long		trap_nr;
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..e9ca8facf1f6 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -10,6 +10,8 @@
 #define ARCH_GET_CPUID		0x1011
 #define ARCH_SET_CPUID		0x1012
 
+#define ARCH_SET_XCR0		0x1021
+
 #define ARCH_MAP_VDSO_X32	0x2001
 #define ARCH_MAP_VDSO_32	0x2002
 #define ARCH_MAP_VDSO_64	0x2003
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 12c70840980e..c66ae73988d5 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -341,6 +341,11 @@ void switch_fpu_return(void)
 		return;
 
 	__fpregs_load_activate();
+
+	if (static_branch_unlikely(&xcr0_switching_active.key)) {
+		if (unlikely(current->thread.xcr0))
+			reset_xcr0(current->thread.xcr0);
+	}
 }
 EXPORT_SYMBOL_GPL(switch_fpu_return);
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 32b153d38748..506d78dda9cb 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -57,6 +57,14 @@ static short xsave_cpuid_features[] __initdata = {
  * Mask of xstate features supported by the CPU and the kernel:
  */
 u64 xfeatures_mask __read_mostly;
+EXPORT_SYMBOL_GPL(xfeatures_mask);
+
+/*
+ * If true, some threads are running with a modified xcr0, so the current value
+ * of xcr0 should be checked before performing kernel xstate operations
+ */
+DEFINE_STATIC_KEY_DEFERRED_FALSE(xcr0_switching_active, HZ);
+EXPORT_SYMBOL_GPL(xcr0_switching_active);
 
 static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] = -1};
@@ -448,7 +456,7 @@ static int xfeature_uncompacted_offset(int xfeature_nr)
 	return ebx;
 }
 
-static int xfeature_size(int xfeature_nr)
+int xfeature_size(int xfeature_nr)
 {
 	u32 eax, ebx, ecx, edx;
 
@@ -456,6 +464,7 @@ static int xfeature_size(int xfeature_nr)
 	cpuid_count(XSTATE_CPUID, xfeature_nr, &eax, &ebx, &ecx, &edx);
 	return eax;
 }
+EXPORT_SYMBOL_GPL(xfeature_size);
 
 /*
  * 'XSAVES' implies two different things:
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 9da70b279dad..30da1b923b1d 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -92,6 +92,9 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 	dst->thread.vm86 = NULL;
 #endif
 
+	if (unlikely(dst->thread.xcr0))
+		static_branch_deferred_inc(&xcr0_switching_active);
+
 	return fpu__copy(dst, src);
 }
 
@@ -108,6 +111,9 @@ void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
+	if (unlikely(t->xcr0))
+		static_branch_slow_dec_deferred(&xcr0_switching_active);
+
 	fpu__drop(fpu);
 }
 
@@ -980,15 +986,107 @@ unsigned long get_wchan(struct task_struct *p)
 	return ret;
 }
 
+static int xcr0_is_legal(unsigned long xcr0)
+{
+	/* Conservatively disallow anything above bit 9,
+	 * to avoid accidentally allowing the disabling of
+	 * new features without updating these checks
+	 */
+	if (xcr0 & ~((1 << 10) - 1))
+		return 0;
+	if (!(xcr0 & XFEATURE_MASK_FP))
+		return 0;
+	if ((xcr0 & XFEATURE_MASK_YMM) && !(xcr0 & XFEATURE_MASK_SSE))
+		return 0;
+	if ((!(xcr0 & XFEATURE_MASK_BNDREGS)) !=
+		(!(xcr0 & XFEATURE_MASK_BNDCSR)))
+		return 0;
+	if (xcr0 & XFEATURE_MASK_AVX512) {
+		if (!(xcr0 & XFEATURE_MASK_YMM))
+			return 0;
+		if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
+			return 0;
+	}
+	return 1;
+}
+
+static int xstate_is_initial(unsigned long mask)
+{
+	int i, j;
+	unsigned long max_bit = __ffs(mask);
+
+	for (i = 0; i < max_bit; ++i) {
+		if (mask & (1 << i)) {
+			char *xfeature_addr = (char *)get_xsave_addr(
+				&current->thread.fpu.state.xsave,
+				1 << i);
+			unsigned long feature_size = xfeature_size(i);
+
+			for (j = 0; j < feature_size; ++j) {
+				if (xfeature_addr[j] != 0)
+					return 0;
+			}
+		}
+	}
+	return 1;
+}
+
 long do_arch_prctl_common(struct task_struct *task, int option,
-			  unsigned long cpuid_enabled)
+			  unsigned long arg2)
 {
 	switch (option) {
 	case ARCH_GET_CPUID:
 		return get_cpuid_mode();
 	case ARCH_SET_CPUID:
-		return set_cpuid_mode(task, cpuid_enabled);
-	}
+		return set_cpuid_mode(task, arg2);
+	case ARCH_SET_XCR0: {
+		if (!use_xsave())
+			return -ENODEV;
+
+		/* Do not allow enabling xstate components that the kernel doesn't
+		 * know about
+		 */
+		if (arg2 & ~xfeatures_mask)
+			return -ENODEV;
+
+		/* Do not allow xcr0 values that the hardware would refuse to load later
+		 */
+		if (!xcr0_is_legal(arg2))
+			return -EINVAL;
 
-	return -EINVAL;
+		/*
+		* We require that any state components being disabled by
+		* this prctl be currently in their initial state.
+		*/
+		if (!xstate_is_initial(arg2))
+			return -EPERM;
+
+		if (arg2 == xfeatures_mask) {
+			if (!current->thread.xcr0)
+				return 0;
+
+			/* A value of zero here means to use the kernel's xfeature mask,
+			 * so restore that value here
+			 */
+			current->thread.xcr0 = 0;
+			xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
+			static_branch_slow_dec_deferred(&xcr0_switching_active);
+		} else {
+			if (arg2 == current->thread.xcr0)
+				return 0;
+			if (!current->thread.xcr0)
+				static_branch_deferred_inc(&xcr0_switching_active);
+
+			current->thread.xcr0 = arg2;
+			/* Ask to restore the FPU on userspace exit, which will restore our
+			 * requested XCR0 value
+			 */
+			set_thread_flag(TIF_NEED_FPU_LOAD);
+		}
+
+		return 0;
+	}
+	default:
+		return -EINVAL;
+	}
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07  1:12 [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread Keno Fischer
@ 2020-04-07  3:57 ` Andy Lutomirski
  2020-04-07  4:44   ` Keno Fischer
  2020-04-07 12:21 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2020-04-07  3:57 UTC (permalink / raw)
  To: Keno Fischer
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, x86, H. Peter Anvin,
	Borislav Petkov, Dave Hansen, Andi Kleen, Kyle Huey,
	Robert O'Callahan


> On Apr 6, 2020, at 6:13 PM, Keno Fischer <keno@juliacomputing.com> wrote:
> 
> This is a follow-up to my from two-years ago [1].

Your changelog is missing an explanation of why this is useful.  Why would a user program want to change XCR0?


> I've been using
> rebased versions of that patch locally for my needs, but I've had
> more people ask me about this patch recently, so I figured, I'd
> have another go at this. Even if this doesn't make it into mainline,
> I figure I can at least improve the quality of the patch for others
> who want to use it. That's said, here's how this version is
> different from v1:
> 
> The major flaw in v1 was that the modified XCR0 would change the
> layout of the in-kernel xsave state, which was not considered,
> acceptable. What this patch does instead is to restore XCR0 before
> any kernel xsave(s)/xrstor(s) operation. XCR0 is then restored together
> with the FPU state on return to userspace if necessary.
> Of course, this code path, is extremely hot, so I additionally added
> a static_key flag that disables the extra check for XCR0-restoring
> unless there is a process in the system that has requested a modified
> XCR0. To summarize, the performance impact in the hot-path is:
> 
> a) In the general case (no thread running with modified XCR0), an extra
>   5-byte nop for ever xstate operation. This should be essentially free.
> b) If some process in the system is running with a modified XCR0, an extra
>   xgetbv and conditional branch. My understanding is that xgetbv is quite
>   cheap at around 10 cycles or so (though obviously more expensive than a
>   nop), so it's used to guard the xsetbv call for the common case that the
>   XCR0 was not modified. If there is some situation that I did not think of
>   in which xgetbv is expensive, this could instead look at the value in the
>   thread_struct, but I thought the xgetbv would probably be cheaper.
>   Additionally, we incur one additional conditional branch on return to
>   user space if the FPU state needs to be reloaded.
> c) For the process that's running with the modified XCR0, we incur an
>   extra xsetbv for every kernel xstate operation.
> 
> Obviously point c) means that processes with active XCR0 modification will
> incur noticeable overhead (an xsetbv may be expensive, or even incur a vmexit
> if running under virtual hardware), even when switching between two threads
> with the same (modified) XCR0 value. Note though that this only happens if
> the xstate gets moved to memory. For simple round trips through the kernel
> (e.g. simple syscalls or signals) that do not use the FPU from the kernel
> and that do not cause a context switch, no additional instruction is executed
> compared to the usual case.
> 
> I think these performance characteristics have the right character for this kind
> of a feature, shifting the performance impact to users of the feature as much
> as possible. I had hoped it might perhaps be possible to indicate
> non-supervisor state components in the IA32_XSS MSR, but those bits are documented
> as faulting if set on current-generation CPUs. If that were changed to allow such
> user components, it could avoid the necessity of the extra xsetbv operations.
> That said, given the hardware we have today, the scheme I implemented here
> is the best I could come up with.
> 
> Thank you in advance for any review comments - I thought the comments
> on v1 were quite helpful. I understand this is a tall ask for mainline
> inclusion, but the feature is just too useful for me to give up ;).
> 
> [1] https://lkml.org/lkml/2018/6/16/134
> 
> Signed-off-by: Keno Fischer <keno@juliacomputing.com>
> ---
> arch/x86/include/asm/fpu/internal.h |  65 ++++++++++-------
> arch/x86/include/asm/fpu/xstate.h   |   2 +
> arch/x86/include/asm/processor.h    |   2 +
> arch/x86/include/uapi/asm/prctl.h   |   2 +
> arch/x86/kernel/fpu/core.c          |   5 ++
> arch/x86/kernel/fpu/xstate.c        |  11 ++-
> arch/x86/kernel/process.c           | 106 ++++++++++++++++++++++++++--
> 7 files changed, 161 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 44c48e34d799..2db28f084206 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -15,6 +15,7 @@
> #include <linux/sched.h>
> #include <linux/slab.h>
> #include <linux/mm.h>
> +#include <linux/jump_label_ratelimit.h>
> 
> #include <asm/user.h>
> #include <asm/fpu/api.h>
> @@ -314,6 +315,41 @@ static inline void copy_kernel_to_xregs_booting(struct xregs_state *xstate)
>    WARN_ON_FPU(err);
> }
> 
> +/*
> + * MXCSR and XCR definitions:
> + */
> +
> +extern unsigned int mxcsr_feature_mask;
> +
> +#define XCR_XFEATURE_ENABLED_MASK    0x00000000
> +
> +static inline u64 xgetbv(u32 index)
> +{
> +    u32 eax, edx;
> +
> +    asm volatile(".byte 0x0f,0x01,0xd0" /* xgetbv */
> +             : "=a" (eax), "=d" (edx)
> +             : "c" (index));
> +    return eax + ((u64)edx << 32);
> +}
> +
> +static inline void xsetbv(u32 index, u64 value)
> +{
> +    u32 eax = value;
> +    u32 edx = value >> 32;
> +
> +    asm volatile(".byte 0x0f,0x01,0xd1" /* xsetbv */
> +             : : "a" (eax), "d" (edx), "c" (index));
> +}
> +
> +static inline void reset_xcr0(u64 value)
> +{
> +    if (static_branch_unlikely(&xcr0_switching_active.key)) {
> +        if (unlikely(xgetbv(XCR_XFEATURE_ENABLED_MASK) != value))
> +            xsetbv(XCR_XFEATURE_ENABLED_MASK, value);
> +    }
> +}
> +
> /*
>  * Save processor xstate to xsave area.
>  */
> @@ -326,6 +362,7 @@ static inline void copy_xregs_to_kernel(struct xregs_state *xstate)
> 
>    WARN_ON_FPU(!alternatives_patched);
> 
> +    reset_xcr0(xfeatures_mask);
>    XSTATE_XSAVE(xstate, lmask, hmask, err);
> 
>    /* We should never fault when copying to a kernel buffer: */
> @@ -340,6 +377,7 @@ static inline void copy_kernel_to_xregs(struct xregs_state *xstate, u64 mask)
>    u32 lmask = mask;
>    u32 hmask = mask >> 32;
> 
> +    reset_xcr0(xfeatures_mask);
>    XSTATE_XRESTORE(xstate, lmask, hmask);
> }
> 
> @@ -615,31 +653,4 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
>    __write_pkru(pkru_val);
> }
> 
> -/*
> - * MXCSR and XCR definitions:
> - */
> -
> -extern unsigned int mxcsr_feature_mask;
> -
> -#define XCR_XFEATURE_ENABLED_MASK    0x00000000
> -
> -static inline u64 xgetbv(u32 index)
> -{
> -    u32 eax, edx;
> -
> -    asm volatile(".byte 0x0f,0x01,0xd0" /* xgetbv */
> -             : "=a" (eax), "=d" (edx)
> -             : "c" (index));
> -    return eax + ((u64)edx << 32);
> -}
> -
> -static inline void xsetbv(u32 index, u64 value)
> -{
> -    u32 eax = value;
> -    u32 edx = value >> 32;
> -
> -    asm volatile(".byte 0x0f,0x01,0xd1" /* xsetbv */
> -             : : "a" (eax), "d" (edx), "c" (index));
> -}
> -
> #endif /* _ASM_X86_FPU_INTERNAL_H */
> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
> index c6136d79f8c0..521ca7a5d2d2 100644
> --- a/arch/x86/include/asm/fpu/xstate.h
> +++ b/arch/x86/include/asm/fpu/xstate.h
> @@ -43,11 +43,13 @@
> 
> extern u64 xfeatures_mask;
> extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
> +extern struct static_key_false_deferred xcr0_switching_active;
> 
> extern void __init update_regset_xstate_info(unsigned int size,
>                         u64 xstate_mask);
> 
> void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
> +int xfeature_size(int xfeature_nr);
> const void *get_xsave_field_ptr(int xfeature_nr);
> int using_compacted_format(void);
> int copy_xstate_to_kernel(void *kbuf, struct xregs_state *xsave, unsigned int offset, unsigned int size);
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3bcf27caf6c9..28caba95d553 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -521,6 +521,8 @@ struct thread_struct {
>    unsigned long           debugreg6;
>    /* Keep track of the exact dr7 value set by the user */
>    unsigned long           ptrace_dr7;
> +    /* Keep track of the exact XCR0 set by the user */
> +    unsigned long           xcr0;
>    /* Fault info: */
>    unsigned long        cr2;
>    unsigned long        trap_nr;
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 5a6aac9fa41f..e9ca8facf1f6 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -10,6 +10,8 @@
> #define ARCH_GET_CPUID        0x1011
> #define ARCH_SET_CPUID        0x1012
> 
> +#define ARCH_SET_XCR0        0x1021
> +
> #define ARCH_MAP_VDSO_X32    0x2001
> #define ARCH_MAP_VDSO_32    0x2002
> #define ARCH_MAP_VDSO_64    0x2003
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 12c70840980e..c66ae73988d5 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -341,6 +341,11 @@ void switch_fpu_return(void)
>        return;
> 
>    __fpregs_load_activate();
> +
> +    if (static_branch_unlikely(&xcr0_switching_active.key)) {
> +        if (unlikely(current->thread.xcr0))
> +            reset_xcr0(current->thread.xcr0);
> +    }
> }
> EXPORT_SYMBOL_GPL(switch_fpu_return);
> 
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 32b153d38748..506d78dda9cb 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -57,6 +57,14 @@ static short xsave_cpuid_features[] __initdata = {
>  * Mask of xstate features supported by the CPU and the kernel:
>  */
> u64 xfeatures_mask __read_mostly;
> +EXPORT_SYMBOL_GPL(xfeatures_mask);
> +
> +/*
> + * If true, some threads are running with a modified xcr0, so the current value
> + * of xcr0 should be checked before performing kernel xstate operations
> + */
> +DEFINE_STATIC_KEY_DEFERRED_FALSE(xcr0_switching_active, HZ);
> +EXPORT_SYMBOL_GPL(xcr0_switching_active);
> 
> static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
> static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] = -1};
> @@ -448,7 +456,7 @@ static int xfeature_uncompacted_offset(int xfeature_nr)
>    return ebx;
> }
> 
> -static int xfeature_size(int xfeature_nr)
> +int xfeature_size(int xfeature_nr)
> {
>    u32 eax, ebx, ecx, edx;
> 
> @@ -456,6 +464,7 @@ static int xfeature_size(int xfeature_nr)
>    cpuid_count(XSTATE_CPUID, xfeature_nr, &eax, &ebx, &ecx, &edx);
>    return eax;
> }
> +EXPORT_SYMBOL_GPL(xfeature_size);
> 
> /*
>  * 'XSAVES' implies two different things:
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 9da70b279dad..30da1b923b1d 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -92,6 +92,9 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
>    dst->thread.vm86 = NULL;
> #endif
> 
> +    if (unlikely(dst->thread.xcr0))
> +        static_branch_deferred_inc(&xcr0_switching_active);
> +
>    return fpu__copy(dst, src);
> }
> 
> @@ -108,6 +111,9 @@ void exit_thread(struct task_struct *tsk)
> 
>    free_vm86(t);
> 
> +    if (unlikely(t->xcr0))
> +        static_branch_slow_dec_deferred(&xcr0_switching_active);
> +
>    fpu__drop(fpu);
> }
> 
> @@ -980,15 +986,107 @@ unsigned long get_wchan(struct task_struct *p)
>    return ret;
> }
> 
> +static int xcr0_is_legal(unsigned long xcr0)
> +{
> +    /* Conservatively disallow anything above bit 9,
> +     * to avoid accidentally allowing the disabling of
> +     * new features without updating these checks
> +     */
> +    if (xcr0 & ~((1 << 10) - 1))
> +        return 0;
> +    if (!(xcr0 & XFEATURE_MASK_FP))
> +        return 0;
> +    if ((xcr0 & XFEATURE_MASK_YMM) && !(xcr0 & XFEATURE_MASK_SSE))
> +        return 0;
> +    if ((!(xcr0 & XFEATURE_MASK_BNDREGS)) !=
> +        (!(xcr0 & XFEATURE_MASK_BNDCSR)))
> +        return 0;
> +    if (xcr0 & XFEATURE_MASK_AVX512) {
> +        if (!(xcr0 & XFEATURE_MASK_YMM))
> +            return 0;
> +        if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
> +            return 0;
> +    }
> +    return 1;
> +}
> +
> +static int xstate_is_initial(unsigned long mask)
> +{
> +    int i, j;
> +    unsigned long max_bit = __ffs(mask);
> +
> +    for (i = 0; i < max_bit; ++i) {
> +        if (mask & (1 << i)) {
> +            char *xfeature_addr = (char *)get_xsave_addr(
> +                &current->thread.fpu.state.xsave,
> +                1 << i);
> +            unsigned long feature_size = xfeature_size(i);
> +
> +            for (j = 0; j < feature_size; ++j) {
> +                if (xfeature_addr[j] != 0)
> +                    return 0;
> +            }
> +        }
> +    }
> +    return 1;
> +}
> +
> long do_arch_prctl_common(struct task_struct *task, int option,
> -              unsigned long cpuid_enabled)
> +              unsigned long arg2)
> {
>    switch (option) {
>    case ARCH_GET_CPUID:
>        return get_cpuid_mode();
>    case ARCH_SET_CPUID:
> -        return set_cpuid_mode(task, cpuid_enabled);
> -    }
> +        return set_cpuid_mode(task, arg2);
> +    case ARCH_SET_XCR0: {
> +        if (!use_xsave())
> +            return -ENODEV;
> +
> +        /* Do not allow enabling xstate components that the kernel doesn't
> +         * know about
> +         */
> +        if (arg2 & ~xfeatures_mask)
> +            return -ENODEV;
> +
> +        /* Do not allow xcr0 values that the hardware would refuse to load later
> +         */
> +        if (!xcr0_is_legal(arg2))
> +            return -EINVAL;
> 
> -    return -EINVAL;
> +        /*
> +        * We require that any state components being disabled by
> +        * this prctl be currently in their initial state.
> +        */
> +        if (!xstate_is_initial(arg2))
> +            return -EPERM;
> +
> +        if (arg2 == xfeatures_mask) {
> +            if (!current->thread.xcr0)
> +                return 0;
> +
> +            /* A value of zero here means to use the kernel's xfeature mask,
> +             * so restore that value here
> +             */
> +            current->thread.xcr0 = 0;
> +            xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
> +            static_branch_slow_dec_deferred(&xcr0_switching_active);
> +        } else {
> +            if (arg2 == current->thread.xcr0)
> +                return 0;
> +            if (!current->thread.xcr0)
> +                static_branch_deferred_inc(&xcr0_switching_active);
> +
> +            current->thread.xcr0 = arg2;
> +            /* Ask to restore the FPU on userspace exit, which will restore our
> +             * requested XCR0 value
> +             */
> +            set_thread_flag(TIF_NEED_FPU_LOAD);
> +        }
> +
> +        return 0;
> +    }
> +    default:
> +        return -EINVAL;
> +    }
> }
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07  3:57 ` Andy Lutomirski
@ 2020-04-07  4:44   ` Keno Fischer
  2020-04-07  4:53     ` Kyle Huey
  0 siblings, 1 reply; 26+ messages in thread
From: Keno Fischer @ 2020-04-07  4:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, x86,
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

On Mon, Apr 6, 2020 at 11:58 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
> > On Apr 6, 2020, at 6:13 PM, Keno Fischer <keno@juliacomputing.com> wrote:
> >
> > This is a follow-up to my from two-years ago [1].
>
> Your changelog is missing an explanation of why this is useful.  Why would a user program want to change XCR0?

Ah, sorry - I wasn't sure what the convention was around repeating the
applicable parts from the v1 changelog in this email.
Here's the description from the v1 patch:

> The rr (http://rr-project.org/) debugger provides user space
> record-and-replay functionality by carefully controlling the process
> environment in order to ensure completely deterministic execution
> of recorded traces. The recently added ARCH_SET_CPUID arch_prctl
> allows rr to move traces across (Intel) machines, by allowing cpuid
> invocations to be reliably recorded and replayed. This works very
> well, with one catch: It is currently not possible to replay a
> recording from a machine supporting a smaller set of XCR0 state
> components on one supporting a larger set. This is because the
> value of XCR0 is observable in userspace (either by explicit
> xgetbv or by looking at the result of xsave) and since glibc
> does observe this value, replay divergence is almost immediate.
> I also suspect that people interested in process (or container)
> live-migration may eventually care about this if a migration happens
> in between a userspace xsave and a corresponding xrstor.
>
> We encounter this problem quite frequently since most of our users
> are using pre-Skylake systems (and thus don't support the AVX512
> state components), while we recently upgraded our main development
> machines to Skylake.

Basically, for rr to work, we need to tightly control any user-visible
CPU behavior,
either by putting in the CPU in the right state or by trapping and emulating
(as we do for rdtsc, cpuid, etc). XCR0 controls a bunch of
user-visible CPU behavior,
namely:
1) The size of the xsave region if xsave is passed an all-ones mask
(which is fairly common)
2) The return value of xgetbv
3) Whether instructions making use of the relevant xstate component traps

In the v1 review, it was raised that user space could be adjusted to
deal with these
issues by always checking support in cpuid first (which is already emulatable).
Unfortunately, we don't control the environment on the record side (rr supports
record on any Intel from the past decade - with the exception of a few that have
microarchitecture bugs causing problems; and kernel versions back to 3.11), so
trying to patch user space is unfortunately a no-go for us (as well as of course
being a debugging tool, so we want to be able to help users debug if they get
uses of these instructions wrong).

Another suggestion in the v1 review was to use a VM instead with an appropriate
XCR0 value. That does mostly work, but has some problems:
1) The performance is quite a bit worse (particularly if we're already
replaying in a virtualized environment)
2) We may want to simultaneously replay tasks with different XCR0
values. This comes
into play e.g. when recording a distributed system where different
nodes in the system
are on hosts with different hardware configurations (the reason you
want to replay them
jointly rather than node-by-node is that this way you can avoid
recording any intra-node
communication, since you can just recompute it from the trace).

As a result, doing this will fully-featured VMs isn't an attractive
proposition. I had looked into
doing something more light-weight using the raw KVM API or something
analogous to what project dune did (http://dune.scs.stanford.edu/ -
basically implementing
linux user space, but where the threads run in guest CPL0 rather than
host CPL3).
My conclusion was that this approach too would require significant
kernel modification to
work well (as well as having the noted performance problems in
virtualized environments).

Sorry if this is too much of an info dump, but I hope this gives some color.

Keno

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07  4:44   ` Keno Fischer
@ 2020-04-07  4:53     ` Kyle Huey
  2020-04-07 12:33       ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Kyle Huey @ 2020-04-07  4:53 UTC (permalink / raw)
  To: Keno Fischer
  Cc: Andy Lutomirski, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

On Mon, Apr 6, 2020 at 9:45 PM Keno Fischer <keno@juliacomputing.com> wrote:
>
> On Mon, Apr 6, 2020 at 11:58 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >
> > > On Apr 6, 2020, at 6:13 PM, Keno Fischer <keno@juliacomputing.com> wrote:
> > >
> > > This is a follow-up to my from two-years ago [1].
> >
> > Your changelog is missing an explanation of why this is useful.  Why would a user program want to change XCR0?
>
> Ah, sorry - I wasn't sure what the convention was around repeating the
> applicable parts from the v1 changelog in this email.
> Here's the description from the v1 patch:
>
> > The rr (http://rr-project.org/) debugger provides user space
> > record-and-replay functionality by carefully controlling the process
> > environment in order to ensure completely deterministic execution
> > of recorded traces. The recently added ARCH_SET_CPUID arch_prctl
> > allows rr to move traces across (Intel) machines, by allowing cpuid
> > invocations to be reliably recorded and replayed. This works very
> > well, with one catch: It is currently not possible to replay a
> > recording from a machine supporting a smaller set of XCR0 state
> > components on one supporting a larger set. This is because the
> > value of XCR0 is observable in userspace (either by explicit
> > xgetbv or by looking at the result of xsave) and since glibc
> > does observe this value, replay divergence is almost immediate.
> > I also suspect that people interested in process (or container)
> > live-migration may eventually care about this if a migration happens
> > in between a userspace xsave and a corresponding xrstor.
> >
> > We encounter this problem quite frequently since most of our users
> > are using pre-Skylake systems (and thus don't support the AVX512
> > state components), while we recently upgraded our main development
> > machines to Skylake.
>
> Basically, for rr to work, we need to tightly control any user-visible
> CPU behavior,
> either by putting in the CPU in the right state or by trapping and emulating
> (as we do for rdtsc, cpuid, etc). XCR0 controls a bunch of
> user-visible CPU behavior,
> namely:
> 1) The size of the xsave region if xsave is passed an all-ones mask
> (which is fairly common)
> 2) The return value of xgetbv

It's mentioned elsewhere, but I want to emphasize that the return
value of xgetbv is the big one because the dynamic linker uses this.
rr trace portability is essentially limited to machines with identical
xcr0 values because of it.

- Kyle

> 3) Whether instructions making use of the relevant xstate component traps
>
> In the v1 review, it was raised that user space could be adjusted to
> deal with these
> issues by always checking support in cpuid first (which is already emulatable).
> Unfortunately, we don't control the environment on the record side (rr supports
> record on any Intel from the past decade - with the exception of a few that have
> microarchitecture bugs causing problems; and kernel versions back to 3.11), so
> trying to patch user space is unfortunately a no-go for us (as well as of course
> being a debugging tool, so we want to be able to help users debug if they get
> uses of these instructions wrong).
>
> Another suggestion in the v1 review was to use a VM instead with an appropriate
> XCR0 value. That does mostly work, but has some problems:
> 1) The performance is quite a bit worse (particularly if we're already
> replaying in a virtualized environment)
> 2) We may want to simultaneously replay tasks with different XCR0
> values. This comes
> into play e.g. when recording a distributed system where different
> nodes in the system
> are on hosts with different hardware configurations (the reason you
> want to replay them
> jointly rather than node-by-node is that this way you can avoid
> recording any intra-node
> communication, since you can just recompute it from the trace).
>
> As a result, doing this will fully-featured VMs isn't an attractive
> proposition. I had looked into
> doing something more light-weight using the raw KVM API or something
> analogous to what project dune did (http://dune.scs.stanford.edu/ -
> basically implementing
> linux user space, but where the threads run in guest CPL0 rather than
> host CPL3).
> My conclusion was that this approach too would require significant
> kernel modification to
> work well (as well as having the noted performance problems in
> virtualized environments).
>
> Sorry if this is too much of an info dump, but I hope this gives some color.
>
> Keno

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07  1:12 [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread Keno Fischer
  2020-04-07  3:57 ` Andy Lutomirski
@ 2020-04-07 12:21 ` Peter Zijlstra
  2020-04-07 14:06   ` Dave Hansen
  2020-04-07 13:14 ` Dave Hansen
  2020-04-07 14:20 ` Andi Kleen
  3 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2020-04-07 12:21 UTC (permalink / raw)
  To: Keno Fischer
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, x86, H. Peter Anvin,
	Borislav Petkov, Dave Hansen, Andi Kleen, Kyle Huey,
	Robert O'Callahan

On Mon, Apr 06, 2020 at 09:12:59PM -0400, Keno Fischer wrote:
> This is a follow-up to my from two-years ago [1]. I've been using
> rebased versions of that patch locally for my needs, but I've had
> more people ask me about this patch recently, so I figured, I'd
> have another go at this. Even if this doesn't make it into mainline,
> I figure I can at least improve the quality of the patch for others
> who want to use it. That's said, here's how this version is
> different from v1:

Aside of having an inconsistent comment style and whitespace damage, it
adds exports without a module user.

But my main reason for replying is asking: 'What the heck for?'

You had a fairly long changelog detailing what the patchd does; but I've
failed to find a single word on _WHY_ we want to do any of that.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07  4:53     ` Kyle Huey
@ 2020-04-07 12:33       ` Peter Zijlstra
  2020-04-07 13:52         ` Keno Fischer
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2020-04-07 12:33 UTC (permalink / raw)
  To: Kyle Huey
  Cc: Keno Fischer, Andy Lutomirski, Linux Kernel Mailing List,
	Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

On Mon, Apr 06, 2020 at 09:53:40PM -0700, Kyle Huey wrote:
> On Mon, Apr 6, 2020 at 9:45 PM Keno Fischer <keno@juliacomputing.com> wrote:
> >
> > On Mon, Apr 6, 2020 at 11:58 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > >
> > >
> > > > On Apr 6, 2020, at 6:13 PM, Keno Fischer <keno@juliacomputing.com> wrote:
> > > >
> > > > This is a follow-up to my from two-years ago [1].
> > >
> > > Your changelog is missing an explanation of why this is useful.  Why would a user program want to change XCR0?
> >
> > Ah, sorry - I wasn't sure what the convention was around repeating the
> > applicable parts from the v1 changelog in this email.
> > Here's the description from the v1 patch:
> >
> > > The rr (http://rr-project.org/) debugger provides user space
> > > record-and-replay functionality by carefully controlling the process
> > > environment in order to ensure completely deterministic execution
> > > of recorded traces. The recently added ARCH_SET_CPUID arch_prctl
> > > allows rr to move traces across (Intel) machines, by allowing cpuid
> > > invocations to be reliably recorded and replayed. This works very
> > > well, with one catch: It is currently not possible to replay a
> > > recording from a machine supporting a smaller set of XCR0 state
> > > components on one supporting a larger set. This is because the
> > > value of XCR0 is observable in userspace (either by explicit
> > > xgetbv or by looking at the result of xsave) and since glibc
> > > does observe this value, replay divergence is almost immediate.
> > > I also suspect that people interested in process (or container)
> > > live-migration may eventually care about this if a migration happens
> > > in between a userspace xsave and a corresponding xrstor.
> > >
> > > We encounter this problem quite frequently since most of our users
> > > are using pre-Skylake systems (and thus don't support the AVX512
> > > state components), while we recently upgraded our main development
> > > machines to Skylake.
> >
> > Basically, for rr to work, we need to tightly control any user-visible
> > CPU behavior,
> > either by putting in the CPU in the right state or by trapping and emulating
> > (as we do for rdtsc, cpuid, etc). XCR0 controls a bunch of
> > user-visible CPU behavior,
> > namely:
> > 1) The size of the xsave region if xsave is passed an all-ones mask
> > (which is fairly common)
> > 2) The return value of xgetbv
> 
> It's mentioned elsewhere, but I want to emphasize that the return
> value of xgetbv is the big one because the dynamic linker uses this.
> rr trace portability is essentially limited to machines with identical
> xcr0 values because of it.

I'm thinking just exposing that value is doable in a much less
objectionable fashion, no?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07  1:12 [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread Keno Fischer
  2020-04-07  3:57 ` Andy Lutomirski
  2020-04-07 12:21 ` Peter Zijlstra
@ 2020-04-07 13:14 ` Dave Hansen
       [not found]   ` <CABV8kRw1TQsqs+z43bSfZ5isctuFGMB4g_ztDYihiiXHcy4nVA@mail.gmail.com>
  2020-04-07 14:20 ` Andi Kleen
  3 siblings, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2020-04-07 13:14 UTC (permalink / raw)
  To: Keno Fischer, linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, x86, H. Peter Anvin,
	Borislav Petkov, Dave Hansen, Andi Kleen, Kyle Huey,
	Robert O'Callahan

I didn't review the first attempt at this, but this looks like a really
bad idea to me.  I don't 100% buy the arguments that this shouldn't be
done in a VM.  The x86 virtualization architecture was literally
designed to hide hardware details like this.

I can't imagine ever merging this until using VMs or at least the KVM
API has been completely ruled out.

I don't doubt that this approach works in a controlled environment.
But, I can't imagine that we could ever wrap sane semantics around such
a beast.

For instance, what would ARCH_SET_XCR0 do in a signal handler?  The
components are mostly in their init states so xstate_is_initial() will
pass.  But the kernel's XRSTOR might #GP because it might try to restore
states that are not enabled in XCR0 now.

Also, this has to be done before XCR0's state is observed by userspace,
which the kernel can't easily do once the program is off and running.
This tells me that it really needs to be done at execve() time, even
before the first program instruction runs.

These are usually the kinds of things that you figure out when you try
to go write a manpage for one of these suckers.

How does this work with things like xstateregs_[gs]et() where the format
of the kernel buffer and thus the kernel XCR0 is exposed as part of our
ABI?  With this patch, wouldn't a debugger app see a state buffer that
looks invalid?

Where else in our ABI is the format of the XSAVE buffer exposed?

I'm extra wary of a v2 that's missing the CodingStyle and changelog
basics as well.

> +static int xcr0_is_legal(unsigned long xcr0)
> +{
> +	/* Conservatively disallow anything above bit 9,
> +	 * to avoid accidentally allowing the disabling of
> +	 * new features without updating these checks
> +	 */
> +	if (xcr0 & ~((1 << 10) - 1))
> +		return 0;

Yay, magic numbers!

This would be much better as a BUILD_BUG_ON().

> +	if (!(xcr0 & XFEATURE_MASK_FP))
> +		return 0;
> +	if ((xcr0 & XFEATURE_MASK_YMM) && !(xcr0 & XFEATURE_MASK_SSE))
> +		return 0;
> +	if ((!(xcr0 & XFEATURE_MASK_BNDREGS)) !=
> +		(!(xcr0 & XFEATURE_MASK_BNDCSR)))
> +		return 0;
> +	if (xcr0 & XFEATURE_MASK_AVX512) {
> +		if (!(xcr0 & XFEATURE_MASK_YMM))
> +			return 0;
> +		if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
> +			return 0;
> +	}
> +	return 1;
> +}

This appears to copy (instead of refactoring) code from __kvm_set_xcr(),
yet manages to get the indentation different and wrong.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07 12:33       ` Peter Zijlstra
@ 2020-04-07 13:52         ` Keno Fischer
  0 siblings, 0 replies; 26+ messages in thread
From: Keno Fischer @ 2020-04-07 13:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kyle Huey, Andy Lutomirski, Linux Kernel Mailing List,
	Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

> > It's mentioned elsewhere, but I want to emphasize that the return
> > value of xgetbv is the big one because the dynamic linker uses this.
> > rr trace portability is essentially limited to machines with identical
> > xcr0 values because of it.
>
> I'm thinking just exposing that value is doable in a much less
> objectionable fashion, no?

Hi Peter,

I'm not sure I understand what you're asking,
but let me attempt to provide an answer anyway.
If I'm off the mark in what you would like to know,
please let me know and I'll try my best to get back
to you.

rr's operating principle relies upon every instruction
having deterministic and reproducible behavior,
every time they're executed and across machines.
That means literally bitwise identical updates to the
x86 register state. Most instructions do that given
identical register state - of course some don't by
design like rdtsc. Those instructions get trapped
and emulated (we're very lucky that doing so is
possible for all such instructions of practical
interest on Intel hardware). xcr0 puts us in a bit of
a bind here, because it modifies the user-visble
behavior of instructions (in the three ways I mentioned).
The xgetbv behavior is indeed the most problematic.
If there was a way to selectively trap
xgetbv/xsave/xrestor and emulate it, that would likely
prove sufficient (even just xgetbv may be sufficient,
but I'd have to do further work to validate that).
However, I don't think it's possible to trap these
instructions without also disabling the corresponding
xstate components, which we do not want, since
those instructions do actually need to get executed.

Thanks,
Keno

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07 12:21 ` Peter Zijlstra
@ 2020-04-07 14:06   ` Dave Hansen
  2020-04-07 14:16     ` Andy Lutomirski
  2020-04-07 16:29     ` Kyle Huey
  0 siblings, 2 replies; 26+ messages in thread
From: Dave Hansen @ 2020-04-07 14:06 UTC (permalink / raw)
  To: Peter Zijlstra, Keno Fischer
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, x86, H. Peter Anvin,
	Borislav Petkov, Dave Hansen, Andi Kleen, Kyle Huey,
	Robert O'Callahan

On 4/7/20 5:21 AM, Peter Zijlstra wrote:
> You had a fairly long changelog detailing what the patchd does; but I've
> failed to find a single word on _WHY_ we want to do any of that.

The goal in these record/replay systems is to be able to recreate thee
exact same program state on two systems at two different times.  To make
it reasonably fast, they try to minimize the number of snapshots they
have to take and avoid things like single stepping.

So, there are some windows where they just let the CPU run and don't
bother with taking any snapshots of register state, for instance.  Let's
say you read a word from shared memory, multiply it and shift it around
some registers, then stick it back in shared memory.  Most of these
things will just a record the snapshot at the memory read and assume
that all the instructions in the middle execute deterministically.  That
eliminates a ton of snapshots.

But, what if an instruction in the middle isn't deterministic between
two machines.  Let's say you record a trace on a a Broadwell system,
then try to replay it on a Skylake, and one of the non-snapshotted
instructions is xgetbv.  Skylake added MPX, so xgetbv will return
different values.  Your replay diverges from what was "recorded", and
life sucks.

Same problem exists for CPUID, but that was hacked around in another set.

I'm also trying to think of what kinds of things CPU companies add to
their architectures that would break this stuff.  I can't recall ever
having a discussion with folks at Intel where we're designing a CPU
feature and we say, "Can't do that, it would break record/replay".  I
suspect there are more of these landmines around and I bet that we're
building more of them into CPUs every day.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07 14:06   ` Dave Hansen
@ 2020-04-07 14:16     ` Andy Lutomirski
  2020-04-07 18:30       ` Keno Fischer
  2020-04-07 16:29     ` Kyle Huey
  1 sibling, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2020-04-07 14:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Keno Fischer, linux-kernel, Thomas Gleixner,
	Ingo Molnar, x86, H. Peter Anvin, Borislav Petkov, Dave Hansen,
	Andi Kleen, Kyle Huey, Robert O'Callahan



> On Apr 7, 2020, at 7:07 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/7/20 5:21 AM, Peter Zijlstra wrote:
>> You had a fairly long changelog detailing what the patchd does; but I've
>> failed to find a single word on _WHY_ we want to do any of that.
> 
> The goal in these record/replay systems is to be able to recreate thee
> exact same program state on two systems at two different times.  To make
> it reasonably fast, they try to minimize the number of snapshots they
> have to take and avoid things like single stepping.
> 
> So, there are some windows where they just let the CPU run and don't
> bother with taking any snapshots of register state, for instance.  Let's
> say you read a word from shared memory, multiply it and shift it around
> some registers, then stick it back in shared memory.  Most of these
> things will just a record the snapshot at the memory read and assume
> that all the instructions in the middle execute deterministically.  That
> eliminates a ton of snapshots.
> 
> But, what if an instruction in the middle isn't deterministic between
> two machines.  Let's say you record a trace on a a Broadwell system,
> then try to replay it on a Skylake, and one of the non-snapshotted
> instructions is xgetbv.  Skylake added MPX, so xgetbv will return
> different values.  Your replay diverges from what was "recorded", and
> life sucks.
> 
> Same problem exists for CPUID, but that was hacked around in another set.
> 
> I'm also trying to think of what kinds of things CPU companies add to
> their architectures that would break this stuff.  I can't recall ever
> having a discussion with folks at Intel where we're designing a CPU
> feature and we say, "Can't do that, it would break record/replay".  I
> suspect there are more of these landmines around and I bet that we're
> building more of them into CPUs every day.

TSX!

I think rr should give the raw KVM API at least a try.  It should be possible to fire up a vCPU in CPL3 in the correct state.  No guest kernel required.  I don’t know if there will be issues with the perf API, though.

If we actually do merge this XCR0 hack, I think the rule should be that it has no effect on kernel behavior.  Signals, ptrace, etc reflect the normal XCR0, not the overridden value.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07  1:12 [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread Keno Fischer
                   ` (2 preceding siblings ...)
  2020-04-07 13:14 ` Dave Hansen
@ 2020-04-07 14:20 ` Andi Kleen
  2020-04-07 18:06   ` Keno Fischer
  3 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2020-04-07 14:20 UTC (permalink / raw)
  To: Keno Fischer
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, x86, H. Peter Anvin,
	Borislav Petkov, Dave Hansen, Andi Kleen, Kyle Huey,
	Robert O'Callahan

> Thank you in advance for any review comments - I thought the comments
> on v1 were quite helpful. I understand this is a tall ask for mainline
> inclusion, but the feature is just too useful for me to give up ;).

> [1] https://lkml.org/lkml/2018/6/16/134

The rationale from that post should have been in this description.

You can already do what you want using the clearcpuid= boot flags using the
infrastructure in [1], which is in newer kernels.

So to disable AVX512 you would use clearcpuid=304 and the AVX feature
set should be the same as Haswell. To disable AVX2 you would use
clearcpuid=293 and you get the same feature set as Sandy Bridge.

A boot option is not as convenient as a run time setting, but has a lot
less weird corner cases.

-Andi

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kernel/cpu/cpuid-deps.c?id=0b00de857a648dafe7020878c7a27cf776f5edf4

-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
       [not found]   ` <CABV8kRw1TQsqs+z43bSfZ5isctuFGMB4g_ztDYihiiXHcy4nVA@mail.gmail.com>
@ 2020-04-07 16:27     ` Dave Hansen
  2020-04-07 17:55       ` Keno Fischer
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2020-04-07 16:27 UTC (permalink / raw)
  To: Keno Fischer
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

>> How does this work with things like xstateregs_[gs]et() where the format
>> of the kernel buffer and thus the kernel XCR0 is exposed as part of our
>> ABI?  With this patch, wouldn't a debugger app see a state buffer that
>> looks invalid?
> 
> Since those operate on the in-kernel buffer of these, which
> in this patch always uses the unmodified XCR0, ptracers
> should not observe a difference.

Those operate on *BOTH* kernel and userspace buffers.  They copy between
them.  That's kinda the point. :)

But I don't see any modifications to copy_xstate_to_user() or
user_regset_copyout() in your patch.

I suspect the patch thus far is only the tip of the iceberg.  I'd really
suggest doing some more thorough audits of all of the locations in the
kernel that touch the fpu buffer *or* that call XSAVE/XRSTOR.  I'm
pretty sure that audit hasn't been done or the ptrace example would have
been found already.

>> I'm also trying to think of what kinds of things CPU companies add to
>> their architectures that would break this stuff.  I can't recall ever
>> having a discussion with folks at Intel where we're designing a CPU
>> feature and we say, "Can't do that, it would break record/replay".
> 
> Heh, I'm having these discussions for you - ask me which Intel
> microarchitectures have interesting bugs here ;). The fact that rr works,
> is pretty much the only reason we buy Intel hardware these days, so
> there is at least a good reason for Intel folks to care. I think the evil
> plan is to make rr so good that everybody is using it, so you'll
> start having these conversations more :).

Having reverse execution is a laudable goal.  I've been using this:

	https://www.windriver.com/products/simics/

to do kernel (and occasional app) debugging the last few years, and its
reverse execution is invaluable for certain kinds of debugging.  But,
it's also not my daily go-to for debugging.

I'm just far from convinced that we your problem is worth solving,
especially in the place you're proposing to solve it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07 14:06   ` Dave Hansen
  2020-04-07 14:16     ` Andy Lutomirski
@ 2020-04-07 16:29     ` Kyle Huey
  1 sibling, 0 replies; 26+ messages in thread
From: Kyle Huey @ 2020-04-07 16:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Keno Fischer, open list, Thomas Gleixner,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

On Tue, Apr 7, 2020 at 7:07 AM Dave Hansen <dave.hansen@intel.com> wrote:
> I'm also trying to think of what kinds of things CPU companies add to
> their architectures that would break this stuff.  I can't recall ever
> having a discussion with folks at Intel where we're designing a CPU
> feature and we say, "Can't do that, it would break record/replay".  I
> suspect there are more of these landmines around and I bet that we're
> building more of them into CPUs every day.

With the ability to control the userspace view of CPUID, which we have
had for a couple years in the kernel now, this is much less of a
concern than you might immediately imagine. Most features that get
added are either entirely deterministic (like all the fancy SIMD
stuff) or can be disabled solely by lying to userspace about the CPUID
flags (like RDRAND or TSX). XGETBV is tricky for rr because it depends
on the kernel state as well, but that is rather unusual.

I lose far more sleep worrying about Intel introducing a new
microarchitecture that breaks our assumptions about performance
counter determinism than I do about new CPU features that are tricky
to handle.

- Kyle

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07 16:27     ` Dave Hansen
@ 2020-04-07 17:55       ` Keno Fischer
  2020-04-07 20:21         ` Dave Hansen
  0 siblings, 1 reply; 26+ messages in thread
From: Keno Fischer @ 2020-04-07 17:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

On Tue, Apr 7, 2020 at 12:27 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> >> How does this work with things like xstateregs_[gs]et() where the format
> >> of the kernel buffer and thus the kernel XCR0 is exposed as part of our
> >> ABI?  With this patch, wouldn't a debugger app see a state buffer that
> >> looks invalid?
> >
> > Since those operate on the in-kernel buffer of these, which
> > in this patch always uses the unmodified XCR0, ptracers
> > should not observe a difference.
>
> Those operate on *BOTH* kernel and userspace buffers.  They copy between
> them.  That's kinda the point. :)

Right, what I meant was that in this patch the kernel level
xsaves that populates the struct fpu always runs with
an unmodified XCR0, so the contents of the xsave area
in struct fpu does not change layout (this is the major
change in this patch over v1). Are you referring to a ptracer
which runs with a modified XCR0, and assumes that the
value it gets back from ptrace will have an XSTATE_BV
equal to its own observed XCR0 and thus get confused
about the layout of the buffer (or potentially have not copied
all of the relevant xstate because it calculated a wrong buffer
size)? If so, I think that's just a buggy ptracer. The kernel's
xfeature mask is available via ptrace and a well-behaved
ptracer should use that (e.g. gdb does, though it looks like
it then also assumes that the xstate has no holes, so it
potentially gets the layout wrong anyway).

In general, I don't really want the modified XCR0 to affect
anything other than the particular instructions that depend
on it and maybe the signal frame (though as I said before,
I'm open to either here).

If I misunderstood what you were trying to say, I apologize.

> I suspect the patch thus far is only the tip of the iceberg.  I'd really
> suggest doing some more thorough audits of all of the locations in the
> kernel that touch the fpu buffer *or* that call XSAVE/XRSTOR.  I'm
> pretty sure that audit hasn't been done or the ptrace example would have
> been found already.

Yes, good idea. I will do this again. That said, I'm hoping that
the general principle to use the kernel layout, except perhaps
in the signal frame will hold and thus not require modification.

> But, it's also not my daily go-to for debugging.

Luckily rr is fast enough (after much work) that there's almost
never a reason not to use it. Our developers essentially use
it exclusively (rather than raw gdb), and our users send us rr
traces as bug reports. It's basically become our one-stop-shop
for all things debugging on Linux.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07 14:20 ` Andi Kleen
@ 2020-04-07 18:06   ` Keno Fischer
  0 siblings, 0 replies; 26+ messages in thread
From: Keno Fischer @ 2020-04-07 18:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Kyle Huey,
	Robert O'Callahan

Hi Andi,

> The rationale from that post should have been in this description.

Yes, sorry.

> You can already do what you want using the clearcpuid= boot flags using the
> infrastructure in [1], which is in newer kernels.

Yes, that it useful, but doesn't solve the problem where
we're trying to jointly replay traces with differing XCR0 values
(as mentioned before, this is useful for recordings from multiple
nodes of a distributed system). Even for the single trace
case, having to reboot the system or boot a virtual machine
manually for every bug report I get would be operationally
annoying. A potential solution to the operational problem would
be using the raw kvm API to get a very lightweight VM
with modified XCR0 but that has performance
and complexity concerns.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07 14:16     ` Andy Lutomirski
@ 2020-04-07 18:30       ` Keno Fischer
  2020-04-14 23:20         ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Keno Fischer @ 2020-04-07 18:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Peter Zijlstra, Linux Kernel Mailing List,
	Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

> TSX!

Yes, it's problematic, but luckily turns out to
be ok in practice if masked off in cpuid.

> I think rr should give the raw KVM API at least a try.  It should be possible to fire up a vCPU in CPL3 in the correct state.  No guest kernel required.  I don’t know if there will be issues with the perf API, though.

Yes, I've looked into it, but stopped short of doing a
complete implementation. Using KVM to solve it
for replay would probably be feasible with a moderate
amount of engineering work, since rr does very few
syscalls during replay. I'm a bit afraid of the
performance implications, but I don't have numbers on this.

Record and diversions are a lot harder though, because
in this mode the tracee is a live process and able to do
syscalls (and needs to receive signals and all that good
stuff associated with being a real process). For diversions,
performance isn't super important, so we could probably
emulate this, but for record, performance is quite critical.
I assume it would be possible to add a feature to KVM
where it forwards syscalls made in guest CPL3 to the real
kernel without round-trip through userspace, but I'm just
seeing myself back here asking
for a weird KVM feature that nobody but me wants ;)
(well almost nobody, as I mentioned, there's an
academic project that tried this with a custom kernel
plugin - http://dune.scs.stanford.edu/).

Admittedly, the use case for this feature during record is
less pressing, since in our (operational) case
the replay machines tend to be much newer than
the record machines, but I wouldn't be surprised if I got
bit by this as soon as the next user xstate component gets
added and users start sending me those kinds of traces,
even if we mask off the feature in CPUID (which rr already
supports for record for similar reasons).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07 17:55       ` Keno Fischer
@ 2020-04-07 20:21         ` Dave Hansen
  2020-04-07 21:42           ` Andy Lutomirski
  2020-04-07 22:15           ` Keno Fischer
  0 siblings, 2 replies; 26+ messages in thread
From: Dave Hansen @ 2020-04-07 20:21 UTC (permalink / raw)
  To: Keno Fischer
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

On 4/7/20 10:55 AM, Keno Fischer wrote:
> On Tue, Apr 7, 2020 at 12:27 PM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>>>> How does this work with things like xstateregs_[gs]et() where the format
>>>> of the kernel buffer and thus the kernel XCR0 is exposed as part of our
>>>> ABI?  With this patch, wouldn't a debugger app see a state buffer that
>>>> looks invalid?
>>>
>>> Since those operate on the in-kernel buffer of these, which
>>> in this patch always uses the unmodified XCR0, ptracers
>>> should not observe a difference.
>>
>> Those operate on *BOTH* kernel and userspace buffers.  They copy between
>> them.  That's kinda the point. :)
> 
> Right, what I meant was that in this patch the kernel level
> xsaves that populates the struct fpu always runs with
> an unmodified XCR0, so the contents of the xsave area
> in struct fpu does not change layout (this is the major
> change in this patch over v1).

The userspace buffer is... a userspace buffer.  It is not and should not
be tied to the format of the kernel buffer.

> Are you referring to a ptracer which runs with a modified XCR0, and
> assumes that the value it gets back from ptrace will have an
> XSTATE_BV equal to its own observed XCR0 and thus get confused about
> the layout of the buffer (or potentially have not copied all of the
> relevant xstate because it calculated a wrong buffer size)?

I don't think it's insane for a process to assume that it can XRSTOR a
buffer that it gets back from ptrace.  That seems like something that
could clearly be an ABI that apps depend on.

Also, let's look at the comment about where XCR0 shows up in the ABI
(arch/x86/include/asm/user.h):

>  * For now, only the first 8 bytes of the software usable bytes[464..471] will
>  * be used and will be set to OS enabled xstate mask (which is same as the
>  * 64bit mask returned by the xgetbv's xCR0).

That also makes it sound like we expect there to be a *SINGLE* value
across the entire system.  It also makes me wonder *which* xgetbv is
expected to match USER_XSTATE_XCR0_WORD.  It can't be the ptracee since
we expect them to change XCR0.  It can't be the ptracer because they can
use this new prctl too.  So does it refer to the kernel?  Or, should the
new prctl() *disable* future ptrace()s?

> If so, I think that's just a buggy ptracer. The kernel's xfeature
> mask is available via ptrace and a well-behaved ptracer should use
> that (e.g. gdb does, though it looks like it then also assumes that
> the xstate has no holes, so it potentially gets the layout wrong
> anyway).

I'm trying to figure out what the semantics are of this whole thing.  It
can't be "don't let userspace observe the real XCR0" because ptrace
exposes that.  Is it, "make memory images portable, unless it's a memory
image from ptrace"?

> In general, I don't really want the modified XCR0 to affect
> anything other than the particular instructions that depend
> on it and maybe the signal frame (though as I said before,
> I'm open to either here).

Just remember that, in the end, we don't get to say what a good ptracer
or bad ptracer is.  If they're expecting semantics that we've kept
constant for 10 years, we change the semantics, and the app breaks, the
kernel is in the wrong.

I also don't feel like I have a good handle on what ptracers *do* with
their XSAVE buffers that they get/set.  How many apps in a distro do
something with this interface?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07 20:21         ` Dave Hansen
@ 2020-04-07 21:42           ` Andy Lutomirski
  2020-04-07 22:15           ` Keno Fischer
  1 sibling, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2020-04-07 21:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Keno Fischer, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan



> On Apr 7, 2020, at 1:21 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/7/20 10:55 AM, Keno Fischer wrote:
>>> On Tue, Apr 7, 2020 at 12:27 PM Dave Hansen <dave.hansen@intel.com> wrote:
>>> 
>>>>> How does this work with things like xstateregs_[gs]et() where the format
>>>>> of the kernel buffer and thus the kernel XCR0 is exposed as part of our
>>>>> ABI?  With this patch, wouldn't a debugger app see a state buffer that
>>>>> looks invalid?
>>>> 
>>>> Since those operate on the in-kernel buffer of these, which
>>>> in this patch always uses the unmodified XCR0, ptracers
>>>> should not observe a difference.
>>> 
>>> Those operate on *BOTH* kernel and userspace buffers.  They copy between
>>> them.  That's kinda the point. :)
>> 
>> Right, what I meant was that in this patch the kernel level
>> xsaves that populates the struct fpu always runs with
>> an unmodified XCR0, so the contents of the xsave area
>> in struct fpu does not change layout (this is the major
>> change in this patch over v1).
> 
> The userspace buffer is... a userspace buffer.  It is not and should not
> be tied to the format of the kernel buffer.
> 
>> Are you referring to a ptracer which runs with a modified XCR0, and
>> assumes that the value it gets back from ptrace will have an
>> XSTATE_BV equal to its own observed XCR0 and thus get confused about
>> the layout of the buffer (or potentially have not copied all of the
>> relevant xstate because it calculated a wrong buffer size)?
> 
> I don't think it's insane for a process to assume that it can XRSTOR a
> buffer that it gets back from ptrace.  That seems like something that
> could clearly be an ABI that apps depend on.
> 
> Also, let's look at the comment about where XCR0 shows up in the ABI
> (arch/x86/include/asm/user.h):
> 
>> * For now, only the first 8 bytes of the software usable bytes[464..471] will
>> * be used and will be set to OS enabled xstate mask (which is same as the
>> * 64bit mask returned by the xgetbv's xCR0).
> 
> That also makes it sound like we expect there to be a *SINGLE* value
> across the entire system.  It also makes me wonder *which* xgetbv is
> expected to match USER_XSTATE_XCR0_WORD.  It can't be the ptracee since
> we expect them to change XCR0.  It can't be the ptracer because they can
> use this new prctl too.  So does it refer to the kernel?  Or, should the
> new prctl() *disable* future ptrace()s?
> 
>> If so, I think that's just a buggy ptracer. The kernel's xfeature
>> mask is available via ptrace and a well-behaved ptracer should use
>> that (e.g. gdb does, though it looks like it then also assumes that
>> the xstate has no holes, so it potentially gets the layout wrong
>> anyway).
> 
> I'm trying to figure out what the semantics are of this whole thing.  It
> can't be "don't let userspace observe the real XCR0" because ptrace
> exposes that.  Is it, "make memory images portable, unless it's a memory
> image from ptrace"?
> 
>> In general, I don't really want the modified XCR0 to affect
>> anything other than the particular instructions that depend
>> on it and maybe the signal frame (though as I said before,
>> I'm open to either here).
> 
> Just remember that, in the end, we don't get to say what a good ptracer
> or bad ptracer is.  If they're expecting semantics that we've kept
> constant for 10 years, we change the semantics, and the app breaks, the
> kernel is in the wrong.
> 
> I also don't feel like I have a good handle on what ptracers *do* with
> their XSAVE buffers that they get/set.  How many apps in a distro do
> something with this interface?

Most of them treat it as bytes to be blindly stuck back into the tracee.  Some of them will display some registers for the user’s benefit.  Maybe a couple that I don’t know about do something odd.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07 20:21         ` Dave Hansen
  2020-04-07 21:42           ` Andy Lutomirski
@ 2020-04-07 22:15           ` Keno Fischer
  2020-04-14 19:55             ` Keno Fischer
  1 sibling, 1 reply; 26+ messages in thread
From: Keno Fischer @ 2020-04-07 22:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

> The userspace buffer is... a userspace buffer.  It is not and should not
> be tied to the format of the kernel buffer.

I don't think I disagree with you. However, I should point out that in
the kernel currently, the layout of this user space buffer changes depending
on the setting of the kernel XCR0 (Is this a good idea? Probably not -
certainly all the users I've looked at get this wrong and assume the
layout is fixed just variably sized, but that is how the kernel currently
works, so we're probably stuck with it)

> > Are you referring to a ptracer which runs with a modified XCR0, and
> > assumes that the value it gets back from ptrace will have an
> > XSTATE_BV equal to its own observed XCR0 and thus get confused about
> > the layout of the buffer (or potentially have not copied all of the
> > relevant xstate because it calculated a wrong buffer size)?
>
> I don't think it's insane for a process to assume that it can XRSTOR a
> buffer that it gets back from ptrace.  That seems like something that
> could clearly be an ABI that apps depend on.

Yes, that's fair, but I'm less worried about the case where
the ptracer itself is running with a modified XCR0,
because something had to explicitly opt that
process into that behavior. That piece can verify
that the application works fine, or alternatively fix
the system calls to emulate whatever behavior
that user application wants to build from this primitive.
I also believe that the issue I mentioned above,
where the ptrace xstate buffer is compacted will
cause it to fail to XRSTOR properly depending
on what the kernel XCR0 value is, and it won't XRSTORC
either, because we don't write the XCOMP_BV.

> Also, let's look at the comment about where XCR0 shows up in the ABI
> (arch/x86/include/asm/user.h):
>
> >  * For now, only the first 8 bytes of the software usable bytes[464..471] will
> >  * be used and will be set to OS enabled xstate mask (which is same as the
> >  * 64bit mask returned by the xgetbv's xCR0).
>
> That also makes it sound like we expect there to be a *SINGLE* value
> across the entire system.  It also makes me wonder *which* xgetbv is
> expected to match USER_XSTATE_XCR0_WORD.  It can't be the ptracee since
> we expect them to change XCR0.  It can't be the ptracer because they can
> use this new prctl too.  So does it refer to the kernel?  Or, should the
> new prctl() *disable* future ptrace()s?

It can't be the ptracee's XCR0, because that would be
breaking to the ptracers, which haven't opted into any
XCR0 modification. I don't think it should be the ptracer's
XCR0, because that would make the ptracer with the
modified XCR0 unable to trace a tracee with the full XCR0,
which, while potentially an acceptable trade-off, seems unnecessary.

> > If so, I think that's just a buggy ptracer. The kernel's xfeature
> > mask is available via ptrace and a well-behaved ptracer should use
> > that (e.g. gdb does, though it looks like it then also assumes that
> > the xstate has no holes, so it potentially gets the layout wrong
> > anyway).
>
> I'm trying to figure out what the semantics are of this whole thing.  It
> can't be "don't let userspace observe the real XCR0" because ptrace
> exposes that.  Is it, "make memory images portable, unless it's a memory
> image from ptrace"?

The semantics I want are "make userspace instructions behave
in the way that they would if XCR0 was this value". I'm open to
additionally extending this to sigframes, because I can't think of
a situation in which writing all those extra zeros would be useful,
but I'm also ok with the argument that it shouldn't affect any kernel
behavior whatsoever as Andy was suggesting earlier in the thread.

You can build something with more complete semantics on top of it,
as rr would, to more fully emulate the environment. You're probably
gonna be needing to trap CPUID anyway to mask off the relevant
features. I would prefer if the kernel didn't make assumptions here,
and just gave me the minimal primitive to build on top of.

> > In general, I don't really want the modified XCR0 to affect
> > anything other than the particular instructions that depend
> > on it and maybe the signal frame (though as I said before,
> > I'm open to either here).
>
> Just remember that, in the end, we don't get to say what a good ptracer
> or bad ptracer is.  If they're expecting semantics that we've kept
> constant for 10 years, we change the semantics, and the app breaks, the
> kernel is in the wrong.

Every ptracer is a bad ptracer - using this API "correctly" is essentially
impossible ;). I certainly agree that we can't change the behavior for
ptracers that haven't opted into XCR0 modification, but I don't think
that imposes any restriction on what a ptracer with modified XCR0
does (well, there's three possible options, and two reasonable ones,
so it should be one of those)

> I also don't feel like I have a good handle on what ptracers *do* with
> their XSAVE buffers that they get/set.  How many apps in a distro do
> something with this interface?

Well, anything that's a debugger would: gdb, lldb, rr.
CRIU's compel tool does too.
However, most ptracers probably don't touch the fpu state.
strace doesn't, proot doesn't.
I did some grepping around and the only other ptracer
I could find that was using this interface is
... linux (in um mode).

As a quick survey, gdb reads the kernel
XCR0 from the xstate and uses that to compute
the size of the xsave area (which I think is most correct).
lldb and rr use cpuid, which should be a
fine upper bound (overallocating is fine for ptrace)
assuming nobody is messing with cpuid *cough*,
though of course what just happens here is that they
just won't see the extra state. criu always uses 4096 bytes.
linux um requests the maximum it knows about.
None (including linux-um, oops) of them are aware
that the layout of the ptrace buffer is compressed if
XCR0 has holes in it (I'll leave it up to you to decide
whether that means we should fix it while hardware
with holes in its XCR0 is still rare - though you can
get into this situation with cmdline flags). All those
applications will have incorrect behavior if the kernel
XCR0 has a whole in it (they don't care about the
user XCR0 though, so that's somewhat unrelated
to this PR).

I don't see any compelling reason to hide additional
xstate from a ptracer with modified xcr0 if it asks for it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07 22:15           ` Keno Fischer
@ 2020-04-14 19:55             ` Keno Fischer
  0 siblings, 0 replies; 26+ messages in thread
From: Keno Fischer @ 2020-04-14 19:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan, Andy Lutomirski,
	Peter Zijlstra

Hi everyone,

I'd like to continue this discussion along two directions:

1) In this patch, what should happen to signal frames?

I continue to think that it would be good for these to observe
the process' XCR0, but I understand the argument that we
should not let the XCR0 setting modify any kernel behavior
whatsoever. Andy, I would in particular appreciate your views
on this since I believe you thought it should do the latter.

2) What would a solution based on the raw KVM API look like?

I'm still afraid that going down the KVM route would just end up
back in the same situation as we're in right now, but I'd like to
explore this further, so here's my current thinking: Particularly for
recording, the process does need to look very much like a regular
linux process, so we can get recording of syscalls and signal state right.
I don't have enough of an intuition for the performance implications
of this. For example, suppose we added a way for the kernel to
directly take syscalls from guest CPL3 - what would the cost
of incurring a vmexit for every syscall be? I suppose another
idea would be to build a minimal linux kernel that sits in guest
CPL0 and emulates at least the process state and other high
frequency syscalls, but forwards the rest to the host kernel.
Seems potentially doable, but a bit brittle - is there prior art
here I should be aware of, e.g. from people looking at securing
containers? As I mentioned, I had looked at Project Dune
before (http://dune.scs.stanford.edu/), which does seem to
do a lot of the things I would need, though it doesn't appear
to currently be handling signals at all, and of course it's also
not really KVM based, but rather
KVM-but-copy-pasted-and-manually-hacked-up-in-a-separate.ko
based.

I may also be missing a completely obvious way to do this -
my apologies if so. I would certainly appreciate any insight on
how to achieve the set of requirements here (multiple tracees
with potentially differing XCR0 values, faithful and performant
provision of syscalls/signals to the tracees) on top of KVM.

If we can figure out a good way forward with KVM, I'd be quite
interested in it, since I think there may be additional performance
games that could be played by having part of rr be in guest CPL0,
I'm just unsure that KVM is really the right abstraction here, so
I'd like to think through it a bit.

Keno

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-07 18:30       ` Keno Fischer
@ 2020-04-14 23:20         ` Andy Lutomirski
  2020-04-15  0:09           ` Keno Fischer
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2020-04-14 23:20 UTC (permalink / raw)
  To: Keno Fischer
  Cc: Dave Hansen, Peter Zijlstra, Linux Kernel Mailing List,
	Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

On Tue, Apr 7, 2020 at 11:30 AM Keno Fischer <keno@juliacomputing.com> wrote:
>
> > TSX!
>
> Yes, it's problematic, but luckily turns out to
> be ok in practice if masked off in cpuid.
>
> > I think rr should give the raw KVM API at least a try.  It should be possible to fire up a vCPU in CPL3 in the correct state.  No guest kernel required.  I don’t know if there will be issues with the perf API, though.
>
> Yes, I've looked into it, but stopped short of doing a
> complete implementation. Using KVM to solve it
> for replay would probably be feasible with a moderate
> amount of engineering work, since rr does very few
> syscalls during replay. I'm a bit afraid of the
> performance implications, but I don't have numbers on this.
>
> Record and diversions are a lot harder though, because
> in this mode the tracee is a live process and able to do
> syscalls (and needs to receive signals and all that good
> stuff associated with being a real process). For diversions,
> performance isn't super important, so we could probably
> emulate this, but for record, performance is quite critical.
> I assume it would be possible to add a feature to KVM
> where it forwards syscalls made in guest CPL3 to the real
> kernel without round-trip through userspace, but I'm just
> seeing myself back here asking
> for a weird KVM feature that nobody but me wants ;)
> (well almost nobody, as I mentioned, there's an
> academic project that tried this with a custom kernel
> plugin - http://dune.scs.stanford.edu/).
>
> Admittedly, the use case for this feature during record is
> less pressing, since in our (operational) case
> the replay machines tend to be much newer than
> the record machines, but I wouldn't be surprised if I got
> bit by this as soon as the next user xstate component gets
> added and users start sending me those kinds of traces,
> even if we mask off the feature in CPUID (which rr already
> supports for record for similar reasons).

I'm imagining that rr would do record the usual way with normal XCR0
(why would you want to record with an unusual XCR0?) and replay would
use KVM.  I'm not sure about diversions.  This way KVM wouldn't need
to deal with syscalls.

Would this work?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-14 23:20         ` Andy Lutomirski
@ 2020-04-15  0:09           ` Keno Fischer
  2020-04-16  1:07             ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Keno Fischer @ 2020-04-15  0:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Peter Zijlstra, Linux Kernel Mailing List,
	Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

> (why would you want to record with an unusual XCR0?)

I was indeed primarily thinking about the replay case when I
originally wrote this patch, but the reason for that is that the machine
that I tend to replay on is fairly new, so the traces we get tend
to be compatible. I have encountered the opposite situation also
where I wanted to send somebody a trace for them to debug, but
their machine didn't support AVX512, so I wanted to disable it.
A while ago we also encountered a similar situation where
PKRU became available on AWS so we couldn't
replay traces from there anymore.
We do of course have CPUID faulting here (which well-behaved
userspace software tends to respect), but the differing return
value from xgetbv messed up the recording nonetheless. In
order to actually make that work, masking those bits out from
XCR0 would have been required. I'm also thinking about the
future where there may be more diversity in XCR0 user state
components and what chips support what.

I assume the objection here will be that we can have our users
reboot with a different kernel command line, which is true enough
and what I will recommend to people for the time being. That said,
part of the appeal for users of rr is that it doesn't require privilege,
so it works in locked down environments like HPC clusters where
users can't easily change the kernel parameters (even KVM can
be an ask here - but is in my experience easier to negotiate with
the sysadmins).

> and replay would use KVM.

Yes, there is still a bit of state management left (e.g. we currently
rely on the kernel for mmap, fork/clone semantics and signal
delivery for deterministic signals) - though of course emulating
all that is much easier than the record side.

> I'm not sure about diversions.

Diversions (terminology note for those not familiar with record
and replay systems - a diversion is a replay that gets turned
back into a real process for the purpose of investigating the
state - think invoking a printing function from the debugger)
are similar to record in that they do handle syscalls (e.g.
for open/read/write for printing, mmap if the process allocates
during a diversion, etc.). We do have more performance
leeway for diversions than we do for records, so we might
be okay without getting too deep into KVM hacking.

I think my overall take-away is that KVM is probably feasible
for replay and diversions, but would require large patches
to KVM to be feasible for record. Unfortunately, that's
also the opposite of where getting a patch like this
into the mainline kernel would be most useful for us.
We do tend to have control over the replay machines,
so if we need to carry a patch like this, it's feasible (as
indeed we have been for the past two years), though
of course we'd like to enable this also for people who
don't want to build their own kernel (as well as for people
for whom that assumption doesn't hold - the assertion that
we control the replay machine is probably quite selfish in
how we in particular use this technology).

Hope that helps,
Keno

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-15  0:09           ` Keno Fischer
@ 2020-04-16  1:07             ` Andy Lutomirski
  2020-04-16  1:14               ` Keno Fischer
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2020-04-16  1:07 UTC (permalink / raw)
  To: Keno Fischer
  Cc: Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

On Tue, Apr 14, 2020 at 5:10 PM Keno Fischer <keno@juliacomputing.com> wrote:
>
> > (why would you want to record with an unusual XCR0?)
>

Hmm.  I don't personally have a strong objection to allowing XCR0 to
be overridden as long as it's done reasonably sanely.  And maybe
requires a sysctl to be enabled.

Would it make matters easier if tasks with nonstandard XCR0 were not
allowed to use ptrace() at all?  And if ARCH_SET_XCR0 were disallowed
if the caller is tracing anyone?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-16  1:07             ` Andy Lutomirski
@ 2020-04-16  1:14               ` Keno Fischer
  2020-04-16  1:16                 ` Keno Fischer
  0 siblings, 1 reply; 26+ messages in thread
From: Keno Fischer @ 2020-04-16  1:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Peter Zijlstra, Linux Kernel Mailing List,
	Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

> Would it make matters easier if tasks with nonstandard XCR0 were not
> allowed to use ptrace() at all?  And if ARCH_SET_XCR0 were disallowed
> if the caller is tracing anyone?

That would be fine by me (as long as you're still allowed to ptrace them of
course). I do think that using the kernel XCR0 is the best choice, but since
I don't really have a use case for it, I'm happy to disallow that and
let anybody who does have a use case come back here and
argue for it one way or the other.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-16  1:14               ` Keno Fischer
@ 2020-04-16  1:16                 ` Keno Fischer
  2020-04-16  1:22                   ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Keno Fischer @ 2020-04-16  1:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Peter Zijlstra, Linux Kernel Mailing List,
	Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

On Wed, Apr 15, 2020 at 9:14 PM Keno Fischer <keno@juliacomputing.com> wrote:
>
> > Would it make matters easier if tasks with nonstandard XCR0 were not
> > allowed to use ptrace() at all?  And if ARCH_SET_XCR0 were disallowed
> > if the caller is tracing anyone?
>
> That would be fine by me (as long as you're still allowed to ptrace them of
> course).

Sorry, I realized after I had hit send that this wording may not be clear.
What I meant was that it would need to be able to have an external ptracer
(with unmodified XCR0) attach to the task, even if it had modified its XCR0.
I don't think you were suggesting that that wouldn't be possible,
but I just wanted to make sure.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread
  2020-04-16  1:16                 ` Keno Fischer
@ 2020-04-16  1:22                   ` Andy Lutomirski
  0 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2020-04-16  1:22 UTC (permalink / raw)
  To: Keno Fischer
  Cc: Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen,
	Kyle Huey, Robert O'Callahan

On Wed, Apr 15, 2020 at 6:17 PM Keno Fischer <keno@juliacomputing.com> wrote:
>
> On Wed, Apr 15, 2020 at 9:14 PM Keno Fischer <keno@juliacomputing.com> wrote:
> >
> > > Would it make matters easier if tasks with nonstandard XCR0 were not
> > > allowed to use ptrace() at all?  And if ARCH_SET_XCR0 were disallowed
> > > if the caller is tracing anyone?
> >
> > That would be fine by me (as long as you're still allowed to ptrace them of
> > course).
>
> Sorry, I realized after I had hit send that this wording may not be clear.
> What I meant was that it would need to be able to have an external ptracer
> (with unmodified XCR0) attach to the task, even if it had modified its XCR0.
> I don't think you were suggesting that that wouldn't be possible,
> but I just wanted to make sure.

Yes, exactly.  Just to make sure we're on the same page, I suggest:

If a process modifies XCR0, then it cannot use ptrace().  Signal
delivery and sigreturn use the modified XCR0.  If you modify your XCR0
from within a signal handler, you get to keep both pieces.  If you
ptrace() a process with a modified XCR0, you see the full regset.
Among other things, this means that you could ptrace() a task with a
reduced XCR0, poke a value in one of the disabled register sets with
ptrace(), and read that same value back out again with ptrace().

Before you implement this, you might want to make sure that at least
one other x86 maintainer agrees with me. :)

I'm sure the CRIU people will notice this and want to find a way to
make ptrace() work from a modified-XCR0 process.  They are welcome to
propose semantics, since neither of the obvious ways to handle it
actually seem correct.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-04-16  1:23 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-07  1:12 [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread Keno Fischer
2020-04-07  3:57 ` Andy Lutomirski
2020-04-07  4:44   ` Keno Fischer
2020-04-07  4:53     ` Kyle Huey
2020-04-07 12:33       ` Peter Zijlstra
2020-04-07 13:52         ` Keno Fischer
2020-04-07 12:21 ` Peter Zijlstra
2020-04-07 14:06   ` Dave Hansen
2020-04-07 14:16     ` Andy Lutomirski
2020-04-07 18:30       ` Keno Fischer
2020-04-14 23:20         ` Andy Lutomirski
2020-04-15  0:09           ` Keno Fischer
2020-04-16  1:07             ` Andy Lutomirski
2020-04-16  1:14               ` Keno Fischer
2020-04-16  1:16                 ` Keno Fischer
2020-04-16  1:22                   ` Andy Lutomirski
2020-04-07 16:29     ` Kyle Huey
2020-04-07 13:14 ` Dave Hansen
     [not found]   ` <CABV8kRw1TQsqs+z43bSfZ5isctuFGMB4g_ztDYihiiXHcy4nVA@mail.gmail.com>
2020-04-07 16:27     ` Dave Hansen
2020-04-07 17:55       ` Keno Fischer
2020-04-07 20:21         ` Dave Hansen
2020-04-07 21:42           ` Andy Lutomirski
2020-04-07 22:15           ` Keno Fischer
2020-04-14 19:55             ` Keno Fischer
2020-04-07 14:20 ` Andi Kleen
2020-04-07 18:06   ` Keno Fischer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.