linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/19] x86, mpx updates for 4.2 (take 9)
@ 2015-06-07 18:37 Dave Hansen
  2015-06-07 18:37 ` [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer Dave Hansen
                   ` (18 more replies)
  0 siblings, 19 replies; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen


Changes from take 8 / v23:
 * Respond to some of Ingo's comments about get_xsave_field_ptr():
   1. add const in return type, and fix up the cascading need for
      'const' in other parts of the MPX code and tracepoints.
   2. add a local 'fpu' var
   3. Update patch description

Changes from take 7 / v22:
 * Add Thomas's reviewed-by
 * merge with tip/x86/fpu changes
 * Fix tiny spelling nit

Changes from take 6 / v21:
 * Address a bunch of Thomas's review comments.

Changes from take 5 / v20:

 * Fix get_xsave_addr() to consult xstate_bv in anticipation
   of fixes to xsave code.
 * Bug fix for when an VMA being unmapped has neighbors which
   are bounds tables.
 * Rewrite unmapping code.  I didn't do this lightly. It was
   not originally my own code, and I resisted changing it
   because it worked.  But, I started bug chasing and decided
   it was unmaintainable.  The rewrite ended up removing
   about 20% of the unmapping code and made it much simpler.

Changes from take 4 / v19:

 * Do not pass a task_struct around when we are
   really just going to operate on current

Changes from take 3 / v18 (all minor):

 * use DECLARE_EVENT_CLASS()/DEFINE_EVENT() for
   the ranged tracepoints to save 10 lines of code.

Changes from take 2 / v17 (all minor):

 * fix a couple of whitespace borkages caught by checkpatch,
   and a spelling error or two.
 * replace printk with pr_info() for boot disable
 * change trace print format for address intervals
 * fix up variable name in tsk_get_xsave_addr() comment
 * remove tsk_get_xsave_field() GPL export
 * fix up Qiaowei's From:

--

Hi x86 maintainers,

There are a few basic things going on here:
1. Make FPU/xsave code cleaner and less error-prone
2. Add trace points to make kernel and app debugging easier
3. Add a boot-time disable for mpx
4. Rewrite the unmapping code.
5. Support 32-bit binaries to run on 64-bit kernels

This set is also available against tip/x86/fpu (8c05f05edb) in git:

  git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-mpx.git mpx-v24

Dave Hansen (19):
      x86, mpx, xsave: Fix up bad get_xsave_addr() assumptions
      x86, fpu: Wrap get_xsave_addr() to make it safer
      x86, mpx: Use new get_xsave_field_ptr()
      x86, mpx: Cleanup: Do not pass task around when unnecessary
      x86, mpx: remove redundant MPX_BNDCFG_ADDR_MASK
      x86, mpx: Restrict mmap size check to bounds tables
      x86, mpx: boot-time disable
      x86, mpx: trace #BR exceptions
      x86, mpx: trace entry to bounds exception paths
      x86, mpx: Trace the attempts to find bounds tables
      x86, mpx: trace allocation of new bounds tables
      x86: make is_64bit_mm() widely available
      x86, mpx: Add temporary variable to reduce masking
      x86, mpx: new directory entry to addr helper
      x86, mpx: do 32-bit-only cmpxchg for 32-bit apps
      x86, mpx: support 32-bit binaries on 64-bit kernel
      x86, mpx: rewrite unmap code
      x86, mpx: do not count MPX VMAs as neighbors when unmapping
      x86, mpx: allow mixed binaries again

 Documentation/kernel-parameters.txt |   4 +
 arch/x86/include/asm/fpu/xstate.h   |   1 +
 arch/x86/include/asm/mmu_context.h  |  13 +
 arch/x86/include/asm/mpx.h          |  74 ++--
 arch/x86/include/asm/processor.h    |  12 +-
 arch/x86/include/asm/trace/mpx.h    | 132 +++++++
 arch/x86/kernel/cpu/common.c        |  16 +
 arch/x86/kernel/fpu/xstate.c        |  77 ++++-
 arch/x86/kernel/traps.c             |  20 +-
 arch/x86/kernel/uprobes.c           |  10 +-
 arch/x86/mm/mpx.c                   | 514 +++++++++++++++++-----------
 kernel/sys.c                        |   8 +-
 12 files changed, 610 insertions(+), 271 deletions(-)


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 01/19] x86, mpx, xsave: Fix up bad get_xsave_addr() assumptions
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
  2015-06-07 18:37 ` [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:30   ` [tip:x86/fpu] x86/fpu/xstate: " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 03/19] x86, mpx: Use new get_xsave_field_ptr() Dave Hansen
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

get_xsave_addr() assumes that if an xsave bit is present in the
hardware (pcntxt_mask) that it is present in a given xsave
buffer.  Due to an bug in the xsave code on all of the systems
that have MPX (and thus all the users of this code), that has
been a true assumption.

But, the bug is getting fixed, so our assumption is not going
to hold any more.

It's quite possible (and normal) for an enabled state to be
present on 'pcntxt_mask', but *not* in 'xstate_bv'.  We need
to consult 'xstate_bv'.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/kernel/fpu/xstate.c |   45 +++++++++++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 8 deletions(-)

diff -puN arch/x86/kernel/fpu/xstate.c~consullt-xstate_bv arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~consullt-xstate_bv	2015-06-01 10:24:03.025676699 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-06-01 10:24:03.029676880 -0700
@@ -382,19 +382,48 @@ void fpu__resume_cpu(void)
  * This is the API that is called to get xstate address in either
  * standard format or compacted format of xsave area.
  *
+ * Note that if there is no data for the field in the xsave buffer
+ * this will return NULL.
+ *
  * Inputs:
- *	xsave: base address of the xsave area;
- *	xstate: state which is defined in xsave.h (e.g. XSTATE_FP, XSTATE_SSE,
- *	etc.)
+ *	xstate: the thread's storage area for all FPU data
+ *	xstate_feature: state which is defined in xsave.h (e.g.
+ *	XSTATE_FP, XSTATE_SSE, etc...)
  * Output:
- *	address of the state in the xsave area.
+ *	address of the state in the xsave area, or NULL if the
+ *	field is not present in the xsave buffer.
  */
-void *get_xsave_addr(struct xregs_state *xsave, int xstate)
+void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
 {
-	int feature = fls64(xstate) - 1;
-	if (!test_bit(feature, (unsigned long *)&xfeatures_mask))
+	int feature_nr = fls64(xstate_feature) - 1;
+	/*
+	 * Do we even *have* xsave state?
+	 */
+	if (!boot_cpu_has(X86_FEATURE_XSAVE))
+		return NULL;
+
+	xsave = &current->thread.fpu.state.xsave;
+	/*
+	 * We should not ever be requesting features that we
+	 * have not enabled.  Remember that pcntxt_mask is
+	 * what we write to the XCR0 register.
+	 */
+	WARN_ONCE(!(xfeatures_mask & xstate_feature),
+		  "get of unsupported state");
+	/*
+	 * This assumes the last 'xsave*' instruction to
+	 * have requested that 'xstate_feature' be saved.
+	 * If it did not, we might be seeing and old value
+	 * of the field in the buffer.
+	 *
+	 * This can happen because the last 'xsave' did not
+	 * request that this feature be saved (unlikely)
+	 * or because the "init optimization" caused it
+	 * to not be saved.
+	 */
+	if (!(xsave->header.xfeatures & xstate_feature))
 		return NULL;
 
-	return (void *)xsave + xstate_comp_offsets[feature];
+	return (void *)xsave + xstate_comp_offsets[feature_nr];
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:31   ` [tip:x86/fpu] x86/fpu/xstate: " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 01/19] x86, mpx, xsave: Fix up bad get_xsave_addr() assumptions Dave Hansen
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, Dave Hansen, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu


From: Dave Hansen <dave.hansen@linux.intel.com>

The MPX code appears is calling a low-level FPU function
(copy_fpregs_to_fpstate()).  This function is not able to
be called in all contexts, although it is safe to call
directly in some cases.

Although probably correct, the current code is ugly and
potentially error-prone.  So, add a wrapper that calls
the (slightly) higher-level fpu__save() (which is preempt-
safe) and also ensures that we even *have* an FPU context
(in the case that this was called when in lazy FPU mode).

Ingo had this to say about the details about when we need
preemption disabled:

> it's indeed generally unsafe to access/copy FPU registers with preemption enabled,
> for two reasons:
>
>   - on older systems that use FSAVE the instruction destroys FPU register
>     contents, which has to be handled carefully
>
>   - even on newer systems if we copy to FPU registers (which this code doesn't)
>     then we don't want a context switch to occur in the middle of it, because a
>     context switch will write to the fpstate, potentially overwriting our new data
>     with old FPU state.
>
> But it's safe to access FPU registers with preemption enabled in a couple of
> special cases:
>
>   - potentially destructively saving FPU registers: the signal handling code does
>     this in copy_fpstate_to_sigframe(), because it can rely on the signal restore
>     side to restore the original FPU state.
>
>   - reading FPU registers on modern systems: we don't do this anywhere at the
>     moment, mostly to keep symmetry with older systems where FSAVE is
>     destructive.
>
>   - initializing FPU registers on modern systems: fpu__clear() does this. Here
>     it's safe because we don't copy from the fpstate.
>
>   - directly writing FPU registers from user-space memory (!). We do this in
>     fpu__restore_sig(), and it's safe because neither context switches nor
>     irq-handler FPU use can corrupt the source context of the copy (which is
>     user-space memory).
>
> Note that the MPX code's current use of copy_fpregs_to_fpstate() was safe I think,
> because:
>
>  - MPX is predicated on eagerfpu, so the destructive F[N]SAVE instruction won't be
>    used.
>
>  - the code was only reading FPU registers, and was doing it only in places that
>    guaranteed that an FPU state was already active (i.e. didn't do it in
>    kthreads)

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: bp@alien8.de
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: the arch/x86 maintainers <x86@kernel.org>

---

Changes from take 8 / v23:
 * Add const specifier for get_xsave_field_ptr() return type
 * Add temporary 'fpu' variable to shorten up the code and
   make it look more consistent with the other FPU code.

Changes from v21:
 * add comments about preemption
 * rename helper to get_xsave_field_ptr()

Changes from "v19":
 * remove 'tsk' argument to get_xsave_addr() since the code
   can only realistically work on 'current', and fix up the
   comment a bit to match.

Changes from "v17":
 * fix s/xstate/xsave_field/ in the function comment
 * remove EXPORT_SYMBOL_GPL()

---

 b/arch/x86/include/asm/fpu/xstate.h |    1 +
 b/arch/x86/kernel/fpu/xstate.c      |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff -puN arch/x86/include/asm/fpu/xstate.h~tsk_get_xsave_addr arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~tsk_get_xsave_addr	2015-06-01 10:24:03.427694831 -0700
+++ b/arch/x86/include/asm/fpu/xstate.h	2015-06-01 10:24:03.432695056 -0700
@@ -41,5 +41,6 @@ extern u64 xstate_fx_sw_bytes[USER_XSTAT
 extern void update_regset_xstate_info(unsigned int size, u64 xstate_mask);
 
 void *get_xsave_addr(struct xregs_state *xsave, int xstate);
+const void *get_xsave_field_ptr(int xstate_field);
 
 #endif
diff -puN arch/x86/kernel/fpu/xstate.c~tsk_get_xsave_addr arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~tsk_get_xsave_addr	2015-06-01 10:24:03.429694921 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-06-01 10:24:03.433695102 -0700
@@ -427,3 +427,35 @@ void *get_xsave_addr(struct xregs_state
 	return (void *)xsave + xstate_comp_offsets[feature_nr];
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
+
+/*
+ * This wraps up the common operations that need to occur when retrieving
+ * data from xsave state.  It first ensures that the current task was
+ * using the FPU and retrieves the data in to a buffer.  It then calculates
+ * the offset of the requested field in the buffer.
+ *
+ * This function is safe to call whether the FPU is in use or not.
+ *
+ * Note that this only works on the current task.
+ *
+ * Inputs:
+ *	@xsave_state: state which is defined in xsave.h (e.g. XSTATE_FP,
+ *	XSTATE_SSE, etc...)
+ * Output:
+ *	address of the state in the xsave area or NULL if the state
+ *	is not present or is in its 'init state'.
+ */
+const void *get_xsave_field_ptr(int xsave_state)
+{
+	struct fpu *fpu = &current->thread.fpu;
+
+	if (!fpu->fpstate_active)
+		return NULL;
+	/*
+	 * fpu__save() takes the CPU's xstate registers
+	 * and saves them off to the 'fpu memory buffer.
+	 */
+	fpu__save(fpu);
+
+	return get_xsave_addr(&fpu->state.xsave, xsave_state);
+}
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 03/19] x86, mpx: Use new get_xsave_field_ptr()
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
  2015-06-07 18:37 ` [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer Dave Hansen
  2015-06-07 18:37 ` [PATCH 01/19] x86, mpx, xsave: Fix up bad get_xsave_addr() assumptions Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:31   ` [tip:x86/fpu] x86/mpx: Use the new get_xsave_field_ptr()API tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 05/19] x86, mpx: remove redundant MPX_BNDCFG_ADDR_MASK Dave Hansen
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, Dave Hansen, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu


From: Dave Hansen <dave.hansen@linux.intel.com>

The MPX registers (bndcsr/bndcfgu/bndstatus) are not directly
accessible via normal instructions.  They essentially act as
if they were floating point registers and are saved/restored
along with those registers.

There are two main paths in the MPX code where we care about
the contents of these registers:
	1. #BR (bounds) faults
	2. the prctl() code where we are setting MPX up

Both of those paths _might_ be called without the FPU having
been used.  That means that 'tsk->thread.fpu.state' might
never be allocated.

Also, fpu_save_init() is not preempt-safe.  It was a bug to
call it without disabling preemption.  The new
get_xsave_addr() calls unlazy_fpu() instead and properly
disables preemption.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: bp@alien8.de
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: the arch/x86 maintainers <x86@kernel.org>
Cc: linux-kernel <linux-kernel@vger.kernel.org>

---

Changes from v21:
 * rename get_xsave_field() to get_xsave_field_ptr()
---

 b/arch/x86/include/asm/mpx.h |    8 ++++----
 b/arch/x86/kernel/traps.c    |   17 ++++++++---------
 b/arch/x86/mm/mpx.c          |   30 +++++++++++++++---------------
 3 files changed, 27 insertions(+), 28 deletions(-)

diff -puN arch/x86/include/asm/mpx.h~use-new-tsk_get_xsave_addr arch/x86/include/asm/mpx.h
--- a/arch/x86/include/asm/mpx.h~use-new-tsk_get_xsave_addr	2015-06-01 10:24:03.804711835 -0700
+++ b/arch/x86/include/asm/mpx.h	2015-06-01 10:24:03.810712106 -0700
@@ -60,8 +60,8 @@
 
 #ifdef CONFIG_X86_INTEL_MPX
 siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
-				struct xregs_state *xsave_buf);
-int mpx_handle_bd_fault(struct xregs_state *xsave_buf);
+				struct task_struct *tsk);
+int mpx_handle_bd_fault(struct task_struct *tsk);
 static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
 {
 	return (mm->bd_addr != MPX_INVALID_BOUNDS_DIR);
@@ -78,11 +78,11 @@ void mpx_notify_unmap(struct mm_struct *
 		      unsigned long start, unsigned long end);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
-					      struct xregs_state *xsave_buf)
+					      struct task_struct *tsk)
 {
 	return NULL;
 }
-static inline int mpx_handle_bd_fault(struct xregs_state *xsave_buf)
+static inline int mpx_handle_bd_fault(struct task_struct *tsk)
 {
 	return -EINVAL;
 }
diff -puN arch/x86/kernel/traps.c~use-new-tsk_get_xsave_addr arch/x86/kernel/traps.c
--- a/arch/x86/kernel/traps.c~use-new-tsk_get_xsave_addr	2015-06-01 10:24:03.805711880 -0700
+++ b/arch/x86/kernel/traps.c	2015-06-01 10:24:03.810712106 -0700
@@ -59,6 +59,7 @@
 #include <asm/fixmap.h>
 #include <asm/mach_traps.h>
 #include <asm/alternative.h>
+#include <asm/fpu/xstate.h>
 #include <asm/mpx.h>
 
 #ifdef CONFIG_X86_64
@@ -371,9 +372,8 @@ dotraplinkage void do_double_fault(struc
 dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 {
 	struct task_struct *tsk = current;
-	struct xregs_state *xsave_buf;
 	enum ctx_state prev_state;
-	struct bndcsr *bndcsr;
+	const struct bndcsr *bndcsr;
 	siginfo_t *info;
 
 	prev_state = exception_enter();
@@ -392,12 +392,11 @@ dotraplinkage void do_bounds(struct pt_r
 
 	/*
 	 * We need to look at BNDSTATUS to resolve this exception.
-	 * It is not directly accessible, though, so we need to
-	 * do an xsave and then pull it out of the xsave buffer.
+	 * A NULL here might mean that it is in its 'init state',
+	 * which is all zeros which indicates MPX was not
+	 * responsible for the exception.
 	 */
-	copy_fpregs_to_fpstate(&tsk->thread.fpu);
-	xsave_buf = &(tsk->thread.fpu.state.xsave);
-	bndcsr = get_xsave_addr(xsave_buf, XSTATE_BNDCSR);
+	bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);
 	if (!bndcsr)
 		goto exit_trap;
 
@@ -408,11 +407,11 @@ dotraplinkage void do_bounds(struct pt_r
 	 */
 	switch (bndcsr->bndstatus & MPX_BNDSTA_ERROR_CODE) {
 	case 2:	/* Bound directory has invalid entry. */
-		if (mpx_handle_bd_fault(xsave_buf))
+		if (mpx_handle_bd_fault(tsk))
 			goto exit_trap;
 		break; /* Success, it was handled */
 	case 1: /* Bound violation. */
-		info = mpx_generate_siginfo(regs, xsave_buf);
+		info = mpx_generate_siginfo(regs, tsk);
 		if (IS_ERR(info)) {
 			/*
 			 * We failed to decode the MPX instruction.  Act as if
diff -puN arch/x86/mm/mpx.c~use-new-tsk_get_xsave_addr arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~use-new-tsk_get_xsave_addr	2015-06-01 10:24:03.807711971 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:03.811712151 -0700
@@ -272,9 +272,9 @@ bad_opcode:
  * The caller is expected to kfree() the returned siginfo_t.
  */
 siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
-				struct xregs_state *xsave_buf)
+				struct task_struct *tsk)
 {
-	struct bndreg *bndregs, *bndreg;
+	const struct bndreg *bndregs, *bndreg;
 	siginfo_t *info = NULL;
 	struct insn insn;
 	uint8_t bndregno;
@@ -294,8 +294,8 @@ siginfo_t *mpx_generate_siginfo(struct p
 		err = -EINVAL;
 		goto err_out;
 	}
-	/* get the bndregs _area_ of the xsave structure */
-	bndregs = get_xsave_addr(xsave_buf, XSTATE_BNDREGS);
+	/* get bndregs field from current task's xsave area */
+	bndregs = get_xsave_field_ptr(XSTATE_BNDREGS);
 	if (!bndregs) {
 		err = -EINVAL;
 		goto err_out;
@@ -342,7 +342,7 @@ err_out:
 
 static __user void *task_get_bounds_dir(struct task_struct *tsk)
 {
-	struct bndcsr *bndcsr;
+	const struct bndcsr *bndcsr;
 
 	if (!cpu_feature_enabled(X86_FEATURE_MPX))
 		return MPX_INVALID_BOUNDS_DIR;
@@ -357,8 +357,7 @@ static __user void *task_get_bounds_dir(
 	 * The bounds directory pointer is stored in a register
 	 * only accessible if we first do an xsave.
 	 */
-	copy_fpregs_to_fpstate(&tsk->thread.fpu);
-	bndcsr = get_xsave_addr(&tsk->thread.fpu.state.xsave, XSTATE_BNDCSR);
+	bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);
 	if (!bndcsr)
 		return MPX_INVALID_BOUNDS_DIR;
 
@@ -389,9 +388,10 @@ int mpx_enable_management(struct task_st
 	 * directory into XSAVE/XRSTOR Save Area and enable MPX through
 	 * XRSTOR instruction.
 	 *
-	 * copy_xregs_to_kernel() is expected to be very expensive. Storing the bounds
-	 * directory here means that we do not have to do xsave in the unmap
-	 * path; we can just use mm->bd_addr instead.
+	 * The copy_xregs_to_kernel() beneath get_xsave_field_ptr() is
+	 * expected to be relatively expensive. Storing the bounds
+	 * directory here means that we do not have to do xsave in the
+	 * unmap path; we can just use mm->bd_addr instead.
 	 */
 	bd_base = task_get_bounds_dir(tsk);
 	down_write(&mm->mmap_sem);
@@ -497,12 +497,12 @@ out_unmap:
  * bound table is 16KB. With 64-bit mode, the size of BD is 2GB,
  * and the size of each bound table is 4MB.
  */
-static int do_mpx_bt_fault(struct xregs_state *xsave_buf)
+static int do_mpx_bt_fault(struct task_struct *tsk)
 {
 	unsigned long bd_entry, bd_base;
-	struct bndcsr *bndcsr;
+	const struct bndcsr *bndcsr;
 
-	bndcsr = get_xsave_addr(xsave_buf, XSTATE_BNDCSR);
+	bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);
 	if (!bndcsr)
 		return -EINVAL;
 	/*
@@ -525,7 +525,7 @@ static int do_mpx_bt_fault(struct xregs_
 	return allocate_bt((long __user *)bd_entry);
 }
 
-int mpx_handle_bd_fault(struct xregs_state *xsave_buf)
+int mpx_handle_bd_fault(struct task_struct *tsk)
 {
 	/*
 	 * Userspace never asked us to manage the bounds tables,
@@ -534,7 +534,7 @@ int mpx_handle_bd_fault(struct xregs_sta
 	if (!kernel_managing_mpx_tables(current->mm))
 		return -EINVAL;
 
-	if (do_mpx_bt_fault(xsave_buf)) {
+	if (do_mpx_bt_fault(tsk)) {
 		force_sig(SIGSEGV, current);
 		/*
 		 * The force_sig() is essentially "handling" this
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 07/19] x86, mpx: boot-time disable
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (5 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 04/19] x86, mpx: Cleanup: Do not pass task around when unnecessary Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:32   ` [tip:x86/fpu] x86/mpx: Introduce a boot-time disable flag tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 09/19] x86, mpx: trace entry to bounds exception paths Dave Hansen
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

MPX has the _potential_ to cause some issues.  Say part of your init
system tried to protect one of its components from buffer overflows
with MPX.  If there were a false positive, it's possible that MPX
could keep a system from booting.

MPX could also potentially cause performance issues since it is
present in hot paths like the unmap path.

Allow it to be disabled at boot time.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de
---

 b/Documentation/kernel-parameters.txt |    4 ++++
 b/arch/x86/kernel/cpu/common.c        |   16 ++++++++++++++++
 2 files changed, 20 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~x86-mpx-disable-boot-time arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~x86-mpx-disable-boot-time	2015-06-01 10:24:05.614793473 -0700
+++ b/arch/x86/kernel/cpu/common.c	2015-06-01 10:24:05.620793744 -0700
@@ -144,6 +144,22 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_p
 } };
 EXPORT_PER_CPU_SYMBOL_GPL(gdt_page);
 
+static int __init x86_mpx_setup(char *s)
+{
+	/* require an exact match without trailing characters */
+	if (strlen(s))
+		return 0;
+
+	/* do not emit a message if the feature is not present */
+	if (!boot_cpu_has(X86_FEATURE_MPX))
+		return 1;
+
+	setup_clear_cpu_cap(X86_FEATURE_MPX);
+	pr_info("nompx: Intel Memory Protection Extensions (MPX) disabled\n");
+	return 1;
+}
+__setup("nompx", x86_mpx_setup);
+
 #ifdef CONFIG_X86_32
 static int cachesize_override = -1;
 static int disable_x86_serial_nr = 1;
diff -puN Documentation/kernel-parameters.txt~x86-mpx-disable-boot-time Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt~x86-mpx-disable-boot-time	2015-06-01 10:24:05.616793563 -0700
+++ b/Documentation/kernel-parameters.txt	2015-06-01 10:24:05.621793789 -0700
@@ -937,6 +937,10 @@ bytes respectively. Such letter suffixes
 			Enable debug messages at boot time.  See
 			Documentation/dynamic-debug-howto.txt for details.
 
+	nompx		[X86] Disables Intel Memory Protection Extensions.
+			See Documentation/x86/intel_mpx.txt for more
+			information about the feature.
+
 	eagerfpu=	[X86]
 			on	enable eager fpu restore
 			off	disable eager fpu restore
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 05/19] x86, mpx: remove redundant MPX_BNDCFG_ADDR_MASK
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (2 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 03/19] x86, mpx: Use new get_xsave_field_ptr() Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:32   ` [tip:x86/fpu] x86/mpx: Remove " tip-bot for Qiaowei Ren
  2015-06-07 18:37 ` [PATCH 06/19] x86, mpx: Restrict mmap size check to bounds tables Dave Hansen
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, qiaowei.ren, dave.hansen


From: Qiaowei Ren <qiaowei.ren@intel.com>

MPX_BNDCFG_ADDR_MASK is defined two times, so this patch removes
redundant one.

Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/mpx.h |    1 -
 1 file changed, 1 deletion(-)

diff -puN arch/x86/include/asm/mpx.h~0001-x86-mpx-remove-redundant-MPX_BNDCFG_ADDR_MASK arch/x86/include/asm/mpx.h
--- a/arch/x86/include/asm/mpx.h~0001-x86-mpx-remove-redundant-MPX_BNDCFG_ADDR_MASK	2015-06-01 10:24:04.792756398 -0700
+++ b/arch/x86/include/asm/mpx.h	2015-06-01 10:24:04.795756533 -0700
@@ -45,7 +45,6 @@
 #define MPX_BNDSTA_TAIL		2
 #define MPX_BNDCFG_TAIL		12
 #define MPX_BNDSTA_ADDR_MASK	(~((1UL<<MPX_BNDSTA_TAIL)-1))
-#define MPX_BNDCFG_ADDR_MASK	(~((1UL<<MPX_BNDCFG_TAIL)-1))
 #define MPX_BT_ADDR_MASK	(~((1UL<<MPX_BD_ENTRY_TAIL)-1))
 
 #define MPX_BNDCFG_ADDR_MASK	(~((1UL<<MPX_BNDCFG_TAIL)-1))
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 06/19] x86, mpx: Restrict mmap size check to bounds tables
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (3 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 05/19] x86, mpx: remove redundant MPX_BNDCFG_ADDR_MASK Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:32   ` [tip:x86/fpu] x86/mpx: Restrict the mmap() " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 04/19] x86, mpx: Cleanup: Do not pass task around when unnecessary Dave Hansen
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The comment and code here are confusing.  We do not currently
allocate the bounds directory in the kernel.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/mpx.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff -puN arch/x86/mm/mpx.c~x86-mpx-we-do-not-allocate-the-bounds-directory arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~x86-mpx-we-do-not-allocate-the-bounds-directory	2015-06-01 10:24:05.203774935 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:05.207775116 -0700
@@ -46,8 +46,8 @@ static unsigned long mpx_mmap(unsigned l
 	vm_flags_t vm_flags;
 	struct vm_area_struct *vma;
 
-	/* Only bounds table and bounds directory can be allocated here */
-	if (len != MPX_BD_SIZE_BYTES && len != MPX_BT_SIZE_BYTES)
+	/* Only bounds table can be allocated here */
+	if (len != MPX_BT_SIZE_BYTES)
 		return -EINVAL;
 
 	down_write(&mm->mmap_sem);
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 04/19] x86, mpx: Cleanup: Do not pass task around when unnecessary
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (4 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 06/19] x86, mpx: Restrict mmap size check to bounds tables Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:31   ` [tip:x86/fpu] x86/mpx: Clean up the code by not passing a task pointer " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 07/19] x86, mpx: boot-time disable Dave Hansen
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen, oleg, bp


From: Dave Hansen <dave.hansen@linux.intel.com>

The MPX code can only work on the current task.  You can not, for
instance, enable MPX management in another process or thread.
You can also not handle a fault for another process or thread.

Despite this, we pass a task_struct around prolifically.  This
patch removes all of the task struct passing for code paths where
the code can not deal with another task (which turns out to be
all of them).

This has no functional changes.  It's just a cleanup.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: bp@alien8.de
Cc: the arch/x86 maintainers <x86@kernel.org>
Cc: linux-kernel <linux-kernel@vger.kernel.org>
---

 b/arch/x86/include/asm/mpx.h       |   10 ++++------
 b/arch/x86/include/asm/processor.h |   12 ++++++------
 b/arch/x86/kernel/traps.c          |    5 ++---
 b/arch/x86/mm/mpx.c                |   19 +++++++++----------
 b/kernel/sys.c                     |    8 ++++----
 5 files changed, 25 insertions(+), 29 deletions(-)

diff -puN arch/x86/include/asm/mpx.h~x86-mpx-dont-pass-current-around arch/x86/include/asm/mpx.h
--- a/arch/x86/include/asm/mpx.h~x86-mpx-dont-pass-current-around	2015-06-01 10:24:04.227730914 -0700
+++ b/arch/x86/include/asm/mpx.h	2015-06-01 10:24:04.238731410 -0700
@@ -59,9 +59,8 @@
 		MPX_BT_ENTRY_MASK) << MPX_BT_ENTRY_SHIFT)
 
 #ifdef CONFIG_X86_INTEL_MPX
-siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
-				struct task_struct *tsk);
-int mpx_handle_bd_fault(struct task_struct *tsk);
+siginfo_t *mpx_generate_siginfo(struct pt_regs *regs);
+int mpx_handle_bd_fault(void);
 static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
 {
 	return (mm->bd_addr != MPX_INVALID_BOUNDS_DIR);
@@ -77,12 +76,11 @@ static inline void mpx_mm_init(struct mm
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
 #else
-static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
-					      struct task_struct *tsk)
+static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
 	return NULL;
 }
-static inline int mpx_handle_bd_fault(struct task_struct *tsk)
+static inline int mpx_handle_bd_fault(void)
 {
 	return -EINVAL;
 }
diff -puN arch/x86/include/asm/processor.h~x86-mpx-dont-pass-current-around arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~x86-mpx-dont-pass-current-around	2015-06-01 10:24:04.229731004 -0700
+++ b/arch/x86/include/asm/processor.h	2015-06-01 10:24:04.239731455 -0700
@@ -802,18 +802,18 @@ extern int get_tsc_mode(unsigned long ad
 extern int set_tsc_mode(unsigned int val);
 
 /* Register/unregister a process' MPX related resource */
-#define MPX_ENABLE_MANAGEMENT(tsk)	mpx_enable_management((tsk))
-#define MPX_DISABLE_MANAGEMENT(tsk)	mpx_disable_management((tsk))
+#define MPX_ENABLE_MANAGEMENT()	mpx_enable_management()
+#define MPX_DISABLE_MANAGEMENT()	mpx_disable_management()
 
 #ifdef CONFIG_X86_INTEL_MPX
-extern int mpx_enable_management(struct task_struct *tsk);
-extern int mpx_disable_management(struct task_struct *tsk);
+extern int mpx_enable_management(void);
+extern int mpx_disable_management(void);
 #else
-static inline int mpx_enable_management(struct task_struct *tsk)
+static inline int mpx_enable_management(void)
 {
 	return -EINVAL;
 }
-static inline int mpx_disable_management(struct task_struct *tsk)
+static inline int mpx_disable_management(void)
 {
 	return -EINVAL;
 }
diff -puN arch/x86/kernel/traps.c~x86-mpx-dont-pass-current-around arch/x86/kernel/traps.c
--- a/arch/x86/kernel/traps.c~x86-mpx-dont-pass-current-around	2015-06-01 10:24:04.231731095 -0700
+++ b/arch/x86/kernel/traps.c	2015-06-01 10:24:04.239731455 -0700
@@ -371,7 +371,6 @@ dotraplinkage void do_double_fault(struc
 
 dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 {
-	struct task_struct *tsk = current;
 	enum ctx_state prev_state;
 	const struct bndcsr *bndcsr;
 	siginfo_t *info;
@@ -407,11 +406,11 @@ dotraplinkage void do_bounds(struct pt_r
 	 */
 	switch (bndcsr->bndstatus & MPX_BNDSTA_ERROR_CODE) {
 	case 2:	/* Bound directory has invalid entry. */
-		if (mpx_handle_bd_fault(tsk))
+		if (mpx_handle_bd_fault())
 			goto exit_trap;
 		break; /* Success, it was handled */
 	case 1: /* Bound violation. */
-		info = mpx_generate_siginfo(regs, tsk);
+		info = mpx_generate_siginfo(regs);
 		if (IS_ERR(info)) {
 			/*
 			 * We failed to decode the MPX instruction.  Act as if
diff -puN arch/x86/mm/mpx.c~x86-mpx-dont-pass-current-around arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~x86-mpx-dont-pass-current-around	2015-06-01 10:24:04.233731185 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:04.240731501 -0700
@@ -271,8 +271,7 @@ bad_opcode:
  *
  * The caller is expected to kfree() the returned siginfo_t.
  */
-siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
-				struct task_struct *tsk)
+siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
 	const struct bndreg *bndregs, *bndreg;
 	siginfo_t *info = NULL;
@@ -340,7 +339,7 @@ err_out:
 	return ERR_PTR(err);
 }
 
-static __user void *task_get_bounds_dir(struct task_struct *tsk)
+static __user void *mpx_get_bounds_dir(void)
 {
 	const struct bndcsr *bndcsr;
 
@@ -376,10 +375,10 @@ static __user void *task_get_bounds_dir(
 		(bndcsr->bndcfgu & MPX_BNDCFG_ADDR_MASK);
 }
 
-int mpx_enable_management(struct task_struct *tsk)
+int mpx_enable_management(void)
 {
 	void __user *bd_base = MPX_INVALID_BOUNDS_DIR;
-	struct mm_struct *mm = tsk->mm;
+	struct mm_struct *mm = current->mm;
 	int ret = 0;
 
 	/*
@@ -393,7 +392,7 @@ int mpx_enable_management(struct task_st
 	 * directory here means that we do not have to do xsave in the
 	 * unmap path; we can just use mm->bd_addr instead.
 	 */
-	bd_base = task_get_bounds_dir(tsk);
+	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
 	mm->bd_addr = bd_base;
 	if (mm->bd_addr == MPX_INVALID_BOUNDS_DIR)
@@ -403,7 +402,7 @@ int mpx_enable_management(struct task_st
 	return ret;
 }
 
-int mpx_disable_management(struct task_struct *tsk)
+int mpx_disable_management(void)
 {
 	struct mm_struct *mm = current->mm;
 
@@ -497,7 +496,7 @@ out_unmap:
  * bound table is 16KB. With 64-bit mode, the size of BD is 2GB,
  * and the size of each bound table is 4MB.
  */
-static int do_mpx_bt_fault(struct task_struct *tsk)
+static int do_mpx_bt_fault(void)
 {
 	unsigned long bd_entry, bd_base;
 	const struct bndcsr *bndcsr;
@@ -525,7 +524,7 @@ static int do_mpx_bt_fault(struct task_s
 	return allocate_bt((long __user *)bd_entry);
 }
 
-int mpx_handle_bd_fault(struct task_struct *tsk)
+int mpx_handle_bd_fault(void)
 {
 	/*
 	 * Userspace never asked us to manage the bounds tables,
@@ -534,7 +533,7 @@ int mpx_handle_bd_fault(struct task_stru
 	if (!kernel_managing_mpx_tables(current->mm))
 		return -EINVAL;
 
-	if (do_mpx_bt_fault(tsk)) {
+	if (do_mpx_bt_fault()) {
 		force_sig(SIGSEGV, current);
 		/*
 		 * The force_sig() is essentially "handling" this
diff -puN kernel/sys.c~x86-mpx-dont-pass-current-around kernel/sys.c
--- a/kernel/sys.c~x86-mpx-dont-pass-current-around	2015-06-01 10:24:04.235731275 -0700
+++ b/kernel/sys.c	2015-06-01 10:24:04.241731546 -0700
@@ -92,10 +92,10 @@
 # define SET_TSC_CTL(a)		(-EINVAL)
 #endif
 #ifndef MPX_ENABLE_MANAGEMENT
-# define MPX_ENABLE_MANAGEMENT(a)	(-EINVAL)
+# define MPX_ENABLE_MANAGEMENT()	(-EINVAL)
 #endif
 #ifndef MPX_DISABLE_MANAGEMENT
-# define MPX_DISABLE_MANAGEMENT(a)	(-EINVAL)
+# define MPX_DISABLE_MANAGEMENT()	(-EINVAL)
 #endif
 #ifndef GET_FP_MODE
 # define GET_FP_MODE(a)		(-EINVAL)
@@ -2230,12 +2230,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 	case PR_MPX_ENABLE_MANAGEMENT:
 		if (arg2 || arg3 || arg4 || arg5)
 			return -EINVAL;
-		error = MPX_ENABLE_MANAGEMENT(me);
+		error = MPX_ENABLE_MANAGEMENT();
 		break;
 	case PR_MPX_DISABLE_MANAGEMENT:
 		if (arg2 || arg3 || arg4 || arg5)
 			return -EINVAL;
-		error = MPX_DISABLE_MANAGEMENT(me);
+		error = MPX_DISABLE_MANAGEMENT();
 		break;
 	case PR_SET_FP_MODE:
 		error = SET_FP_MODE(me, arg2);
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 08/19] x86, mpx: trace #BR exceptions
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (7 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 09/19] x86, mpx: trace entry to bounds exception paths Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:33   ` [tip:x86/fpu] x86/mpx: Trace " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 10/19] x86, mpx: Trace the attempts to find bounds tables Dave Hansen
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This is the first in a series of MPX tracing patches.
I've found these extremely useful in the process of
debugging applications and the kernel code itself.

This exception hooks in to the bounds (#BR) exception
very early and allows capturing the key registers which
would influence how the exception is handled.

Note that bndcfgu/bndstatus are technically still
64-bit registers even in 32-bit mode.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/trace/mpx.h |   50 +++++++++++++++++++++++++++++++++++++
 b/arch/x86/kernel/traps.c          |    2 +
 b/arch/x86/mm/mpx.c                |    3 ++
 3 files changed, 55 insertions(+)

diff -puN /dev/null arch/x86/include/asm/trace/mpx.h
--- /dev/null	2015-05-29 14:41:09.563023787 -0700
+++ b/arch/x86/include/asm/trace/mpx.h	2015-06-01 10:24:06.063813725 -0700
@@ -0,0 +1,50 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mpx
+
+#if !defined(_TRACE_MPX_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MPX_H
+
+#include <linux/tracepoint.h>
+
+#ifdef CONFIG_X86_INTEL_MPX
+
+TRACE_EVENT(bounds_exception_mpx,
+
+	TP_PROTO(const struct bndcsr *bndcsr),
+	TP_ARGS(bndcsr),
+
+	TP_STRUCT__entry(
+		__field(u64, bndcfgu)
+		__field(u64, bndstatus)
+	),
+
+	TP_fast_assign(
+		/* need to get rid of the 'const' on bndcsr */
+		__entry->bndcfgu   = (u64)bndcsr->bndcfgu;
+		__entry->bndstatus = (u64)bndcsr->bndstatus;
+	),
+
+	TP_printk("bndcfgu:0x%llx bndstatus:0x%llx",
+		__entry->bndcfgu,
+		__entry->bndstatus)
+);
+
+#else
+
+/*
+ * This gets used outside of MPX-specific code, so we need a stub.
+ */
+static inline void trace_bounds_exception_mpx(const struct bndcsr *bndcsr)
+{
+}
+
+#endif /* CONFIG_X86_INTEL_MPX */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH asm/trace/
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE mpx
+#endif /* _TRACE_MPX_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff -puN arch/x86/kernel/traps.c~x86-br-exception-trace arch/x86/kernel/traps.c
--- a/arch/x86/kernel/traps.c~x86-br-exception-trace	2015-06-01 10:24:06.058813499 -0700
+++ b/arch/x86/kernel/traps.c	2015-06-01 10:24:06.064813770 -0700
@@ -60,6 +60,7 @@
 #include <asm/mach_traps.h>
 #include <asm/alternative.h>
 #include <asm/fpu/xstate.h>
+#include <asm/trace/mpx.h>
 #include <asm/mpx.h>
 
 #ifdef CONFIG_X86_64
@@ -399,6 +400,7 @@ dotraplinkage void do_bounds(struct pt_r
 	if (!bndcsr)
 		goto exit_trap;
 
+	trace_bounds_exception_mpx(bndcsr);
 	/*
 	 * The error code field of the BNDSTATUS register communicates status
 	 * information of a bound range exception #BR or operation involving
diff -puN arch/x86/mm/mpx.c~x86-br-exception-trace arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~x86-br-exception-trace	2015-06-01 10:24:06.060813590 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:06.064813770 -0700
@@ -17,6 +17,9 @@
 #include <asm/processor.h>
 #include <asm/fpu/internal.h>
 
+#define CREATE_TRACE_POINTS
+#include <asm/trace/mpx.h>
+
 static const char *mpx_mapping_name(struct vm_area_struct *vma)
 {
 	return "[mpx]";
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 10/19] x86, mpx: Trace the attempts to find bounds tables
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (8 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 08/19] x86, mpx: trace #BR exceptions Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:33   ` [tip:x86/fpu] x86/mpx: " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 13/19] x86, mpx: Add temporary variable to reduce masking Dave Hansen
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There are two different events being traced here.  They are
doing similar things so share a trace "EVENT_CLASS" and are
presented together.

1. Trace when MPX is zapping pages "mpx_unmap_zap":

	When MPX can not free an entire bounds table, it will
	instead try to zap unused parts of a bounds table to free
	the backing memory.  This decreases RSS (resident set
	size) without decreasing the virtual space allocated
	for bounds tables.

2. Trace attempts to find bounds tables "mpx_unmap_search":

	This event traces any time we go looking to unmap a
	bounds table for a given virtual address range.  This is
	useful to ensure that the kernel actually "tried" to free
	a bounds table versus times it succeeded in finding one.

	It might try and fail if it realized that a table was
	shared with an adjacent VMA which is not being unmapped.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/trace/mpx.h |   32 ++++++++++++++++++++++++++++++++
 b/arch/x86/mm/mpx.c                |    2 ++
 2 files changed, 34 insertions(+)

diff -puN arch/x86/include/asm/trace/mpx.h~mpx-trace_unmap_zap arch/x86/include/asm/trace/mpx.h
--- a/arch/x86/include/asm/trace/mpx.h~mpx-trace_unmap_zap	2015-06-01 10:24:06.928852740 -0700
+++ b/arch/x86/include/asm/trace/mpx.h	2015-06-01 10:24:06.934853011 -0700
@@ -63,6 +63,38 @@ TRACE_EVENT(bounds_exception_mpx,
 		__entry->bndstatus)
 );
 
+DECLARE_EVENT_CLASS(mpx_range_trace,
+
+	TP_PROTO(unsigned long start,
+		 unsigned long end),
+	TP_ARGS(start, end),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, start)
+		__field(unsigned long, end)
+	),
+
+	TP_fast_assign(
+		__entry->start = start;
+		__entry->end   = end;
+	),
+
+	TP_printk("[0x%p:0x%p]",
+		(void *)__entry->start,
+		(void *)__entry->end
+	)
+);
+
+DEFINE_EVENT(mpx_range_trace, mpx_unmap_zap,
+	TP_PROTO(unsigned long start, unsigned long end),
+	TP_ARGS(start, end)
+);
+
+DEFINE_EVENT(mpx_range_trace, mpx_unmap_search,
+	TP_PROTO(unsigned long start, unsigned long end),
+	TP_ARGS(start, end)
+);
+
 #else
 
 /*
diff -puN arch/x86/mm/mpx.c~mpx-trace_unmap_zap arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~mpx-trace_unmap_zap	2015-06-01 10:24:06.930852830 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:06.934853011 -0700
@@ -668,6 +668,7 @@ static int zap_bt_entries(struct mm_stru
 
 		len = min(vma->vm_end, end) - addr;
 		zap_page_range(vma, addr, len, NULL);
+		trace_mpx_unmap_zap(addr, addr+len);
 
 		vma = vma->vm_next;
 		addr = vma->vm_start;
@@ -840,6 +841,7 @@ static int mpx_unmap_tables(struct mm_st
 	long __user *bd_entry, *bde_start, *bde_end;
 	unsigned long bt_addr;
 
+	trace_mpx_unmap_search(start, end);
 	/*
 	 * "Edge" bounds tables are those which are being used by the region
 	 * (start -> end), but that may be shared with adjacent areas.  If they
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 09/19] x86, mpx: trace entry to bounds exception paths
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (6 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 07/19] x86, mpx: boot-time disable Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:33   ` [tip:x86/fpu] x86/mpx: Trace " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 08/19] x86, mpx: trace #BR exceptions Dave Hansen
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There are two basic things that can happen as the result of
a bounds exception (#BR):

	1. We allocate a new bounds table
	2. We pass up a bounds exception to userspace.

This patch adds a trace point for the case where we are
passing the exception up to userspace with a signal.

We are also explicit that we're printing out the inverse of
the 'upper' that we encounter.  If you want to filter, for
instance, you need to ~ the value first.  The reason we do
this is because of how 'upper' is stored in the bounds table.
If a pointer's range is:

	0x1000 -> 0x2000

it is stored in the bounds table as (32-bits here for brevity):

	lower: 0x00001000
	upper: 0xffffdfff

That is so that an all 0's entry:

	lower: 0x00000000
	upper: 0x00000000

corresponds to the "init" bounds which store a *range* of:

	0x00000000 -> 0xffffffff

That is, by far, the common case, and that lets us use the
zero page, or deduplicate the memory, etc... The 'upper'
stored in the table is gibberish to print by itself, so we
print ~upper to get the *actual*, logical, human-readable
value printed out.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/trace/mpx.h |   34 ++++++++++++++++++++++++++++++++++
 b/arch/x86/mm/mpx.c                |    1 +
 2 files changed, 35 insertions(+)

diff -puN arch/x86/include/asm/trace/mpx.h~x86-mpx-trace-1 arch/x86/include/asm/trace/mpx.h
--- a/arch/x86/include/asm/trace/mpx.h~x86-mpx-trace-1	2015-06-01 10:24:06.504833616 -0700
+++ b/arch/x86/include/asm/trace/mpx.h	2015-06-01 10:24:06.509833841 -0700
@@ -8,6 +8,40 @@
 
 #ifdef CONFIG_X86_INTEL_MPX
 
+TRACE_EVENT(mpx_bounds_register_exception,
+
+	TP_PROTO(void *addr_referenced,
+		 const struct bndreg *bndreg),
+	TP_ARGS(addr_referenced, bndreg),
+
+	TP_STRUCT__entry(
+		__field(void *, addr_referenced)
+		__field(u64, lower_bound)
+		__field(u64, upper_bound)
+	),
+
+	TP_fast_assign(
+		__entry->addr_referenced = addr_referenced;
+		__entry->lower_bound = bndreg->lower_bound;
+		__entry->upper_bound = bndreg->upper_bound;
+	),
+	/*
+	 * Note that we are printing out the '~' of the upper
+	 * bounds register here.  It is actually stored in its
+	 * one's complement form so that its 'init' state
+	 * corresponds to all 0's.  But, that looks like
+	 * gibberish when printed out, so print out the 1's
+	 * complement instead of the actual value here.  Note
+	 * though that you still need to specify filters for the
+	 * actual value, not the displayed one.
+	 */
+	TP_printk("address referenced: 0x%p bounds: lower: 0x%llx ~upper: 0x%llx",
+		__entry->addr_referenced,
+		__entry->lower_bound,
+		~__entry->upper_bound
+	)
+);
+
 TRACE_EVENT(bounds_exception_mpx,
 
 	TP_PROTO(const struct bndcsr *bndcsr),
diff -puN arch/x86/mm/mpx.c~x86-mpx-trace-1 arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~x86-mpx-trace-1	2015-06-01 10:24:06.506833706 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:06.510833886 -0700
@@ -335,6 +335,7 @@ siginfo_t *mpx_generate_siginfo(struct p
 		err = -EINVAL;
 		goto err_out;
 	}
+	trace_mpx_bounds_register_exception(info->si_addr, bndreg);
 	return info;
 err_out:
 	/* info might be NULL, but kfree() handles that */
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 12/19] x86: make is_64bit_mm() widely available
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (11 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 14/19] x86, mpx: new directory entry to addr helper Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:34   ` [tip:x86/fpu] x86: Make " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 11/19] x86, mpx: trace allocation of new bounds tables Dave Hansen
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen, oleg


From: Dave Hansen <dave.hansen@linux.intel.com>

The uprobes code has a nice helper, is_64bit_mm(), that consults both
the runtime and compile-time flags for 32-bit support.  Instead of
reinventing the wheel, pull it in to an x86 header so we can use it
for MPX.

I prefer passing the mm around to test_thread_flag(TIF_IA32) because
it makes it explicit where the context is coming from.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/mmu_context.h |   13 +++++++++++++
 b/arch/x86/kernel/uprobes.c          |   10 +---------
 2 files changed, 14 insertions(+), 9 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~x86-make-is_64bit_mm-available arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~x86-make-is_64bit_mm-available	2015-06-01 10:24:07.787891484 -0700
+++ b/arch/x86/include/asm/mmu_context.h	2015-06-01 10:24:07.793891755 -0700
@@ -142,6 +142,19 @@ static inline void arch_exit_mmap(struct
 	paravirt_arch_exit_mmap(mm);
 }
 
+#ifdef CONFIG_X86_64
+static inline bool is_64bit_mm(struct mm_struct *mm)
+{
+	return	!config_enabled(CONFIG_IA32_EMULATION) ||
+		!(mm->context.ia32_compat == TIF_IA32);
+}
+#else
+static inline bool is_64bit_mm(struct mm_struct *mm)
+{
+	return false;
+}
+#endif
+
 static inline void arch_bprm_mm_init(struct mm_struct *mm,
 		struct vm_area_struct *vma)
 {
diff -puN arch/x86/kernel/uprobes.c~x86-make-is_64bit_mm-available arch/x86/kernel/uprobes.c
--- a/arch/x86/kernel/uprobes.c~x86-make-is_64bit_mm-available	2015-06-01 10:24:07.789891574 -0700
+++ b/arch/x86/kernel/uprobes.c	2015-06-01 10:24:07.793891755 -0700
@@ -29,6 +29,7 @@
 #include <linux/kdebug.h>
 #include <asm/processor.h>
 #include <asm/insn.h>
+#include <asm/mmu_context.h>
 
 /* Post-execution fixups. */
 
@@ -312,11 +313,6 @@ static int uprobe_init_insn(struct arch_
 }
 
 #ifdef CONFIG_X86_64
-static inline bool is_64bit_mm(struct mm_struct *mm)
-{
-	return	!config_enabled(CONFIG_IA32_EMULATION) ||
-		!(mm->context.ia32_compat == TIF_IA32);
-}
 /*
  * If arch_uprobe->insn doesn't use rip-relative addressing, return
  * immediately.  Otherwise, rewrite the instruction so that it accesses
@@ -497,10 +493,6 @@ static void riprel_post_xol(struct arch_
 	}
 }
 #else /* 32-bit: */
-static inline bool is_64bit_mm(struct mm_struct *mm)
-{
-	return false;
-}
 /*
  * No RIP-relative addressing on 32-bit
  */
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 14/19] x86, mpx: new directory entry to addr helper
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (10 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 13/19] x86, mpx: Add temporary variable to reduce masking Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:34   ` [tip:x86/fpu] x86/mpx: Introduce new 'directory entry' to 'addr' helper function tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 12/19] x86: make is_64bit_mm() widely available Dave Hansen
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Currently, to get from a bounds directory entry to the virtual
address of a bounds table, we simply mask off a few low bits.
However, the set of bits we mask off is different for 32 and
64-bit binaries.

This breaks the operation out in to a helper function and also
adds a temporary variable to store the result until we are
sure we are returning one.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

---

 b/arch/x86/include/asm/mpx.h |    1 -
 b/arch/x86/mm/mpx.c          |   41 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 34 insertions(+), 8 deletions(-)

diff -puN arch/x86/include/asm/mpx.h~mpx-new-entry-to-addr-helper arch/x86/include/asm/mpx.h
--- a/arch/x86/include/asm/mpx.h~mpx-new-entry-to-addr-helper	2015-06-01 10:24:08.603928289 -0700
+++ b/arch/x86/include/asm/mpx.h	2015-06-01 10:24:08.608928515 -0700
@@ -45,7 +45,6 @@
 #define MPX_BNDSTA_TAIL		2
 #define MPX_BNDCFG_TAIL		12
 #define MPX_BNDSTA_ADDR_MASK	(~((1UL<<MPX_BNDSTA_TAIL)-1))
-#define MPX_BT_ADDR_MASK	(~((1UL<<MPX_BD_ENTRY_TAIL)-1))
 
 #define MPX_BNDCFG_ADDR_MASK	(~((1UL<<MPX_BNDCFG_TAIL)-1))
 #define MPX_BNDSTA_ERROR_CODE	0x3
diff -puN arch/x86/mm/mpx.c~mpx-new-entry-to-addr-helper arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~mpx-new-entry-to-addr-helper	2015-06-01 10:24:08.604928334 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:08.609928560 -0700
@@ -576,29 +576,55 @@ static int mpx_resolve_fault(long __user
 	return 0;
 }
 
+static unsigned long mpx_bd_entry_to_bt_addr(struct mm_struct *mm,
+					     unsigned long bd_entry)
+{
+	unsigned long bt_addr = bd_entry;
+	int align_to_bytes;
+	/*
+	 * Bit 0 in a bt_entry is always the valid bit.
+	 */
+	bt_addr &= ~MPX_BD_ENTRY_VALID_FLAG;
+	/*
+	 * Tables are naturally aligned at 8-byte boundaries
+	 * on 64-bit and 4-byte boundaries on 32-bit.  The
+	 * documentation makes it appear that the low bits
+	 * are ignored by the hardware, so we do the same.
+	 */
+	if (is_64bit_mm(mm))
+		align_to_bytes = 8;
+	else
+		align_to_bytes = 4;
+	bt_addr &= ~(align_to_bytes-1);
+	return bt_addr;
+}
+
 /*
  * Get the base of bounds tables pointed by specific bounds
  * directory entry.
  */
 static int get_bt_addr(struct mm_struct *mm,
-			long __user *bd_entry, unsigned long *bt_addr)
+			long __user *bd_entry_ptr,
+			unsigned long *bt_addr_result)
 {
 	int ret;
 	int valid_bit;
+	unsigned long bd_entry;
+	unsigned long bt_addr;
 
-	if (!access_ok(VERIFY_READ, (bd_entry), sizeof(*bd_entry)))
+	if (!access_ok(VERIFY_READ, (bd_entry_ptr), sizeof(*bd_entry_ptr)))
 		return -EFAULT;
 
 	while (1) {
 		int need_write = 0;
 
 		pagefault_disable();
-		ret = get_user(*bt_addr, bd_entry);
+		ret = get_user(bd_entry, bd_entry_ptr);
 		pagefault_enable();
 		if (!ret)
 			break;
 		if (ret == -EFAULT)
-			ret = mpx_resolve_fault(bd_entry, need_write);
+			ret = mpx_resolve_fault(bd_entry_ptr, need_write);
 		/*
 		 * If we could not resolve the fault, consider it
 		 * userspace's fault and error out.
@@ -607,8 +633,8 @@ static int get_bt_addr(struct mm_struct
 			return ret;
 	}
 
-	valid_bit = *bt_addr & MPX_BD_ENTRY_VALID_FLAG;
-	*bt_addr &= MPX_BT_ADDR_MASK;
+	valid_bit = bd_entry & MPX_BD_ENTRY_VALID_FLAG;
+	bt_addr = mpx_bd_entry_to_bt_addr(mm, bd_entry);
 
 	/*
 	 * When the kernel is managing bounds tables, a bounds directory
@@ -617,7 +643,7 @@ static int get_bt_addr(struct mm_struct
 	 * data in the address field, we know something is wrong. This
 	 * -EINVAL return will cause a SIGSEGV.
 	 */
-	if (!valid_bit && *bt_addr)
+	if (!valid_bit && bt_addr)
 		return -EINVAL;
 	/*
 	 * Do we have an completely zeroed bt entry?  That is OK.  It
@@ -628,6 +654,7 @@ static int get_bt_addr(struct mm_struct
 	if (!valid_bit)
 		return -ENOENT;
 
+	*bt_addr_result = bt_addr;
 	return 0;
 }
 
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 11/19] x86, mpx: trace allocation of new bounds tables
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (12 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 12/19] x86: make is_64bit_mm() widely available Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:33   ` [tip:x86/fpu] x86/mpx: Trace " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 15/19] x86, mpx: do 32-bit-only cmpxchg for 32-bit apps Dave Hansen
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Bounds tables are a significant consumer of memory.  It is important
to know when they are being allocated.  Add a trace point to trace
whenever an allocation occurs and also its virtual address.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

---

 b/arch/x86/include/asm/trace/mpx.h |   16 ++++++++++++++++
 b/arch/x86/mm/mpx.c                |    1 +
 2 files changed, 17 insertions(+)

diff -puN arch/x86/include/asm/trace/mpx.h~trace_mpx_new_bounds_table arch/x86/include/asm/trace/mpx.h
--- a/arch/x86/include/asm/trace/mpx.h~trace_mpx_new_bounds_table	2015-06-01 10:24:07.357872090 -0700
+++ b/arch/x86/include/asm/trace/mpx.h	2015-06-01 10:24:07.362872315 -0700
@@ -95,6 +95,22 @@ DEFINE_EVENT(mpx_range_trace, mpx_unmap_
 	TP_ARGS(start, end)
 );
 
+TRACE_EVENT(mpx_new_bounds_table,
+
+	TP_PROTO(unsigned long table_vaddr),
+	TP_ARGS(table_vaddr),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, table_vaddr)
+	),
+
+	TP_fast_assign(
+		__entry->table_vaddr = table_vaddr;
+	),
+
+	TP_printk("table vaddr:%p", (void *)__entry->table_vaddr)
+);
+
 #else
 
 /*
diff -puN arch/x86/mm/mpx.c~trace_mpx_new_bounds_table arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~trace_mpx_new_bounds_table	2015-06-01 10:24:07.359872180 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:07.363872360 -0700
@@ -483,6 +483,7 @@ static int allocate_bt(long __user *bd_e
 		ret = -EINVAL;
 		goto out_unmap;
 	}
+	trace_mpx_new_bounds_table(bt_addr);
 	return 0;
 out_unmap:
 	vm_munmap(bt_addr & MPX_BT_ADDR_MASK, MPX_BT_SIZE_BYTES);
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 13/19] x86, mpx: Add temporary variable to reduce masking
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (9 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 10/19] x86, mpx: Trace the attempts to find bounds tables Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:34   ` [tip:x86/fpu] x86/mpx: " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 14/19] x86, mpx: new directory entry to addr helper Dave Hansen
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

When we allocate a bounds table, we call mmap(), then add a
"valid" bit to the value before storing it in to the bounds
directory.

If we fail along the way, we go and mask that valid bit
_back_ out.  That seems a little silly, and this makes it
much more clear when we have a plain address versus an
actual table _entry_.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/mpx.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff -puN arch/x86/mm/mpx.c~mpx-remove-unnecessary-masking arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~mpx-remove-unnecessary-masking	2015-06-01 10:24:08.222911104 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:08.227911330 -0700
@@ -429,6 +429,7 @@ static int allocate_bt(long __user *bd_e
 	unsigned long expected_old_val = 0;
 	unsigned long actual_old_val = 0;
 	unsigned long bt_addr;
+	unsigned long bd_new_entry;
 	int ret = 0;
 
 	/*
@@ -441,7 +442,7 @@ static int allocate_bt(long __user *bd_e
 	/*
 	 * Set the valid flag (kinda like _PAGE_PRESENT in a pte)
 	 */
-	bt_addr = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
+	bd_new_entry = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
 
 	/*
 	 * Go poke the address of the new bounds table in to the
@@ -455,7 +456,7 @@ static int allocate_bt(long __user *bd_e
 	 * of the MPX code that have to pagefault_disable().
 	 */
 	ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
-					   expected_old_val, bt_addr);
+					   expected_old_val, bd_new_entry);
 	if (ret)
 		goto out_unmap;
 
@@ -486,7 +487,7 @@ static int allocate_bt(long __user *bd_e
 	trace_mpx_new_bounds_table(bt_addr);
 	return 0;
 out_unmap:
-	vm_munmap(bt_addr & MPX_BT_ADDR_MASK, MPX_BT_SIZE_BYTES);
+	vm_munmap(bt_addr, MPX_BT_SIZE_BYTES);
 	return ret;
 }
 
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 15/19] x86, mpx: do 32-bit-only cmpxchg for 32-bit apps
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (13 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 11/19] x86, mpx: trace allocation of new bounds tables Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:35   ` [tip:x86/fpu] x86/mpx: Use 32-bit-only cmpxchg() " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 16/19] x86, mpx: support 32-bit binaries on 64-bit kernel Dave Hansen
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

user_atomic_cmpxchg_inatomic() actually looks at sizeof(*ptr) to
figure out how many bytes to copy.  If we run it on a 64-bit
kernel with a 64-bit pointer, it will copy a 64-bit bounds
directory entry.  That's fine, except when we have 32-bit
programs with 32-bit bounds directory entries and we only *want*
32-bits.

This patch breaks the cmpxchg operation out in to its own
function and performs the 32-bit type swizzling in there.

Note, the "64-bit" version of this code _would_ work on a
32-bit-only kernel.  The issue this patch addresses is only for
when the kernel's 'long' is mismatched from the size of the
bounds directory entry of the process we are working on.

The new helper modifies 'actual_old_val' or returns an error.
But gcc doesn't know this, so it warns about 'actual_old_val'
being unused.  Shut it up with an uninitialized_var().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

---

Changes from v21:
 * correct type of bd_entry_addr from long -> unsigned long
 * shorten variable names
 * add whitespace after variables
 * unconditionally do assignment
---

 b/arch/x86/mm/mpx.c |   41 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 36 insertions(+), 5 deletions(-)

diff -puN arch/x86/mm/mpx.c~mpx-variable-sized-userspace-pokes arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~mpx-variable-sized-userspace-pokes	2015-06-01 10:24:08.926942858 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:08.928942948 -0700
@@ -419,6 +419,35 @@ int mpx_disable_management(void)
 	return 0;
 }
 
+static int mpx_cmpxchg_bd_entry(struct mm_struct *mm,
+		unsigned long *curval,
+		unsigned long __user *addr,
+		unsigned long old_val, unsigned long new_val)
+{
+	int ret;
+	/*
+	 * user_atomic_cmpxchg_inatomic() actually uses sizeof()
+	 * the pointer that we pass to it to figure out how much
+	 * data to cmpxchg.  We have to be careful here not to
+	 * pass a pointer to a 64-bit data type when we only want
+	 * a 32-bit copy.
+	 */
+	if (is_64bit_mm(mm)) {
+		ret = user_atomic_cmpxchg_inatomic(curval,
+				addr, old_val, new_val);
+	} else {
+		u32 uninitialized_var(curval_32);
+		u32 old_val_32 = old_val;
+		u32 new_val_32 = new_val;
+		u32 __user *addr_32 = (u32 __user *)addr;
+
+		ret = user_atomic_cmpxchg_inatomic(&curval_32,
+				addr_32, old_val_32, new_val_32);
+		*curval = curval_32;
+	}
+	return ret;
+}
+
 /*
  * With 32-bit mode, MPX_BT_SIZE_BYTES is 4MB, and the size of each
  * bounds table is 16KB. With 64-bit mode, MPX_BT_SIZE_BYTES is 2GB,
@@ -426,6 +455,7 @@ int mpx_disable_management(void)
  */
 static int allocate_bt(long __user *bd_entry)
 {
+	struct mm_struct *mm = current->mm;
 	unsigned long expected_old_val = 0;
 	unsigned long actual_old_val = 0;
 	unsigned long bt_addr;
@@ -455,8 +485,8 @@ static int allocate_bt(long __user *bd_e
 	 * mmap_sem at this point, unlike some of the other part
 	 * of the MPX code that have to pagefault_disable().
 	 */
-	ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
-					   expected_old_val, bd_new_entry);
+	ret = mpx_cmpxchg_bd_entry(mm, &actual_old_val,	bd_entry,
+				   expected_old_val, bd_new_entry);
 	if (ret)
 		goto out_unmap;
 
@@ -710,15 +740,16 @@ static int unmap_single_bt(struct mm_str
 		long __user *bd_entry, unsigned long bt_addr)
 {
 	unsigned long expected_old_val = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
-	unsigned long actual_old_val = 0;
+	unsigned long uninitialized_var(actual_old_val);
 	int ret;
 
 	while (1) {
 		int need_write = 1;
+		unsigned long cleared_bd_entry = 0;
 
 		pagefault_disable();
-		ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
-						   expected_old_val, 0);
+		ret = mpx_cmpxchg_bd_entry(mm, &actual_old_val,
+				bd_entry, expected_old_val, cleared_bd_entry);
 		pagefault_enable();
 		if (!ret)
 			break;
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 16/19] x86, mpx: support 32-bit binaries on 64-bit kernel
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (14 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 15/19] x86, mpx: do 32-bit-only cmpxchg for 32-bit apps Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:35   ` [tip:x86/fpu] x86/mpx: Support 32-bit binaries on 64-bit kernels tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 18/19] x86, mpx: do not count MPX VMAs as neighbors when unmapping Dave Hansen
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Right now, the kernel can only switch between 64-bit and 32-bit
binaries at compile time. This patch adds support for 32-bit
binaries on 64-bit kernels when we support ia32 emulation.

We essentially choose which set of table sizes to use when doing
arithmetic for the bounds table calculations.

This also uses a different approach for calculating the table
indexes than before.  I think the new one makes it much more
clear what is going on, and allows us to share more code between
the 32 and 64-bit cases.

Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--

Changes from v22:
 * fix spelling nit

Changes from v21:
 * remove old comment
 * add a comment about why the masks are OK
 * use cpu_info->x86_virt_bits
 * add comment about lack of 32-bit hole

Changes from v20:

 * Fix macro confusion between BD and BT
 * Add accessor for bt_entry_size_bytes()

---

 b/arch/x86/include/asm/mpx.h |   66 ++++++++--------
 b/arch/x86/mm/mpx.c          |  170 +++++++++++++++++++++++++++++++++++++------
 2 files changed, 181 insertions(+), 55 deletions(-)

diff -puN arch/x86/include/asm/mpx.h~0002-x86-mpx-support-32bit-binaries-on-64bit-kernel arch/x86/include/asm/mpx.h
--- a/arch/x86/include/asm/mpx.h~0002-x86-mpx-support-32bit-binaries-on-64bit-kernel	2015-06-01 10:24:09.323960764 -0700
+++ b/arch/x86/include/asm/mpx.h	2015-06-01 10:24:09.329961034 -0700
@@ -13,49 +13,47 @@
 #define MPX_BNDCFG_ENABLE_FLAG	0x1
 #define MPX_BD_ENTRY_VALID_FLAG	0x1
 
-#ifdef CONFIG_X86_64
-
-/* upper 28 bits [47:20] of the virtual address in 64-bit used to
- * index into bounds directory (BD).
+/*
+ * The upper 28 bits [47:20] of the virtual address in 64-bit
+ * are used to index into bounds directory (BD).
+ *
+ * The directory is 2G (2^31) in size, and with 8-byte entries
+ * it has 2^28 entries.
  */
-#define MPX_BD_ENTRY_OFFSET	28
-#define MPX_BD_ENTRY_SHIFT	3
-/* bits [19:3] of the virtual address in 64-bit used to index into
- * bounds table (BT).
+#define MPX_BD_SIZE_BYTES_64	(1UL<<31)
+#define MPX_BD_ENTRY_BYTES_64	8
+#define MPX_BD_NR_ENTRIES_64	(MPX_BD_SIZE_BYTES_64/MPX_BD_ENTRY_BYTES_64)
+
+/*
+ * The 32-bit directory is 4MB (2^22) in size, and with 4-byte
+ * entries it has 2^20 entries.
  */
-#define MPX_BT_ENTRY_OFFSET	17
-#define MPX_BT_ENTRY_SHIFT	5
-#define MPX_IGN_BITS		3
-#define MPX_BD_ENTRY_TAIL	3
-
-#else
-
-#define MPX_BD_ENTRY_OFFSET	20
-#define MPX_BD_ENTRY_SHIFT	2
-#define MPX_BT_ENTRY_OFFSET	10
-#define MPX_BT_ENTRY_SHIFT	4
-#define MPX_IGN_BITS		2
-#define MPX_BD_ENTRY_TAIL	2
-
-#endif
-
-#define MPX_BD_SIZE_BYTES (1UL<<(MPX_BD_ENTRY_OFFSET+MPX_BD_ENTRY_SHIFT))
-#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
+#define MPX_BD_SIZE_BYTES_32	(1UL<<22)
+#define MPX_BD_ENTRY_BYTES_32	4
+#define MPX_BD_NR_ENTRIES_32	(MPX_BD_SIZE_BYTES_32/MPX_BD_ENTRY_BYTES_32)
+
+/*
+ * A 64-bit table is 4MB total in size, and an entry is
+ * 4 64-bit pointers in size.
+ */
+#define MPX_BT_SIZE_BYTES_64	(1UL<<22)
+#define MPX_BT_ENTRY_BYTES_64	32
+#define MPX_BT_NR_ENTRIES_64	(MPX_BT_SIZE_BYTES_64/MPX_BT_ENTRY_BYTES_64)
+
+/*
+ * A 32-bit table is 16kB total in size, and an entry is
+ * 4 32-bit pointers in size.
+ */
+#define MPX_BT_SIZE_BYTES_32	(1UL<<14)
+#define MPX_BT_ENTRY_BYTES_32	16
+#define MPX_BT_NR_ENTRIES_32	(MPX_BT_SIZE_BYTES_32/MPX_BT_ENTRY_BYTES_32)
 
 #define MPX_BNDSTA_TAIL		2
 #define MPX_BNDCFG_TAIL		12
 #define MPX_BNDSTA_ADDR_MASK	(~((1UL<<MPX_BNDSTA_TAIL)-1))
-
 #define MPX_BNDCFG_ADDR_MASK	(~((1UL<<MPX_BNDCFG_TAIL)-1))
 #define MPX_BNDSTA_ERROR_CODE	0x3
 
-#define MPX_BD_ENTRY_MASK	((1<<MPX_BD_ENTRY_OFFSET)-1)
-#define MPX_BT_ENTRY_MASK	((1<<MPX_BT_ENTRY_OFFSET)-1)
-#define MPX_GET_BD_ENTRY_OFFSET(addr)	((((addr)>>(MPX_BT_ENTRY_OFFSET+ \
-		MPX_IGN_BITS)) & MPX_BD_ENTRY_MASK) << MPX_BD_ENTRY_SHIFT)
-#define MPX_GET_BT_ENTRY_OFFSET(addr)	((((addr)>>MPX_IGN_BITS) & \
-		MPX_BT_ENTRY_MASK) << MPX_BT_ENTRY_SHIFT)
-
 #ifdef CONFIG_X86_INTEL_MPX
 siginfo_t *mpx_generate_siginfo(struct pt_regs *regs);
 int mpx_handle_bd_fault(void);
diff -puN arch/x86/mm/mpx.c~0002-x86-mpx-support-32bit-binaries-on-64bit-kernel arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~0002-x86-mpx-support-32bit-binaries-on-64bit-kernel	2015-06-01 10:24:09.325960854 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:09.330961080 -0700
@@ -34,6 +34,22 @@ static int is_mpx_vma(struct vm_area_str
 	return (vma->vm_ops == &mpx_vma_ops);
 }
 
+static inline unsigned long mpx_bd_size_bytes(struct mm_struct *mm)
+{
+	if (is_64bit_mm(mm))
+		return MPX_BD_SIZE_BYTES_64;
+	else
+		return MPX_BD_SIZE_BYTES_32;
+}
+
+static inline unsigned long mpx_bt_size_bytes(struct mm_struct *mm)
+{
+	if (is_64bit_mm(mm))
+		return MPX_BT_SIZE_BYTES_64;
+	else
+		return MPX_BT_SIZE_BYTES_32;
+}
+
 /*
  * This is really a simplified "vm_mmap". it only handles MPX
  * bounds tables (the bounds directory is user-allocated).
@@ -50,7 +66,7 @@ static unsigned long mpx_mmap(unsigned l
 	struct vm_area_struct *vma;
 
 	/* Only bounds table can be allocated here */
-	if (len != MPX_BT_SIZE_BYTES)
+	if (len != mpx_bt_size_bytes(mm))
 		return -EINVAL;
 
 	down_write(&mm->mmap_sem);
@@ -449,13 +465,12 @@ static int mpx_cmpxchg_bd_entry(struct m
 }
 
 /*
- * With 32-bit mode, MPX_BT_SIZE_BYTES is 4MB, and the size of each
- * bounds table is 16KB. With 64-bit mode, MPX_BT_SIZE_BYTES is 2GB,
+ * With 32-bit mode, a bounds directory is 4MB, and the size of each
+ * bounds table is 16KB. With 64-bit mode, a bounds directory is 2GB,
  * and the size of each bounds table is 4MB.
  */
-static int allocate_bt(long __user *bd_entry)
+static int allocate_bt(struct mm_struct *mm, long __user *bd_entry)
 {
-	struct mm_struct *mm = current->mm;
 	unsigned long expected_old_val = 0;
 	unsigned long actual_old_val = 0;
 	unsigned long bt_addr;
@@ -466,7 +481,7 @@ static int allocate_bt(long __user *bd_e
 	 * Carve the virtual space out of userspace for the new
 	 * bounds table:
 	 */
-	bt_addr = mpx_mmap(MPX_BT_SIZE_BYTES);
+	bt_addr = mpx_mmap(mpx_bt_size_bytes(mm));
 	if (IS_ERR((void *)bt_addr))
 		return PTR_ERR((void *)bt_addr);
 	/*
@@ -517,7 +532,7 @@ static int allocate_bt(long __user *bd_e
 	trace_mpx_new_bounds_table(bt_addr);
 	return 0;
 out_unmap:
-	vm_munmap(bt_addr, MPX_BT_SIZE_BYTES);
+	vm_munmap(bt_addr, mpx_bt_size_bytes(mm));
 	return ret;
 }
 
@@ -536,6 +551,7 @@ static int do_mpx_bt_fault(void)
 {
 	unsigned long bd_entry, bd_base;
 	const struct bndcsr *bndcsr;
+	struct mm_struct *mm = current->mm;
 
 	bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);
 	if (!bndcsr)
@@ -554,10 +570,10 @@ static int do_mpx_bt_fault(void)
 	 * the directory is.
 	 */
 	if ((bd_entry < bd_base) ||
-	    (bd_entry >= bd_base + MPX_BD_SIZE_BYTES))
+	    (bd_entry >= bd_base + mpx_bd_size_bytes(mm)))
 		return -EINVAL;
 
-	return allocate_bt((long __user *)bd_entry);
+	return allocate_bt(mm, (long __user *)bd_entry);
 }
 
 int mpx_handle_bd_fault(void)
@@ -789,7 +805,115 @@ static int unmap_single_bt(struct mm_str
 	 * avoid recursion, do_munmap() will check whether it comes
 	 * from one bounds table through VM_MPX flag.
 	 */
-	return do_munmap(mm, bt_addr, MPX_BT_SIZE_BYTES);
+	return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm));
+}
+
+static inline int bt_entry_size_bytes(struct mm_struct *mm)
+{
+	if (is_64bit_mm(mm))
+		return MPX_BT_ENTRY_BYTES_64;
+	else
+		return MPX_BT_ENTRY_BYTES_32;
+}
+
+/*
+ * Take a virtual address and turns it in to the offset in bytes
+ * inside of the bounds table where the bounds table entry
+ * controlling 'addr' can be found.
+ */
+static unsigned long mpx_get_bt_entry_offset_bytes(struct mm_struct *mm,
+		unsigned long addr)
+{
+	unsigned long bt_table_nr_entries;
+	unsigned long offset = addr;
+
+	if (is_64bit_mm(mm)) {
+		/* Bottom 3 bits are ignored on 64-bit */
+		offset >>= 3;
+		bt_table_nr_entries = MPX_BT_NR_ENTRIES_64;
+	} else {
+		/* Bottom 2 bits are ignored on 32-bit */
+		offset >>= 2;
+		bt_table_nr_entries = MPX_BT_NR_ENTRIES_32;
+	}
+	/*
+	 * We know the size of the table in to which we are
+	 * indexing, and we have eliminated all the low bits
+	 * which are ignored for indexing.
+	 *
+	 * Mask out all the high bits which we do not need
+	 * to index in to the table.  Note that the tables
+	 * are always powers of two so this gives us a proper
+	 * mask.
+	 */
+	offset &= (bt_table_nr_entries-1);
+	/*
+	 * We now have an entry offset in terms of *entries* in
+	 * the table.  We need to scale it back up to bytes.
+	 */
+	offset *= bt_entry_size_bytes(mm);
+	return offset;
+}
+
+/*
+ * How much virtual address space does a single bounds
+ * directory entry cover?
+ *
+ * Note, we need a long long because 4GB doesn't fit in
+ * to a long on 32-bit.
+ */
+static inline unsigned long bd_entry_virt_space(struct mm_struct *mm)
+{
+	unsigned long long virt_space = (1ULL << boot_cpu_data.x86_virt_bits);
+	if (is_64bit_mm(mm))
+		return virt_space / MPX_BD_NR_ENTRIES_64;
+	else
+		return virt_space / MPX_BD_NR_ENTRIES_32;
+}
+
+/*
+ * Return an offset in terms of bytes in to the bounds
+ * directory where the bounds directory entry for a given
+ * virtual address resides.
+ *
+ * This has to be in bytes because the directory entries
+ * are different sizes on 64/32 bit.
+ */
+static unsigned long mpx_get_bd_entry_offset(struct mm_struct *mm,
+		unsigned long addr)
+{
+	/*
+	 * There are several ways to derive the bd offsets.  We
+	 * use the following approach here:
+	 * 1. We know the size of the virtual address space
+	 * 2. We know the number of entries in a bounds table
+	 * 3. We know that each entry covers a fixed amount of
+	 *    virtual address space.
+	 * So, we can just divide the virtual address by the
+	 * virtual space used by one entry to determine which
+	 * entry "controls" the given virtual address.
+	 */
+	if (is_64bit_mm(mm)) {
+		int bd_entry_size = 8; /* 64-bit pointer */
+		/*
+		 * Take the 64-bit addressing hole in to account.
+		 */
+		addr &= ((1UL << boot_cpu_data.x86_virt_bits) - 1);
+		return (addr / bd_entry_virt_space(mm)) * bd_entry_size;
+	} else {
+		int bd_entry_size = 4; /* 32-bit pointer */
+		/*
+		 * 32-bit has no hole so this case needs no mask
+		 */
+		return (addr / bd_entry_virt_space(mm)) * bd_entry_size;
+	}
+	/*
+	 * The two return calls above are exact copies.  If we
+	 * pull out a single copy and put it in here, gcc won't
+	 * realize that we're doing a power-of-2 divide and use
+	 * shifts.  It uses a real divide.  If we put them up
+	 * there, it manages to figure it out (gcc 4.8.3).
+	 */
 }
 
 /*
@@ -803,6 +927,7 @@ static int unmap_shared_bt(struct mm_str
 		unsigned long end, bool prev_shared, bool next_shared)
 {
 	unsigned long bt_addr;
+	unsigned long start_off, end_off;
 	int ret;
 
 	ret = get_bt_addr(mm, bd_entry, &bt_addr);
@@ -814,17 +939,20 @@ static int unmap_shared_bt(struct mm_str
 	if (ret)
 		return ret;
 
+	start_off = mpx_get_bt_entry_offset_bytes(mm, start);
+	end_off   = mpx_get_bt_entry_offset_bytes(mm, end);
+
 	if (prev_shared && next_shared)
 		ret = zap_bt_entries(mm, bt_addr,
-				bt_addr+MPX_GET_BT_ENTRY_OFFSET(start),
-				bt_addr+MPX_GET_BT_ENTRY_OFFSET(end));
+				bt_addr + start_off,
+				bt_addr + end_off);
 	else if (prev_shared)
 		ret = zap_bt_entries(mm, bt_addr,
-				bt_addr+MPX_GET_BT_ENTRY_OFFSET(start),
-				bt_addr+MPX_BT_SIZE_BYTES);
+				bt_addr + start_off,
+				bt_addr + mpx_bt_size_bytes(mm));
 	else if (next_shared)
 		ret = zap_bt_entries(mm, bt_addr, bt_addr,
-				bt_addr+MPX_GET_BT_ENTRY_OFFSET(end));
+				bt_addr + end_off);
 	else
 		ret = unmap_single_bt(mm, bd_entry, bt_addr);
 
@@ -845,8 +973,8 @@ static int unmap_edge_bts(struct mm_stru
 	struct vm_area_struct *prev, *next;
 	bool prev_shared = false, next_shared = false;
 
-	bde_start = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(start);
-	bde_end = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(end-1);
+	bde_start = mm->bd_addr + mpx_get_bd_entry_offset(mm, start);
+	bde_end   = mm->bd_addr + mpx_get_bd_entry_offset(mm, end-1);
 
 	/*
 	 * Check whether bde_start and bde_end are shared with adjacent
@@ -858,10 +986,10 @@ static int unmap_edge_bts(struct mm_stru
 	 * in to 'next'.
 	 */
 	next = find_vma_prev(mm, start, &prev);
-	if (prev && (mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(prev->vm_end-1))
+	if (prev && (mm->bd_addr + mpx_get_bd_entry_offset(mm, prev->vm_end-1))
 			== bde_start)
 		prev_shared = true;
-	if (next && (mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(next->vm_start))
+	if (next && (mm->bd_addr + mpx_get_bd_entry_offset(mm, next->vm_start))
 			== bde_end)
 		next_shared = true;
 
@@ -927,8 +1055,8 @@ static int mpx_unmap_tables(struct mm_st
 	 *   1. fully covered
 	 *   2. not at the edges of the mapping, even if full aligned
 	 */
-	bde_start = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(start);
-	bde_end = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(end-1);
+	bde_start = mm->bd_addr + mpx_get_bd_entry_offset(mm, start);
+	bde_end   = mm->bd_addr + mpx_get_bd_entry_offset(mm, end-1);
 	for (bd_entry = bde_start + 1; bd_entry < bde_end; bd_entry++) {
 		ret = get_bt_addr(mm, bd_entry, &bt_addr);
 		switch (ret) {
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 17/19] x86, mpx: rewrite unmap code
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (16 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 18/19] x86, mpx: do not count MPX VMAs as neighbors when unmapping Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:35   ` [tip:x86/fpu] x86/mpx: Rewrite the " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 19/19] x86, mpx: allow mixed binaries again Dave Hansen
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The MPX code needs to clear out bounds tables for memory which
is no longer in use.  We do this when a userspace mapping is
torn down (unmapped).

There are two modes:
1. An entire bounds table becomes unused, and can be freed
   and its pointer removed from the bounds directory.  This
   happens either when a large mapping is torn down, or when
   a small mapping is torn down and it is the last mapping
   "covered" by a bounds table.
2. Only part of a bounds table becomes unused, in which case
   we free the backing memory as if MADV_DONTNEED was called.

The old code was a spaghetti mess of "edge" bounds tables
where the edges were handled specially, even if we were
unmapping an entire one.  Non-edge bounds tables are always
fully unmapped, but share a different code path from the edge
ones.  The old code had a bug where it was unmapping too much
memory.  I worked on fixing it for two days and gave up.

I didn't write the original code.  I didn't particularly like
it, but it worked, so I left it.  After my debug session, I
realized it was undebuggagle *and* buggy, so out it went.

I also wrote a new unmapping test program which uncovers bugs
pretty nicely.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/mpx.c |  411 +++++++++++++++++++++-------------------------------
 1 file changed, 168 insertions(+), 243 deletions(-)

diff -puN arch/x86/mm/mpx.c~rewrite-unmap-code arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~rewrite-unmap-code	2015-06-01 10:24:09.726978941 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:09.729979076 -0700
@@ -704,110 +704,6 @@ static int get_bt_addr(struct mm_struct
 	return 0;
 }
 
-/*
- * Free the backing physical pages of bounds table 'bt_addr'.
- * Assume start...end is within that bounds table.
- */
-static int zap_bt_entries(struct mm_struct *mm,
-		unsigned long bt_addr,
-		unsigned long start, unsigned long end)
-{
-	struct vm_area_struct *vma;
-	unsigned long addr, len;
-
-	/*
-	 * Find the first overlapping vma. If vma->vm_start > start, there
-	 * will be a hole in the bounds table. This -EINVAL return will
-	 * cause a SIGSEGV.
-	 */
-	vma = find_vma(mm, start);
-	if (!vma || vma->vm_start > start)
-		return -EINVAL;
-
-	/*
-	 * A NUMA policy on a VM_MPX VMA could cause this bouds table to
-	 * be split. So we need to look across the entire 'start -> end'
-	 * range of this bounds table, find all of the VM_MPX VMAs, and
-	 * zap only those.
-	 */
-	addr = start;
-	while (vma && vma->vm_start < end) {
-		/*
-		 * We followed a bounds directory entry down
-		 * here.  If we find a non-MPX VMA, that's bad,
-		 * so stop immediately and return an error.  This
-		 * probably results in a SIGSEGV.
-		 */
-		if (!is_mpx_vma(vma))
-			return -EINVAL;
-
-		len = min(vma->vm_end, end) - addr;
-		zap_page_range(vma, addr, len, NULL);
-		trace_mpx_unmap_zap(addr, addr+len);
-
-		vma = vma->vm_next;
-		addr = vma->vm_start;
-	}
-
-	return 0;
-}
-
-static int unmap_single_bt(struct mm_struct *mm,
-		long __user *bd_entry, unsigned long bt_addr)
-{
-	unsigned long expected_old_val = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
-	unsigned long uninitialized_var(actual_old_val);
-	int ret;
-
-	while (1) {
-		int need_write = 1;
-		unsigned long cleared_bd_entry = 0;
-
-		pagefault_disable();
-		ret = mpx_cmpxchg_bd_entry(mm, &actual_old_val,
-				bd_entry, expected_old_val, cleared_bd_entry);
-		pagefault_enable();
-		if (!ret)
-			break;
-		if (ret == -EFAULT)
-			ret = mpx_resolve_fault(bd_entry, need_write);
-		/*
-		 * If we could not resolve the fault, consider it
-		 * userspace's fault and error out.
-		 */
-		if (ret)
-			return ret;
-	}
-	/*
-	 * The cmpxchg was performed, check the results.
-	 */
-	if (actual_old_val != expected_old_val) {
-		/*
-		 * Someone else raced with us to unmap the table.
-		 * There was no bounds table pointed to by the
-		 * directory, so declare success.  Somebody freed
-		 * it.
-		 */
-		if (!actual_old_val)
-			return 0;
-		/*
-		 * Something messed with the bounds directory
-		 * entry.  We hold mmap_sem for read or write
-		 * here, so it could not be a _new_ bounds table
-		 * that someone just allocated.  Something is
-		 * wrong, so pass up the error and SIGSEGV.
-		 */
-		return -EINVAL;
-	}
-
-	/*
-	 * Note, we are likely being called under do_munmap() already. To
-	 * avoid recursion, do_munmap() will check whether it comes
-	 * from one bounds table through VM_MPX flag.
-	 */
-	return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm));
-}
-
 static inline int bt_entry_size_bytes(struct mm_struct *mm)
 {
 	if (is_64bit_mm(mm))
@@ -872,13 +768,69 @@ static inline unsigned long bd_entry_vir
 }
 
 /*
- * Return an offset in terms of bytes in to the bounds
- * directory where the bounds directory entry for a given
- * virtual address resides.
- *
- * This has to be in bytes because the directory entries
- * are different sizes on 64/32 bit.
+ * Free the backing physical pages of bounds table 'bt_addr'.
+ * Assume start...end is within that bounds table.
  */
+static noinline int zap_bt_entries_mapping(struct mm_struct *mm,
+		unsigned long bt_addr,
+		unsigned long start_mapping, unsigned long end_mapping)
+{
+	struct vm_area_struct *vma;
+	unsigned long addr, len;
+	unsigned long start;
+	unsigned long end;
+
+	/*
+	 * if we 'end' on a boundary, the offset will be 0 which
+	 * is not what we want.  Back it up a byte to get the
+	 * last bt entry.  Then once we have the entry itself,
+	 * move 'end' back up by the table entry size.
+	 */
+	start = bt_addr + mpx_get_bt_entry_offset_bytes(mm, start_mapping);
+	end   = bt_addr + mpx_get_bt_entry_offset_bytes(mm, end_mapping - 1);
+	/*
+	 * Move end back up by one entry.  Among other things
+	 * this ensures that it remains page-aligned and does
+	 * not screw up zap_page_range()
+	 */
+	end += bt_entry_size_bytes(mm);
+
+	/*
+	 * Find the first overlapping vma. If vma->vm_start > start, there
+	 * will be a hole in the bounds table. This -EINVAL return will
+	 * cause a SIGSEGV.
+	 */
+	vma = find_vma(mm, start);
+	if (!vma || vma->vm_start > start)
+		return -EINVAL;
+
+	/*
+	 * A NUMA policy on a VM_MPX VMA could cause this bounds table to
+	 * be split. So we need to look across the entire 'start -> end'
+	 * range of this bounds table, find all of the VM_MPX VMAs, and
+	 * zap only those.
+	 */
+	addr = start;
+	while (vma && vma->vm_start < end) {
+		/*
+		 * We followed a bounds directory entry down
+		 * here.  If we find a non-MPX VMA, that's bad,
+		 * so stop immediately and return an error.  This
+		 * probably results in a SIGSEGV.
+		 */
+		if (!is_mpx_vma(vma))
+			return -EINVAL;
+
+		len = min(vma->vm_end, end) - addr;
+		zap_page_range(vma, addr, len, NULL);
+		trace_mpx_unmap_zap(addr, addr+len);
+
+		vma = vma->vm_next;
+		addr = vma->vm_start;
+	}
+	return 0;
+}
+
 static unsigned long mpx_get_bd_entry_offset(struct mm_struct *mm,
 		unsigned long addr)
 {
@@ -916,69 +868,80 @@ static unsigned long mpx_get_bd_entry_of
 	 */
 }
 
-/*
- * If the bounds table pointed by bounds directory 'bd_entry' is
- * not shared, unmap this whole bounds table. Otherwise, only free
- * those backing physical pages of bounds table entries covered
- * in this virtual address region start...end.
- */
-static int unmap_shared_bt(struct mm_struct *mm,
-		long __user *bd_entry, unsigned long start,
-		unsigned long end, bool prev_shared, bool next_shared)
+static int unmap_entire_bt(struct mm_struct *mm,
+		long __user *bd_entry, unsigned long bt_addr)
 {
-	unsigned long bt_addr;
-	unsigned long start_off, end_off;
+	unsigned long expected_old_val = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
+	unsigned long uninitialized_var(actual_old_val);
 	int ret;
 
-	ret = get_bt_addr(mm, bd_entry, &bt_addr);
+	while (1) {
+		int need_write = 1;
+		unsigned long cleared_bd_entry = 0;
+
+		pagefault_disable();
+		ret = mpx_cmpxchg_bd_entry(mm, &actual_old_val,
+				bd_entry, expected_old_val, cleared_bd_entry);
+		pagefault_enable();
+		if (!ret)
+			break;
+		if (ret == -EFAULT)
+			ret = mpx_resolve_fault(bd_entry, need_write);
+		/*
+		 * If we could not resolve the fault, consider it
+		 * userspace's fault and error out.
+		 */
+		if (ret)
+			return ret;
+	}
+	/*
+	 * The cmpxchg was performed, check the results.
+	 */
+	if (actual_old_val != expected_old_val) {
+		/*
+		 * Someone else raced with us to unmap the table.
+		 * That is OK, since we were both trying to do
+		 * the same thing.  Declare success.
+		 */
+		if (!actual_old_val)
+			return 0;
+		/*
+		 * Something messed with the bounds directory
+		 * entry.  We hold mmap_sem for read or write
+		 * here, so it could not be a _new_ bounds table
+		 * that someone just allocated.  Something is
+		 * wrong, so pass up the error and SIGSEGV.
+		 */
+		return -EINVAL;
+	}
 	/*
-	 * We could see an "error" ret for not-present bounds
-	 * tables (not really an error), or actual errors, but
-	 * stop unmapping either way.
+	 * Note, we are likely being called under do_munmap() already. To
+	 * avoid recursion, do_munmap() will check whether it comes
+	 * from one bounds table through VM_MPX flag.
 	 */
-	if (ret)
-		return ret;
-
-	start_off = mpx_get_bt_entry_offset_bytes(mm, start);
-	end_off   = mpx_get_bt_entry_offset_bytes(mm, end);
-
-	if (prev_shared && next_shared)
-		ret = zap_bt_entries(mm, bt_addr,
-				bt_addr + start_off,
-				bt_addr + end_off);
-	else if (prev_shared)
-		ret = zap_bt_entries(mm, bt_addr,
-				bt_addr + start_off,
-				bt_addr + mpx_bt_size_bytes(mm));
-	else if (next_shared)
-		ret = zap_bt_entries(mm, bt_addr, bt_addr,
-				bt_addr + end_off);
-	else
-		ret = unmap_single_bt(mm, bd_entry, bt_addr);
-
-	return ret;
+	return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm));
 }
 
-/*
- * A virtual address region being munmap()ed might share bounds table
- * with adjacent VMAs. We only need to free the backing physical
- * memory of these shared bounds tables entries covered in this virtual
- * address region.
- */
-static int unmap_edge_bts(struct mm_struct *mm,
-		unsigned long start, unsigned long end)
+static int try_unmap_single_bt(struct mm_struct *mm,
+	       unsigned long start, unsigned long end)
 {
+	struct vm_area_struct *next;
+	struct vm_area_struct *prev;
+	/*
+	 * "bta" == Bounds Table Area: the area controlled by the
+	 * bounds table that we are unmapping.
+	 */
+	unsigned long bta_start_vaddr = start & ~(bd_entry_virt_space(mm)-1);
+	unsigned long bta_end_vaddr = bta_start_vaddr + bd_entry_virt_space(mm);
+	unsigned long uninitialized_var(bt_addr);
+	void __user *bde_vaddr;
 	int ret;
-	long __user *bde_start, *bde_end;
-	struct vm_area_struct *prev, *next;
-	bool prev_shared = false, next_shared = false;
-
-	bde_start = mm->bd_addr + mpx_get_bd_entry_offset(mm, start);
-	bde_end   = mm->bd_addr + mpx_get_bd_entry_offset(mm, end-1);
-
 	/*
-	 * Check whether bde_start and bde_end are shared with adjacent
-	 * VMAs.
+	 * We know 'start' and 'end' lie within an area controlled
+	 * by a single bounds table.  See if there are any other
+	 * VMAs controlled by that bounds table.  If there are not
+	 * then we can "expand" the are we are unmapping to possibly
+	 * cover the entire table.
 	 *
 	 * We already unliked the VMAs from the mm's rbtree so 'start'
 	 * is guaranteed to be in a hole. This gets us the first VMA
@@ -986,102 +949,64 @@ static int unmap_edge_bts(struct mm_stru
 	 * in to 'next'.
 	 */
 	next = find_vma_prev(mm, start, &prev);
-	if (prev && (mm->bd_addr + mpx_get_bd_entry_offset(mm, prev->vm_end-1))
-			== bde_start)
-		prev_shared = true;
-	if (next && (mm->bd_addr + mpx_get_bd_entry_offset(mm, next->vm_start))
-			== bde_end)
-		next_shared = true;
-
-	/*
-	 * This virtual address region being munmap()ed is only
-	 * covered by one bounds table.
-	 *
-	 * In this case, if this table is also shared with adjacent
-	 * VMAs, only part of the backing physical memory of the bounds
-	 * table need be freeed. Otherwise the whole bounds table need
-	 * be unmapped.
-	 */
-	if (bde_start == bde_end) {
-		return unmap_shared_bt(mm, bde_start, start, end,
-				prev_shared, next_shared);
+	if ((!prev || prev->vm_end <= bta_start_vaddr) &&
+	    (!next || next->vm_start >= bta_end_vaddr)) {
+		/*
+		 * No neighbor VMAs controlled by same bounds
+		 * table.  Try to unmap the whole thing
+		 */
+		start = bta_start_vaddr;
+		end = bta_end_vaddr;
 	}
 
+	bde_vaddr = mm->bd_addr + mpx_get_bd_entry_offset(mm, start);
+	ret = get_bt_addr(mm, bde_vaddr, &bt_addr);
 	/*
-	 * If more than one bounds tables are covered in this virtual
-	 * address region being munmap()ed, we need to separately check
-	 * whether bde_start and bde_end are shared with adjacent VMAs.
+	 * No bounds table there, so nothing to unmap.
 	 */
-	ret = unmap_shared_bt(mm, bde_start, start, end, prev_shared, false);
-	if (ret)
-		return ret;
-	ret = unmap_shared_bt(mm, bde_end, start, end, false, next_shared);
+	if (ret == -ENOENT) {
+		ret = 0;
+		return 0;
+	}
 	if (ret)
 		return ret;
-
-	return 0;
+	/*
+	 * We are unmapping an entire table.  Either because the
+	 * unmap that started this whole process was large enough
+	 * to cover an entire table, or that the unmap was small
+	 * but was the area covered by a bounds table.
+	 */
+	if ((start == bta_start_vaddr) &&
+	    (end == bta_end_vaddr))
+		return unmap_entire_bt(mm, bde_vaddr, bt_addr);
+	return zap_bt_entries_mapping(mm, bt_addr, start, end);
 }
 
 static int mpx_unmap_tables(struct mm_struct *mm,
 		unsigned long start, unsigned long end)
 {
-	int ret;
-	long __user *bd_entry, *bde_start, *bde_end;
-	unsigned long bt_addr;
-
+	unsigned long one_unmap_start;
 	trace_mpx_unmap_search(start, end);
-	/*
-	 * "Edge" bounds tables are those which are being used by the region
-	 * (start -> end), but that may be shared with adjacent areas.  If they
-	 * turn out to be completely unshared, they will be freed.  If they are
-	 * shared, we will free the backing store (like an MADV_DONTNEED) for
-	 * areas used by this region.
-	 */
-	ret = unmap_edge_bts(mm, start, end);
-	switch (ret) {
-		/* non-present tables are OK */
-		case 0:
-		case -ENOENT:
-			/* Success, or no tables to unmap */
-			break;
-		case -EINVAL:
-		case -EFAULT:
-		default:
-			return ret;
-	}
-
-	/*
-	 * Only unmap the bounds table that are
-	 *   1. fully covered
-	 *   2. not at the edges of the mapping, even if full aligned
-	 */
-	bde_start = mm->bd_addr + mpx_get_bd_entry_offset(mm, start);
-	bde_end   = mm->bd_addr + mpx_get_bd_entry_offset(mm, end-1);
-	for (bd_entry = bde_start + 1; bd_entry < bde_end; bd_entry++) {
-		ret = get_bt_addr(mm, bd_entry, &bt_addr);
-		switch (ret) {
-			case 0:
-				break;
-			case -ENOENT:
-				/* No table here, try the next one */
-				continue;
-			case -EINVAL:
-			case -EFAULT:
-			default:
-				/*
-				 * Note: we are being strict here.
-				 * Any time we run in to an issue
-				 * unmapping tables, we stop and
-				 * SIGSEGV.
-				 */
-				return ret;
-		}
 
-		ret = unmap_single_bt(mm, bd_entry, bt_addr);
+	one_unmap_start = start;
+	while (one_unmap_start < end) {
+		int ret;
+		unsigned long next_unmap_start = ALIGN(one_unmap_start+1,
+						       bd_entry_virt_space(mm));
+		unsigned long one_unmap_end = end;
+		/*
+		 * if the end is beyond the current bounds table,
+		 * move it back so we only deal with a single one
+		 * at a time
+		 */
+		if (one_unmap_end > next_unmap_start)
+			one_unmap_end = next_unmap_start;
+		ret = try_unmap_single_bt(mm, one_unmap_start, one_unmap_end);
 		if (ret)
 			return ret;
-	}
 
+		one_unmap_start = next_unmap_start;
+	}
 	return 0;
 }
 
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 19/19] x86, mpx: allow mixed binaries again
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (17 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 17/19] x86, mpx: rewrite unmap code Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 12:36   ` [tip:x86/fpu] x86/mpx: Allow 32-bit binaries on 64-bit kernels again tip-bot for Dave Hansen
  18 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

We explicitly disable allowing 32-bit binaries to enable
MPX on 64-bit kernels.  Re-allow that.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/mpx.c |    6 ------
 1 file changed, 6 deletions(-)

diff -puN arch/x86/mm/mpx.c~x86-mpx-allow-mixed-binaries-again arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~x86-mpx-allow-mixed-binaries-again	2015-06-01 10:24:26.813749587 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:26.816749722 -0700
@@ -367,12 +367,6 @@ static __user void *mpx_get_bounds_dir(v
 		return MPX_INVALID_BOUNDS_DIR;
 
 	/*
-	 * 32-bit binaries on 64-bit kernels are currently
-	 * unsupported.
-	 */
-	if (IS_ENABLED(CONFIG_X86_64) && test_thread_flag(TIF_IA32))
-		return MPX_INVALID_BOUNDS_DIR;
-	/*
 	 * The bounds directory pointer is stored in a register
 	 * only accessible if we first do an xsave.
 	 */
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 18/19] x86, mpx: do not count MPX VMAs as neighbors when unmapping
  2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
                   ` (15 preceding siblings ...)
  2015-06-07 18:37 ` [PATCH 16/19] x86, mpx: support 32-bit binaries on 64-bit kernel Dave Hansen
@ 2015-06-07 18:37 ` Dave Hansen
  2015-06-09 10:23   ` Ingo Molnar
  2015-06-09 12:35   ` [tip:x86/fpu] x86/mpx: Do " tip-bot for Dave Hansen
  2015-06-07 18:37 ` [PATCH 17/19] x86, mpx: rewrite unmap code Dave Hansen
  2015-06-07 18:37 ` [PATCH 19/19] x86, mpx: allow mixed binaries again Dave Hansen
  18 siblings, 2 replies; 62+ messages in thread
From: Dave Hansen @ 2015-06-07 18:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The comment pretty much says it all.

I wrote a test program that does lots of random allocations
and forces bounds tables to be created.  It came up with a
layout like this:

  ....   | BOUNDS DIRECTORY ENTRY COVERS |  ....
         |    BOUNDS TABLE COVERS        |
|  BOUNDS TABLE |  REAL ALLOC | BOUNDS TABLE |

Unmapping "REAL ALLOC" should have been able to free the
bounds table "covering" the "REAL ALLOC" because it was the
last real user.  But, the neighboring VMA bounds tables were
found, considered as real neighbors, and we declined to free
the bounds table covering the area.

Doing this over and over left a small but significant number
of these orphans.  Handling them is fairly straighforward.
All we have to do is walk the VMAs and skip all of the MPX
ones when looking for neighbors.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/mpx.c |   24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff -puN arch/x86/mm/mpx.c~mpx-dont-count-mpx-vmas-as-neighbors arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~mpx-dont-count-mpx-vmas-as-neighbors	2015-06-01 10:24:10.037992968 -0700
+++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:10.040993104 -0700
@@ -937,16 +937,30 @@ static int try_unmap_single_bt(struct mm
 	void __user *bde_vaddr;
 	int ret;
 	/*
+	 * We already unliked the VMAs from the mm's rbtree so 'start'
+	 * is guaranteed to be in a hole. This gets us the first VMA
+	 * before the hole in to 'prev' and the next VMA after the hole
+	 * in to 'next'.
+	 */
+	next = find_vma_prev(mm, start, &prev);
+	/*
+	 * Do not count other MPX bounds table VMAs as neighbors.
+	 * Although theoretically possible, we do not allow bounds
+	 * tables for bounds tables so our heads do not explode.
+	 * If we count them as neighbors here, we may end up with
+	 * lots of tables even though we have no actual table
+	 * entries in use.
+	 */
+	while (next && is_mpx_vma(next))
+		next = next->vm_next;
+	while (prev && is_mpx_vma(prev))
+		prev = prev->vm_prev;
+	/*
 	 * We know 'start' and 'end' lie within an area controlled
 	 * by a single bounds table.  See if there are any other
 	 * VMAs controlled by that bounds table.  If there are not
 	 * then we can "expand" the are we are unmapping to possibly
 	 * cover the entire table.
-	 *
-	 * We already unliked the VMAs from the mm's rbtree so 'start'
-	 * is guaranteed to be in a hole. This gets us the first VMA
-	 * before the hole in to 'prev' and the next VMA after the hole
-	 * in to 'next'.
 	 */
 	next = find_vma_prev(mm, start, &prev);
 	if ((!prev || prev->vm_end <= bta_start_vaddr) &&
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 18/19] x86, mpx: do not count MPX VMAs as neighbors when unmapping
  2015-06-07 18:37 ` [PATCH 18/19] x86, mpx: do not count MPX VMAs as neighbors when unmapping Dave Hansen
@ 2015-06-09 10:23   ` Ingo Molnar
  2015-06-09 12:35   ` [tip:x86/fpu] x86/mpx: Do " tip-bot for Dave Hansen
  1 sibling, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2015-06-09 10:23 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, x86, tglx, dave.hansen


* Dave Hansen <dave@sr71.net> wrote:

> diff -puN arch/x86/mm/mpx.c~mpx-dont-count-mpx-vmas-as-neighbors arch/x86/mm/mpx.c
> --- a/arch/x86/mm/mpx.c~mpx-dont-count-mpx-vmas-as-neighbors	2015-06-01 10:24:10.037992968 -0700
> +++ b/arch/x86/mm/mpx.c	2015-06-01 10:24:10.040993104 -0700
> @@ -937,16 +937,30 @@ static int try_unmap_single_bt(struct mm
>  	void __user *bde_vaddr;
>  	int ret;
>  	/*
> +	 * We already unliked the VMAs from the mm's rbtree so 'start'
> +	 * is guaranteed to be in a hole. This gets us the first VMA
> +	 * before the hole in to 'prev' and the next VMA after the hole
> +	 * in to 'next'.

Hey, I didn't know VMAs were on Facebook ;-)

I fixed it up to 'unlinked'.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/fpu/xstate: Fix up bad get_xsave_addr() assumptions
  2015-06-07 18:37 ` [PATCH 01/19] x86, mpx, xsave: Fix up bad get_xsave_addr() assumptions Dave Hansen
@ 2015-06-09 12:30   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, linux-kernel, dave, tglx, akpm, dave.hansen, torvalds,
	mingo, hpa

Commit-ID:  0c4109bec0a6cde471bef3a21cd6f8384a614469
Gitweb:     http://git.kernel.org/tip/0c4109bec0a6cde471bef3a21cd6f8384a614469
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:00 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:29 +0200

x86/fpu/xstate: Fix up bad get_xsave_addr() assumptions

get_xsave_addr() assumes that if an xsave bit is present in the
hardware (pcntxt_mask) that it is present in a given xsave
buffer.  Due to an bug in the xsave code on all of the systems
that have MPX (and thus all the users of this code), that has
been a true assumption.

But, the bug is getting fixed, so our assumption is not going
to hold any more.

It's quite possible (and normal) for an enabled state to be
present on 'pcntxt_mask', but *not* in 'xstate_bv'.  We need
to consult 'xstate_bv'.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183700.1E739B34@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/fpu/xstate.c | 45 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index a580eb5..af3700e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -382,19 +382,48 @@ void fpu__resume_cpu(void)
  * This is the API that is called to get xstate address in either
  * standard format or compacted format of xsave area.
  *
+ * Note that if there is no data for the field in the xsave buffer
+ * this will return NULL.
+ *
  * Inputs:
- *	xsave: base address of the xsave area;
- *	xstate: state which is defined in xsave.h (e.g. XSTATE_FP, XSTATE_SSE,
- *	etc.)
+ *	xstate: the thread's storage area for all FPU data
+ *	xstate_feature: state which is defined in xsave.h (e.g.
+ *	XSTATE_FP, XSTATE_SSE, etc...)
  * Output:
- *	address of the state in the xsave area.
+ *	address of the state in the xsave area, or NULL if the
+ *	field is not present in the xsave buffer.
  */
-void *get_xsave_addr(struct xregs_state *xsave, int xstate)
+void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
 {
-	int feature = fls64(xstate) - 1;
-	if (!test_bit(feature, (unsigned long *)&xfeatures_mask))
+	int feature_nr = fls64(xstate_feature) - 1;
+	/*
+	 * Do we even *have* xsave state?
+	 */
+	if (!boot_cpu_has(X86_FEATURE_XSAVE))
+		return NULL;
+
+	xsave = &current->thread.fpu.state.xsave;
+	/*
+	 * We should not ever be requesting features that we
+	 * have not enabled.  Remember that pcntxt_mask is
+	 * what we write to the XCR0 register.
+	 */
+	WARN_ONCE(!(xfeatures_mask & xstate_feature),
+		  "get of unsupported state");
+	/*
+	 * This assumes the last 'xsave*' instruction to
+	 * have requested that 'xstate_feature' be saved.
+	 * If it did not, we might be seeing and old value
+	 * of the field in the buffer.
+	 *
+	 * This can happen because the last 'xsave' did not
+	 * request that this feature be saved (unlikely)
+	 * or because the "init optimization" caused it
+	 * to not be saved.
+	 */
+	if (!(xsave->header.xfeatures & xstate_feature))
 		return NULL;
 
-	return (void *)xsave + xstate_comp_offsets[feature];
+	return (void *)xsave + xstate_comp_offsets[feature_nr];
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/fpu/xstate: Wrap get_xsave_addr() to make it safer
  2015-06-07 18:37 ` [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer Dave Hansen
@ 2015-06-09 12:31   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: sbsiddha, peterz, torvalds, luto, linux-kernel, dave, oleg, hpa,
	fenghua.yu, riel, mingo, dave.hansen, tglx, akpm

Commit-ID:  04cd027bcba1ead7bfe39e7f1c6f4d993c4c3323
Gitweb:     http://git.kernel.org/tip/04cd027bcba1ead7bfe39e7f1c6f4d993c4c3323
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:00 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:29 +0200

x86/fpu/xstate: Wrap get_xsave_addr() to make it safer

The MPX code appears is calling a low-level FPU function
(copy_fpregs_to_fpstate()).  This function is not able to
be called in all contexts, although it is safe to call
directly in some cases.

Although probably correct, the current code is ugly and
potentially error-prone.  So, add a wrapper that calls
the (slightly) higher-level fpu__save() (which is preempt-
safe) and also ensures that we even *have* an FPU context
(in the case that this was called when in lazy FPU mode).

Ingo had this to say about the details about when we need
preemption disabled:

> it's indeed generally unsafe to access/copy FPU registers with preemption enabled,
> for two reasons:
>
>   - on older systems that use FSAVE the instruction destroys FPU register
>     contents, which has to be handled carefully
>
>   - even on newer systems if we copy to FPU registers (which this code doesn't)
>     then we don't want a context switch to occur in the middle of it, because a
>     context switch will write to the fpstate, potentially overwriting our new data
>     with old FPU state.
>
> But it's safe to access FPU registers with preemption enabled in a couple of
> special cases:
>
>   - potentially destructively saving FPU registers: the signal handling code does
>     this in copy_fpstate_to_sigframe(), because it can rely on the signal restore
>     side to restore the original FPU state.
>
>   - reading FPU registers on modern systems: we don't do this anywhere at the
>     moment, mostly to keep symmetry with older systems where FSAVE is
>     destructive.
>
>   - initializing FPU registers on modern systems: fpu__clear() does this. Here
>     it's safe because we don't copy from the fpstate.
>
>   - directly writing FPU registers from user-space memory (!). We do this in
>     fpu__restore_sig(), and it's safe because neither context switches nor
>     irq-handler FPU use can corrupt the source context of the copy (which is
>     user-space memory).
>
> Note that the MPX code's current use of copy_fpregs_to_fpstate() was safe I think,
> because:
>
>  - MPX is predicated on eagerfpu, so the destructive F[N]SAVE instruction won't be
>    used.
>
>  - the code was only reading FPU registers, and was doing it only in places that
>    guaranteed that an FPU state was already active (i.e. didn't do it in
>    kthreads)

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave@sr71.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: bp@alien8.de
Link: http://lkml.kernel.org/r/20150607183700.AA881696@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/fpu/xstate.h |  1 +
 arch/x86/kernel/fpu/xstate.c      | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 3398946..4656b25 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -41,5 +41,6 @@ extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 extern void update_regset_xstate_info(unsigned int size, u64 xstate_mask);
 
 void *get_xsave_addr(struct xregs_state *xsave, int xstate);
+const void *get_xsave_field_ptr(int xstate_field);
 
 #endif
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index af3700e..49d0d9b 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -427,3 +427,35 @@ void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
 	return (void *)xsave + xstate_comp_offsets[feature_nr];
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
+
+/*
+ * This wraps up the common operations that need to occur when retrieving
+ * data from xsave state.  It first ensures that the current task was
+ * using the FPU and retrieves the data in to a buffer.  It then calculates
+ * the offset of the requested field in the buffer.
+ *
+ * This function is safe to call whether the FPU is in use or not.
+ *
+ * Note that this only works on the current task.
+ *
+ * Inputs:
+ *	@xsave_state: state which is defined in xsave.h (e.g. XSTATE_FP,
+ *	XSTATE_SSE, etc...)
+ * Output:
+ *	address of the state in the xsave area or NULL if the state
+ *	is not present or is in its 'init state'.
+ */
+const void *get_xsave_field_ptr(int xsave_state)
+{
+	struct fpu *fpu = &current->thread.fpu;
+
+	if (!fpu->fpstate_active)
+		return NULL;
+	/*
+	 * fpu__save() takes the CPU's xstate registers
+	 * and saves them off to the 'fpu memory buffer.
+	 */
+	fpu__save(fpu);
+
+	return get_xsave_addr(&fpu->state.xsave, xsave_state);
+}

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Use the new get_xsave_field_ptr()API
  2015-06-07 18:37 ` [PATCH 03/19] x86, mpx: Use new get_xsave_field_ptr() Dave Hansen
@ 2015-06-09 12:31   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, oleg, linux-kernel, tglx, dave, riel, dave.hansen,
	peterz, hpa, sbsiddha, akpm, mingo, fenghua.yu, luto

Commit-ID:  a84eeaa96b36a03188e1423349669c108d3a4bd7
Gitweb:     http://git.kernel.org/tip/a84eeaa96b36a03188e1423349669c108d3a4bd7
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:01 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:30 +0200

x86/mpx: Use the new get_xsave_field_ptr()API

The MPX registers (bndcsr/bndcfgu/bndstatus) are not directly
accessible via normal instructions.  They essentially act as
if they were floating point registers and are saved/restored
along with those registers.

There are two main paths in the MPX code where we care about
the contents of these registers:

	1. #BR (bounds) faults
	2. the prctl() code where we are setting MPX up

Both of those paths _might_ be called without the FPU having
been used.  That means that 'tsk->thread.fpu.state' might
never be allocated.

Also, fpu_save_init() is not preempt-safe.  It was a bug to
call it without disabling preemption.  The new
get_xsave_addr() calls unlazy_fpu() instead and properly
disables preemption.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave@sr71.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: bp@alien8.de
Link: http://lkml.kernel.org/r/20150607183701.BC0D37CF@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mpx.h |  8 ++++----
 arch/x86/kernel/traps.c    | 17 ++++++++---------
 arch/x86/mm/mpx.c          | 30 +++++++++++++++---------------
 3 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index f3c1b71..39f2d0f 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -60,8 +60,8 @@
 
 #ifdef CONFIG_X86_INTEL_MPX
 siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
-				struct xregs_state *xsave_buf);
-int mpx_handle_bd_fault(struct xregs_state *xsave_buf);
+				struct task_struct *tsk);
+int mpx_handle_bd_fault(struct task_struct *tsk);
 static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
 {
 	return (mm->bd_addr != MPX_INVALID_BOUNDS_DIR);
@@ -78,11 +78,11 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
-					      struct xregs_state *xsave_buf)
+					      struct task_struct *tsk)
 {
 	return NULL;
 }
-static inline int mpx_handle_bd_fault(struct xregs_state *xsave_buf)
+static inline int mpx_handle_bd_fault(struct task_struct *tsk)
 {
 	return -EINVAL;
 }
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a2510f2..42f1531 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -59,6 +59,7 @@
 #include <asm/fixmap.h>
 #include <asm/mach_traps.h>
 #include <asm/alternative.h>
+#include <asm/fpu/xstate.h>
 #include <asm/mpx.h>
 
 #ifdef CONFIG_X86_64
@@ -371,9 +372,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 {
 	struct task_struct *tsk = current;
-	struct xregs_state *xsave_buf;
 	enum ctx_state prev_state;
-	struct bndcsr *bndcsr;
+	const struct bndcsr *bndcsr;
 	siginfo_t *info;
 
 	prev_state = exception_enter();
@@ -392,12 +392,11 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 
 	/*
 	 * We need to look at BNDSTATUS to resolve this exception.
-	 * It is not directly accessible, though, so we need to
-	 * do an xsave and then pull it out of the xsave buffer.
+	 * A NULL here might mean that it is in its 'init state',
+	 * which is all zeros which indicates MPX was not
+	 * responsible for the exception.
 	 */
-	copy_fpregs_to_fpstate(&tsk->thread.fpu);
-	xsave_buf = &(tsk->thread.fpu.state.xsave);
-	bndcsr = get_xsave_addr(xsave_buf, XSTATE_BNDCSR);
+	bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);
 	if (!bndcsr)
 		goto exit_trap;
 
@@ -408,11 +407,11 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	 */
 	switch (bndcsr->bndstatus & MPX_BNDSTA_ERROR_CODE) {
 	case 2:	/* Bound directory has invalid entry. */
-		if (mpx_handle_bd_fault(xsave_buf))
+		if (mpx_handle_bd_fault(tsk))
 			goto exit_trap;
 		break; /* Success, it was handled */
 	case 1: /* Bound violation. */
-		info = mpx_generate_siginfo(regs, xsave_buf);
+		info = mpx_generate_siginfo(regs, tsk);
 		if (IS_ERR(info)) {
 			/*
 			 * We failed to decode the MPX instruction.  Act as if
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 2e0dfd3..9d67e23 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -272,9 +272,9 @@ bad_opcode:
  * The caller is expected to kfree() the returned siginfo_t.
  */
 siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
-				struct xregs_state *xsave_buf)
+				struct task_struct *tsk)
 {
-	struct bndreg *bndregs, *bndreg;
+	const struct bndreg *bndregs, *bndreg;
 	siginfo_t *info = NULL;
 	struct insn insn;
 	uint8_t bndregno;
@@ -294,8 +294,8 @@ siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
 		err = -EINVAL;
 		goto err_out;
 	}
-	/* get the bndregs _area_ of the xsave structure */
-	bndregs = get_xsave_addr(xsave_buf, XSTATE_BNDREGS);
+	/* get bndregs field from current task's xsave area */
+	bndregs = get_xsave_field_ptr(XSTATE_BNDREGS);
 	if (!bndregs) {
 		err = -EINVAL;
 		goto err_out;
@@ -342,7 +342,7 @@ err_out:
 
 static __user void *task_get_bounds_dir(struct task_struct *tsk)
 {
-	struct bndcsr *bndcsr;
+	const struct bndcsr *bndcsr;
 
 	if (!cpu_feature_enabled(X86_FEATURE_MPX))
 		return MPX_INVALID_BOUNDS_DIR;
@@ -357,8 +357,7 @@ static __user void *task_get_bounds_dir(struct task_struct *tsk)
 	 * The bounds directory pointer is stored in a register
 	 * only accessible if we first do an xsave.
 	 */
-	copy_fpregs_to_fpstate(&tsk->thread.fpu);
-	bndcsr = get_xsave_addr(&tsk->thread.fpu.state.xsave, XSTATE_BNDCSR);
+	bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);
 	if (!bndcsr)
 		return MPX_INVALID_BOUNDS_DIR;
 
@@ -389,9 +388,10 @@ int mpx_enable_management(struct task_struct *tsk)
 	 * directory into XSAVE/XRSTOR Save Area and enable MPX through
 	 * XRSTOR instruction.
 	 *
-	 * copy_xregs_to_kernel() is expected to be very expensive. Storing the bounds
-	 * directory here means that we do not have to do xsave in the unmap
-	 * path; we can just use mm->bd_addr instead.
+	 * The copy_xregs_to_kernel() beneath get_xsave_field_ptr() is
+	 * expected to be relatively expensive. Storing the bounds
+	 * directory here means that we do not have to do xsave in the
+	 * unmap path; we can just use mm->bd_addr instead.
 	 */
 	bd_base = task_get_bounds_dir(tsk);
 	down_write(&mm->mmap_sem);
@@ -497,12 +497,12 @@ out_unmap:
  * bound table is 16KB. With 64-bit mode, the size of BD is 2GB,
  * and the size of each bound table is 4MB.
  */
-static int do_mpx_bt_fault(struct xregs_state *xsave_buf)
+static int do_mpx_bt_fault(struct task_struct *tsk)
 {
 	unsigned long bd_entry, bd_base;
-	struct bndcsr *bndcsr;
+	const struct bndcsr *bndcsr;
 
-	bndcsr = get_xsave_addr(xsave_buf, XSTATE_BNDCSR);
+	bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);
 	if (!bndcsr)
 		return -EINVAL;
 	/*
@@ -525,7 +525,7 @@ static int do_mpx_bt_fault(struct xregs_state *xsave_buf)
 	return allocate_bt((long __user *)bd_entry);
 }
 
-int mpx_handle_bd_fault(struct xregs_state *xsave_buf)
+int mpx_handle_bd_fault(struct task_struct *tsk)
 {
 	/*
 	 * Userspace never asked us to manage the bounds tables,
@@ -534,7 +534,7 @@ int mpx_handle_bd_fault(struct xregs_state *xsave_buf)
 	if (!kernel_managing_mpx_tables(current->mm))
 		return -EINVAL;
 
-	if (do_mpx_bt_fault(xsave_buf)) {
+	if (do_mpx_bt_fault(tsk)) {
 		force_sig(SIGSEGV, current);
 		/*
 		 * The force_sig() is essentially "handling" this

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Clean up the code by not passing a task pointer around when unnecessary
  2015-06-07 18:37 ` [PATCH 04/19] x86, mpx: Cleanup: Do not pass task around when unnecessary Dave Hansen
@ 2015-06-09 12:31   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, mingo, dave.hansen, linux-kernel, hpa, akpm, oleg, peterz,
	dave, torvalds

Commit-ID:  46a6e0cf1c6665a8e867d8f7798d7a3538633f03
Gitweb:     http://git.kernel.org/tip/46a6e0cf1c6665a8e867d8f7798d7a3538633f03
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:02 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:30 +0200

x86/mpx: Clean up the code by not passing a task pointer around when unnecessary

The MPX code can only work on the current task.  You can not,
for instance, enable MPX management in another process or
thread. You can also not handle a fault for another process or
thread.

Despite this, we pass a task_struct around prolifically.  This
patch removes all of the task struct passing for code paths
where the code can not deal with another task (which turns out
to be all of them).

This has no functional changes.  It's just a cleanup.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bp@alien8.de
Link: http://lkml.kernel.org/r/20150607183702.6A81DA2C@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mpx.h       | 10 ++++------
 arch/x86/include/asm/processor.h | 12 ++++++------
 arch/x86/kernel/traps.c          |  5 ++---
 arch/x86/mm/mpx.c                | 19 +++++++++----------
 kernel/sys.c                     |  8 ++++----
 5 files changed, 25 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 39f2d0f..0cdd16a 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -59,9 +59,8 @@
 		MPX_BT_ENTRY_MASK) << MPX_BT_ENTRY_SHIFT)
 
 #ifdef CONFIG_X86_INTEL_MPX
-siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
-				struct task_struct *tsk);
-int mpx_handle_bd_fault(struct task_struct *tsk);
+siginfo_t *mpx_generate_siginfo(struct pt_regs *regs);
+int mpx_handle_bd_fault(void);
 static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
 {
 	return (mm->bd_addr != MPX_INVALID_BOUNDS_DIR);
@@ -77,12 +76,11 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
 #else
-static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
-					      struct task_struct *tsk)
+static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
 	return NULL;
 }
-static inline int mpx_handle_bd_fault(struct task_struct *tsk)
+static inline int mpx_handle_bd_fault(void)
 {
 	return -EINVAL;
 }
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 8e04f51..53dbd2b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -802,18 +802,18 @@ extern int get_tsc_mode(unsigned long adr);
 extern int set_tsc_mode(unsigned int val);
 
 /* Register/unregister a process' MPX related resource */
-#define MPX_ENABLE_MANAGEMENT(tsk)	mpx_enable_management((tsk))
-#define MPX_DISABLE_MANAGEMENT(tsk)	mpx_disable_management((tsk))
+#define MPX_ENABLE_MANAGEMENT()	mpx_enable_management()
+#define MPX_DISABLE_MANAGEMENT()	mpx_disable_management()
 
 #ifdef CONFIG_X86_INTEL_MPX
-extern int mpx_enable_management(struct task_struct *tsk);
-extern int mpx_disable_management(struct task_struct *tsk);
+extern int mpx_enable_management(void);
+extern int mpx_disable_management(void);
 #else
-static inline int mpx_enable_management(struct task_struct *tsk)
+static inline int mpx_enable_management(void)
 {
 	return -EINVAL;
 }
-static inline int mpx_disable_management(struct task_struct *tsk)
+static inline int mpx_disable_management(void)
 {
 	return -EINVAL;
 }
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 42f1531..cffff66 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -371,7 +371,6 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 
 dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 {
-	struct task_struct *tsk = current;
 	enum ctx_state prev_state;
 	const struct bndcsr *bndcsr;
 	siginfo_t *info;
@@ -407,11 +406,11 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	 */
 	switch (bndcsr->bndstatus & MPX_BNDSTA_ERROR_CODE) {
 	case 2:	/* Bound directory has invalid entry. */
-		if (mpx_handle_bd_fault(tsk))
+		if (mpx_handle_bd_fault())
 			goto exit_trap;
 		break; /* Success, it was handled */
 	case 1: /* Bound violation. */
-		info = mpx_generate_siginfo(regs, tsk);
+		info = mpx_generate_siginfo(regs);
 		if (IS_ERR(info)) {
 			/*
 			 * We failed to decode the MPX instruction.  Act as if
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9d67e23..47e4a856 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -271,8 +271,7 @@ bad_opcode:
  *
  * The caller is expected to kfree() the returned siginfo_t.
  */
-siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
-				struct task_struct *tsk)
+siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
 	const struct bndreg *bndregs, *bndreg;
 	siginfo_t *info = NULL;
@@ -340,7 +339,7 @@ err_out:
 	return ERR_PTR(err);
 }
 
-static __user void *task_get_bounds_dir(struct task_struct *tsk)
+static __user void *mpx_get_bounds_dir(void)
 {
 	const struct bndcsr *bndcsr;
 
@@ -376,10 +375,10 @@ static __user void *task_get_bounds_dir(struct task_struct *tsk)
 		(bndcsr->bndcfgu & MPX_BNDCFG_ADDR_MASK);
 }
 
-int mpx_enable_management(struct task_struct *tsk)
+int mpx_enable_management(void)
 {
 	void __user *bd_base = MPX_INVALID_BOUNDS_DIR;
-	struct mm_struct *mm = tsk->mm;
+	struct mm_struct *mm = current->mm;
 	int ret = 0;
 
 	/*
@@ -393,7 +392,7 @@ int mpx_enable_management(struct task_struct *tsk)
 	 * directory here means that we do not have to do xsave in the
 	 * unmap path; we can just use mm->bd_addr instead.
 	 */
-	bd_base = task_get_bounds_dir(tsk);
+	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
 	mm->bd_addr = bd_base;
 	if (mm->bd_addr == MPX_INVALID_BOUNDS_DIR)
@@ -403,7 +402,7 @@ int mpx_enable_management(struct task_struct *tsk)
 	return ret;
 }
 
-int mpx_disable_management(struct task_struct *tsk)
+int mpx_disable_management(void)
 {
 	struct mm_struct *mm = current->mm;
 
@@ -497,7 +496,7 @@ out_unmap:
  * bound table is 16KB. With 64-bit mode, the size of BD is 2GB,
  * and the size of each bound table is 4MB.
  */
-static int do_mpx_bt_fault(struct task_struct *tsk)
+static int do_mpx_bt_fault(void)
 {
 	unsigned long bd_entry, bd_base;
 	const struct bndcsr *bndcsr;
@@ -525,7 +524,7 @@ static int do_mpx_bt_fault(struct task_struct *tsk)
 	return allocate_bt((long __user *)bd_entry);
 }
 
-int mpx_handle_bd_fault(struct task_struct *tsk)
+int mpx_handle_bd_fault(void)
 {
 	/*
 	 * Userspace never asked us to manage the bounds tables,
@@ -534,7 +533,7 @@ int mpx_handle_bd_fault(struct task_struct *tsk)
 	if (!kernel_managing_mpx_tables(current->mm))
 		return -EINVAL;
 
-	if (do_mpx_bt_fault(tsk)) {
+	if (do_mpx_bt_fault()) {
 		force_sig(SIGSEGV, current);
 		/*
 		 * The force_sig() is essentially "handling" this
diff --git a/kernel/sys.c b/kernel/sys.c
index a4e372b..8571296 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -92,10 +92,10 @@
 # define SET_TSC_CTL(a)		(-EINVAL)
 #endif
 #ifndef MPX_ENABLE_MANAGEMENT
-# define MPX_ENABLE_MANAGEMENT(a)	(-EINVAL)
+# define MPX_ENABLE_MANAGEMENT()	(-EINVAL)
 #endif
 #ifndef MPX_DISABLE_MANAGEMENT
-# define MPX_DISABLE_MANAGEMENT(a)	(-EINVAL)
+# define MPX_DISABLE_MANAGEMENT()	(-EINVAL)
 #endif
 #ifndef GET_FP_MODE
 # define GET_FP_MODE(a)		(-EINVAL)
@@ -2230,12 +2230,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_MPX_ENABLE_MANAGEMENT:
 		if (arg2 || arg3 || arg4 || arg5)
 			return -EINVAL;
-		error = MPX_ENABLE_MANAGEMENT(me);
+		error = MPX_ENABLE_MANAGEMENT();
 		break;
 	case PR_MPX_DISABLE_MANAGEMENT:
 		if (arg2 || arg3 || arg4 || arg5)
 			return -EINVAL;
-		error = MPX_DISABLE_MANAGEMENT(me);
+		error = MPX_DISABLE_MANAGEMENT();
 		break;
 	case PR_SET_FP_MODE:
 		error = SET_FP_MODE(me, arg2);

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Remove redundant MPX_BNDCFG_ADDR_MASK
  2015-06-07 18:37 ` [PATCH 05/19] x86, mpx: remove redundant MPX_BNDCFG_ADDR_MASK Dave Hansen
@ 2015-06-09 12:32   ` tip-bot for Qiaowei Ren
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Qiaowei Ren @ 2015-06-09 12:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: qiaowei.ren, mingo, tglx, hpa, linux-kernel, dave, dave.hansen,
	peterz, akpm, torvalds

Commit-ID:  3c1d32300920a446c67d697cd6b80f012ad06028
Gitweb:     http://git.kernel.org/tip/3c1d32300920a446c67d697cd6b80f012ad06028
Author:     Qiaowei Ren <qiaowei.ren@intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:02 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:30 +0200

x86/mpx: Remove redundant MPX_BNDCFG_ADDR_MASK

MPX_BNDCFG_ADDR_MASK is defined two times, so this patch removes
redundant one.

Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183702.5F129376@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mpx.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 0cdd16a..871e5e5 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -45,7 +45,6 @@
 #define MPX_BNDSTA_TAIL		2
 #define MPX_BNDCFG_TAIL		12
 #define MPX_BNDSTA_ADDR_MASK	(~((1UL<<MPX_BNDSTA_TAIL)-1))
-#define MPX_BNDCFG_ADDR_MASK	(~((1UL<<MPX_BNDCFG_TAIL)-1))
 #define MPX_BT_ADDR_MASK	(~((1UL<<MPX_BD_ENTRY_TAIL)-1))
 
 #define MPX_BNDCFG_ADDR_MASK	(~((1UL<<MPX_BNDCFG_TAIL)-1))

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Restrict the mmap() size check to bounds tables
  2015-06-07 18:37 ` [PATCH 06/19] x86, mpx: Restrict mmap size check to bounds tables Dave Hansen
@ 2015-06-09 12:32   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dave, linux-kernel, dave.hansen, tglx, mingo, akpm, torvalds,
	hpa, peterz

Commit-ID:  eb099e5bc5457c42043e57cf0f4ce574669b9697
Gitweb:     http://git.kernel.org/tip/eb099e5bc5457c42043e57cf0f4ce574669b9697
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:02 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:31 +0200

x86/mpx: Restrict the mmap() size check to bounds tables

The comment and code here are confusing.  We do not currently
allocate the bounds directory in the kernel.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183702.222CEC2A@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/mpx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 47e4a856..d6e02f3 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -46,8 +46,8 @@ static unsigned long mpx_mmap(unsigned long len)
 	vm_flags_t vm_flags;
 	struct vm_area_struct *vma;
 
-	/* Only bounds table and bounds directory can be allocated here */
-	if (len != MPX_BD_SIZE_BYTES && len != MPX_BT_SIZE_BYTES)
+	/* Only bounds table can be allocated here */
+	if (len != MPX_BT_SIZE_BYTES)
 		return -EINVAL;
 
 	down_write(&mm->mmap_sem);

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Introduce a boot-time disable flag
  2015-06-07 18:37 ` [PATCH 07/19] x86, mpx: boot-time disable Dave Hansen
@ 2015-06-09 12:32   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, akpm, tglx, dave, dave.hansen, peterz, hpa,
	torvalds, mingo

Commit-ID:  8c3641e957a948f41f0174290096ed7a3b95e703
Gitweb:     http://git.kernel.org/tip/8c3641e957a948f41f0174290096ed7a3b95e703
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:02 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:31 +0200

x86/mpx: Introduce a boot-time disable flag

MPX has the _potential_ to cause some issues.  Say part of your
init system tried to protect one of its components from buffer
overflows with MPX.  If there were a false positive, it's
possible that MPX could keep a system from booting.

MPX could also potentially cause performance issues since it is
present in hot paths like the unmap path.

Allow it to be disabled at boot time.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150607183702.2E8B77AB@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/kernel-parameters.txt |  4 ++++
 arch/x86/kernel/cpu/common.c        | 16 ++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 61ab162..8b7e5c3 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -937,6 +937,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			Enable debug messages at boot time.  See
 			Documentation/dynamic-debug-howto.txt for details.
 
+	nompx		[X86] Disables Intel Memory Protection Extensions.
+			See Documentation/x86/intel_mpx.txt for more
+			information about the feature.
+
 	eagerfpu=	[X86]
 			on	enable eager fpu restore
 			off	disable eager fpu restore
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 401ccb03..3956858 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -144,6 +144,22 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
 } };
 EXPORT_PER_CPU_SYMBOL_GPL(gdt_page);
 
+static int __init x86_mpx_setup(char *s)
+{
+	/* require an exact match without trailing characters */
+	if (strlen(s))
+		return 0;
+
+	/* do not emit a message if the feature is not present */
+	if (!boot_cpu_has(X86_FEATURE_MPX))
+		return 1;
+
+	setup_clear_cpu_cap(X86_FEATURE_MPX);
+	pr_info("nompx: Intel Memory Protection Extensions (MPX) disabled\n");
+	return 1;
+}
+__setup("nompx", x86_mpx_setup);
+
 #ifdef CONFIG_X86_32
 static int cachesize_override = -1;
 static int disable_x86_serial_nr = 1;

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Trace #BR exceptions
  2015-06-07 18:37 ` [PATCH 08/19] x86, mpx: trace #BR exceptions Dave Hansen
@ 2015-06-09 12:33   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dave.hansen, peterz, hpa, akpm, mingo, dave, linux-kernel, tglx,
	torvalds

Commit-ID:  e7126cf5f10aef1555cb99eddb7efff41bdf9566
Gitweb:     http://git.kernel.org/tip/e7126cf5f10aef1555cb99eddb7efff41bdf9566
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:03 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:31 +0200

x86/mpx: Trace #BR exceptions

This is the first in a series of MPX tracing patches.
I've found these extremely useful in the process of
debugging applications and the kernel code itself.

This exception hooks in to the bounds (#BR) exception
very early and allows capturing the key registers which
would influence how the exception is handled.

Note that bndcfgu/bndstatus are technically still
64-bit registers even in 32-bit mode.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183703.5FE2619A@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/trace/mpx.h | 50 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c          |  2 ++
 arch/x86/mm/mpx.c                |  3 +++
 3 files changed, 55 insertions(+)

diff --git a/arch/x86/include/asm/trace/mpx.h b/arch/x86/include/asm/trace/mpx.h
new file mode 100644
index 0000000..5c03ec8
--- /dev/null
+++ b/arch/x86/include/asm/trace/mpx.h
@@ -0,0 +1,50 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mpx
+
+#if !defined(_TRACE_MPX_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MPX_H
+
+#include <linux/tracepoint.h>
+
+#ifdef CONFIG_X86_INTEL_MPX
+
+TRACE_EVENT(bounds_exception_mpx,
+
+	TP_PROTO(const struct bndcsr *bndcsr),
+	TP_ARGS(bndcsr),
+
+	TP_STRUCT__entry(
+		__field(u64, bndcfgu)
+		__field(u64, bndstatus)
+	),
+
+	TP_fast_assign(
+		/* need to get rid of the 'const' on bndcsr */
+		__entry->bndcfgu   = (u64)bndcsr->bndcfgu;
+		__entry->bndstatus = (u64)bndcsr->bndstatus;
+	),
+
+	TP_printk("bndcfgu:0x%llx bndstatus:0x%llx",
+		__entry->bndcfgu,
+		__entry->bndstatus)
+);
+
+#else
+
+/*
+ * This gets used outside of MPX-specific code, so we need a stub.
+ */
+static inline void trace_bounds_exception_mpx(const struct bndcsr *bndcsr)
+{
+}
+
+#endif /* CONFIG_X86_INTEL_MPX */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH asm/trace/
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE mpx
+#endif /* _TRACE_MPX_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cffff66..36cb15b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -60,6 +60,7 @@
 #include <asm/mach_traps.h>
 #include <asm/alternative.h>
 #include <asm/fpu/xstate.h>
+#include <asm/trace/mpx.h>
 #include <asm/mpx.h>
 
 #ifdef CONFIG_X86_64
@@ -399,6 +400,7 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
 	if (!bndcsr)
 		goto exit_trap;
 
+	trace_bounds_exception_mpx(bndcsr);
 	/*
 	 * The error code field of the BNDSTATUS register communicates status
 	 * information of a bound range exception #BR or operation involving
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index d6e02f3..1fef52c 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -17,6 +17,9 @@
 #include <asm/processor.h>
 #include <asm/fpu/internal.h>
 
+#define CREATE_TRACE_POINTS
+#include <asm/trace/mpx.h>
+
 static const char *mpx_mapping_name(struct vm_area_struct *vma)
 {
 	return "[mpx]";

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Trace entry to bounds exception paths
  2015-06-07 18:37 ` [PATCH 09/19] x86, mpx: trace entry to bounds exception paths Dave Hansen
@ 2015-06-09 12:33   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, mingo, hpa, peterz, linux-kernel, akpm, dave.hansen,
	torvalds, dave

Commit-ID:  97efebf1bc30a80122af3295ebdb726dbc040ca6
Gitweb:     http://git.kernel.org/tip/97efebf1bc30a80122af3295ebdb726dbc040ca6
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:03 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:32 +0200

x86/mpx: Trace entry to bounds exception paths

There are two basic things that can happen as the result of
a bounds exception (#BR):

	1. We allocate a new bounds table
	2. We pass up a bounds exception to userspace.

This patch adds a trace point for the case where we are
passing the exception up to userspace with a signal.

We are also explicit that we're printing out the inverse of
the 'upper' that we encounter.  If you want to filter, for
instance, you need to ~ the value first.  The reason we do
this is because of how 'upper' is stored in the bounds table.

If a pointer's range is:

	0x1000 -> 0x2000

it is stored in the bounds table as (32-bits here for brevity):

	lower: 0x00001000
	upper: 0xffffdfff

That is so that an all 0's entry:

	lower: 0x00000000
	upper: 0x00000000

corresponds to the "init" bounds which store a *range* of:

	0x00000000 -> 0xffffffff

That is, by far, the common case, and that lets us use the
zero page, or deduplicate the memory, etc... The 'upper'
stored in the table is gibberish to print by itself, so we
print ~upper to get the *actual*, logical, human-readable
value printed out.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183703.027BB9B0@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/trace/mpx.h | 34 ++++++++++++++++++++++++++++++++++
 arch/x86/mm/mpx.c                |  1 +
 2 files changed, 35 insertions(+)

diff --git a/arch/x86/include/asm/trace/mpx.h b/arch/x86/include/asm/trace/mpx.h
index 5c03ec8..5c3af06 100644
--- a/arch/x86/include/asm/trace/mpx.h
+++ b/arch/x86/include/asm/trace/mpx.h
@@ -8,6 +8,40 @@
 
 #ifdef CONFIG_X86_INTEL_MPX
 
+TRACE_EVENT(mpx_bounds_register_exception,
+
+	TP_PROTO(void *addr_referenced,
+		 const struct bndreg *bndreg),
+	TP_ARGS(addr_referenced, bndreg),
+
+	TP_STRUCT__entry(
+		__field(void *, addr_referenced)
+		__field(u64, lower_bound)
+		__field(u64, upper_bound)
+	),
+
+	TP_fast_assign(
+		__entry->addr_referenced = addr_referenced;
+		__entry->lower_bound = bndreg->lower_bound;
+		__entry->upper_bound = bndreg->upper_bound;
+	),
+	/*
+	 * Note that we are printing out the '~' of the upper
+	 * bounds register here.  It is actually stored in its
+	 * one's complement form so that its 'init' state
+	 * corresponds to all 0's.  But, that looks like
+	 * gibberish when printed out, so print out the 1's
+	 * complement instead of the actual value here.  Note
+	 * though that you still need to specify filters for the
+	 * actual value, not the displayed one.
+	 */
+	TP_printk("address referenced: 0x%p bounds: lower: 0x%llx ~upper: 0x%llx",
+		__entry->addr_referenced,
+		__entry->lower_bound,
+		~__entry->upper_bound
+	)
+);
+
 TRACE_EVENT(bounds_exception_mpx,
 
 	TP_PROTO(const struct bndcsr *bndcsr),
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 1fef52c..75e5d70 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -335,6 +335,7 @@ siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 		err = -EINVAL;
 		goto err_out;
 	}
+	trace_mpx_bounds_register_exception(info->si_addr, bndreg);
 	return info;
 err_out:
 	/* info might be NULL, but kfree() handles that */

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Trace the attempts to find bounds tables
  2015-06-07 18:37 ` [PATCH 10/19] x86, mpx: Trace the attempts to find bounds tables Dave Hansen
@ 2015-06-09 12:33   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, mingo, dave.hansen, linux-kernel, dave, tglx, akpm,
	torvalds, hpa

Commit-ID:  2a1dcb1f796ad37028df37d96fc7c5b6b1705a43
Gitweb:     http://git.kernel.org/tip/2a1dcb1f796ad37028df37d96fc7c5b6b1705a43
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:03 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:32 +0200

x86/mpx: Trace the attempts to find bounds tables

There are two different events being traced here.  They are
doing similar things so share a trace "EVENT_CLASS" and are
presented together.

1. Trace when MPX is zapping pages "mpx_unmap_zap":

	When MPX can not free an entire bounds table, it will
	instead try to zap unused parts of a bounds table to free
	the backing memory.  This decreases RSS (resident set
	size) without decreasing the virtual space allocated
	for bounds tables.

2. Trace attempts to find bounds tables "mpx_unmap_search":

	This event traces any time we go looking to unmap a
	bounds table for a given virtual address range.  This is
	useful to ensure that the kernel actually "tried" to free
	a bounds table versus times it succeeded in finding one.

	It might try and fail if it realized that a table was
	shared with an adjacent VMA which is not being unmapped.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183703.B9D2468B@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/trace/mpx.h | 32 ++++++++++++++++++++++++++++++++
 arch/x86/mm/mpx.c                |  2 ++
 2 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/trace/mpx.h b/arch/x86/include/asm/trace/mpx.h
index 5c3af06..c13c6fa 100644
--- a/arch/x86/include/asm/trace/mpx.h
+++ b/arch/x86/include/asm/trace/mpx.h
@@ -63,6 +63,38 @@ TRACE_EVENT(bounds_exception_mpx,
 		__entry->bndstatus)
 );
 
+DECLARE_EVENT_CLASS(mpx_range_trace,
+
+	TP_PROTO(unsigned long start,
+		 unsigned long end),
+	TP_ARGS(start, end),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, start)
+		__field(unsigned long, end)
+	),
+
+	TP_fast_assign(
+		__entry->start = start;
+		__entry->end   = end;
+	),
+
+	TP_printk("[0x%p:0x%p]",
+		(void *)__entry->start,
+		(void *)__entry->end
+	)
+);
+
+DEFINE_EVENT(mpx_range_trace, mpx_unmap_zap,
+	TP_PROTO(unsigned long start, unsigned long end),
+	TP_ARGS(start, end)
+);
+
+DEFINE_EVENT(mpx_range_trace, mpx_unmap_search,
+	TP_PROTO(unsigned long start, unsigned long end),
+	TP_ARGS(start, end)
+);
+
 #else
 
 /*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 75e5d70..55729ee 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -668,6 +668,7 @@ static int zap_bt_entries(struct mm_struct *mm,
 
 		len = min(vma->vm_end, end) - addr;
 		zap_page_range(vma, addr, len, NULL);
+		trace_mpx_unmap_zap(addr, addr+len);
 
 		vma = vma->vm_next;
 		addr = vma->vm_start;
@@ -840,6 +841,7 @@ static int mpx_unmap_tables(struct mm_struct *mm,
 	long __user *bd_entry, *bde_start, *bde_end;
 	unsigned long bt_addr;
 
+	trace_mpx_unmap_search(start, end);
 	/*
 	 * "Edge" bounds tables are those which are being used by the region
 	 * (start -> end), but that may be shared with adjacent areas.  If they

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Trace allocation of new bounds tables
  2015-06-07 18:37 ` [PATCH 11/19] x86, mpx: trace allocation of new bounds tables Dave Hansen
@ 2015-06-09 12:33   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, akpm, mingo, hpa, peterz, linux-kernel, dave.hansen,
	torvalds, dave

Commit-ID:  cd4996dce18b619bd7b3acf75c91f49c77f05a97
Gitweb:     http://git.kernel.org/tip/cd4996dce18b619bd7b3acf75c91f49c77f05a97
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:04 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:32 +0200

x86/mpx: Trace allocation of new bounds tables

Bounds tables are a significant consumer of memory.  It is
important to know when they are being allocated.  Add a trace
point to trace whenever an allocation occurs and also its
virtual address.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183704.EC23A93E@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/trace/mpx.h | 16 ++++++++++++++++
 arch/x86/mm/mpx.c                |  1 +
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/trace/mpx.h b/arch/x86/include/asm/trace/mpx.h
index c13c6fa..173dd3b 100644
--- a/arch/x86/include/asm/trace/mpx.h
+++ b/arch/x86/include/asm/trace/mpx.h
@@ -95,6 +95,22 @@ DEFINE_EVENT(mpx_range_trace, mpx_unmap_search,
 	TP_ARGS(start, end)
 );
 
+TRACE_EVENT(mpx_new_bounds_table,
+
+	TP_PROTO(unsigned long table_vaddr),
+	TP_ARGS(table_vaddr),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, table_vaddr)
+	),
+
+	TP_fast_assign(
+		__entry->table_vaddr = table_vaddr;
+	),
+
+	TP_printk("table vaddr:%p", (void *)__entry->table_vaddr)
+);
+
 #else
 
 /*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 55729ee..c17fd27 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -483,6 +483,7 @@ static int allocate_bt(long __user *bd_entry)
 		ret = -EINVAL;
 		goto out_unmap;
 	}
+	trace_mpx_new_bounds_table(bt_addr);
 	return 0;
 out_unmap:
 	vm_munmap(bt_addr & MPX_BT_ADDR_MASK, MPX_BT_SIZE_BYTES);

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86: Make is_64bit_mm() widely available
  2015-06-07 18:37 ` [PATCH 12/19] x86: make is_64bit_mm() widely available Dave Hansen
@ 2015-06-09 12:34   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, dave, mingo, akpm, dave.hansen, oleg, linux-kernel, hpa,
	torvalds, peterz

Commit-ID:  b0e9b09b3bd64e67bba862e238d3757b2482b6de
Gitweb:     http://git.kernel.org/tip/b0e9b09b3bd64e67bba862e238d3757b2482b6de
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:04 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:32 +0200

x86: Make is_64bit_mm() widely available

The uprobes code has a nice helper, is_64bit_mm(), that consults
both the runtime and compile-time flags for 32-bit support.
Instead of reinventing the wheel, pull it in to an x86 header so
we can use it for MPX.

I prefer passing the 'mm' around to test_thread_flag(TIF_IA32)
because it makes it explicit where the context is coming from.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183704.F0209999@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mmu_context.h | 13 +++++++++++++
 arch/x86/kernel/uprobes.c          | 10 +---------
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 883f6b9..5e8daee 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -142,6 +142,19 @@ static inline void arch_exit_mmap(struct mm_struct *mm)
 	paravirt_arch_exit_mmap(mm);
 }
 
+#ifdef CONFIG_X86_64
+static inline bool is_64bit_mm(struct mm_struct *mm)
+{
+	return	!config_enabled(CONFIG_IA32_EMULATION) ||
+		!(mm->context.ia32_compat == TIF_IA32);
+}
+#else
+static inline bool is_64bit_mm(struct mm_struct *mm)
+{
+	return false;
+}
+#endif
+
 static inline void arch_bprm_mm_init(struct mm_struct *mm,
 		struct vm_area_struct *vma)
 {
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 0b81ad6..6647624 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -29,6 +29,7 @@
 #include <linux/kdebug.h>
 #include <asm/processor.h>
 #include <asm/insn.h>
+#include <asm/mmu_context.h>
 
 /* Post-execution fixups. */
 
@@ -312,11 +313,6 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
 }
 
 #ifdef CONFIG_X86_64
-static inline bool is_64bit_mm(struct mm_struct *mm)
-{
-	return	!config_enabled(CONFIG_IA32_EMULATION) ||
-		!(mm->context.ia32_compat == TIF_IA32);
-}
 /*
  * If arch_uprobe->insn doesn't use rip-relative addressing, return
  * immediately.  Otherwise, rewrite the instruction so that it accesses
@@ -497,10 +493,6 @@ static void riprel_post_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
 	}
 }
 #else /* 32-bit: */
-static inline bool is_64bit_mm(struct mm_struct *mm)
-{
-	return false;
-}
 /*
  * No RIP-relative addressing on 32-bit
  */

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Add temporary variable to reduce masking
  2015-06-07 18:37 ` [PATCH 13/19] x86, mpx: Add temporary variable to reduce masking Dave Hansen
@ 2015-06-09 12:34   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mingo, tglx, dave.hansen, dave, torvalds, akpm,
	hpa, peterz

Commit-ID:  a1149fc83a1f97612e72ec24a0bdbabff7b85e77
Gitweb:     http://git.kernel.org/tip/a1149fc83a1f97612e72ec24a0bdbabff7b85e77
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:04 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:33 +0200

x86/mpx: Add temporary variable to reduce masking

When we allocate a bounds table, we call mmap(), then add a
"valid" bit to the value before storing it in to the bounds
directory.

If we fail along the way, we go and mask that valid bit
_back_ out.  That seems a little silly, and this makes it
much more clear when we have a plain address versus an
actual table _entry_.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183704.3D69D5F4@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/mpx.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index c17fd27..4f7fb7c 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -429,6 +429,7 @@ static int allocate_bt(long __user *bd_entry)
 	unsigned long expected_old_val = 0;
 	unsigned long actual_old_val = 0;
 	unsigned long bt_addr;
+	unsigned long bd_new_entry;
 	int ret = 0;
 
 	/*
@@ -441,7 +442,7 @@ static int allocate_bt(long __user *bd_entry)
 	/*
 	 * Set the valid flag (kinda like _PAGE_PRESENT in a pte)
 	 */
-	bt_addr = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
+	bd_new_entry = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
 
 	/*
 	 * Go poke the address of the new bounds table in to the
@@ -455,7 +456,7 @@ static int allocate_bt(long __user *bd_entry)
 	 * of the MPX code that have to pagefault_disable().
 	 */
 	ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
-					   expected_old_val, bt_addr);
+					   expected_old_val, bd_new_entry);
 	if (ret)
 		goto out_unmap;
 
@@ -486,7 +487,7 @@ static int allocate_bt(long __user *bd_entry)
 	trace_mpx_new_bounds_table(bt_addr);
 	return 0;
 out_unmap:
-	vm_munmap(bt_addr & MPX_BT_ADDR_MASK, MPX_BT_SIZE_BYTES);
+	vm_munmap(bt_addr, MPX_BT_SIZE_BYTES);
 	return ret;
 }
 

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Introduce new 'directory entry' to 'addr' helper function
  2015-06-07 18:37 ` [PATCH 14/19] x86, mpx: new directory entry to addr helper Dave Hansen
@ 2015-06-09 12:34   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, tglx, dave.hansen, mingo, akpm, peterz, linux-kernel, dave,
	torvalds

Commit-ID:  54587653904c552c56b9dec153d7a89063394b09
Gitweb:     http://git.kernel.org/tip/54587653904c552c56b9dec153d7a89063394b09
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:04 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:33 +0200

x86/mpx: Introduce new 'directory entry' to 'addr' helper function

Currently, to get from a bounds directory entry to the virtual
address of a bounds table, we simply mask off a few low bits.
However, the set of bits we mask off is different for 32-bit and
64-bit binaries.

This breaks the operation out in to a helper function and also
adds a temporary variable to store the result until we are
sure we are returning one.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183704.007686CE@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mpx.h |  1 -
 arch/x86/mm/mpx.c          | 41 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 871e5e5..99d374e 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -45,7 +45,6 @@
 #define MPX_BNDSTA_TAIL		2
 #define MPX_BNDCFG_TAIL		12
 #define MPX_BNDSTA_ADDR_MASK	(~((1UL<<MPX_BNDSTA_TAIL)-1))
-#define MPX_BT_ADDR_MASK	(~((1UL<<MPX_BD_ENTRY_TAIL)-1))
 
 #define MPX_BNDCFG_ADDR_MASK	(~((1UL<<MPX_BNDCFG_TAIL)-1))
 #define MPX_BNDSTA_ERROR_CODE	0x3
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 4f7fb7c..8cc7934 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -576,29 +576,55 @@ static int mpx_resolve_fault(long __user *addr, int write)
 	return 0;
 }
 
+static unsigned long mpx_bd_entry_to_bt_addr(struct mm_struct *mm,
+					     unsigned long bd_entry)
+{
+	unsigned long bt_addr = bd_entry;
+	int align_to_bytes;
+	/*
+	 * Bit 0 in a bt_entry is always the valid bit.
+	 */
+	bt_addr &= ~MPX_BD_ENTRY_VALID_FLAG;
+	/*
+	 * Tables are naturally aligned at 8-byte boundaries
+	 * on 64-bit and 4-byte boundaries on 32-bit.  The
+	 * documentation makes it appear that the low bits
+	 * are ignored by the hardware, so we do the same.
+	 */
+	if (is_64bit_mm(mm))
+		align_to_bytes = 8;
+	else
+		align_to_bytes = 4;
+	bt_addr &= ~(align_to_bytes-1);
+	return bt_addr;
+}
+
 /*
  * Get the base of bounds tables pointed by specific bounds
  * directory entry.
  */
 static int get_bt_addr(struct mm_struct *mm,
-			long __user *bd_entry, unsigned long *bt_addr)
+			long __user *bd_entry_ptr,
+			unsigned long *bt_addr_result)
 {
 	int ret;
 	int valid_bit;
+	unsigned long bd_entry;
+	unsigned long bt_addr;
 
-	if (!access_ok(VERIFY_READ, (bd_entry), sizeof(*bd_entry)))
+	if (!access_ok(VERIFY_READ, (bd_entry_ptr), sizeof(*bd_entry_ptr)))
 		return -EFAULT;
 
 	while (1) {
 		int need_write = 0;
 
 		pagefault_disable();
-		ret = get_user(*bt_addr, bd_entry);
+		ret = get_user(bd_entry, bd_entry_ptr);
 		pagefault_enable();
 		if (!ret)
 			break;
 		if (ret == -EFAULT)
-			ret = mpx_resolve_fault(bd_entry, need_write);
+			ret = mpx_resolve_fault(bd_entry_ptr, need_write);
 		/*
 		 * If we could not resolve the fault, consider it
 		 * userspace's fault and error out.
@@ -607,8 +633,8 @@ static int get_bt_addr(struct mm_struct *mm,
 			return ret;
 	}
 
-	valid_bit = *bt_addr & MPX_BD_ENTRY_VALID_FLAG;
-	*bt_addr &= MPX_BT_ADDR_MASK;
+	valid_bit = bd_entry & MPX_BD_ENTRY_VALID_FLAG;
+	bt_addr = mpx_bd_entry_to_bt_addr(mm, bd_entry);
 
 	/*
 	 * When the kernel is managing bounds tables, a bounds directory
@@ -617,7 +643,7 @@ static int get_bt_addr(struct mm_struct *mm,
 	 * data in the address field, we know something is wrong. This
 	 * -EINVAL return will cause a SIGSEGV.
 	 */
-	if (!valid_bit && *bt_addr)
+	if (!valid_bit && bt_addr)
 		return -EINVAL;
 	/*
 	 * Do we have an completely zeroed bt entry?  That is OK.  It
@@ -628,6 +654,7 @@ static int get_bt_addr(struct mm_struct *mm,
 	if (!valid_bit)
 		return -ENOENT;
 
+	*bt_addr_result = bt_addr;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Use 32-bit-only cmpxchg() for 32-bit apps
  2015-06-07 18:37 ` [PATCH 15/19] x86, mpx: do 32-bit-only cmpxchg for 32-bit apps Dave Hansen
@ 2015-06-09 12:35   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:35 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, peterz, dave, mingo, dave.hansen, hpa, akpm,
	linux-kernel, tglx

Commit-ID:  6ac52bb4913eadfa327138b91aab5d37234a2c3b
Gitweb:     http://git.kernel.org/tip/6ac52bb4913eadfa327138b91aab5d37234a2c3b
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:05 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:33 +0200

x86/mpx: Use 32-bit-only cmpxchg() for 32-bit apps

user_atomic_cmpxchg_inatomic() actually looks at sizeof(*ptr) to
figure out how many bytes to copy.  If we run it on a 64-bit
kernel with a 64-bit pointer, it will copy a 64-bit bounds
directory entry.  That's fine, except when we have 32-bit
programs with 32-bit bounds directory entries and we only *want*
32-bits.

This patch breaks the cmpxchg() operation out in to its own
function and performs the 32-bit type swizzling in there.

Note, the "64-bit" version of this code _would_ work on a
32-bit-only kernel.  The issue this patch addresses is only for
when the kernel's 'long' is mismatched from the size of the
bounds directory entry of the process we are working on.

The new helper modifies 'actual_old_val' or returns an error.
But gcc doesn't know this, so it warns about 'actual_old_val'
being unused.  Shut it up with an uninitialized_var().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183705.672B115E@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/mpx.c | 41 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 8cc7934..294ea20 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -419,6 +419,35 @@ int mpx_disable_management(void)
 	return 0;
 }
 
+static int mpx_cmpxchg_bd_entry(struct mm_struct *mm,
+		unsigned long *curval,
+		unsigned long __user *addr,
+		unsigned long old_val, unsigned long new_val)
+{
+	int ret;
+	/*
+	 * user_atomic_cmpxchg_inatomic() actually uses sizeof()
+	 * the pointer that we pass to it to figure out how much
+	 * data to cmpxchg.  We have to be careful here not to
+	 * pass a pointer to a 64-bit data type when we only want
+	 * a 32-bit copy.
+	 */
+	if (is_64bit_mm(mm)) {
+		ret = user_atomic_cmpxchg_inatomic(curval,
+				addr, old_val, new_val);
+	} else {
+		u32 uninitialized_var(curval_32);
+		u32 old_val_32 = old_val;
+		u32 new_val_32 = new_val;
+		u32 __user *addr_32 = (u32 __user *)addr;
+
+		ret = user_atomic_cmpxchg_inatomic(&curval_32,
+				addr_32, old_val_32, new_val_32);
+		*curval = curval_32;
+	}
+	return ret;
+}
+
 /*
  * With 32-bit mode, MPX_BT_SIZE_BYTES is 4MB, and the size of each
  * bounds table is 16KB. With 64-bit mode, MPX_BT_SIZE_BYTES is 2GB,
@@ -426,6 +455,7 @@ int mpx_disable_management(void)
  */
 static int allocate_bt(long __user *bd_entry)
 {
+	struct mm_struct *mm = current->mm;
 	unsigned long expected_old_val = 0;
 	unsigned long actual_old_val = 0;
 	unsigned long bt_addr;
@@ -455,8 +485,8 @@ static int allocate_bt(long __user *bd_entry)
 	 * mmap_sem at this point, unlike some of the other part
 	 * of the MPX code that have to pagefault_disable().
 	 */
-	ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
-					   expected_old_val, bd_new_entry);
+	ret = mpx_cmpxchg_bd_entry(mm, &actual_old_val,	bd_entry,
+				   expected_old_val, bd_new_entry);
 	if (ret)
 		goto out_unmap;
 
@@ -710,15 +740,16 @@ static int unmap_single_bt(struct mm_struct *mm,
 		long __user *bd_entry, unsigned long bt_addr)
 {
 	unsigned long expected_old_val = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
-	unsigned long actual_old_val = 0;
+	unsigned long uninitialized_var(actual_old_val);
 	int ret;
 
 	while (1) {
 		int need_write = 1;
+		unsigned long cleared_bd_entry = 0;
 
 		pagefault_disable();
-		ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
-						   expected_old_val, 0);
+		ret = mpx_cmpxchg_bd_entry(mm, &actual_old_val,
+				bd_entry, expected_old_val, cleared_bd_entry);
 		pagefault_enable();
 		if (!ret)
 			break;

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Support 32-bit binaries on 64-bit kernels
  2015-06-07 18:37 ` [PATCH 16/19] x86, mpx: support 32-bit binaries on 64-bit kernel Dave Hansen
@ 2015-06-09 12:35   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:35 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, dave.hansen, akpm, linux-kernel, dave, tglx, mingo,
	qiaowei.ren, peterz, torvalds

Commit-ID:  613fcb7d3c79ec25b5913a6aa974c9047c31e68c
Gitweb:     http://git.kernel.org/tip/613fcb7d3c79ec25b5913a6aa974c9047c31e68c
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:05 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:34 +0200

x86/mpx: Support 32-bit binaries on 64-bit kernels

Right now, the kernel can only switch between 64-bit and 32-bit
binaries at compile time. This patch adds support for 32-bit
binaries on 64-bit kernels when we support ia32 emulation.

We essentially choose which set of table sizes to use when doing
arithmetic for the bounds table calculations.

This also uses a different approach for calculating the table
indexes than before.  I think the new one makes it much more
clear what is going on, and allows us to share more code between
the 32-bit and 64-bit cases.

Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183705.E01F21E2@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mpx.h |  62 ++++++++---------
 arch/x86/mm/mpx.c          | 170 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 179 insertions(+), 53 deletions(-)

diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 99d374e..7a35495 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -13,49 +13,47 @@
 #define MPX_BNDCFG_ENABLE_FLAG	0x1
 #define MPX_BD_ENTRY_VALID_FLAG	0x1
 
-#ifdef CONFIG_X86_64
-
-/* upper 28 bits [47:20] of the virtual address in 64-bit used to
- * index into bounds directory (BD).
- */
-#define MPX_BD_ENTRY_OFFSET	28
-#define MPX_BD_ENTRY_SHIFT	3
-/* bits [19:3] of the virtual address in 64-bit used to index into
- * bounds table (BT).
+/*
+ * The upper 28 bits [47:20] of the virtual address in 64-bit
+ * are used to index into bounds directory (BD).
+ *
+ * The directory is 2G (2^31) in size, and with 8-byte entries
+ * it has 2^28 entries.
  */
-#define MPX_BT_ENTRY_OFFSET	17
-#define MPX_BT_ENTRY_SHIFT	5
-#define MPX_IGN_BITS		3
-#define MPX_BD_ENTRY_TAIL	3
+#define MPX_BD_SIZE_BYTES_64	(1UL<<31)
+#define MPX_BD_ENTRY_BYTES_64	8
+#define MPX_BD_NR_ENTRIES_64	(MPX_BD_SIZE_BYTES_64/MPX_BD_ENTRY_BYTES_64)
 
-#else
-
-#define MPX_BD_ENTRY_OFFSET	20
-#define MPX_BD_ENTRY_SHIFT	2
-#define MPX_BT_ENTRY_OFFSET	10
-#define MPX_BT_ENTRY_SHIFT	4
-#define MPX_IGN_BITS		2
-#define MPX_BD_ENTRY_TAIL	2
+/*
+ * The 32-bit directory is 4MB (2^22) in size, and with 4-byte
+ * entries it has 2^20 entries.
+ */
+#define MPX_BD_SIZE_BYTES_32	(1UL<<22)
+#define MPX_BD_ENTRY_BYTES_32	4
+#define MPX_BD_NR_ENTRIES_32	(MPX_BD_SIZE_BYTES_32/MPX_BD_ENTRY_BYTES_32)
 
-#endif
+/*
+ * A 64-bit table is 4MB total in size, and an entry is
+ * 4 64-bit pointers in size.
+ */
+#define MPX_BT_SIZE_BYTES_64	(1UL<<22)
+#define MPX_BT_ENTRY_BYTES_64	32
+#define MPX_BT_NR_ENTRIES_64	(MPX_BT_SIZE_BYTES_64/MPX_BT_ENTRY_BYTES_64)
 
-#define MPX_BD_SIZE_BYTES (1UL<<(MPX_BD_ENTRY_OFFSET+MPX_BD_ENTRY_SHIFT))
-#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
+/*
+ * A 32-bit table is 16kB total in size, and an entry is
+ * 4 32-bit pointers in size.
+ */
+#define MPX_BT_SIZE_BYTES_32	(1UL<<14)
+#define MPX_BT_ENTRY_BYTES_32	16
+#define MPX_BT_NR_ENTRIES_32	(MPX_BT_SIZE_BYTES_32/MPX_BT_ENTRY_BYTES_32)
 
 #define MPX_BNDSTA_TAIL		2
 #define MPX_BNDCFG_TAIL		12
 #define MPX_BNDSTA_ADDR_MASK	(~((1UL<<MPX_BNDSTA_TAIL)-1))
-
 #define MPX_BNDCFG_ADDR_MASK	(~((1UL<<MPX_BNDCFG_TAIL)-1))
 #define MPX_BNDSTA_ERROR_CODE	0x3
 
-#define MPX_BD_ENTRY_MASK	((1<<MPX_BD_ENTRY_OFFSET)-1)
-#define MPX_BT_ENTRY_MASK	((1<<MPX_BT_ENTRY_OFFSET)-1)
-#define MPX_GET_BD_ENTRY_OFFSET(addr)	((((addr)>>(MPX_BT_ENTRY_OFFSET+ \
-		MPX_IGN_BITS)) & MPX_BD_ENTRY_MASK) << MPX_BD_ENTRY_SHIFT)
-#define MPX_GET_BT_ENTRY_OFFSET(addr)	((((addr)>>MPX_IGN_BITS) & \
-		MPX_BT_ENTRY_MASK) << MPX_BT_ENTRY_SHIFT)
-
 #ifdef CONFIG_X86_INTEL_MPX
 siginfo_t *mpx_generate_siginfo(struct pt_regs *regs);
 int mpx_handle_bd_fault(void);
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 294ea20..e323ef6 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -34,6 +34,22 @@ static int is_mpx_vma(struct vm_area_struct *vma)
 	return (vma->vm_ops == &mpx_vma_ops);
 }
 
+static inline unsigned long mpx_bd_size_bytes(struct mm_struct *mm)
+{
+	if (is_64bit_mm(mm))
+		return MPX_BD_SIZE_BYTES_64;
+	else
+		return MPX_BD_SIZE_BYTES_32;
+}
+
+static inline unsigned long mpx_bt_size_bytes(struct mm_struct *mm)
+{
+	if (is_64bit_mm(mm))
+		return MPX_BT_SIZE_BYTES_64;
+	else
+		return MPX_BT_SIZE_BYTES_32;
+}
+
 /*
  * This is really a simplified "vm_mmap". it only handles MPX
  * bounds tables (the bounds directory is user-allocated).
@@ -50,7 +66,7 @@ static unsigned long mpx_mmap(unsigned long len)
 	struct vm_area_struct *vma;
 
 	/* Only bounds table can be allocated here */
-	if (len != MPX_BT_SIZE_BYTES)
+	if (len != mpx_bt_size_bytes(mm))
 		return -EINVAL;
 
 	down_write(&mm->mmap_sem);
@@ -449,13 +465,12 @@ static int mpx_cmpxchg_bd_entry(struct mm_struct *mm,
 }
 
 /*
- * With 32-bit mode, MPX_BT_SIZE_BYTES is 4MB, and the size of each
- * bounds table is 16KB. With 64-bit mode, MPX_BT_SIZE_BYTES is 2GB,
+ * With 32-bit mode, a bounds directory is 4MB, and the size of each
+ * bounds table is 16KB. With 64-bit mode, a bounds directory is 2GB,
  * and the size of each bounds table is 4MB.
  */
-static int allocate_bt(long __user *bd_entry)
+static int allocate_bt(struct mm_struct *mm, long __user *bd_entry)
 {
-	struct mm_struct *mm = current->mm;
 	unsigned long expected_old_val = 0;
 	unsigned long actual_old_val = 0;
 	unsigned long bt_addr;
@@ -466,7 +481,7 @@ static int allocate_bt(long __user *bd_entry)
 	 * Carve the virtual space out of userspace for the new
 	 * bounds table:
 	 */
-	bt_addr = mpx_mmap(MPX_BT_SIZE_BYTES);
+	bt_addr = mpx_mmap(mpx_bt_size_bytes(mm));
 	if (IS_ERR((void *)bt_addr))
 		return PTR_ERR((void *)bt_addr);
 	/*
@@ -517,7 +532,7 @@ static int allocate_bt(long __user *bd_entry)
 	trace_mpx_new_bounds_table(bt_addr);
 	return 0;
 out_unmap:
-	vm_munmap(bt_addr, MPX_BT_SIZE_BYTES);
+	vm_munmap(bt_addr, mpx_bt_size_bytes(mm));
 	return ret;
 }
 
@@ -536,6 +551,7 @@ static int do_mpx_bt_fault(void)
 {
 	unsigned long bd_entry, bd_base;
 	const struct bndcsr *bndcsr;
+	struct mm_struct *mm = current->mm;
 
 	bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);
 	if (!bndcsr)
@@ -554,10 +570,10 @@ static int do_mpx_bt_fault(void)
 	 * the directory is.
 	 */
 	if ((bd_entry < bd_base) ||
-	    (bd_entry >= bd_base + MPX_BD_SIZE_BYTES))
+	    (bd_entry >= bd_base + mpx_bd_size_bytes(mm)))
 		return -EINVAL;
 
-	return allocate_bt((long __user *)bd_entry);
+	return allocate_bt(mm, (long __user *)bd_entry);
 }
 
 int mpx_handle_bd_fault(void)
@@ -789,7 +805,115 @@ static int unmap_single_bt(struct mm_struct *mm,
 	 * avoid recursion, do_munmap() will check whether it comes
 	 * from one bounds table through VM_MPX flag.
 	 */
-	return do_munmap(mm, bt_addr, MPX_BT_SIZE_BYTES);
+	return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm));
+}
+
+static inline int bt_entry_size_bytes(struct mm_struct *mm)
+{
+	if (is_64bit_mm(mm))
+		return MPX_BT_ENTRY_BYTES_64;
+	else
+		return MPX_BT_ENTRY_BYTES_32;
+}
+
+/*
+ * Take a virtual address and turns it in to the offset in bytes
+ * inside of the bounds table where the bounds table entry
+ * controlling 'addr' can be found.
+ */
+static unsigned long mpx_get_bt_entry_offset_bytes(struct mm_struct *mm,
+		unsigned long addr)
+{
+	unsigned long bt_table_nr_entries;
+	unsigned long offset = addr;
+
+	if (is_64bit_mm(mm)) {
+		/* Bottom 3 bits are ignored on 64-bit */
+		offset >>= 3;
+		bt_table_nr_entries = MPX_BT_NR_ENTRIES_64;
+	} else {
+		/* Bottom 2 bits are ignored on 32-bit */
+		offset >>= 2;
+		bt_table_nr_entries = MPX_BT_NR_ENTRIES_32;
+	}
+	/*
+	 * We know the size of the table in to which we are
+	 * indexing, and we have eliminated all the low bits
+	 * which are ignored for indexing.
+	 *
+	 * Mask out all the high bits which we do not need
+	 * to index in to the table.  Note that the tables
+	 * are always powers of two so this gives us a proper
+	 * mask.
+	 */
+	offset &= (bt_table_nr_entries-1);
+	/*
+	 * We now have an entry offset in terms of *entries* in
+	 * the table.  We need to scale it back up to bytes.
+	 */
+	offset *= bt_entry_size_bytes(mm);
+	return offset;
+}
+
+/*
+ * How much virtual address space does a single bounds
+ * directory entry cover?
+ *
+ * Note, we need a long long because 4GB doesn't fit in
+ * to a long on 32-bit.
+ */
+static inline unsigned long bd_entry_virt_space(struct mm_struct *mm)
+{
+	unsigned long long virt_space = (1ULL << boot_cpu_data.x86_virt_bits);
+	if (is_64bit_mm(mm))
+		return virt_space / MPX_BD_NR_ENTRIES_64;
+	else
+		return virt_space / MPX_BD_NR_ENTRIES_32;
+}
+
+/*
+ * Return an offset in terms of bytes in to the bounds
+ * directory where the bounds directory entry for a given
+ * virtual address resides.
+ *
+ * This has to be in bytes because the directory entries
+ * are different sizes on 64/32 bit.
+ */
+static unsigned long mpx_get_bd_entry_offset(struct mm_struct *mm,
+		unsigned long addr)
+{
+	/*
+	 * There are several ways to derive the bd offsets.  We
+	 * use the following approach here:
+	 * 1. We know the size of the virtual address space
+	 * 2. We know the number of entries in a bounds table
+	 * 3. We know that each entry covers a fixed amount of
+	 *    virtual address space.
+	 * So, we can just divide the virtual address by the
+	 * virtual space used by one entry to determine which
+	 * entry "controls" the given virtual address.
+	 */
+	if (is_64bit_mm(mm)) {
+		int bd_entry_size = 8; /* 64-bit pointer */
+		/*
+		 * Take the 64-bit addressing hole in to account.
+		 */
+		addr &= ((1UL << boot_cpu_data.x86_virt_bits) - 1);
+		return (addr / bd_entry_virt_space(mm)) * bd_entry_size;
+	} else {
+		int bd_entry_size = 4; /* 32-bit pointer */
+		/*
+		 * 32-bit has no hole so this case needs no mask
+		 */
+		return (addr / bd_entry_virt_space(mm)) * bd_entry_size;
+	}
+	/*
+	 * The two return calls above are exact copies.  If we
+	 * pull out a single copy and put it in here, gcc won't
+	 * realize that we're doing a power-of-2 divide and use
+	 * shifts.  It uses a real divide.  If we put them up
+	 * there, it manages to figure it out (gcc 4.8.3).
+	 */
 }
 
 /*
@@ -803,6 +927,7 @@ static int unmap_shared_bt(struct mm_struct *mm,
 		unsigned long end, bool prev_shared, bool next_shared)
 {
 	unsigned long bt_addr;
+	unsigned long start_off, end_off;
 	int ret;
 
 	ret = get_bt_addr(mm, bd_entry, &bt_addr);
@@ -814,17 +939,20 @@ static int unmap_shared_bt(struct mm_struct *mm,
 	if (ret)
 		return ret;
 
+	start_off = mpx_get_bt_entry_offset_bytes(mm, start);
+	end_off   = mpx_get_bt_entry_offset_bytes(mm, end);
+
 	if (prev_shared && next_shared)
 		ret = zap_bt_entries(mm, bt_addr,
-				bt_addr+MPX_GET_BT_ENTRY_OFFSET(start),
-				bt_addr+MPX_GET_BT_ENTRY_OFFSET(end));
+				bt_addr + start_off,
+				bt_addr + end_off);
 	else if (prev_shared)
 		ret = zap_bt_entries(mm, bt_addr,
-				bt_addr+MPX_GET_BT_ENTRY_OFFSET(start),
-				bt_addr+MPX_BT_SIZE_BYTES);
+				bt_addr + start_off,
+				bt_addr + mpx_bt_size_bytes(mm));
 	else if (next_shared)
 		ret = zap_bt_entries(mm, bt_addr, bt_addr,
-				bt_addr+MPX_GET_BT_ENTRY_OFFSET(end));
+				bt_addr + end_off);
 	else
 		ret = unmap_single_bt(mm, bd_entry, bt_addr);
 
@@ -845,8 +973,8 @@ static int unmap_edge_bts(struct mm_struct *mm,
 	struct vm_area_struct *prev, *next;
 	bool prev_shared = false, next_shared = false;
 
-	bde_start = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(start);
-	bde_end = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(end-1);
+	bde_start = mm->bd_addr + mpx_get_bd_entry_offset(mm, start);
+	bde_end   = mm->bd_addr + mpx_get_bd_entry_offset(mm, end-1);
 
 	/*
 	 * Check whether bde_start and bde_end are shared with adjacent
@@ -858,10 +986,10 @@ static int unmap_edge_bts(struct mm_struct *mm,
 	 * in to 'next'.
 	 */
 	next = find_vma_prev(mm, start, &prev);
-	if (prev && (mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(prev->vm_end-1))
+	if (prev && (mm->bd_addr + mpx_get_bd_entry_offset(mm, prev->vm_end-1))
 			== bde_start)
 		prev_shared = true;
-	if (next && (mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(next->vm_start))
+	if (next && (mm->bd_addr + mpx_get_bd_entry_offset(mm, next->vm_start))
 			== bde_end)
 		next_shared = true;
 
@@ -927,8 +1055,8 @@ static int mpx_unmap_tables(struct mm_struct *mm,
 	 *   1. fully covered
 	 *   2. not at the edges of the mapping, even if full aligned
 	 */
-	bde_start = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(start);
-	bde_end = mm->bd_addr + MPX_GET_BD_ENTRY_OFFSET(end-1);
+	bde_start = mm->bd_addr + mpx_get_bd_entry_offset(mm, start);
+	bde_end   = mm->bd_addr + mpx_get_bd_entry_offset(mm, end-1);
 	for (bd_entry = bde_start + 1; bd_entry < bde_end; bd_entry++) {
 		ret = get_bt_addr(mm, bd_entry, &bt_addr);
 		switch (ret) {

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Rewrite the unmap code
  2015-06-07 18:37 ` [PATCH 17/19] x86, mpx: rewrite unmap code Dave Hansen
@ 2015-06-09 12:35   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:35 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, linux-kernel, mingo, akpm, dave.hansen, hpa, peterz,
	torvalds, dave

Commit-ID:  3ceaccdf92073d193f0bfbe24280dd736e3fed86
Gitweb:     http://git.kernel.org/tip/3ceaccdf92073d193f0bfbe24280dd736e3fed86
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:06 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:34 +0200

x86/mpx: Rewrite the unmap code

The MPX code needs to clear out bounds tables for memory which
is no longer in use.  We do this when a userspace mapping is
torn down (unmapped).

There are two modes:

  1. An entire bounds table becomes unused, and can be freed
     and its pointer removed from the bounds directory.  This
     happens either when a large mapping is torn down, or when
     a small mapping is torn down and it is the last mapping
     "covered" by a bounds table.

  2. Only part of a bounds table becomes unused, in which case
     we free the backing memory as if MADV_DONTNEED was called.

The old code was a spaghetti mess of "edge" bounds tables
where the edges were handled specially, even if we were
unmapping an entire one.  Non-edge bounds tables are always
fully unmapped, but share a different code path from the edge
ones.  The old code had a bug where it was unmapping too much
memory.  I worked on fixing it for two days and gave up.

I didn't write the original code.  I didn't particularly like
it, but it worked, so I left it.  After my debug session, I
realized it was undebuggagle *and* buggy, so out it went.

I also wrote a new unmapping test program which uncovers bugs
pretty nicely.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183706.DCAEC67D@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/mpx.c | 411 ++++++++++++++++++++++--------------------------------
 1 file changed, 168 insertions(+), 243 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index e323ef6..18fcf73 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -704,110 +704,6 @@ static int get_bt_addr(struct mm_struct *mm,
 	return 0;
 }
 
-/*
- * Free the backing physical pages of bounds table 'bt_addr'.
- * Assume start...end is within that bounds table.
- */
-static int zap_bt_entries(struct mm_struct *mm,
-		unsigned long bt_addr,
-		unsigned long start, unsigned long end)
-{
-	struct vm_area_struct *vma;
-	unsigned long addr, len;
-
-	/*
-	 * Find the first overlapping vma. If vma->vm_start > start, there
-	 * will be a hole in the bounds table. This -EINVAL return will
-	 * cause a SIGSEGV.
-	 */
-	vma = find_vma(mm, start);
-	if (!vma || vma->vm_start > start)
-		return -EINVAL;
-
-	/*
-	 * A NUMA policy on a VM_MPX VMA could cause this bouds table to
-	 * be split. So we need to look across the entire 'start -> end'
-	 * range of this bounds table, find all of the VM_MPX VMAs, and
-	 * zap only those.
-	 */
-	addr = start;
-	while (vma && vma->vm_start < end) {
-		/*
-		 * We followed a bounds directory entry down
-		 * here.  If we find a non-MPX VMA, that's bad,
-		 * so stop immediately and return an error.  This
-		 * probably results in a SIGSEGV.
-		 */
-		if (!is_mpx_vma(vma))
-			return -EINVAL;
-
-		len = min(vma->vm_end, end) - addr;
-		zap_page_range(vma, addr, len, NULL);
-		trace_mpx_unmap_zap(addr, addr+len);
-
-		vma = vma->vm_next;
-		addr = vma->vm_start;
-	}
-
-	return 0;
-}
-
-static int unmap_single_bt(struct mm_struct *mm,
-		long __user *bd_entry, unsigned long bt_addr)
-{
-	unsigned long expected_old_val = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
-	unsigned long uninitialized_var(actual_old_val);
-	int ret;
-
-	while (1) {
-		int need_write = 1;
-		unsigned long cleared_bd_entry = 0;
-
-		pagefault_disable();
-		ret = mpx_cmpxchg_bd_entry(mm, &actual_old_val,
-				bd_entry, expected_old_val, cleared_bd_entry);
-		pagefault_enable();
-		if (!ret)
-			break;
-		if (ret == -EFAULT)
-			ret = mpx_resolve_fault(bd_entry, need_write);
-		/*
-		 * If we could not resolve the fault, consider it
-		 * userspace's fault and error out.
-		 */
-		if (ret)
-			return ret;
-	}
-	/*
-	 * The cmpxchg was performed, check the results.
-	 */
-	if (actual_old_val != expected_old_val) {
-		/*
-		 * Someone else raced with us to unmap the table.
-		 * There was no bounds table pointed to by the
-		 * directory, so declare success.  Somebody freed
-		 * it.
-		 */
-		if (!actual_old_val)
-			return 0;
-		/*
-		 * Something messed with the bounds directory
-		 * entry.  We hold mmap_sem for read or write
-		 * here, so it could not be a _new_ bounds table
-		 * that someone just allocated.  Something is
-		 * wrong, so pass up the error and SIGSEGV.
-		 */
-		return -EINVAL;
-	}
-
-	/*
-	 * Note, we are likely being called under do_munmap() already. To
-	 * avoid recursion, do_munmap() will check whether it comes
-	 * from one bounds table through VM_MPX flag.
-	 */
-	return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm));
-}
-
 static inline int bt_entry_size_bytes(struct mm_struct *mm)
 {
 	if (is_64bit_mm(mm))
@@ -872,13 +768,69 @@ static inline unsigned long bd_entry_virt_space(struct mm_struct *mm)
 }
 
 /*
- * Return an offset in terms of bytes in to the bounds
- * directory where the bounds directory entry for a given
- * virtual address resides.
- *
- * This has to be in bytes because the directory entries
- * are different sizes on 64/32 bit.
+ * Free the backing physical pages of bounds table 'bt_addr'.
+ * Assume start...end is within that bounds table.
  */
+static noinline int zap_bt_entries_mapping(struct mm_struct *mm,
+		unsigned long bt_addr,
+		unsigned long start_mapping, unsigned long end_mapping)
+{
+	struct vm_area_struct *vma;
+	unsigned long addr, len;
+	unsigned long start;
+	unsigned long end;
+
+	/*
+	 * if we 'end' on a boundary, the offset will be 0 which
+	 * is not what we want.  Back it up a byte to get the
+	 * last bt entry.  Then once we have the entry itself,
+	 * move 'end' back up by the table entry size.
+	 */
+	start = bt_addr + mpx_get_bt_entry_offset_bytes(mm, start_mapping);
+	end   = bt_addr + mpx_get_bt_entry_offset_bytes(mm, end_mapping - 1);
+	/*
+	 * Move end back up by one entry.  Among other things
+	 * this ensures that it remains page-aligned and does
+	 * not screw up zap_page_range()
+	 */
+	end += bt_entry_size_bytes(mm);
+
+	/*
+	 * Find the first overlapping vma. If vma->vm_start > start, there
+	 * will be a hole in the bounds table. This -EINVAL return will
+	 * cause a SIGSEGV.
+	 */
+	vma = find_vma(mm, start);
+	if (!vma || vma->vm_start > start)
+		return -EINVAL;
+
+	/*
+	 * A NUMA policy on a VM_MPX VMA could cause this bounds table to
+	 * be split. So we need to look across the entire 'start -> end'
+	 * range of this bounds table, find all of the VM_MPX VMAs, and
+	 * zap only those.
+	 */
+	addr = start;
+	while (vma && vma->vm_start < end) {
+		/*
+		 * We followed a bounds directory entry down
+		 * here.  If we find a non-MPX VMA, that's bad,
+		 * so stop immediately and return an error.  This
+		 * probably results in a SIGSEGV.
+		 */
+		if (!is_mpx_vma(vma))
+			return -EINVAL;
+
+		len = min(vma->vm_end, end) - addr;
+		zap_page_range(vma, addr, len, NULL);
+		trace_mpx_unmap_zap(addr, addr+len);
+
+		vma = vma->vm_next;
+		addr = vma->vm_start;
+	}
+	return 0;
+}
+
 static unsigned long mpx_get_bd_entry_offset(struct mm_struct *mm,
 		unsigned long addr)
 {
@@ -916,69 +868,80 @@ static unsigned long mpx_get_bd_entry_offset(struct mm_struct *mm,
 	 */
 }
 
-/*
- * If the bounds table pointed by bounds directory 'bd_entry' is
- * not shared, unmap this whole bounds table. Otherwise, only free
- * those backing physical pages of bounds table entries covered
- * in this virtual address region start...end.
- */
-static int unmap_shared_bt(struct mm_struct *mm,
-		long __user *bd_entry, unsigned long start,
-		unsigned long end, bool prev_shared, bool next_shared)
+static int unmap_entire_bt(struct mm_struct *mm,
+		long __user *bd_entry, unsigned long bt_addr)
 {
-	unsigned long bt_addr;
-	unsigned long start_off, end_off;
+	unsigned long expected_old_val = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
+	unsigned long uninitialized_var(actual_old_val);
 	int ret;
 
-	ret = get_bt_addr(mm, bd_entry, &bt_addr);
+	while (1) {
+		int need_write = 1;
+		unsigned long cleared_bd_entry = 0;
+
+		pagefault_disable();
+		ret = mpx_cmpxchg_bd_entry(mm, &actual_old_val,
+				bd_entry, expected_old_val, cleared_bd_entry);
+		pagefault_enable();
+		if (!ret)
+			break;
+		if (ret == -EFAULT)
+			ret = mpx_resolve_fault(bd_entry, need_write);
+		/*
+		 * If we could not resolve the fault, consider it
+		 * userspace's fault and error out.
+		 */
+		if (ret)
+			return ret;
+	}
 	/*
-	 * We could see an "error" ret for not-present bounds
-	 * tables (not really an error), or actual errors, but
-	 * stop unmapping either way.
+	 * The cmpxchg was performed, check the results.
 	 */
-	if (ret)
-		return ret;
-
-	start_off = mpx_get_bt_entry_offset_bytes(mm, start);
-	end_off   = mpx_get_bt_entry_offset_bytes(mm, end);
-
-	if (prev_shared && next_shared)
-		ret = zap_bt_entries(mm, bt_addr,
-				bt_addr + start_off,
-				bt_addr + end_off);
-	else if (prev_shared)
-		ret = zap_bt_entries(mm, bt_addr,
-				bt_addr + start_off,
-				bt_addr + mpx_bt_size_bytes(mm));
-	else if (next_shared)
-		ret = zap_bt_entries(mm, bt_addr, bt_addr,
-				bt_addr + end_off);
-	else
-		ret = unmap_single_bt(mm, bd_entry, bt_addr);
-
-	return ret;
+	if (actual_old_val != expected_old_val) {
+		/*
+		 * Someone else raced with us to unmap the table.
+		 * That is OK, since we were both trying to do
+		 * the same thing.  Declare success.
+		 */
+		if (!actual_old_val)
+			return 0;
+		/*
+		 * Something messed with the bounds directory
+		 * entry.  We hold mmap_sem for read or write
+		 * here, so it could not be a _new_ bounds table
+		 * that someone just allocated.  Something is
+		 * wrong, so pass up the error and SIGSEGV.
+		 */
+		return -EINVAL;
+	}
+	/*
+	 * Note, we are likely being called under do_munmap() already. To
+	 * avoid recursion, do_munmap() will check whether it comes
+	 * from one bounds table through VM_MPX flag.
+	 */
+	return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm));
 }
 
-/*
- * A virtual address region being munmap()ed might share bounds table
- * with adjacent VMAs. We only need to free the backing physical
- * memory of these shared bounds tables entries covered in this virtual
- * address region.
- */
-static int unmap_edge_bts(struct mm_struct *mm,
-		unsigned long start, unsigned long end)
+static int try_unmap_single_bt(struct mm_struct *mm,
+	       unsigned long start, unsigned long end)
 {
+	struct vm_area_struct *next;
+	struct vm_area_struct *prev;
+	/*
+	 * "bta" == Bounds Table Area: the area controlled by the
+	 * bounds table that we are unmapping.
+	 */
+	unsigned long bta_start_vaddr = start & ~(bd_entry_virt_space(mm)-1);
+	unsigned long bta_end_vaddr = bta_start_vaddr + bd_entry_virt_space(mm);
+	unsigned long uninitialized_var(bt_addr);
+	void __user *bde_vaddr;
 	int ret;
-	long __user *bde_start, *bde_end;
-	struct vm_area_struct *prev, *next;
-	bool prev_shared = false, next_shared = false;
-
-	bde_start = mm->bd_addr + mpx_get_bd_entry_offset(mm, start);
-	bde_end   = mm->bd_addr + mpx_get_bd_entry_offset(mm, end-1);
-
 	/*
-	 * Check whether bde_start and bde_end are shared with adjacent
-	 * VMAs.
+	 * We know 'start' and 'end' lie within an area controlled
+	 * by a single bounds table.  See if there are any other
+	 * VMAs controlled by that bounds table.  If there are not
+	 * then we can "expand" the are we are unmapping to possibly
+	 * cover the entire table.
 	 *
 	 * We already unliked the VMAs from the mm's rbtree so 'start'
 	 * is guaranteed to be in a hole. This gets us the first VMA
@@ -986,102 +949,64 @@ static int unmap_edge_bts(struct mm_struct *mm,
 	 * in to 'next'.
 	 */
 	next = find_vma_prev(mm, start, &prev);
-	if (prev && (mm->bd_addr + mpx_get_bd_entry_offset(mm, prev->vm_end-1))
-			== bde_start)
-		prev_shared = true;
-	if (next && (mm->bd_addr + mpx_get_bd_entry_offset(mm, next->vm_start))
-			== bde_end)
-		next_shared = true;
-
-	/*
-	 * This virtual address region being munmap()ed is only
-	 * covered by one bounds table.
-	 *
-	 * In this case, if this table is also shared with adjacent
-	 * VMAs, only part of the backing physical memory of the bounds
-	 * table need be freeed. Otherwise the whole bounds table need
-	 * be unmapped.
-	 */
-	if (bde_start == bde_end) {
-		return unmap_shared_bt(mm, bde_start, start, end,
-				prev_shared, next_shared);
+	if ((!prev || prev->vm_end <= bta_start_vaddr) &&
+	    (!next || next->vm_start >= bta_end_vaddr)) {
+		/*
+		 * No neighbor VMAs controlled by same bounds
+		 * table.  Try to unmap the whole thing
+		 */
+		start = bta_start_vaddr;
+		end = bta_end_vaddr;
 	}
 
+	bde_vaddr = mm->bd_addr + mpx_get_bd_entry_offset(mm, start);
+	ret = get_bt_addr(mm, bde_vaddr, &bt_addr);
 	/*
-	 * If more than one bounds tables are covered in this virtual
-	 * address region being munmap()ed, we need to separately check
-	 * whether bde_start and bde_end are shared with adjacent VMAs.
+	 * No bounds table there, so nothing to unmap.
 	 */
-	ret = unmap_shared_bt(mm, bde_start, start, end, prev_shared, false);
-	if (ret)
-		return ret;
-	ret = unmap_shared_bt(mm, bde_end, start, end, false, next_shared);
+	if (ret == -ENOENT) {
+		ret = 0;
+		return 0;
+	}
 	if (ret)
 		return ret;
-
-	return 0;
+	/*
+	 * We are unmapping an entire table.  Either because the
+	 * unmap that started this whole process was large enough
+	 * to cover an entire table, or that the unmap was small
+	 * but was the area covered by a bounds table.
+	 */
+	if ((start == bta_start_vaddr) &&
+	    (end == bta_end_vaddr))
+		return unmap_entire_bt(mm, bde_vaddr, bt_addr);
+	return zap_bt_entries_mapping(mm, bt_addr, start, end);
 }
 
 static int mpx_unmap_tables(struct mm_struct *mm,
 		unsigned long start, unsigned long end)
 {
-	int ret;
-	long __user *bd_entry, *bde_start, *bde_end;
-	unsigned long bt_addr;
-
+	unsigned long one_unmap_start;
 	trace_mpx_unmap_search(start, end);
-	/*
-	 * "Edge" bounds tables are those which are being used by the region
-	 * (start -> end), but that may be shared with adjacent areas.  If they
-	 * turn out to be completely unshared, they will be freed.  If they are
-	 * shared, we will free the backing store (like an MADV_DONTNEED) for
-	 * areas used by this region.
-	 */
-	ret = unmap_edge_bts(mm, start, end);
-	switch (ret) {
-		/* non-present tables are OK */
-		case 0:
-		case -ENOENT:
-			/* Success, or no tables to unmap */
-			break;
-		case -EINVAL:
-		case -EFAULT:
-		default:
-			return ret;
-	}
-
-	/*
-	 * Only unmap the bounds table that are
-	 *   1. fully covered
-	 *   2. not at the edges of the mapping, even if full aligned
-	 */
-	bde_start = mm->bd_addr + mpx_get_bd_entry_offset(mm, start);
-	bde_end   = mm->bd_addr + mpx_get_bd_entry_offset(mm, end-1);
-	for (bd_entry = bde_start + 1; bd_entry < bde_end; bd_entry++) {
-		ret = get_bt_addr(mm, bd_entry, &bt_addr);
-		switch (ret) {
-			case 0:
-				break;
-			case -ENOENT:
-				/* No table here, try the next one */
-				continue;
-			case -EINVAL:
-			case -EFAULT:
-			default:
-				/*
-				 * Note: we are being strict here.
-				 * Any time we run in to an issue
-				 * unmapping tables, we stop and
-				 * SIGSEGV.
-				 */
-				return ret;
-		}
 
-		ret = unmap_single_bt(mm, bd_entry, bt_addr);
+	one_unmap_start = start;
+	while (one_unmap_start < end) {
+		int ret;
+		unsigned long next_unmap_start = ALIGN(one_unmap_start+1,
+						       bd_entry_virt_space(mm));
+		unsigned long one_unmap_end = end;
+		/*
+		 * if the end is beyond the current bounds table,
+		 * move it back so we only deal with a single one
+		 * at a time
+		 */
+		if (one_unmap_end > next_unmap_start)
+			one_unmap_end = next_unmap_start;
+		ret = try_unmap_single_bt(mm, one_unmap_start, one_unmap_end);
 		if (ret)
 			return ret;
-	}
 
+		one_unmap_start = next_unmap_start;
+	}
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Do not count MPX VMAs as neighbors when unmapping
  2015-06-07 18:37 ` [PATCH 18/19] x86, mpx: do not count MPX VMAs as neighbors when unmapping Dave Hansen
  2015-06-09 10:23   ` Ingo Molnar
@ 2015-06-09 12:35   ` tip-bot for Dave Hansen
  1 sibling, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:35 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: akpm, hpa, torvalds, peterz, mingo, linux-kernel, tglx, dave,
	dave.hansen

Commit-ID:  bea03c50b871a2fa922f31ad7c9993bb4fc7b192
Gitweb:     http://git.kernel.org/tip/bea03c50b871a2fa922f31ad7c9993bb4fc7b192
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:06 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:34 +0200

x86/mpx: Do not count MPX VMAs as neighbors when unmapping

The comment pretty much says it all.

I wrote a test program that does lots of random allocations
and forces bounds tables to be created.  It came up with a
layout like this:

  ....   | BOUNDS DIRECTORY ENTRY COVERS |  ....
         |    BOUNDS TABLE COVERS        |
|  BOUNDS TABLE |  REAL ALLOC | BOUNDS TABLE |

Unmapping "REAL ALLOC" should have been able to free the
bounds table "covering" the "REAL ALLOC" because it was the
last real user.  But, the neighboring VMA bounds tables were
found, considered as real neighbors, and we declined to free
the bounds table covering the area.

Doing this over and over left a small but significant number
of these orphans.  Handling them is fairly straighforward.
All we have to do is walk the VMAs and skip all of the MPX
ones when looking for neighbors.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183706.A6BD90BF@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/mpx.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 18fcf73..6233d51 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -937,16 +937,30 @@ static int try_unmap_single_bt(struct mm_struct *mm,
 	void __user *bde_vaddr;
 	int ret;
 	/*
+	 * We already unlinked the VMAs from the mm's rbtree so 'start'
+	 * is guaranteed to be in a hole. This gets us the first VMA
+	 * before the hole in to 'prev' and the next VMA after the hole
+	 * in to 'next'.
+	 */
+	next = find_vma_prev(mm, start, &prev);
+	/*
+	 * Do not count other MPX bounds table VMAs as neighbors.
+	 * Although theoretically possible, we do not allow bounds
+	 * tables for bounds tables so our heads do not explode.
+	 * If we count them as neighbors here, we may end up with
+	 * lots of tables even though we have no actual table
+	 * entries in use.
+	 */
+	while (next && is_mpx_vma(next))
+		next = next->vm_next;
+	while (prev && is_mpx_vma(prev))
+		prev = prev->vm_prev;
+	/*
 	 * We know 'start' and 'end' lie within an area controlled
 	 * by a single bounds table.  See if there are any other
 	 * VMAs controlled by that bounds table.  If there are not
 	 * then we can "expand" the are we are unmapping to possibly
 	 * cover the entire table.
-	 *
-	 * We already unliked the VMAs from the mm's rbtree so 'start'
-	 * is guaranteed to be in a hole. This gets us the first VMA
-	 * before the hole in to 'prev' and the next VMA after the hole
-	 * in to 'next'.
 	 */
 	next = find_vma_prev(mm, start, &prev);
 	if ((!prev || prev->vm_end <= bta_start_vaddr) &&

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:x86/fpu] x86/mpx: Allow 32-bit binaries on 64-bit kernels again
  2015-06-07 18:37 ` [PATCH 19/19] x86, mpx: allow mixed binaries again Dave Hansen
@ 2015-06-09 12:36   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Dave Hansen @ 2015-06-09 12:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, mingo, dave, akpm, torvalds, hpa, linux-kernel,
	dave.hansen, tglx

Commit-ID:  97ac46a5087eaf87fd76ff7bb31ec9c896010442
Gitweb:     http://git.kernel.org/tip/97ac46a5087eaf87fd76ff7bb31ec9c896010442
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Sun, 7 Jun 2015 11:37:06 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Jun 2015 12:24:34 +0200

x86/mpx: Allow 32-bit binaries on 64-bit kernels again

Now that the bugs in mixed mode MPX handling are fixed, re-allow
32-bit binaries on 64-bit kernels again.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183706.70277DAD@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/mpx.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 6233d51..7a657f5 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -367,12 +367,6 @@ static __user void *mpx_get_bounds_dir(void)
 		return MPX_INVALID_BOUNDS_DIR;
 
 	/*
-	 * 32-bit binaries on 64-bit kernels are currently
-	 * unsupported.
-	 */
-	if (IS_ENABLED(CONFIG_X86_64) && test_thread_flag(TIF_IA32))
-		return MPX_INVALID_BOUNDS_DIR;
-	/*
 	 * The bounds directory pointer is stored in a register
 	 * only accessible if we first do an xsave.
 	 */

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-29 22:34 [PATCH 00/19] x86, mpx updates for 4.2 (take 8) Dave Hansen
@ 2015-05-29 22:34 ` Dave Hansen
  0 siblings, 0 replies; 62+ messages in thread
From: Dave Hansen @ 2015-05-29 22:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, Dave Hansen, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu


From: Dave Hansen <dave.hansen@linux.intel.com>

The MPX code appears to be saving off the FPU in a potntially
unsafe way (if eagerfpu=off).  It does not disable preemption or
ensure that the FPU state has been allocated.  All of the
preemption safety comes from the unfortunatley-named
'unlazy_fpu()'.

This patch introduces a new helper which will do both of those
things internally.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: bp@alien8.de
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: the arch/x86 maintainers <x86@kernel.org>

---

Changes from v21:
 * add comments about preemption
 * rename helper to get_xsave_field_ptr()

Changes from "v19":
 * remove 'tsk' argument to get_xsave_addr() since the code
   can only realistically work on 'current', and fix up the
   comment a bit to match.

Changes from "v17":
 * fix s/xstate/xsave_field/ in the function comment
 * remove EXPORT_SYMBOL_GPL()

---

 b/arch/x86/include/asm/fpu/xstate.h |    1 +
 b/arch/x86/kernel/fpu/xstate.c      |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff -puN arch/x86/include/asm/fpu/xstate.h~tsk_get_xsave_addr arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~tsk_get_xsave_addr	2015-05-28 08:49:45.191271502 -0700
+++ b/arch/x86/include/asm/fpu/xstate.h	2015-05-29 13:43:34.291184369 -0700
@@ -41,5 +41,6 @@ extern u64 xstate_fx_sw_bytes[USER_XSTAT
 extern void update_regset_xstate_info(unsigned int size, u64 xstate_mask);
 
 void *get_xsave_addr(struct xregs_state *xsave, int xstate);
+const void *get_xsave_field_ptr(int xstate_field);
 
 #endif
diff -puN arch/x86/kernel/fpu/xstate.c~tsk_get_xsave_addr arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~tsk_get_xsave_addr	2015-05-28 08:49:45.192271546 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-05-29 12:32:47.869662576 -0700
@@ -427,3 +427,35 @@ void *get_xsave_addr(struct xregs_state
 	return (void *)xsave + xstate_comp_offsets[feature_nr];
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
+
+/*
+ * This wraps up the common operations that need to occur when retrieving
+ * data from xsave state.  It first ensures that the current task was
+ * using the FPU and retrieves the data in to a buffer.  It then calculates
+ * the offset of the requested field in the buffer.
+ *
+ * This function is safe to call whether the FPU is in use or not.
+ *
+ * Note that this only works on the current task.
+ *
+ * Inputs:
+ *	@xsave_state: state which is defined in xsave.h (e.g. XSTATE_FP,
+ *	XSTATE_SSE, etc...)
+ * Output:
+ *	address of the state in the xsave area or NULL if the state
+ *	is not present or is in its 'init state'.
+ */
+const void *get_xsave_field_ptr(int xsave_state)
+{
+	struct fpu *fpu = &current->thread.fpu;
+
+	if (!fpu->fpstate_active)
+		return NULL;
+	/*
+	 * fpu__save() takes the CPU's xstate registers
+	 * and saves them off to the 'fpu memory buffer.
+	 */
+	fpu__save(fpu);
+
+	return get_xsave_addr(&fpu->xstate->xsave, xsave_state);
+}
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-29 16:10             ` Borislav Petkov
@ 2015-05-29 18:51               ` Ingo Molnar
  0 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2015-05-29 18:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Dave Hansen, linux-kernel, X86 ML,
	Thomas Gleixner, Dave Hansen, Oleg Nesterov, Rik van Riel,
	Suresh Siddha, Ingo Molnar, H. Peter Anvin, Fenghua Yu,
	Linus Torvalds, Peter Zijlstra


* Borislav Petkov <bp@alien8.de> wrote:

> On Thu, May 28, 2015 at 06:05:33PM -0700, Andy Lutomirski wrote:
> > I would propose that we take the opposite approach and just ban
> > eagerfpu=off when MPX is enabled.  We could then take the next step
> > and default eagerfpu=on for everyone and, if nothing breaks, then just
> > delete lazy mode entirely.
> > 
> > I suspect we'd have to go back to Pentium 3 or earlier to find a CPU
> > on which lazy mode is actually a good idea.
> 
> Last time I checked (and ran some benchmarks) it was only a minute
> slowdown so I say we kill lazy mode if it means significant code
> complexity drop.
> 
> Can I also emulate Greg here and suggest that Pentium 3 people should
> buy newer hw? They should think about the environment, if nothing else.
> 
> :-P

I went back as far as Athon64 and the CR0 manipulation and CR0 faults are overly 
expensive there too.

Ok, you guys convinced me, I'll do a patch for this in tip:x86/fpu, and then 
people can benchmark it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-28 16:02         ` Dave Hansen
@ 2015-05-29 18:49           ` Ingo Molnar
  0 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2015-05-29 18:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, tglx, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu, Linus Torvalds, Peter Zijlstra


* Dave Hansen <dave@sr71.net> wrote:

> On 05/28/2015 08:01 AM, Ingo Molnar wrote:
> > fpu__activate_fpstate_read() will only activate the fpstate for reads (as the name 
> > suggests it).
> 
> I've got no problem doing it this way.  But are you planning to push this 
> function in to 4.2?  Is there a tree you want me to merge this on top of?

Yes and yes, please use tip:x86/fpu (or tip:master) for all FPU related work.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-29 16:47     ` Dave Hansen
@ 2015-05-29 18:48       ` Ingo Molnar
  0 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2015-05-29 18:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, tglx, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu


* Dave Hansen <dave@sr71.net> wrote:

> On 05/28/2015 01:41 AM, Ingo Molnar wrote:
> >> > +	union fpregs_state *xstate;
> >> > +
> >> > +	if (!current->thread.fpu.fpstate_active)
> >> > +		return NULL;
> >> > +	/*
> >> > +	 * fpu__save() takes the CPU's xstate registers
> >> > +	 * and saves them off to the 'fpu memory buffer.
> >> > +	 */
> >> > +	fpu__save(&current->thread.fpu);
> >> > +	xstate = &current->thread.fpu.state;
> >> > +
> >> > +	return get_xsave_addr(&xstate->xsave, xsave_state);
> > Small nit, this would become a lot shorter if you introduced a helper local 
> > variable:
> > 
> > 	struct fpu *fpu = &current->thread.fpu;
> > 
> > But more importantly, for a generic get_xsave_field_ptr() API, fpu__save() is 
> > not enough: fpu__save() will only save FPU registers into memory if necessary 
> > (i.e. if the FPU is already in use), and if you call it on a task with no FPU 
> > state then it will still have an !fpu->fpstate_active FPU state after the 
> > call, with random, invalid data in the xsave area.
> 
> But why does this matter?  We just did a !fpu.fpstate_active check, so we can't 
> have a !fpu.fpstate_active before or after the call.

Ah yes, you are right, I missed this:

> >> > +	if (!current->thread.fpu.fpstate_active)
> >> > +		return NULL;

because the usual pattern is:

		if (!fpu->fpstate_active)
			return NULL;

:-)

So your variant is fine too.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-29 18:29               ` Andy Lutomirski
@ 2015-05-29 18:44                 ` Ingo Molnar
  0 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2015-05-29 18:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, X86 ML, Thomas Gleixner, Dave Hansen,
	Oleg Nesterov, Borislav Petkov, Rik van Riel, Suresh Siddha,
	Ingo Molnar, H. Peter Anvin, Fenghua Yu, Linus Torvalds,
	Peter Zijlstra


* Andy Lutomirski <luto@amacapital.net> wrote:

> > It's not that simple, because the decision is not 'lazy versus eager', but 
> > 'mixed lazy/eager versus eager-only':
> >
> > Even on modern machines, if a task is not using the FPU (it's doing integer 
> > only work, with short sleeps just shuffling around requests, etc.) then 
> > context switches get up to 5-10% faster with lazy FPU restores.
> 
> That's only sort of true.  I'd believe that a context switch between two lazy 
> tasks is 5-10% faster than a context switch between two eager tasks.  I bet that 
> a context switch between a lazy task and an eager task is a whole lot slower 
> than a context switch between two eager tasks because manipulating CR0.TS is 
> incredibly slow on all modern CPUs AFAICT.  It's even worse in a VM guest.
> 
> In other words, with lazy restore, we save the XRSTOR(S) and possibly a 
> subsequent XSAVEOPT/XSAVES, but the cost is a MOV to CR0 and possibly a CLTS, 
> and the MOV to CR0 is much, much slower than even a worst-case XRSTOR(S).  In 
> the worst lazy-restore case, we also pay a full exception roundtrip, and 
> everything pales in comparison.  If we're a guest, then there's probably a 
> handful of exits thrown in for good measure.
> 
> For true integer-only tasks, I think we should instead convince glibc to add 
> things like vzeroall in convenient places to force as much xstate as possible to 
> the init state, thus speeding up the optimized save/restore variants.
> 
> I think the fundamental issue here is that CPU designers care about xstate 
> save/restore/optimize performance, but they don't care at all about TS 
> performance, so TS manipulations are probably microcoded and serializing.

That's definitely true.

Btw., potentially being able to get rid of lazy restores was why I wrote the 
FPU-benchmarking code, and it gives these results on reasonably recent Intel CPUs:

CR0 reads are reasonably fast:

 [    0.519287] x86/fpu: Cost of: CR0                         read          :     4 cycles

but we can cache that so it doesn't help us.

writes are bad:

 [    0.528643] x86/fpu: Cost of: CR0                         write         :   208 cycles

and we cannot cache it, so that hurts us.

and a CR0::TS fault cost is horrible:

 [    0.538042] x86/fpu: Cost of: CR0::TS                     fault         :  1156 cycles

and this is hurting us too.

Since the first version I have extended the benchmark with a cache-cold column as 
well - in the cache cold case the difference is even more striking, and in may 
cases context switches are cache cold.

Interestingly, this kind of high cost of CR0 related accesses is true even on 
pretty old, 10+ years old x86 CPUs, per my measurements, so it's not limited to 
modern x86 microarchitectures.

So yes, it would be nice to standardize on synchronous context switching of all 
CPU state.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-29 18:17             ` Ingo Molnar
@ 2015-05-29 18:29               ` Andy Lutomirski
  2015-05-29 18:44                 ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Andy Lutomirski @ 2015-05-29 18:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, linux-kernel, X86 ML, Thomas Gleixner, Dave Hansen,
	Oleg Nesterov, Borislav Petkov, Rik van Riel, Suresh Siddha,
	Ingo Molnar, H. Peter Anvin, Fenghua Yu, Linus Torvalds,
	Peter Zijlstra

On Fri, May 29, 2015 at 11:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Thu, May 28, 2015 at 9:24 AM, Dave Hansen <dave@sr71.net> wrote:
>> > On 05/28/2015 08:01 AM, Ingo Molnar wrote:
>> >> But the real question is: can we support in-use MPX with asynchronous lazy
>> >> restore, while it's still semantically correct? I don't think so, unless you add
>> >> MPX specific synchronous restore to the context switch path, which isn't such a
>> >> good idea IMHO.
>> >
>> > Right now, we assume that the first use of the FPU gets an #ND exception to
>> > tell us that someone is using the FPU.  MPX doesn't generate #ND, thus the
>> > need to do it eagerly.
>
> Basically MPX is not really a vector operation, it just uses the xstate (as in
> 'extended CPU state') context area to do easy saves/restores on context switches.
> MPX is an MMU-ish feature.
>
> That's an entirely sensible design approach, which reduces the support code needed
> for MPX, and it's not surprising that MPX accesses were not made conditional on
> CR0::TS.
>
>> > On CPUs that support it we could, instead, do an xgetbv during the context
>> > switch to ensure that all things having an xstate/xfeature but that do not
>> > generate #ND exceptions are in their init state.  If they are not in their
>> > init state, we exit lazy mode.
>
> Yeah, no, we don't need to do anything complex here.
>
> This property is something we know when MPX gets enabled, so for MPX tasks we
> should either simply set _TIF_WORK_CTXSW and let __switch_to_xtra() handle it, or
> should slightly modify the eagerfpu choice code to always do eager restores when
> switching to an MPX task.
>

Do we actually know which tasks use MPX, or do we merely know which
tasks use kernel-assisted MPX?

> Nothing complex is needed to support the mixed lazy/eager model, the current FPU
> code handles it just fine, because it's already a mixed lazy/eager model :-)
>
>> > We could theoretically use the same kind of thing with the compacted xsave
>> > format to ensure that we only allocate enough space for what we *need* in the
>> > xsave buffer and not allocate for the worst-case.  AVX512 has 32x512-bit
>> > registers (2kbytes) and it would be a bit of a shame to need to allocate ~3k
>> > of space.
>>
>> I understand the point of this type of optimization (except that I really don't
>> like the idea of sending SIGBUS or whatever if we fail an allocation at context
>> switch time), but why are we even considering trying to support MPX and lazy fpu
>> at the same time?  Judging from all the bug reports, it seems like it's a giant
>> mess, and the code to support lazy restore is not exactly pretty.
>>
>> I would propose that we take the opposite approach and just ban eagerfpu=off
>> when MPX is enabled.  We could then take the next step and default eagerfpu=on
>> for everyone and, if nothing breaks, then just delete lazy mode entirely.
>>
>> I suspect we'd have to go back to Pentium 3 or earlier to find a CPU on which
>> lazy mode is actually a good idea.  Fiddling with CR0 and handling exceptions is
>> really slow, and I think we should trust CPUs with XSAVEOPT support to do their
>> job and let the older CPUs take the small performance hit, if it even is a
>> performance hit.
>
> It's not that simple, because the decision is not 'lazy versus eager', but 'mixed
> lazy/eager versus eager-only':
>
> Even on modern machines, if a task is not using the FPU (it's doing integer only
> work, with short sleeps just shuffling around requests, etc.) then context
> switches get up to 5-10% faster with lazy FPU restores.

That's only sort of true.  I'd believe that a context switch between
two lazy tasks is 5-10% faster than a context switch between two eager
tasks.  I bet that a context switch between a lazy task and an eager
task is a whole lot slower than a context switch between two eager
tasks because manipulating CR0.TS is incredibly slow on all modern
CPUs AFAICT.  It's even worse in a VM guest.

In other words, with lazy restore, we save the XRSTOR(S) and possibly
a subsequent XSAVEOPT/XSAVES, but the cost is a MOV to CR0 and
possibly a CLTS, and the MOV to CR0 is much, much slower than even a
worst-case XRSTOR(S).  In the worst lazy-restore case, we also pay a
full exception roundtrip, and everything pales in comparison.  If
we're a guest, then there's probably a handful of exits thrown in for
good measure.

For true integer-only tasks, I think we should instead convince glibc
to add things like vzeroall in convenient places to force as much
xstate as possible to the init state, thus speeding up the optimized
save/restore variants.

I think the fundamental issue here is that CPU designers care about
xstate save/restore/optimize performance, but they don't care at all
about TS performance, so TS manipulations are probably microcoded and
serializing.

>
> So we have this dynamic measurement code in place in the lazy case that
> opportunistically enables eagerfpu handling on a per task basis, and that method
> works pretty efficiently and has a good hit rate in isolating FPU-users from
> integer-users.
>
> So it's not 'lazy restores versus eager restores', but:
>
>   - optimized, mixed lazy and eager use
>   vs.
>   - eager-only use
>
> Which is a lot less clear-cut choice.
>
> It's true that right now we forcibly use eagerfpu on all modern CPUs (XSAVE
> supporting ones - in essence modern Intel CPUs) which hides all this - but if you
> re-enable it it's measurable even on Intel systems. On AMD systems it's the
> current state of affairs right now.
>
> Also, I'd like to point out that the FPU code is a lot less of a mess in the
> latest x86/fpu tree! ;-)

That part's certainly true.

>
> I'd not give up on lazy restores just yet - or at least not without much better
> measurements backing it all up...

Fair enough.  I suspect that the only workloads on which it will win
are old 32-bit distros, though -- even integer-only 64-bit workloads
are likely to use SSE2 for things like memcpy.

--Andy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-29  1:05           ` Andy Lutomirski
  2015-05-29 15:31             ` Dave Hansen
  2015-05-29 16:10             ` Borislav Petkov
@ 2015-05-29 18:17             ` Ingo Molnar
  2015-05-29 18:29               ` Andy Lutomirski
  2 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2015-05-29 18:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, X86 ML, Thomas Gleixner, Dave Hansen,
	Oleg Nesterov, Borislav Petkov, Rik van Riel, Suresh Siddha,
	Ingo Molnar, H. Peter Anvin, Fenghua Yu, Linus Torvalds,
	Peter Zijlstra


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Thu, May 28, 2015 at 9:24 AM, Dave Hansen <dave@sr71.net> wrote:
> > On 05/28/2015 08:01 AM, Ingo Molnar wrote:
> >> But the real question is: can we support in-use MPX with asynchronous lazy
> >> restore, while it's still semantically correct? I don't think so, unless you add
> >> MPX specific synchronous restore to the context switch path, which isn't such a
> >> good idea IMHO.
> >
> > Right now, we assume that the first use of the FPU gets an #ND exception to 
> > tell us that someone is using the FPU.  MPX doesn't generate #ND, thus the 
> > need to do it eagerly.

Basically MPX is not really a vector operation, it just uses the xstate (as in 
'extended CPU state') context area to do easy saves/restores on context switches. 
MPX is an MMU-ish feature.

That's an entirely sensible design approach, which reduces the support code needed 
for MPX, and it's not surprising that MPX accesses were not made conditional on 
CR0::TS.

> > On CPUs that support it we could, instead, do an xgetbv during the context 
> > switch to ensure that all things having an xstate/xfeature but that do not 
> > generate #ND exceptions are in their init state.  If they are not in their 
> > init state, we exit lazy mode.

Yeah, no, we don't need to do anything complex here.

This property is something we know when MPX gets enabled, so for MPX tasks we 
should either simply set _TIF_WORK_CTXSW and let __switch_to_xtra() handle it, or 
should slightly modify the eagerfpu choice code to always do eager restores when 
switching to an MPX task.

Nothing complex is needed to support the mixed lazy/eager model, the current FPU 
code handles it just fine, because it's already a mixed lazy/eager model :-)

> > We could theoretically use the same kind of thing with the compacted xsave 
> > format to ensure that we only allocate enough space for what we *need* in the 
> > xsave buffer and not allocate for the worst-case.  AVX512 has 32x512-bit 
> > registers (2kbytes) and it would be a bit of a shame to need to allocate ~3k 
> > of space.
> 
> I understand the point of this type of optimization (except that I really don't 
> like the idea of sending SIGBUS or whatever if we fail an allocation at context 
> switch time), but why are we even considering trying to support MPX and lazy fpu 
> at the same time?  Judging from all the bug reports, it seems like it's a giant 
> mess, and the code to support lazy restore is not exactly pretty.
> 
> I would propose that we take the opposite approach and just ban eagerfpu=off 
> when MPX is enabled.  We could then take the next step and default eagerfpu=on 
> for everyone and, if nothing breaks, then just delete lazy mode entirely.
> 
> I suspect we'd have to go back to Pentium 3 or earlier to find a CPU on which 
> lazy mode is actually a good idea.  Fiddling with CR0 and handling exceptions is 
> really slow, and I think we should trust CPUs with XSAVEOPT support to do their 
> job and let the older CPUs take the small performance hit, if it even is a 
> performance hit.

It's not that simple, because the decision is not 'lazy versus eager', but 'mixed 
lazy/eager versus eager-only':

Even on modern machines, if a task is not using the FPU (it's doing integer only 
work, with short sleeps just shuffling around requests, etc.) then context 
switches get up to 5-10% faster with lazy FPU restores.

So we have this dynamic measurement code in place in the lazy case that 
opportunistically enables eagerfpu handling on a per task basis, and that method 
works pretty efficiently and has a good hit rate in isolating FPU-users from 
integer-users.

So it's not 'lazy restores versus eager restores', but:

  - optimized, mixed lazy and eager use
  vs.
  - eager-only use

Which is a lot less clear-cut choice.

It's true that right now we forcibly use eagerfpu on all modern CPUs (XSAVE 
supporting ones - in essence modern Intel CPUs) which hides all this - but if you 
re-enable it it's measurable even on Intel systems. On AMD systems it's the 
current state of affairs right now.

Also, I'd like to point out that the FPU code is a lot less of a mess in the 
latest x86/fpu tree! ;-)

I'd not give up on lazy restores just yet - or at least not without much better 
measurements backing it all up...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-28  8:41   ` Ingo Molnar
  2015-05-28 14:45     ` Dave Hansen
@ 2015-05-29 16:47     ` Dave Hansen
  2015-05-29 18:48       ` Ingo Molnar
  1 sibling, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-05-29 16:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, tglx, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu

On 05/28/2015 01:41 AM, Ingo Molnar wrote:
>> > +	union fpregs_state *xstate;
>> > +
>> > +	if (!current->thread.fpu.fpstate_active)
>> > +		return NULL;
>> > +	/*
>> > +	 * fpu__save() takes the CPU's xstate registers
>> > +	 * and saves them off to the 'fpu memory buffer.
>> > +	 */
>> > +	fpu__save(&current->thread.fpu);
>> > +	xstate = &current->thread.fpu.state;
>> > +
>> > +	return get_xsave_addr(&xstate->xsave, xsave_state);
> Small nit, this would become a lot shorter if you introduced a helper local 
> variable:
> 
> 	struct fpu *fpu = &current->thread.fpu;
> 
> But more importantly, for a generic get_xsave_field_ptr() API, fpu__save() is not 
> enough: fpu__save() will only save FPU registers into memory if necessary (i.e. if 
> the FPU is already in use), and if you call it on a task with no FPU state then it 
> will still have an !fpu->fpstate_active FPU state after the call, with random, 
> invalid data in the xsave area.

But why does this matter?  We just did a !fpu.fpstate_active check, so
we can't have a !fpu.fpstate_active before or after the call.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-29  1:05           ` Andy Lutomirski
  2015-05-29 15:31             ` Dave Hansen
@ 2015-05-29 16:10             ` Borislav Petkov
  2015-05-29 18:51               ` Ingo Molnar
  2015-05-29 18:17             ` Ingo Molnar
  2 siblings, 1 reply; 62+ messages in thread
From: Borislav Petkov @ 2015-05-29 16:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Ingo Molnar, linux-kernel, X86 ML, Thomas Gleixner,
	Dave Hansen, Oleg Nesterov, Rik van Riel, Suresh Siddha,
	Ingo Molnar, H. Peter Anvin, Fenghua Yu, Linus Torvalds,
	Peter Zijlstra

On Thu, May 28, 2015 at 06:05:33PM -0700, Andy Lutomirski wrote:
> I would propose that we take the opposite approach and just ban
> eagerfpu=off when MPX is enabled.  We could then take the next step
> and default eagerfpu=on for everyone and, if nothing breaks, then just
> delete lazy mode entirely.
> 
> I suspect we'd have to go back to Pentium 3 or earlier to find a CPU
> on which lazy mode is actually a good idea.

Last time I checked (and ran some benchmarks) it was only a minute
slowdown so I say we kill lazy mode if it means significant code
complexity drop.

Can I also emulate Greg here and suggest that Pentium 3 people should
buy newer hw? They should think about the environment, if nothing else.

:-P

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-29  1:05           ` Andy Lutomirski
@ 2015-05-29 15:31             ` Dave Hansen
  2015-05-29 16:10             ` Borislav Petkov
  2015-05-29 18:17             ` Ingo Molnar
  2 siblings, 0 replies; 62+ messages in thread
From: Dave Hansen @ 2015-05-29 15:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel, X86 ML, Thomas Gleixner, Dave Hansen,
	Oleg Nesterov, Borislav Petkov, Rik van Riel, Suresh Siddha,
	Ingo Molnar, H. Peter Anvin, Fenghua Yu, Linus Torvalds,
	Peter Zijlstra

On 05/28/2015 06:05 PM, Andy Lutomirski wrote:
> I would propose that we take the opposite approach and just ban
> eagerfpu=off when MPX is enabled.  We could then take the next step
> and default eagerfpu=on for everyone and, if nothing breaks, then just
> delete lazy mode entirely.

No objections from me on this.

It's definitely the simplest thing to do, and it's one less potential
delta that enabling MPX could impose on someone, so it makes me happy on
that front.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-28 16:24         ` Dave Hansen
@ 2015-05-29  1:05           ` Andy Lutomirski
  2015-05-29 15:31             ` Dave Hansen
                               ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Andy Lutomirski @ 2015-05-29  1:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, linux-kernel, X86 ML, Thomas Gleixner, Dave Hansen,
	Oleg Nesterov, Borislav Petkov, Rik van Riel, Suresh Siddha,
	Ingo Molnar, H. Peter Anvin, Fenghua Yu, Linus Torvalds,
	Peter Zijlstra

On Thu, May 28, 2015 at 9:24 AM, Dave Hansen <dave@sr71.net> wrote:
> On 05/28/2015 08:01 AM, Ingo Molnar wrote:
>> But the real question is: can we support in-use MPX with asynchronous lazy
>> restore, while it's still semantically correct? I don't think so, unless you add
>> MPX specific synchronous restore to the context switch path, which isn't such a
>> good idea IMHO.
>
> Right now, we assume that the first use of the FPU gets an #ND exception
> to tell us that someone is using the FPU.  MPX doesn't generate #ND,
> thus the need to do it eagerly.
>
> On CPUs that support it we could, instead, do an xgetbv during the
> context switch to ensure that all things having an xstate/xfeature but
> that do not generate #ND exceptions are in their init state.  If they
> are not in their init state, we exit lazy mode.
>
> We could theoretically use the same kind of thing with the compacted
> xsave format to ensure that we only allocate enough space for what we
> *need* in the xsave buffer and not allocate for the worst-case.  AVX512
> has 32x512-bit registers (2kbytes) and it would be a bit of a shame to
> need to allocate ~3k of space.

I understand the point of this type of optimization (except that I
really don't like the idea of sending SIGBUS or whatever if we fail an
allocation at context switch time), but why are we even considering
trying to support MPX and lazy fpu at the same time?  Judging from all
the bug reports, it seems like it's a giant mess, and the code to
support lazy restore is not exactly pretty.

I would propose that we take the opposite approach and just ban
eagerfpu=off when MPX is enabled.  We could then take the next step
and default eagerfpu=on for everyone and, if nothing breaks, then just
delete lazy mode entirely.

I suspect we'd have to go back to Pentium 3 or earlier to find a CPU
on which lazy mode is actually a good idea.  Fiddling with CR0 and
handling exceptions is really slow, and I think we should trust CPUs
with XSAVEOPT support to do their job and let the older CPUs take the
small performance hit, if it even is a performance hit.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-28 15:01       ` Ingo Molnar
  2015-05-28 16:02         ` Dave Hansen
@ 2015-05-28 16:24         ` Dave Hansen
  2015-05-29  1:05           ` Andy Lutomirski
  1 sibling, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-05-28 16:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, tglx, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu, Linus Torvalds, Peter Zijlstra

On 05/28/2015 08:01 AM, Ingo Molnar wrote:
> But the real question is: can we support in-use MPX with asynchronous lazy 
> restore, while it's still semantically correct? I don't think so, unless you add 
> MPX specific synchronous restore to the context switch path, which isn't such a 
> good idea IMHO.

Right now, we assume that the first use of the FPU gets an #ND exception
to tell us that someone is using the FPU.  MPX doesn't generate #ND,
thus the need to do it eagerly.

On CPUs that support it we could, instead, do an xgetbv during the
context switch to ensure that all things having an xstate/xfeature but
that do not generate #ND exceptions are in their init state.  If they
are not in their init state, we exit lazy mode.

We could theoretically use the same kind of thing with the compacted
xsave format to ensure that we only allocate enough space for what we
*need* in the xsave buffer and not allocate for the worst-case.  AVX512
has 32x512-bit registers (2kbytes) and it would be a bit of a shame to
need to allocate ~3k of space.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-28 15:01       ` Ingo Molnar
@ 2015-05-28 16:02         ` Dave Hansen
  2015-05-29 18:49           ` Ingo Molnar
  2015-05-28 16:24         ` Dave Hansen
  1 sibling, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-05-28 16:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, tglx, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu, Linus Torvalds, Peter Zijlstra

On 05/28/2015 08:01 AM, Ingo Molnar wrote:
> fpu__activate_fpstate_read() will only activate the fpstate for reads (as the name 
> suggests it).

I've got no problem doing it this way.  But are you planning to push
this function in to 4.2?  Is there a tree you want me to merge this on
top of?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-28 14:45     ` Dave Hansen
@ 2015-05-28 15:01       ` Ingo Molnar
  2015-05-28 16:02         ` Dave Hansen
  2015-05-28 16:24         ` Dave Hansen
  0 siblings, 2 replies; 62+ messages in thread
From: Ingo Molnar @ 2015-05-28 15:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, tglx, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu, Linus Torvalds, Peter Zijlstra


* Dave Hansen <dave@sr71.net> wrote:

> On 05/28/2015 01:41 AM, Ingo Molnar wrote:
>
> > What you want here is to make the (in-memory) FPU state valid and current, 
> > before reading it, and the function to use for that is 
> > fpu__activate_fpstate_read() (available in the latest tip:x86/fpu tree).
> 
> Do we really want to unconditionally activate the FPU?
>
> Let's say supporting MPX didn't require eager mode and someone called 
> get_xsave_addr().  We would ideally want to keep the FPU inactive and just 
> return NULL.  Right?

So there's two distinct types of 'active' here:

  - active fpstate (in-kernel memory context buffer)
  - active fpregs  (in-FPU hardware registers)

fpu__activate_fpstate_read() will only activate the fpstate for reads (as the name 
suggests it).

In your hypothetical case, if it's called with lazy FPU state then the fpstate is 
active already, and the fpstate represents the 'real' FPU state of the current 
task - while the FPU's contents are still some previous task's FPU state. So we 
can return the contents of this task's fpstate just fine even if the registers 
themselves are not (yet) loaded with them.

But the real question is: can we support in-use MPX with asynchronous lazy 
restore, while it's still semantically correct? I don't think so, unless you add 
MPX specific synchronous restore to the context switch path, which isn't such a 
good idea IMHO.

Furthermore, I don't think we want to extend lazy FPU use, in fact I'm considering 
getting rid of it altogether, even on old CPUs: the CR0 fault costs are horrible 
all across the CPU spectrum (even for legacy CPUs), and modern user-space makes 
use of the FPU all the time.

Yes, on older CPUs, if user-space does not use the FPU but context switches 
frequently, then the cost of always doing FPU save/restore is measurable, but the 
worst-case I've measured was something like a 10% increase in context switching 
cost.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-28  8:41   ` Ingo Molnar
@ 2015-05-28 14:45     ` Dave Hansen
  2015-05-28 15:01       ` Ingo Molnar
  2015-05-29 16:47     ` Dave Hansen
  1 sibling, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-05-28 14:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, tglx, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu

On 05/28/2015 01:41 AM, Ingo Molnar wrote:
> What you want here is to make the (in-memory) FPU state valid and current, before 
> reading it, and the function to use for that is fpu__activate_fpstate_read() 
> (available in the latest tip:x86/fpu tree).

Do we really want to unconditionally activate the FPU?

Let's say supporting MPX didn't require eager mode and someone called
get_xsave_addr().  We would ideally want to keep the FPU inactive and
just return NULL.  Right?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-27 18:36 ` [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer Dave Hansen
@ 2015-05-28  8:41   ` Ingo Molnar
  2015-05-28 14:45     ` Dave Hansen
  2015-05-29 16:47     ` Dave Hansen
  0 siblings, 2 replies; 62+ messages in thread
From: Ingo Molnar @ 2015-05-28  8:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, tglx, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu


* Dave Hansen <dave@sr71.net> wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The MPX code appears to be saving off the FPU in an unsafe
> way.  It does not disable preemption or ensure that the
> FPU state has been allocated.  All of the preemption safety
> comes from the unfortunatley-named 'unlazy_fpu()'.

Btw., with the new FPU code these functions are named differently, and the bug in 
the MPX code became a lot more obvious:

     copy_fpregs_to_fpstate(&tsk->thread.fpu);
     xsave_buf = &(tsk->thread.fpu.state.xsave);
     bndcsr = get_xsave_addr(xsave_buf, XSTATE_BNDCSR);

it's indeed generally unsafe to access/copy FPU registers with preemption enabled, 
for two reasons:

  - on older systems that use FSAVE the instruction destroys FPU register
    contents, which has to be handled carefully

  - even on newer systems if we copy to FPU registers (which this code doesn't) 
    then we don't want a context switch to occur in the middle of it, because a 
    context switch will write to the fpstate, potentially overwriting our new data 
    with old FPU state.

But it's safe to access FPU registers with preemption enabled in a couple of 
special cases:

  - potentially destructively saving FPU registers: the signal handling code does
    this in copy_fpstate_to_sigframe(), because it can rely on the signal restore
    side to restore the original FPU state.

  - reading FPU registers on modern systems: we don't do this anywhere at the
    moment, mostly to keep symmetry with older systems where FSAVE is
    destructive.

  - initializing FPU registers on modern systems: fpu__clear() does this. Here
    it's safe because we don't copy from the fpstate.

  - directly writing FPU registers from user-space memory (!). We do this in
    fpu__restore_sig(), and it's safe because neither context switches nor
    irq-handler FPU use can corrupt the source context of the copy (which is
    user-space memory).

Note that the MPX code's current use of copy_fpregs_to_fpstate() was safe I think, 
because:

 - MPX is predicated on eagerfpu, so the destructive F[N]SAVE instruction won't be 
   used.

 - the code was only reading FPU registers, and was doing it only in places that
   guaranteed that an FPU state was already active (i.e. didn't do it in
   kthreads)

But ... I agree that a more robust API should be used to access FPU registers:

> @@ -427,3 +427,36 @@ void *get_xsave_addr(struct xregs_state
>  	return (void *)xsave + xstate_comp_offsets[feature_nr];
>  }
>  EXPORT_SYMBOL_GPL(get_xsave_addr);
>
> +/*
> + * This wraps up the common operations that need to occur when retrieving
> + * data from xsave state.  It first ensures that the current task was
> + * using the FPU and retrieves the data in to a buffer.  It then calculates
> + * the offset of the requested field in the buffer.
> + *
> + * This function is safe to call whether the FPU is in use or not.
> + *
> + * Note that this only works on the current task.
> + *
> + * Inputs:
> + *	@xsave_state: state which is defined in xsave.h (e.g. XSTATE_FP,
> + *	XSTATE_SSE, etc...)
> + * Output:
> + *	address of the state in the xsave area or NULL if the state
> + *	is not present or is in its 'init state'.
> + */
> +void *get_xsave_field_ptr(int xsave_state)

So this is retrieving (reading) data from FPU registers, but returns a writable 
'void *'. So the return pointer from this interface should be constified, to make 
sure no modifications may occur over them (which modificiations would be unsafe).

> +	union fpregs_state *xstate;
> +
> +	if (!current->thread.fpu.fpstate_active)
> +		return NULL;
> +	/*
> +	 * fpu__save() takes the CPU's xstate registers
> +	 * and saves them off to the 'fpu memory buffer.
> +	 */
> +	fpu__save(&current->thread.fpu);
> +	xstate = &current->thread.fpu.state;
> +
> +	return get_xsave_addr(&xstate->xsave, xsave_state);

Small nit, this would become a lot shorter if you introduced a helper local 
variable:

	struct fpu *fpu = &current->thread.fpu;

But more importantly, for a generic get_xsave_field_ptr() API, fpu__save() is not 
enough: fpu__save() will only save FPU registers into memory if necessary (i.e. if 
the FPU is already in use), and if you call it on a task with no FPU state then it 
will still have an !fpu->fpstate_active FPU state after the call, with random, 
invalid data in the xsave area.

What you want here is to make the (in-memory) FPU state valid and current, before 
reading it, and the function to use for that is fpu__activate_fpstate_read() 
(available in the latest tip:x86/fpu tree).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-27 18:36 [PATCH 00/19] x86, mpx updates for 4.2 (take 8) Dave Hansen
@ 2015-05-27 18:36 ` Dave Hansen
  2015-05-28  8:41   ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-05-27 18:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, Dave Hansen, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu


From: Dave Hansen <dave.hansen@linux.intel.com>

The MPX code appears to be saving off the FPU in an unsafe
way.  It does not disable preemption or ensure that the
FPU state has been allocated.  All of the preemption safety
comes from the unfortunatley-named 'unlazy_fpu()'.

This patch introduces a new helper which will do both of
those things internally.

Note that this requires a patch from Oleg in order to work
properly.  It is currently in tip/x86/fpu.

> commit f893959b0898bd876673adbeb6798bdf25c034d7
> Author: Oleg Nesterov <oleg@redhat.com>
> Date:   Fri Mar 13 18:30:30 2015 +0100
>
>    x86/fpu: Don't abuse drop_init_fpu() in flush_thread()

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: bp@alien8.de
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: the arch/x86 maintainers <x86@kernel.org>

---

Changes from v21:
 * add comments about preemption
 * rename helper to get_xsave_field_ptr()

Changes from "v19":
 * remove 'tsk' argument to get_xsave_addr() since the code
   can only realistically work on 'current', and fix up the
   comment a bit to match.

Changes from "v17":
 * fix s/xstate/xsave_field/ in the function comment
 * remove EXPORT_SYMBOL_GPL()

---

 b/arch/x86/include/asm/fpu/xstate.h |    1 +
 b/arch/x86/kernel/fpu/xstate.c      |   33 +++++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff -puN arch/x86/include/asm/fpu/xstate.h~tsk_get_xsave_addr arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~tsk_get_xsave_addr	2015-05-27 09:32:14.928463071 -0700
+++ b/arch/x86/include/asm/fpu/xstate.h	2015-05-27 09:32:14.934463342 -0700
@@ -41,5 +41,6 @@ extern u64 xstate_fx_sw_bytes[USER_XSTAT
 extern void update_regset_xstate_info(unsigned int size, u64 xstate_mask);
 
 void *get_xsave_addr(struct xregs_state *xsave, int xstate);
+void *get_xsave_field_ptr(int xstate_field);
 
 #endif
diff -puN arch/x86/kernel/fpu/xstate.c~tsk_get_xsave_addr arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~tsk_get_xsave_addr	2015-05-27 09:32:14.930463161 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-05-27 09:32:14.934463342 -0700
@@ -427,3 +427,36 @@ void *get_xsave_addr(struct xregs_state
 	return (void *)xsave + xstate_comp_offsets[feature_nr];
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
+
+/*
+ * This wraps up the common operations that need to occur when retrieving
+ * data from xsave state.  It first ensures that the current task was
+ * using the FPU and retrieves the data in to a buffer.  It then calculates
+ * the offset of the requested field in the buffer.
+ *
+ * This function is safe to call whether the FPU is in use or not.
+ *
+ * Note that this only works on the current task.
+ *
+ * Inputs:
+ *	@xsave_state: state which is defined in xsave.h (e.g. XSTATE_FP,
+ *	XSTATE_SSE, etc...)
+ * Output:
+ *	address of the state in the xsave area or NULL if the state
+ *	is not present or is in its 'init state'.
+ */
+void *get_xsave_field_ptr(int xsave_state)
+{
+	union fpregs_state *xstate;
+
+	if (!current->thread.fpu.fpstate_active)
+		return NULL;
+	/*
+	 * fpu__save() takes the CPU's xstate registers
+	 * and saves them off to the 'fpu memory buffer.
+	 */
+	fpu__save(&current->thread.fpu);
+	xstate = &current->thread.fpu.state;
+
+	return get_xsave_addr(&xstate->xsave, xsave_state);
+}
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-19  6:25 ` [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer Dave Hansen
@ 2015-05-19  8:15   ` Thomas Gleixner
  0 siblings, 0 replies; 62+ messages in thread
From: Thomas Gleixner @ 2015-05-19  8:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, dave.hansen, oleg, bp, riel, sbsiddha, luto,
	mingo, hpa, fenghua.yu

On Mon, 18 May 2015, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The MPX code appears to be saving off the FPU in an unsafe
> way.  It does not disable preemption or ensure that the
> FPU state has been allocated.  All of the preemption safety
> comes from the unfortunatley-named 'unlazy_fpu()'.
> 
> This patch introduces a new helper which will do both of
> those things internally.
> 
> Note that this requires a patch from Oleg in order to work
> properly.  It is currently in tip/x86/fpu.
> 
> > commit f893959b0898bd876673adbeb6798bdf25c034d7
> > Author: Oleg Nesterov <oleg@redhat.com>
> > Date:   Fri Mar 13 18:30:30 2015 +0100
> >
> >    x86/fpu: Don't abuse drop_init_fpu() in flush_thread()
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: bp@alien8.de
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Suresh Siddha <sbsiddha@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Fenghua Yu <fenghua.yu@intel.com>
> Cc: the arch/x86 maintainers <x86@kernel.org>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer
  2015-05-19  6:25 [PATCH 00/19] x86, mpx updates for 4.2 (take 7) Dave Hansen
@ 2015-05-19  6:25 ` Dave Hansen
  2015-05-19  8:15   ` Thomas Gleixner
  0 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-05-19  6:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, Dave Hansen, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu


From: Dave Hansen <dave.hansen@linux.intel.com>

The MPX code appears to be saving off the FPU in an unsafe
way.  It does not disable preemption or ensure that the
FPU state has been allocated.  All of the preemption safety
comes from the unfortunatley-named 'unlazy_fpu()'.

This patch introduces a new helper which will do both of
those things internally.

Note that this requires a patch from Oleg in order to work
properly.  It is currently in tip/x86/fpu.

> commit f893959b0898bd876673adbeb6798bdf25c034d7
> Author: Oleg Nesterov <oleg@redhat.com>
> Date:   Fri Mar 13 18:30:30 2015 +0100
>
>    x86/fpu: Don't abuse drop_init_fpu() in flush_thread()

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: bp@alien8.de
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: the arch/x86 maintainers <x86@kernel.org>

---

Changes from v21:
 * add comments about preemption
 * rename helper to get_xsave_field_ptr()

Changes from "v19":
 * remove 'tsk' argument to get_xsave_addr() since the code
   can only realistically work on 'current', and fix up the
   comment a bit to match.

Changes from "v17":
 * fix s/xstate/xsave_field/ in the function comment
 * remove EXPORT_SYMBOL_GPL()

---

 b/arch/x86/include/asm/xsave.h |    1 +
 b/arch/x86/kernel/xsave.c      |   33 +++++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff -puN arch/x86/include/asm/xsave.h~tsk_get_xsave_addr arch/x86/include/asm/xsave.h
--- a/arch/x86/include/asm/xsave.h~tsk_get_xsave_addr	2015-05-18 17:48:58.222390639 -0700
+++ b/arch/x86/include/asm/xsave.h	2015-05-18 17:48:58.227390864 -0700
@@ -252,6 +252,7 @@ static inline int xrestore_user(struct x
 }
 
 void *get_xsave_addr(struct xsave_struct *xsave, int xstate);
+void *get_xsave_field_ptr(int xstate_field);
 void setup_xstate_comp(void);
 
 #endif
diff -puN arch/x86/kernel/xsave.c~tsk_get_xsave_addr arch/x86/kernel/xsave.c
--- a/arch/x86/kernel/xsave.c~tsk_get_xsave_addr	2015-05-18 17:48:58.224390729 -0700
+++ b/arch/x86/kernel/xsave.c	2015-05-18 17:48:58.227390864 -0700
@@ -750,3 +750,36 @@ void *get_xsave_addr(struct xsave_struct
 	return (void *)xsave + xstate_comp_offsets[feature_nr];
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
+
+/*
+ * This wraps up the common operations that need to occur when retrieving
+ * data from xsave state.  It first ensures that the current task was
+ * using the FPU and retrieves the data in to a buffer.  It then calculates
+ * the offset of the requested field in the buffer.
+ *
+ * This function is safe to call whether the FPU is in use or not.
+ *
+ * Note that this only works on the current task.
+ *
+ * Inputs:
+ *	@xsave_field: state which is defined in xsave.h (e.g. XSTATE_FP,
+ *	XSTATE_SSE, etc...)
+ * Output:
+ *	address of the state in the xsave area or NULL if the field
+ *	is not present or is in its 'init state'.
+ */
+void *get_xsave_field_ptr(int xsave_field)
+{
+	union thread_xstate *xstate;
+
+	if (!tsk_used_math(current))
+		return NULL;
+	/*
+	 * unlazy_fpu() is poorly named and will actually
+	 * save the xstate off in to the memory buffer.
+	 */
+	unlazy_fpu(current);
+	xstate = current->thread.fpu.state;
+
+	return get_xsave_addr(&xstate->xsave, xsave_field);
+}
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: wrap get_xsave_addr() to make it safer
  2015-05-18 19:38   ` Thomas Gleixner
@ 2015-05-18 19:42     ` Thomas Gleixner
  0 siblings, 0 replies; 62+ messages in thread
From: Thomas Gleixner @ 2015-05-18 19:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, dave.hansen, oleg, bp, riel, sbsiddha, luto,
	mingo, hpa, fenghua.yu



On Mon, 18 May 2015, Thomas Gleixner wrote:

> On Fri, 8 May 2015, Dave Hansen wrote:
> > The MPX code appears to be saving off the FPU in an unsafe
> > way.   It does not disable preemption or ensure that the
> > FPU state has been allocated.
> > 
> > This patch introduces a new helper which will do both of
> > those things internally.
> 
> This changelog does not really match the implementation. Unless I'm
> missing something I can't find anything preemption related.

Gah. Hit send before finishing the mail.

It's unlazy_fpu (which I agree is a horrible name) which does the
right thing.

> > +
> > +/*
> > + * This wraps up the common operations that need to occur when retrieving
> > + * data from xsave state.  It first ensures that the current task was
> > + * using the FPU and retrieves the data in to a buffer.  It then calculates
> > + * the offset of the requested field in the buffer.
> > + *
> > + * This function is safe to call whether the FPU is in use or not.
> > + *
> > + * Note that this only works on the current task.
> > + *
> > + * Inputs:
> > + *	@xsave_field: state which is defined in xsave.h (e.g. XSTATE_FP,
> > + *	XSTATE_SSE, etc...)
> > + * Output:
> > + *	address of the state in the xsave area.
> 
>   or NULL in case of .....
> 
> > + */
> > +void *get_xsave_field(int xsave_field)
> > +{
> > +	union thread_xstate *xstate;
> > +
> > +	if (!tsk_used_math(current))
> > +		return NULL;
> > +	/*
> > +	 * unlazy_fpu() is poorly named and will actually
> > +	 * save the xstate off in to the memory buffer.
> > +	 */
> > +	unlazy_fpu(current);
> > +	xstate = current->thread.fpu.state;
> > +
> > +	return get_xsave_addr(&xstate->xsave, xsave_field);
> > +}
> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/19] x86, fpu: wrap get_xsave_addr() to make it safer
  2015-05-08 18:59 ` [PATCH 02/19] x86, fpu: wrap get_xsave_addr() to make it safer Dave Hansen
@ 2015-05-18 19:38   ` Thomas Gleixner
  2015-05-18 19:42     ` Thomas Gleixner
  0 siblings, 1 reply; 62+ messages in thread
From: Thomas Gleixner @ 2015-05-18 19:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, dave.hansen, oleg, bp, riel, sbsiddha, luto,
	mingo, hpa, fenghua.yu

On Fri, 8 May 2015, Dave Hansen wrote:
> The MPX code appears to be saving off the FPU in an unsafe
> way.   It does not disable preemption or ensure that the
> FPU state has been allocated.
> 
> This patch introduces a new helper which will do both of
> those things internally.

This changelog does not really match the implementation. Unless I'm
missing something I can't find anything preemption related.

> +
> +/*
> + * This wraps up the common operations that need to occur when retrieving
> + * data from xsave state.  It first ensures that the current task was
> + * using the FPU and retrieves the data in to a buffer.  It then calculates
> + * the offset of the requested field in the buffer.
> + *
> + * This function is safe to call whether the FPU is in use or not.
> + *
> + * Note that this only works on the current task.
> + *
> + * Inputs:
> + *	@xsave_field: state which is defined in xsave.h (e.g. XSTATE_FP,
> + *	XSTATE_SSE, etc...)
> + * Output:
> + *	address of the state in the xsave area.

  or NULL in case of .....

> + */
> +void *get_xsave_field(int xsave_field)
> +{
> +	union thread_xstate *xstate;
> +
> +	if (!tsk_used_math(current))
> +		return NULL;
> +	/*
> +	 * unlazy_fpu() is poorly named and will actually
> +	 * save the xstate off in to the memory buffer.
> +	 */
> +	unlazy_fpu(current);
> +	xstate = current->thread.fpu.state;
> +
> +	return get_xsave_addr(&xstate->xsave, xsave_field);
> +}

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 02/19] x86, fpu: wrap get_xsave_addr() to make it safer
  2015-05-08 18:59 [PATCH 00/19] x86, mpx updates for 4.2 (take 6) Dave Hansen
@ 2015-05-08 18:59 ` Dave Hansen
  2015-05-18 19:38   ` Thomas Gleixner
  0 siblings, 1 reply; 62+ messages in thread
From: Dave Hansen @ 2015-05-08 18:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, Dave Hansen, dave.hansen, oleg, bp, riel, sbsiddha,
	luto, mingo, hpa, fenghua.yu


From: Dave Hansen <dave.hansen@linux.intel.com>

Changes from "v19":
 * remove 'tsk' argument to get_xsave_addr() since the code
   can only realistically work on 'current', and fix up the
   comment a bit to match.

Changes from "v17":
 * fix s/xstate/xsave_field/ in the function comment
 * remove EXPORT_SYMBOL_GPL()

---
From: Dave Hansen <dave.hansen@linux.intel.com>

The MPX code appears to be saving off the FPU in an unsafe
way.   It does not disable preemption or ensure that the
FPU state has been allocated.

This patch introduces a new helper which will do both of
those things internally.

Note that this requires a patch from Oleg in order to work
properly.  It is currently in tip/x86/fpu.

> commit f893959b0898bd876673adbeb6798bdf25c034d7
> Author: Oleg Nesterov <oleg@redhat.com>
> Date:   Fri Mar 13 18:30:30 2015 +0100
>
>    x86/fpu: Don't abuse drop_init_fpu() in flush_thread()

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: bp@alien8.de
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: the arch/x86 maintainers <x86@kernel.org>
---

 b/arch/x86/include/asm/xsave.h |    1 +
 b/arch/x86/kernel/xsave.c      |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff -puN arch/x86/include/asm/xsave.h~tsk_get_xsave_addr arch/x86/include/asm/xsave.h
--- a/arch/x86/include/asm/xsave.h~tsk_get_xsave_addr	2015-05-08 11:46:10.973580863 -0700
+++ b/arch/x86/include/asm/xsave.h	2015-05-08 11:46:10.978581089 -0700
@@ -252,6 +252,7 @@ static inline int xrestore_user(struct x
 }
 
 void *get_xsave_addr(struct xsave_struct *xsave, int xstate);
+void *get_xsave_field(int xstate_field);
 void setup_xstate_comp(void);
 
 #endif
diff -puN arch/x86/kernel/xsave.c~tsk_get_xsave_addr arch/x86/kernel/xsave.c
--- a/arch/x86/kernel/xsave.c~tsk_get_xsave_addr	2015-05-08 11:46:10.975580953 -0700
+++ b/arch/x86/kernel/xsave.c	2015-05-08 11:46:10.978581089 -0700
@@ -749,3 +749,35 @@ void *get_xsave_addr(struct xsave_struct
 	return (void *)xsave + xstate_comp_offsets[feature_nr];
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
+
+/*
+ * This wraps up the common operations that need to occur when retrieving
+ * data from xsave state.  It first ensures that the current task was
+ * using the FPU and retrieves the data in to a buffer.  It then calculates
+ * the offset of the requested field in the buffer.
+ *
+ * This function is safe to call whether the FPU is in use or not.
+ *
+ * Note that this only works on the current task.
+ *
+ * Inputs:
+ *	@xsave_field: state which is defined in xsave.h (e.g. XSTATE_FP,
+ *	XSTATE_SSE, etc...)
+ * Output:
+ *	address of the state in the xsave area.
+ */
+void *get_xsave_field(int xsave_field)
+{
+	union thread_xstate *xstate;
+
+	if (!tsk_used_math(current))
+		return NULL;
+	/*
+	 * unlazy_fpu() is poorly named and will actually
+	 * save the xstate off in to the memory buffer.
+	 */
+	unlazy_fpu(current);
+	xstate = current->thread.fpu.state;
+
+	return get_xsave_addr(&xstate->xsave, xsave_field);
+}
_

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2015-06-09 12:38 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-07 18:37 [PATCH 00/19] x86, mpx updates for 4.2 (take 9) Dave Hansen
2015-06-07 18:37 ` [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer Dave Hansen
2015-06-09 12:31   ` [tip:x86/fpu] x86/fpu/xstate: " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 01/19] x86, mpx, xsave: Fix up bad get_xsave_addr() assumptions Dave Hansen
2015-06-09 12:30   ` [tip:x86/fpu] x86/fpu/xstate: " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 03/19] x86, mpx: Use new get_xsave_field_ptr() Dave Hansen
2015-06-09 12:31   ` [tip:x86/fpu] x86/mpx: Use the new get_xsave_field_ptr()API tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 05/19] x86, mpx: remove redundant MPX_BNDCFG_ADDR_MASK Dave Hansen
2015-06-09 12:32   ` [tip:x86/fpu] x86/mpx: Remove " tip-bot for Qiaowei Ren
2015-06-07 18:37 ` [PATCH 06/19] x86, mpx: Restrict mmap size check to bounds tables Dave Hansen
2015-06-09 12:32   ` [tip:x86/fpu] x86/mpx: Restrict the mmap() " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 04/19] x86, mpx: Cleanup: Do not pass task around when unnecessary Dave Hansen
2015-06-09 12:31   ` [tip:x86/fpu] x86/mpx: Clean up the code by not passing a task pointer " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 07/19] x86, mpx: boot-time disable Dave Hansen
2015-06-09 12:32   ` [tip:x86/fpu] x86/mpx: Introduce a boot-time disable flag tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 09/19] x86, mpx: trace entry to bounds exception paths Dave Hansen
2015-06-09 12:33   ` [tip:x86/fpu] x86/mpx: Trace " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 08/19] x86, mpx: trace #BR exceptions Dave Hansen
2015-06-09 12:33   ` [tip:x86/fpu] x86/mpx: Trace " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 10/19] x86, mpx: Trace the attempts to find bounds tables Dave Hansen
2015-06-09 12:33   ` [tip:x86/fpu] x86/mpx: " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 13/19] x86, mpx: Add temporary variable to reduce masking Dave Hansen
2015-06-09 12:34   ` [tip:x86/fpu] x86/mpx: " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 14/19] x86, mpx: new directory entry to addr helper Dave Hansen
2015-06-09 12:34   ` [tip:x86/fpu] x86/mpx: Introduce new 'directory entry' to 'addr' helper function tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 12/19] x86: make is_64bit_mm() widely available Dave Hansen
2015-06-09 12:34   ` [tip:x86/fpu] x86: Make " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 11/19] x86, mpx: trace allocation of new bounds tables Dave Hansen
2015-06-09 12:33   ` [tip:x86/fpu] x86/mpx: Trace " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 15/19] x86, mpx: do 32-bit-only cmpxchg for 32-bit apps Dave Hansen
2015-06-09 12:35   ` [tip:x86/fpu] x86/mpx: Use 32-bit-only cmpxchg() " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 16/19] x86, mpx: support 32-bit binaries on 64-bit kernel Dave Hansen
2015-06-09 12:35   ` [tip:x86/fpu] x86/mpx: Support 32-bit binaries on 64-bit kernels tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 18/19] x86, mpx: do not count MPX VMAs as neighbors when unmapping Dave Hansen
2015-06-09 10:23   ` Ingo Molnar
2015-06-09 12:35   ` [tip:x86/fpu] x86/mpx: Do " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 17/19] x86, mpx: rewrite unmap code Dave Hansen
2015-06-09 12:35   ` [tip:x86/fpu] x86/mpx: Rewrite the " tip-bot for Dave Hansen
2015-06-07 18:37 ` [PATCH 19/19] x86, mpx: allow mixed binaries again Dave Hansen
2015-06-09 12:36   ` [tip:x86/fpu] x86/mpx: Allow 32-bit binaries on 64-bit kernels again tip-bot for Dave Hansen
  -- strict thread matches above, loose matches on Subject: below --
2015-05-29 22:34 [PATCH 00/19] x86, mpx updates for 4.2 (take 8) Dave Hansen
2015-05-29 22:34 ` [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer Dave Hansen
2015-05-27 18:36 [PATCH 00/19] x86, mpx updates for 4.2 (take 8) Dave Hansen
2015-05-27 18:36 ` [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer Dave Hansen
2015-05-28  8:41   ` Ingo Molnar
2015-05-28 14:45     ` Dave Hansen
2015-05-28 15:01       ` Ingo Molnar
2015-05-28 16:02         ` Dave Hansen
2015-05-29 18:49           ` Ingo Molnar
2015-05-28 16:24         ` Dave Hansen
2015-05-29  1:05           ` Andy Lutomirski
2015-05-29 15:31             ` Dave Hansen
2015-05-29 16:10             ` Borislav Petkov
2015-05-29 18:51               ` Ingo Molnar
2015-05-29 18:17             ` Ingo Molnar
2015-05-29 18:29               ` Andy Lutomirski
2015-05-29 18:44                 ` Ingo Molnar
2015-05-29 16:47     ` Dave Hansen
2015-05-29 18:48       ` Ingo Molnar
2015-05-19  6:25 [PATCH 00/19] x86, mpx updates for 4.2 (take 7) Dave Hansen
2015-05-19  6:25 ` [PATCH 02/19] x86, fpu: Wrap get_xsave_addr() to make it safer Dave Hansen
2015-05-19  8:15   ` Thomas Gleixner
2015-05-08 18:59 [PATCH 00/19] x86, mpx updates for 4.2 (take 6) Dave Hansen
2015-05-08 18:59 ` [PATCH 02/19] x86, fpu: wrap get_xsave_addr() to make it safer Dave Hansen
2015-05-18 19:38   ` Thomas Gleixner
2015-05-18 19:42     ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).