kernel-hardening.lists.openwall.com archive mirror
 help / color / mirror / Atom feed
* [kernel-hardening] [PATCH v3 0/9] core, x86: prevent bounds-check bypass via speculative execution
@ 2018-01-13 18:17 Dan Williams
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 1/9] Documentation: document array_ptr Dan Williams
                   ` (8 more replies)
  0 siblings, 9 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-13 18:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mark Rutland, kernel-hardening, Peter Zijlstra, Catalin Marinas,
	Will Deacon, H. Peter Anvin, Elena Reshetova, linux-arch,
	Andi Kleen, Jonathan Corbet, x86, Russell King, Ingo Molnar,
	Alan Cox, Tom Lendacky, Kees Cook, Al Viro, tglx, alan, gregkh,
	akpm, torvalds

Changes since v2 [1]:
* style fix in Documentation/speculation.txt (Geert)

* add Russell and Catalin to the cc on the ARM patches (Russell)

* clarify changelog for "x86: introduce __uaccess_begin_nospec and
  ASM_IFENCE" (Eric, Linus, Josh)

* fix the dynamic 'mask' / 'ifence' toggle vs CONFIG_JUMP_LABEL=n
  (Peter)

* include the get_user_{1,2,4,8} helpers in the ASM_IFENCE protections
  (Linus)

* fix array_ptr_mask for ARCH=i386 builds (Kbuild robot)

* prioritize the get_user protections, and the fdtable fix

[1]: https://lwn.net/Articles/744141/

---

Quoting Mark's original RFC:

"Recently, Google Project Zero discovered several classes of attack
against speculative execution. One of these, known as variant-1, allows
explicit bounds checks to be bypassed under speculation, providing an
arbitrary read gadget. Further details can be found on the GPZ blog [2]
and the Documentation patch in this series."

This series incorporates Mark Rutland's latest ARM changes and adds
the x86 specific implementation of 'ifence_array_ptr'. That ifence
based approach is provided as an opt-in fallback, but the default
mitigation, '__array_ptr', uses a 'mask' approach that removes
conditional branches instructions, and otherwise aims to redirect
speculation to use a NULL pointer rather than a user controlled value.

The mask is generated by the following from Alexei, and Linus:

    mask = ~(long)(_i | (_s - 1 - _i)) >> (BITS_PER_LONG - 1);

...and Linus provided an optimized mask generation helper for x86:

    asm ("cmpq %1,%2; sbbq %0,%0;"
		:"=r" (mask)
		:"r"(sz),"r" (idx)
		:"cc");

The 'array_ptr' mechanism can be switched between 'mask' and 'ifence'
via the spectre_v1={mask,ifence} command line option if
CONFIG_SPECTRE1_DYNAMIC=y, and the compile-time default is otherwise set
by selecting either CONFIG_SPECTRE1_MASK or CONFIG_SPECTRE1_IFENCE. This
level of sophistication is provided given concerns about 'value
speculation' [3].

The get_user protections and 'array_ptr' infrastructure are the only
concern of this patch set. Going forward 'array_ptr' is a tool that
sub-system maintainers can use to instrument array bounds checks like
'__fcheck_files'. When to use 'array_ptr' is saved for a future patch
set, and in the meantime the 'get_user' protections raise the bar for
launching a Spectre-v1 attack.

These patches are also available via the 'nospec-v3' git branch here:

    git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec-v3

Note that the BPF fix for Spectre variant1 is merged for 4.15-rc8.

[2]: https://googleprojectzero.blogspot.co.uk/2018/01/reading-privileged-memory-with-side.html
[3]: https://marc.info/?l=linux-netdev&m=151527996901350&w=2

---

Dan Williams (6):
      x86: implement ifence()
      x86: implement ifence_array_ptr() and array_ptr_mask()
      asm/nospec: mask speculative execution flows
      x86: introduce __uaccess_begin_nospec and ASM_IFENCE
      x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
      vfs, fdtable: prevent bounds-check bypass via speculative execution

Mark Rutland (3):
      Documentation: document array_ptr
      arm64: implement ifence_array_ptr()
      arm: implement ifence_array_ptr()


 Documentation/speculation.txt     |  143 +++++++++++++++++++++++++++++++++++++
 arch/arm/Kconfig                  |    1 
 arch/arm/include/asm/barrier.h    |   24 ++++++
 arch/arm64/Kconfig                |    1 
 arch/arm64/include/asm/barrier.h  |   24 ++++++
 arch/x86/Kconfig                  |    3 +
 arch/x86/include/asm/barrier.h    |   50 +++++++++++++
 arch/x86/include/asm/msr.h        |    3 -
 arch/x86/include/asm/smap.h       |    4 +
 arch/x86/include/asm/uaccess.h    |   16 +++-
 arch/x86/include/asm/uaccess_32.h |    6 +-
 arch/x86/include/asm/uaccess_64.h |   12 ++-
 arch/x86/lib/copy_user_64.S       |    3 +
 arch/x86/lib/getuser.S            |    5 +
 arch/x86/lib/usercopy_32.c        |    8 +-
 include/linux/fdtable.h           |    7 +-
 include/linux/nospec.h            |   92 ++++++++++++++++++++++++
 kernel/Kconfig.nospec             |   46 ++++++++++++
 kernel/Makefile                   |    1 
 kernel/nospec.c                   |   52 +++++++++++++
 lib/Kconfig                       |    3 +
 21 files changed, 484 insertions(+), 20 deletions(-)
 create mode 100644 Documentation/speculation.txt
 create mode 100644 include/linux/nospec.h
 create mode 100644 kernel/Kconfig.nospec
 create mode 100644 kernel/nospec.c

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] [PATCH v3 1/9] Documentation: document array_ptr
  2018-01-13 18:17 [kernel-hardening] [PATCH v3 0/9] core, x86: prevent bounds-check bypass via speculative execution Dan Williams
@ 2018-01-13 18:17 ` Dan Williams
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 2/9] arm64: implement ifence_array_ptr() Dan Williams
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-13 18:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mark Rutland, linux-arch, kernel-hardening, Peter Zijlstra,
	gregkh, Jonathan Corbet, Will Deacon, tglx, torvalds, akpm, alan

From: Mark Rutland <mark.rutland@arm.com>

Document the rationale and usage of the new array_ptr() helper.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/speculation.txt |  143 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)
 create mode 100644 Documentation/speculation.txt

diff --git a/Documentation/speculation.txt b/Documentation/speculation.txt
new file mode 100644
index 000000000000..1e59d1d9eaf4
--- /dev/null
+++ b/Documentation/speculation.txt
@@ -0,0 +1,143 @@
+This document explains potential effects of speculation, and how undesirable
+effects can be mitigated portably using common APIs.
+
+===========
+Speculation
+===========
+
+To improve performance and minimize average latencies, many contemporary CPUs
+employ speculative execution techniques such as branch prediction, performing
+work which may be discarded at a later stage.
+
+Typically speculative execution cannot be observed from architectural state,
+such as the contents of registers. However, in some cases it is possible to
+observe its impact on microarchitectural state, such as the presence or
+absence of data in caches. Such state may form side-channels which can be
+observed to extract secret information.
+
+For example, in the presence of branch prediction, it is possible for bounds
+checks to be ignored by code which is speculatively executed. Consider the
+following code:
+
+	int load_array(int *array, unsigned int idx)
+	{
+		if (idx >= MAX_ARRAY_ELEMS)
+			return 0;
+		else
+			return array[idx];
+	}
+
+Which, on arm64, may be compiled to an assembly sequence such as:
+
+	CMP	<idx>, #MAX_ARRAY_ELEMS
+	B.LT	less
+	MOV	<returnval>, #0
+	RET
+  less:
+	LDR	<returnval>, [<array>, <idx>]
+	RET
+
+It is possible that a CPU mis-predicts the conditional branch, and
+speculatively loads array[idx], even if idx >= MAX_ARRAY_ELEMS. This value
+will subsequently be discarded, but the speculated load may affect
+microarchitectural state which can be subsequently measured.
+
+More complex sequences involving multiple dependent memory accesses may result
+in sensitive information being leaked. Consider the following code, building
+on the prior example:
+
+	int load_dependent_arrays(int *arr1, int *arr2, int idx)
+	{
+		int val1, val2,
+
+		val1 = load_array(arr1, idx);
+		val2 = load_array(arr2, val1);
+
+		return val2;
+	}
+
+Under speculation, the first call to load_array() may return the value of an
+out-of-bounds address, while the second call will influence microarchitectural
+state dependent on this value. This may provide an arbitrary read primitive.
+
+====================================
+Mitigating speculation side-channels
+====================================
+
+The kernel provides a generic API to ensure that bounds checks are respected
+even under speculation. Architectures which are affected by speculation-based
+side-channels are expected to implement these primitives.
+
+The array_ptr() helper in <asm/barrier.h> can be used to prevent
+information from being leaked via side-channels.
+
+A call to array_ptr(arr, idx, sz) returns a sanitized pointer to
+arr[idx] only if idx falls in the [0, sz) interval. When idx < 0 or idx > sz,
+NULL is returned. Additionally, array_ptr() an out-of-bounds poitner is
+not propagated to code which is speculatively executed.
+
+This can be used to protect the earlier load_array() example:
+
+	int load_array(int *array, unsigned int idx)
+	{
+		int *elem;
+
+		elem = array_ptr(array, idx, MAX_ARRAY_ELEMS);
+		if (elem)
+			return *elem;
+		else
+			return 0;
+	}
+
+This can also be used in situations where multiple fields on a structure are
+accessed:
+
+	struct foo array[SIZE];
+	int a, b;
+
+	void do_thing(int idx)
+	{
+		struct foo *elem;
+
+		elem = array_ptr(array, idx, SIZE);
+		if (elem) {
+			a = elem->field_a;
+			b = elem->field_b;
+		}
+	}
+
+It is imperative that the returned pointer is used. Pointers which are
+generated separately are subject to a number of potential CPU and compiler
+optimizations, and may still be used speculatively. For example, this means
+that the following sequence is unsafe:
+
+	struct foo array[SIZE];
+	int a, b;
+
+	void do_thing(int idx)
+	{
+		if (array_ptr(array, idx, SIZE) != NULL) {
+			// unsafe as wrong pointer is used
+			a = array[idx].field_a;
+			b = array[idx].field_b;
+		}
+	}
+
+Similarly, it is unsafe to compare the returned pointer with other pointers,
+as this may permit the compiler to substitute one pointer with another,
+permitting speculation. For example, the following sequence is unsafe:
+
+	struct foo array[SIZE];
+	int a, b;
+
+	void do_thing(int idx)
+	{
+		struct foo *elem = array_ptr(array, idx, size);
+
+		// unsafe due to pointer substitution
+		if (elem == &array[idx]) {
+			a = elem->field_a;
+			b = elem->field_b;
+		}
+	}
+

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [kernel-hardening] [PATCH v3 2/9] arm64: implement ifence_array_ptr()
  2018-01-13 18:17 [kernel-hardening] [PATCH v3 0/9] core, x86: prevent bounds-check bypass via speculative execution Dan Williams
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 1/9] Documentation: document array_ptr Dan Williams
@ 2018-01-13 18:17 ` Dan Williams
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 3/9] arm: " Dan Williams
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-13 18:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mark Rutland, linux-arch, kernel-hardening, Peter Zijlstra,
	Catalin Marinas, Will Deacon, gregkh, tglx, torvalds, akpm, alan

From: Mark Rutland <mark.rutland@arm.com>

This patch implements ifence_array_ptr() for arm64, using an
LDR+CSEL+CSDB sequence to inhibit speculative use of the returned value.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/arm64/include/asm/barrier.h |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 77651c49ef44..74ffcddb26e6 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -40,6 +40,30 @@
 #define dma_rmb()	dmb(oshld)
 #define dma_wmb()	dmb(oshst)
 
+#define ifence_array_ptr(arr, idx, sz)					\
+({									\
+	typeof(&(arr)[0]) __nap_arr = (arr);				\
+	typeof(idx) __nap_idx = (idx);					\
+	typeof(sz) __nap_sz = (sz);					\
+									\
+	unsigned long __nap_ptr = (unsigned long)__nap_arr +		\
+				  sizeof(__nap_arr[0]) * idx;		\
+									\
+	asm volatile(							\
+	"	cmp	%[i], %[s]\n"					\
+	"	b.cs	1f\n"						\
+	"	ldr	%[p], %[pp]\n"					\
+	"1:	csel	%[p], %[p], xzr, cc\n"				\
+	"	hint	#0x14 // CSDB\n"				\
+	: [p] "=&r" (__nap_ptr)						\
+	: [pp] "m" (__nap_ptr),						\
+	  [i] "r" ((unsigned long)__nap_idx),				\
+	  [s] "r" ((unsigned long)__nap_sz)				\
+	: "cc");							\
+									\
+	(typeof(&(__nap_arr)[0]))__nap_ptr;				\
+})
+
 #define __smp_mb()	dmb(ish)
 #define __smp_rmb()	dmb(ishld)
 #define __smp_wmb()	dmb(ishst)

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [kernel-hardening] [PATCH v3 3/9] arm: implement ifence_array_ptr()
  2018-01-13 18:17 [kernel-hardening] [PATCH v3 0/9] core, x86: prevent bounds-check bypass via speculative execution Dan Williams
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 1/9] Documentation: document array_ptr Dan Williams
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 2/9] arm64: implement ifence_array_ptr() Dan Williams
@ 2018-01-13 18:17 ` Dan Williams
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 4/9] x86: implement ifence() Dan Williams
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-13 18:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mark Rutland, linux-arch, kernel-hardening, gregkh, Russell King,
	tglx, torvalds, akpm, alan

From: Mark Rutland <mark.rutland@arm.com>

This patch implements ifence_array_ptr() for arm, using an
LDR+MOVCS+CSDB sequence to inhibit speculative use of the returned
value.

Cc: Russell King <linux@armlinux.org.uk>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/arm/include/asm/barrier.h |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 40f5c410fd8c..919235ed6e68 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -59,6 +59,30 @@ extern void arm_heavy_mb(void);
 #define dma_wmb()	barrier()
 #endif
 
+#define ifence_array_ptr(arr, idx, sz)					\
+({									\
+	typeof(&(arr)[0]) __nap_arr = (arr);				\
+	typeof(idx) __nap_idx = (idx);					\
+	typeof(sz) __nap_sz = (sz);					\
+									\
+	unsigned long __nap_ptr = (unsigned long)__nap_arr +		\
+				  sizeof(__nap_arr[0]) * idx;		\
+									\
+	asm volatile(							\
+	"	cmp	%[i], %[s]\n"					\
+	"	bcs	1f\n"						\
+	"	ldr	%[p], %[pp]\n"					\
+	"1:	movcs	%[p], #0\n"					\
+	"	.inst	0xe320f018 @ CSDB\n"				\
+	: [p] "=&r" (__nap_ptr)						\
+	: [pp] "m" (__nap_ptr),						\
+	  [i] "r" ((unsigned long)__nap_idx),				\
+	  [s] "r" ((unsigned long)__nap_sz)				\
+	: "cc");							\
+									\
+	(typeof(&(__nap_arr)[0]))__nap_ptr;				\
+})
+
 #define __smp_mb()	dmb(ish)
 #define __smp_rmb()	__smp_mb()
 #define __smp_wmb()	dmb(ishst)

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [kernel-hardening] [PATCH v3 4/9] x86: implement ifence()
  2018-01-13 18:17 [kernel-hardening] [PATCH v3 0/9] core, x86: prevent bounds-check bypass via speculative execution Dan Williams
                   ` (2 preceding siblings ...)
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 3/9] arm: " Dan Williams
@ 2018-01-13 18:17 ` Dan Williams
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 5/9] x86: implement ifence_array_ptr() and array_ptr_mask() Dan Williams
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-13 18:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mark Rutland, Tom Lendacky, linux-arch, gregkh, Peter Zijlstra,
	Alan Cox, x86, Ingo Molnar, H. Peter Anvin, kernel-hardening,
	tglx, torvalds, akpm, Elena Reshetova, alan

The new barrier, 'ifence', ensures that speculative execution never
crosses the fence.

Previously the kernel only needed this fence in 'rdtsc_ordered', but now
it is also proposed as a mitigation against Spectre variant1 attacks.
When used it needs to be placed in the success path after a bounds check
i.e.:

	if (x < max) {
		ifence();
		val = array[x];
	} else
		return -EINVAL;

With this change the cpu will never issue speculative reads of
'array + x' with values of x >= max.

'ifence', via 'ifence_array_ptr', is an opt-in fallback to the default
mitigation provided by '__array_ptr'. It is also proposed for blocking
speculation in the 'get_user' path to bypass 'access_ok' checks. For
now, just provide the common definition for later patches to build upon.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Alan Cox <alan.cox@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/barrier.h |    4 ++++
 arch/x86/include/asm/msr.h     |    3 +--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 7fb336210e1b..b04f572d6d97 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -24,6 +24,10 @@
 #define wmb()	asm volatile("sfence" ::: "memory")
 #endif
 
+/* prevent speculative execution past this barrier */
+#define ifence() alternative_2("", "mfence", X86_FEATURE_MFENCE_RDTSC, \
+				   "lfence", X86_FEATURE_LFENCE_RDTSC)
+
 #ifdef CONFIG_X86_PPRO_FENCE
 #define dma_rmb()	rmb()
 #else
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 07962f5f6fba..e426d2a33ff3 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -214,8 +214,7 @@ static __always_inline unsigned long long rdtsc_ordered(void)
 	 * that some other imaginary CPU is updating continuously with a
 	 * time stamp.
 	 */
-	alternative_2("", "mfence", X86_FEATURE_MFENCE_RDTSC,
-			  "lfence", X86_FEATURE_LFENCE_RDTSC);
+	ifence();
 	return rdtsc();
 }
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [kernel-hardening] [PATCH v3 5/9] x86: implement ifence_array_ptr() and array_ptr_mask()
  2018-01-13 18:17 [kernel-hardening] [PATCH v3 0/9] core, x86: prevent bounds-check bypass via speculative execution Dan Williams
                   ` (3 preceding siblings ...)
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 4/9] x86: implement ifence() Dan Williams
@ 2018-01-13 18:17 ` Dan Williams
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 6/9] asm/nospec: mask speculative execution flows Dan Williams
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-13 18:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, kernel-hardening, gregkh, x86, Ingo Molnar,
	H. Peter Anvin, tglx, torvalds, akpm, alan

'ifence_array_ptr' is provided as an alternative to the default
'__array_ptr' implementation that uses a mask to sanitize user
controllable pointers. Later patches will allow it to be selected via
the kernel command line. The '__array_ptr' implementation otherwise
appears safe for current generation cpus across multiple architectures.

'array_ptr_mask' is used by the default 'array_ptr' implementation to
cheaply calculate an array bounds mask.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/barrier.h |   46 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index b04f572d6d97..1b507f9e2cc7 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -28,6 +28,52 @@
 #define ifence() alternative_2("", "mfence", X86_FEATURE_MFENCE_RDTSC, \
 				   "lfence", X86_FEATURE_LFENCE_RDTSC)
 
+/**
+ * ifence_array_ptr - Generate a pointer to an array element,
+ * ensuring the pointer is bounded under speculation.
+ *
+ * @arr: the base of the array
+ * @idx: the index of the element
+ * @sz: the number of elements in the array
+ *
+ * If @idx falls in the interval [0, @sz), returns the pointer to
+ * @arr[@idx], otherwise returns NULL.
+ */
+#define ifence_array_ptr(arr, idx, sz)					\
+({									\
+	typeof(*(arr)) *__arr = (arr), *__ret;				\
+	typeof(idx) __idx = (idx);					\
+	typeof(sz) __sz = (sz);						\
+									\
+	__ret = __idx < __sz ? __arr + __idx : NULL;			\
+	ifence();							\
+	__ret;								\
+})
+
+/**
+ * array_ptr_mask - generate a mask for array_ptr() that is ~0UL when
+ * the bounds check succeeds and 0 otherwise
+ */
+#define array_ptr_mask array_ptr_mask
+static inline unsigned long array_ptr_mask(unsigned long idx, unsigned long sz)
+{
+	unsigned long mask;
+
+	/*
+	 * mask = index - size, if that result is >= 0 then the index is
+	 * invalid and the mask is 0 else ~0
+	 */
+#ifdef CONFIG_X86_32
+	asm ("cmpl %1,%2; sbbl %0,%0;"
+#else
+	asm ("cmpq %1,%2; sbbq %0,%0;"
+#endif
+			:"=r" (mask)
+			:"r"(sz),"r" (idx)
+			:"cc");
+	return mask;
+}
+
 #ifdef CONFIG_X86_PPRO_FENCE
 #define dma_rmb()	rmb()
 #else

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [kernel-hardening] [PATCH v3 6/9] asm/nospec: mask speculative execution flows
  2018-01-13 18:17 [kernel-hardening] [PATCH v3 0/9] core, x86: prevent bounds-check bypass via speculative execution Dan Williams
                   ` (4 preceding siblings ...)
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 5/9] x86: implement ifence_array_ptr() and array_ptr_mask() Dan Williams
@ 2018-01-13 18:17 ` Dan Williams
  2018-01-13 18:18 ` [kernel-hardening] [PATCH v3 7/9] x86: introduce __uaccess_begin_nospec and ASM_IFENCE Dan Williams
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-13 18:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, kernel-hardening, Catalin Marinas, x86, Will Deacon,
	Russell King, Ingo Molnar, gregkh, H. Peter Anvin, tglx,
	torvalds, akpm, alan

'__array_ptr' is proposed as a generic mechanism to mitigate against
Spectre-variant-1 attacks, i.e. an attack that bypasses memory bounds
checks via speculative execution). The '__array_ptr' implementation
appears safe for current generation cpus across multiple architectures.

In comparison, 'ifence_array_ptr' uses a hard / architectural 'ifence'
approach to preclude the possibility speculative execution. However, it
is not the default given a concern for avoiding instruction-execution
barriers in potential fast paths.

Based on an original implementation by Linus Torvalds, tweaked to remove
speculative flows by Alexei Starovoitov, and tweaked again by Linus to
introduce an x86 assembly implementation for the mask generation.

Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Co-developed-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/arm/Kconfig       |    1 +
 arch/arm64/Kconfig     |    1 +
 arch/x86/Kconfig       |    3 ++
 include/linux/nospec.h |   92 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/Kconfig.nospec  |   46 ++++++++++++++++++++++++
 kernel/Makefile        |    1 +
 kernel/nospec.c        |   52 +++++++++++++++++++++++++++
 lib/Kconfig            |    3 ++
 8 files changed, 199 insertions(+)
 create mode 100644 include/linux/nospec.h
 create mode 100644 kernel/Kconfig.nospec
 create mode 100644 kernel/nospec.c

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 51c8df561077..fd4789ec8cac 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -7,6 +7,7 @@ config ARM
 	select ARCH_HAS_DEBUG_VIRTUAL
 	select ARCH_HAS_DEVMEM_IS_ALLOWED
 	select ARCH_HAS_ELF_RANDOMIZE
+	select ARCH_HAS_IFENCE
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
 	select ARCH_HAS_STRICT_MODULE_RWX if MMU
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c9a7e9e1414f..22765c4b6986 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -16,6 +16,7 @@ config ARM64
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
 	select ARCH_HAS_KCOV
+	select ARCH_HAS_IFENCE
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SG_CHAIN
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d4fc98c50378..68698289c83c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_KCOV			if X86_64
+	select ARCH_HAS_IFENCE
 	select ARCH_HAS_PMEM_API		if X86_64
 	# Causing hangs/crashes, see the commit that added this change for details.
 	select ARCH_HAS_REFCOUNT
@@ -442,6 +443,8 @@ config INTEL_RDT
 
 	  Say N if unsure.
 
+source "kernel/Kconfig.nospec"
+
 if X86_32
 config X86_EXTENDED_PLATFORM
 	bool "Support for extended (non-PC) x86 platforms"
diff --git a/include/linux/nospec.h b/include/linux/nospec.h
new file mode 100644
index 000000000000..f6e7ba7a7344
--- /dev/null
+++ b/include/linux/nospec.h
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2018 Intel Corporation. All rights reserved.
+
+#ifndef __NOSPEC_H__
+#define __NOSPEC_H__
+
+#include <linux/jump_label.h>
+#include <asm/barrier.h>
+
+/*
+ * If idx is negative or if idx > size then bit 63 is set in the mask,
+ * and the value of ~(-1L) is zero. When the mask is zero, bounds check
+ * failed, __array_ptr will return NULL.
+ */
+#ifndef array_ptr_mask
+#define array_ptr_mask(idx, sz)						\
+({									\
+	unsigned long mask;						\
+	unsigned long _i = (idx);					\
+	unsigned long _s = (sz);					\
+									\
+	mask = ~(long)(_i | (_s - 1 - _i)) >> (BITS_PER_LONG - 1);	\
+	mask;								\
+})
+#endif
+
+/**
+ * __array_ptr - Generate a pointer to an array element, ensuring
+ * the pointer is bounded under speculation to NULL.
+ *
+ * @base: the base of the array
+ * @idx: the index of the element, must be less than LONG_MAX
+ * @sz: the number of elements in the array, must be less than LONG_MAX
+ *
+ * If @idx falls in the interval [0, @sz), returns the pointer to
+ * @arr[@idx], otherwise returns NULL.
+ */
+#define __array_ptr(base, idx, sz)					\
+({									\
+	union { typeof(*(base)) *_ptr; unsigned long _bit; } __u;	\
+	typeof(*(base)) *_arr = (base);					\
+	unsigned long _i = (idx);					\
+	unsigned long _mask = array_ptr_mask(_i, (sz));			\
+									\
+	__u._ptr = _arr + (_i & _mask);					\
+	__u._bit &= _mask;						\
+	__u._ptr;							\
+})
+
+#if defined(ARCH_HAS_IFENCE) && !defined(ifence_array_ptr)
+#error Arch claims ARCH_HAS_IFENCE, but does not implement ifence_array_ptr
+#endif
+
+#ifdef CONFIG_SPECTRE1_DYNAMIC
+#ifndef HAVE_JUMP_LABEL
+#error Compiler lacks asm-goto, can generate unsafe code
+#endif
+
+#ifdef CONFIG_SPECTRE1_IFENCE
+DECLARE_STATIC_KEY_TRUE(nospec_key);
+#else
+DECLARE_STATIC_KEY_FALSE(nospec_key);
+#endif
+
+/*
+ * The expectation is that no compiler or cpu will mishandle __array_ptr
+ * leading to problematic speculative execution. Bypass the ifence
+ * based implementation by default.
+ */
+#define array_ptr(base, idx, sz)				\
+({								\
+	typeof(*(base)) *__ret;					\
+								\
+	if (static_branch_unlikely(&nospec_key))		\
+		__ret = ifence_array_ptr(base, idx, sz);	\
+	else							\
+		__ret = __array_ptr(base, idx, sz);		\
+	__ret;							\
+})
+#else /* CONFIG_SPECTRE1_DYNAMIC */
+/*
+ * If jump labels are disabled we hard code either ifence_array_ptr or
+ * array_ptr based on the config choice
+ */
+#ifdef CONFIG_SPECTRE1_IFENCE
+#define array_ptr ifence_array_ptr
+#else
+/* fallback to __array_ptr by default */
+#define array_ptr __array_ptr
+#endif
+#endif /* CONFIG_SPECTRE1_DYNAMIC */
+#endif /* __NOSPEC_H__ */
diff --git a/kernel/Kconfig.nospec b/kernel/Kconfig.nospec
new file mode 100644
index 000000000000..33e34a87d067
--- /dev/null
+++ b/kernel/Kconfig.nospec
@@ -0,0 +1,46 @@
+# SPDX-License-Identifier: GPL-2.0
+
+menu "Speculative execution past bounds check"
+	depends on ARCH_HAS_IFENCE
+
+choice
+        prompt "Speculative execution past bounds check"
+        default SPECTRE1_MASK
+        help
+	  Select the default mechanism for guarding against kernel
+	  memory leaks via speculative execution past a boundary-check
+	  (Spectre variant1) . This choice determines the contents of
+	  the array_ptr() helper. Note, that vulnerable code paths need
+	  to be instrumented with this helper to be protected.
+
+config SPECTRE1_MASK
+	bool "mask"
+	help
+	  Provide an array_ptr() implementation that arranges for only
+	  safe speculative flows to be exposed to the compiler/cpu. It
+	  is preferred over "ifence" since it arranges for problematic
+	  speculation to be disabled without need of an instruction
+	  barrier.
+
+config SPECTRE1_IFENCE
+	bool "ifence"
+	depends on ARCH_HAS_IFENCE
+	help
+	  Provide a array_ptr() implementation that is specified by the
+	  cpu architecture to barrier all speculative execution.  Unless
+	  you have specific knowledge of the "mask" approach being
+	  unsuitable with a given compiler/cpu, select "mask".
+
+endchoice
+
+config SPECTRE1_DYNAMIC
+	bool "Support dynamic switching of speculative execution mitigation"
+	depends on ARCH_HAS_IFENCE
+	depends on JUMP_LABEL
+	help
+	  For architectures that support the 'ifence' mitigation, allow
+	  dynamic switching between it and the 'mask' approach. This supports
+	  evaluation or emergency switching.
+
+	  If unsure, say Y
+endmenu
diff --git a/kernel/Makefile b/kernel/Makefile
index 172d151d429c..d5269be9d58a 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -101,6 +101,7 @@ obj-$(CONFIG_TRACEPOINTS) += trace/
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
 obj-$(CONFIG_BPF) += bpf/
+obj-$(CONFIG_SPECTRE1_DYNAMIC) += nospec.o
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/kernel/nospec.c b/kernel/nospec.c
new file mode 100644
index 000000000000..992de957216d
--- /dev/null
+++ b/kernel/nospec.c
@@ -0,0 +1,52 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2018 Intel Corporation. All rights reserved.
+#include <linux/module.h>
+#include <linux/compiler.h>
+#include <linux/jump_label.h>
+#include <linux/moduleparam.h>
+
+enum {
+	F_IFENCE,
+};
+
+#ifdef CONFIG_SPECTRE1_IFENCE
+static unsigned long nospec_flag = 1 << F_IFENCE;
+DEFINE_STATIC_KEY_TRUE(nospec_key);
+#else
+static unsigned long nospec_flag;
+DEFINE_STATIC_KEY_FALSE(nospec_key);
+#endif
+
+EXPORT_SYMBOL(nospec_key);
+
+static int param_set_nospec(const char *val, const struct kernel_param *kp)
+{
+	unsigned long *flags = kp->arg;
+
+	if (strcmp(val, "ifence") == 0 || strcmp(val, "ifence\n") == 0) {
+		if (!test_and_set_bit(F_IFENCE, flags))
+			static_key_enable(&nospec_key.key);
+		return 0;
+	} else if (strcmp(val, "mask") == 0 || strcmp(val, "mask\n") == 0) {
+		if (test_and_clear_bit(F_IFENCE, flags))
+			static_key_disable(&nospec_key.key);
+		return 0;
+	}
+	return -EINVAL;
+}
+
+static int param_get_nospec(char *buffer, const struct kernel_param *kp)
+{
+	unsigned long *flags = kp->arg;
+
+	return sprintf(buffer, "%s\n", test_bit(F_IFENCE, flags)
+			? "ifence" : "mask");
+}
+
+static struct kernel_param_ops nospec_param_ops = {
+	.set = param_set_nospec,
+	.get = param_get_nospec,
+};
+
+core_param_cb(spectre_v1, &nospec_param_ops, &nospec_flag, 0600);
+MODULE_PARM_DESC(spectre_v1, "Spectre-v1 mitigation: 'mask' (default) vs 'ifence'");
diff --git a/lib/Kconfig b/lib/Kconfig
index c5e84fbcb30b..3cc7e7a03781 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -570,6 +570,9 @@ config STACKDEPOT
 	bool
 	select STACKTRACE
 
+config ARCH_HAS_IFENCE
+	bool
+
 config SBITMAP
 	bool
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [kernel-hardening] [PATCH v3 7/9] x86: introduce __uaccess_begin_nospec and ASM_IFENCE
  2018-01-13 18:17 [kernel-hardening] [PATCH v3 0/9] core, x86: prevent bounds-check bypass via speculative execution Dan Williams
                   ` (5 preceding siblings ...)
  2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 6/9] asm/nospec: mask speculative execution flows Dan Williams
@ 2018-01-13 18:18 ` Dan Williams
  2018-01-13 18:18 ` [kernel-hardening] [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths Dan Williams
  2018-01-13 18:18 ` [kernel-hardening] [PATCH v3 9/9] vfs, fdtable: prevent bounds-check bypass via speculative execution Dan Williams
  8 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-13 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, Tom Lendacky, Andi Kleen, Kees Cook,
	kernel-hardening, gregkh, x86, Ingo Molnar, Al Viro,
	H. Peter Anvin, tglx, torvalds, akpm, alan

For 'get_user' paths, do not allow the kernel to speculate on the value
of a user controlled pointer. In addition to the 'stac' instruction for
Supervisor Mode Access Protection, an 'ifence' causes the 'access_ok'
result to resolve in the pipeline before the cpu might take any
speculative action on the pointer value.

Since this is a major kernel interface that deals with user controlled
data, the '__uaccess_begin_nospec' mechanism will prevent speculative
execution past an 'access_ok' permission check. While speculative
execution past 'access_ok' is not enough to lead to a kernel memory
leak, it is a necessary precondition.

To be clear, '__uaccess_begin_nospec' and ASM_IFENCE are not addressing
any known issues with 'get_user' they are addressing a class of
potential problems that could be near 'get_user' usages. In other words,
these helpers are for hygiene not clinical fixes.

There are no functional changes in this patch.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andi Kleen <ak@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: x86@kernel.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/smap.h    |    4 ++++
 arch/x86/include/asm/uaccess.h |   10 ++++++++++
 2 files changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/smap.h b/arch/x86/include/asm/smap.h
index db333300bd4b..0b59707e0b46 100644
--- a/arch/x86/include/asm/smap.h
+++ b/arch/x86/include/asm/smap.h
@@ -40,6 +40,10 @@
 
 #endif /* CONFIG_X86_SMAP */
 
+#define ASM_IFENCE \
+	ALTERNATIVE_2 "", "mfence", X86_FEATURE_MFENCE_RDTSC, \
+			  "lfence", X86_FEATURE_LFENCE_RDTSC
+
 #else /* __ASSEMBLY__ */
 
 #include <asm/alternative.h>
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 574dff4d2913..a31fd4fc6483 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -124,6 +124,11 @@ extern int __get_user_bad(void);
 
 #define __uaccess_begin() stac()
 #define __uaccess_end()   clac()
+#define __uaccess_begin_nospec()	\
+({					\
+	stac();				\
+	ifence();			\
+})
 
 /*
  * This is a type: either unsigned long, if the argument fits into
@@ -487,6 +492,11 @@ struct __large_struct { unsigned long buf[100]; };
 	__uaccess_begin();						\
 	barrier();
 
+#define uaccess_try_nospec do {						\
+	current->thread.uaccess_err = 0;				\
+	__uaccess_begin_nospec();					\
+	barrier();
+
 #define uaccess_catch(err)						\
 	__uaccess_end();						\
 	(err) |= (current->thread.uaccess_err ? -EFAULT : 0);		\

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [kernel-hardening] [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-13 18:17 [kernel-hardening] [PATCH v3 0/9] core, x86: prevent bounds-check bypass via speculative execution Dan Williams
                   ` (6 preceding siblings ...)
  2018-01-13 18:18 ` [kernel-hardening] [PATCH v3 7/9] x86: introduce __uaccess_begin_nospec and ASM_IFENCE Dan Williams
@ 2018-01-13 18:18 ` Dan Williams
  2018-01-13 19:05   ` [kernel-hardening] " Linus Torvalds
  2018-01-13 18:18 ` [kernel-hardening] [PATCH v3 9/9] vfs, fdtable: prevent bounds-check bypass via speculative execution Dan Williams
  8 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2018-01-13 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, Andi Kleen, Kees Cook, kernel-hardening, gregkh, x86,
	Ingo Molnar, Al Viro, H. Peter Anvin, tglx, torvalds, akpm, alan

Quoting Linus:

    I do think that it would be a good idea to very expressly document
    the fact that it's not that the user access itself is unsafe. I do
    agree that things like "get_user()" want to be protected, but not
    because of any direct bugs or problems with get_user() and friends,
    but simply because get_user() is an excellent source of a pointer
    that is obviously controlled from a potentially attacking user
    space. So it's a prime candidate for then finding _subsequent_
    accesses that can then be used to perturb the cache.

Note that '__copy_user_ll' is also called in the 'put_user' case, but
there is currently no indication that put_user in general deserves the
same hygiene as 'get_user'.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: x86@kernel.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/uaccess.h    |    6 +++---
 arch/x86/include/asm/uaccess_32.h |    6 +++---
 arch/x86/include/asm/uaccess_64.h |   12 ++++++------
 arch/x86/lib/copy_user_64.S       |    3 +++
 arch/x86/lib/getuser.S            |    5 +++++
 arch/x86/lib/usercopy_32.c        |    8 ++++----
 6 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index a31fd4fc6483..82c73f064e76 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -450,7 +450,7 @@ do {									\
 ({									\
 	int __gu_err;							\
 	__inttype(*(ptr)) __gu_val;					\
-	__uaccess_begin();						\
+	__uaccess_begin_nospec();					\
 	__get_user_size(__gu_val, (ptr), (size), __gu_err, -EFAULT);	\
 	__uaccess_end();						\
 	(x) = (__force __typeof__(*(ptr)))__gu_val;			\
@@ -558,7 +558,7 @@ struct __large_struct { unsigned long buf[100]; };
  *	get_user_ex(...);
  * } get_user_catch(err)
  */
-#define get_user_try		uaccess_try
+#define get_user_try		uaccess_try_nospec
 #define get_user_catch(err)	uaccess_catch(err)
 
 #define get_user_ex(x, ptr)	do {					\
@@ -592,7 +592,7 @@ extern void __cmpxchg_wrong_size(void)
 	__typeof__(ptr) __uval = (uval);				\
 	__typeof__(*(ptr)) __old = (old);				\
 	__typeof__(*(ptr)) __new = (new);				\
-	__uaccess_begin();						\
+	__uaccess_begin_nospec();					\
 	switch (size) {							\
 	case 1:								\
 	{								\
diff --git a/arch/x86/include/asm/uaccess_32.h b/arch/x86/include/asm/uaccess_32.h
index 72950401b223..ba2dc1930630 100644
--- a/arch/x86/include/asm/uaccess_32.h
+++ b/arch/x86/include/asm/uaccess_32.h
@@ -29,21 +29,21 @@ raw_copy_from_user(void *to, const void __user *from, unsigned long n)
 		switch (n) {
 		case 1:
 			ret = 0;
-			__uaccess_begin();
+			__uaccess_begin_nospec();
 			__get_user_asm_nozero(*(u8 *)to, from, ret,
 					      "b", "b", "=q", 1);
 			__uaccess_end();
 			return ret;
 		case 2:
 			ret = 0;
-			__uaccess_begin();
+			__uaccess_begin_nospec();
 			__get_user_asm_nozero(*(u16 *)to, from, ret,
 					      "w", "w", "=r", 2);
 			__uaccess_end();
 			return ret;
 		case 4:
 			ret = 0;
-			__uaccess_begin();
+			__uaccess_begin_nospec();
 			__get_user_asm_nozero(*(u32 *)to, from, ret,
 					      "l", "k", "=r", 4);
 			__uaccess_end();
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index f07ef3c575db..62546b3a398e 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -55,31 +55,31 @@ raw_copy_from_user(void *dst, const void __user *src, unsigned long size)
 		return copy_user_generic(dst, (__force void *)src, size);
 	switch (size) {
 	case 1:
-		__uaccess_begin();
+		__uaccess_begin_nospec();
 		__get_user_asm_nozero(*(u8 *)dst, (u8 __user *)src,
 			      ret, "b", "b", "=q", 1);
 		__uaccess_end();
 		return ret;
 	case 2:
-		__uaccess_begin();
+		__uaccess_begin_nospec();
 		__get_user_asm_nozero(*(u16 *)dst, (u16 __user *)src,
 			      ret, "w", "w", "=r", 2);
 		__uaccess_end();
 		return ret;
 	case 4:
-		__uaccess_begin();
+		__uaccess_begin_nospec();
 		__get_user_asm_nozero(*(u32 *)dst, (u32 __user *)src,
 			      ret, "l", "k", "=r", 4);
 		__uaccess_end();
 		return ret;
 	case 8:
-		__uaccess_begin();
+		__uaccess_begin_nospec();
 		__get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src,
 			      ret, "q", "", "=r", 8);
 		__uaccess_end();
 		return ret;
 	case 10:
-		__uaccess_begin();
+		__uaccess_begin_nospec();
 		__get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src,
 			       ret, "q", "", "=r", 10);
 		if (likely(!ret))
@@ -89,7 +89,7 @@ raw_copy_from_user(void *dst, const void __user *src, unsigned long size)
 		__uaccess_end();
 		return ret;
 	case 16:
-		__uaccess_begin();
+		__uaccess_begin_nospec();
 		__get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src,
 			       ret, "q", "", "=r", 16);
 		if (likely(!ret))
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 020f75cc8cf6..2429ca38dee6 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -31,6 +31,7 @@
  */
 ENTRY(copy_user_generic_unrolled)
 	ASM_STAC
+	ASM_IFENCE
 	cmpl $8,%edx
 	jb 20f		/* less then 8 bytes, go to byte copy loop */
 	ALIGN_DESTINATION
@@ -135,6 +136,7 @@ EXPORT_SYMBOL(copy_user_generic_unrolled)
  */
 ENTRY(copy_user_generic_string)
 	ASM_STAC
+	ASM_IFENCE
 	cmpl $8,%edx
 	jb 2f		/* less than 8 bytes, go to byte copy loop */
 	ALIGN_DESTINATION
@@ -175,6 +177,7 @@ EXPORT_SYMBOL(copy_user_generic_string)
  */
 ENTRY(copy_user_enhanced_fast_string)
 	ASM_STAC
+	ASM_IFENCE
 	cmpl $64,%edx
 	jb .L_copy_short_string	/* less then 64 bytes, avoid the costly 'rep' */
 	movl %edx,%ecx
diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index c97d935a29e8..85f400b8ee7c 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -41,6 +41,7 @@ ENTRY(__get_user_1)
 	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
 	jae bad_get_user
 	ASM_STAC
+	ASM_IFENCE
 1:	movzbl (%_ASM_AX),%edx
 	xor %eax,%eax
 	ASM_CLAC
@@ -55,6 +56,7 @@ ENTRY(__get_user_2)
 	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
 	jae bad_get_user
 	ASM_STAC
+	ASM_IFENCE
 2:	movzwl -1(%_ASM_AX),%edx
 	xor %eax,%eax
 	ASM_CLAC
@@ -69,6 +71,7 @@ ENTRY(__get_user_4)
 	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
 	jae bad_get_user
 	ASM_STAC
+	ASM_IFENCE
 3:	movl -3(%_ASM_AX),%edx
 	xor %eax,%eax
 	ASM_CLAC
@@ -84,6 +87,7 @@ ENTRY(__get_user_8)
 	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
 	jae bad_get_user
 	ASM_STAC
+	ASM_IFENCE
 4:	movq -7(%_ASM_AX),%rdx
 	xor %eax,%eax
 	ASM_CLAC
@@ -95,6 +99,7 @@ ENTRY(__get_user_8)
 	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
 	jae bad_get_user_8
 	ASM_STAC
+	ASM_IFENCE
 4:	movl -7(%_ASM_AX),%edx
 5:	movl -3(%_ASM_AX),%ecx
 	xor %eax,%eax
diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 1b377f734e64..7add8ba06887 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -331,12 +331,12 @@ do {									\
 
 unsigned long __copy_user_ll(void *to, const void *from, unsigned long n)
 {
-	stac();
+	__uaccess_begin_nospec();
 	if (movsl_is_ok(to, from, n))
 		__copy_user(to, from, n);
 	else
 		n = __copy_user_intel(to, from, n);
-	clac();
+	__uaccess_end();
 	return n;
 }
 EXPORT_SYMBOL(__copy_user_ll);
@@ -344,7 +344,7 @@ EXPORT_SYMBOL(__copy_user_ll);
 unsigned long __copy_from_user_ll_nocache_nozero(void *to, const void __user *from,
 					unsigned long n)
 {
-	stac();
+	__uaccess_begin_nospec();
 #ifdef CONFIG_X86_INTEL_USERCOPY
 	if (n > 64 && static_cpu_has(X86_FEATURE_XMM2))
 		n = __copy_user_intel_nocache(to, from, n);
@@ -353,7 +353,7 @@ unsigned long __copy_from_user_ll_nocache_nozero(void *to, const void __user *fr
 #else
 	__copy_user(to, from, n);
 #endif
-	clac();
+	__uaccess_end();
 	return n;
 }
 EXPORT_SYMBOL(__copy_from_user_ll_nocache_nozero);

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [kernel-hardening] [PATCH v3 9/9] vfs, fdtable: prevent bounds-check bypass via speculative execution
  2018-01-13 18:17 [kernel-hardening] [PATCH v3 0/9] core, x86: prevent bounds-check bypass via speculative execution Dan Williams
                   ` (7 preceding siblings ...)
  2018-01-13 18:18 ` [kernel-hardening] [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths Dan Williams
@ 2018-01-13 18:18 ` Dan Williams
  8 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-13 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, kernel-hardening, gregkh, Al Viro, tglx, torvalds,
	akpm, alan

Expectedly, static analysis reports that 'fd' is a user controlled value
that is used as a data dependency to read from the 'fdt->fd' array.  In
order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid 'file *' returned from __fcheck_files.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Co-developed-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/fdtable.h |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 1c65817673db..9731f1a255db 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -10,6 +10,7 @@
 #include <linux/compiler.h>
 #include <linux/spinlock.h>
 #include <linux/rcupdate.h>
+#include <linux/nospec.h>
 #include <linux/types.h>
 #include <linux/init.h>
 #include <linux/fs.h>
@@ -81,9 +82,11 @@ struct dentry;
 static inline struct file *__fcheck_files(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+	struct file __rcu **fdp;
 
-	if (fd < fdt->max_fds)
-		return rcu_dereference_raw(fdt->fd[fd]);
+	fdp = array_ptr(fdt->fd, fd, fdt->max_fds);
+	if (fdp)
+		return rcu_dereference_raw(*fdp);
 	return NULL;
 }
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-13 18:18 ` [kernel-hardening] [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths Dan Williams
@ 2018-01-13 19:05   ` Linus Torvalds
  2018-01-13 19:33     ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2018-01-13 19:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, linux-arch, Andi Kleen, Kees Cook,
	kernel-hardening, Greg Kroah-Hartman, the arch/x86 maintainers,
	Ingo Molnar, Al Viro, H. Peter Anvin, Thomas Gleixner,
	Andrew Morton, Alan Cox

On Sat, Jan 13, 2018 at 10:18 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> index c97d935a29e8..85f400b8ee7c 100644
> --- a/arch/x86/lib/getuser.S
> +++ b/arch/x86/lib/getuser.S
> @@ -41,6 +41,7 @@ ENTRY(__get_user_1)
>         cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
>         jae bad_get_user
>         ASM_STAC
> +       ASM_IFENCE
>  1:     movzbl (%_ASM_AX),%edx
>         xor %eax,%eax
>         ASM_CLAC

So I really would like to know from somebody (preferably somebody with
real microarchitectural knowledge) just how expensive that "lfence"
ends up being.

Because since we could just generate the masking of the address from
the exact same condition code that we already generate, the "lfence"
really can be replaced by just two ALU instructions instead:

   diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
   index c97d935a29e8..4c378b485399 100644
   --- a/arch/x86/lib/getuser.S
   +++ b/arch/x86/lib/getuser.S
   @@ -40,6 +40,8 @@ ENTRY(__get_user_1)
           mov PER_CPU_VAR(current_task), %_ASM_DX
           cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
           jae bad_get_user
   +       sbb %_ASM_DX,%_ASM_DX
   +       and %_ASM_DX,%_ASM_AX
           ASM_STAC
    1:     movzbl (%_ASM_AX),%edx
           xor %eax,%eax

which looks like it should have a fairly low maximum overhead (ok, the
above is totally untested, maybe I got the condition the wrong way
around _again_).

I _know_ that lfence is expensive as hell on P4, for example.

Yes, yes, "sbb" is often more expensive than most ALU instructions,
and Agner Fog says it has a 10-cycle latency on Prescott (which is
outrageous, but being one or two cycles more due to the flags
generation is normal). So the sbb/and may certainly add a few cycles
to the critical path, but on Prescott "lfence" is *50* cycles
according to those same tables by Agner Fog.

Is there anybody who is willing to say one way or another wrt the
"sbb/and" sequence vs "lfence".

                       Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-13 19:05   ` [kernel-hardening] " Linus Torvalds
@ 2018-01-13 19:33     ` Linus Torvalds
  2018-01-13 20:22       ` Eric W. Biederman
  2018-01-16 22:23       ` Dan Williams
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2018-01-13 19:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, linux-arch, Andi Kleen, Kees Cook,
	kernel-hardening, Greg Kroah-Hartman, the arch/x86 maintainers,
	Ingo Molnar, Al Viro, H. Peter Anvin, Thomas Gleixner,
	Andrew Morton, Alan Cox

On Sat, Jan 13, 2018 at 11:05 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I _know_ that lfence is expensive as hell on P4, for example.
>
> Yes, yes, "sbb" is often more expensive than most ALU instructions,
> and Agner Fog says it has a 10-cycle latency on Prescott (which is
> outrageous, but being one or two cycles more due to the flags
> generation is normal). So the sbb/and may certainly add a few cycles
> to the critical path, but on Prescott "lfence" is *50* cycles
> according to those same tables by Agner Fog.

Side note: I don't think P4 is really relevant for a performance
discussion, I was just giving it as an example where we do know actual
cycles.

I'm much more interested in modern Intel big-core CPU's, and just
wondering whether somebody could ask an architect.

Because I _suspect_ the answer from a CPU architect would be: "Christ,
the sbb/and sequence is much better because it doesn't have any extra
serialization", but maybe I'm wrong, and people feel that lfence is
particularly easy to do right without any real downside.

                Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-13 19:33     ` Linus Torvalds
@ 2018-01-13 20:22       ` Eric W. Biederman
  2018-01-16 22:23       ` Dan Williams
  1 sibling, 0 replies; 39+ messages in thread
From: Eric W. Biederman @ 2018-01-13 20:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Williams, Linux Kernel Mailing List, linux-arch, Andi Kleen,
	Kees Cook, kernel-hardening, Greg Kroah-Hartman,
	the arch/x86 maintainers, Ingo Molnar, Al Viro, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton, Alan Cox

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, Jan 13, 2018 at 11:05 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> I _know_ that lfence is expensive as hell on P4, for example.
>>
>> Yes, yes, "sbb" is often more expensive than most ALU instructions,
>> and Agner Fog says it has a 10-cycle latency on Prescott (which is
>> outrageous, but being one or two cycles more due to the flags
>> generation is normal). So the sbb/and may certainly add a few cycles
>> to the critical path, but on Prescott "lfence" is *50* cycles
>> according to those same tables by Agner Fog.
>
> Side note: I don't think P4 is really relevant for a performance
> discussion, I was just giving it as an example where we do know actual
> cycles.
>
> I'm much more interested in modern Intel big-core CPU's, and just
> wondering whether somebody could ask an architect.
>
> Because I _suspect_ the answer from a CPU architect would be: "Christ,
> the sbb/and sequence is much better because it doesn't have any extra
> serialization", but maybe I'm wrong, and people feel that lfence is
> particularly easy to do right without any real downside.

As an educated observer it seems like the cmpq/sbb/and sequence is an
improvement because it moves the dependency from one end of the cpu
pipeline to another.  If any cpu does data speculation on anything other
than branch targets that sequence could still be susceptible to
speculation.

>From the AMD patches it appears that lfence is becoming a serializing
instruction which in principal is much more expensive.

Also do we have alternatives for these sequences so if we run on an
in-order atom (or 386 or 486) where speculation does not occur we can
avoid the cost?

Eric

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-13 19:33     ` Linus Torvalds
  2018-01-13 20:22       ` Eric W. Biederman
@ 2018-01-16 22:23       ` Dan Williams
       [not found]         ` <CA+55aFxAFG5czVmCyhYMyHmXLNJ7pcXxWzusjZvLRh_qTGHj6Q@mail.gmail.com>
  2018-01-17  4:30         ` Dan Williams
  1 sibling, 2 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-16 22:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-arch, Andi Kleen, Kees Cook,
	kernel-hardening, Greg Kroah-Hartman, the arch/x86 maintainers,
	Ingo Molnar, Al Viro, H. Peter Anvin, Thomas Gleixner,
	Andrew Morton, Alan Cox

On Sat, Jan 13, 2018 at 11:33 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sat, Jan 13, 2018 at 11:05 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> I _know_ that lfence is expensive as hell on P4, for example.
>>
>> Yes, yes, "sbb" is often more expensive than most ALU instructions,
>> and Agner Fog says it has a 10-cycle latency on Prescott (which is
>> outrageous, but being one or two cycles more due to the flags
>> generation is normal). So the sbb/and may certainly add a few cycles
>> to the critical path, but on Prescott "lfence" is *50* cycles
>> according to those same tables by Agner Fog.
>
> Side note: I don't think P4 is really relevant for a performance
> discussion, I was just giving it as an example where we do know actual
> cycles.
>
> I'm much more interested in modern Intel big-core CPU's, and just
> wondering whether somebody could ask an architect.
>
> Because I _suspect_ the answer from a CPU architect would be: "Christ,
> the sbb/and sequence is much better because it doesn't have any extra
> serialization", but maybe I'm wrong, and people feel that lfence is
> particularly easy to do right without any real downside.
>

>From the last paragraph of this guidance:

https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf

...I read that as Linux can constrain speculation with 'and; sbb'
instead of 'lfence', and any future changes will be handled like any
new cpu enabling.

To your specific question of the relative costs, sbb is
architecturally cheaper, so let's go with that approach.

For this '__uaccess_begin_nospec' patch set, at a minimum the kernel
needs a helper that can be easily grep'd when/if it needs changing in
a future kernel. It also indicates that the command line approach to
dynamically switch the mitigation mechanism is over-engineering.

That said, for get_user specifically, can we do something even
cheaper. Dave H. reminds me that any valid user pointer that gets past
the address limit check will have the high bit clear. So instead of
calculating a mask, just unconditionally clear the high bit. It seems
worse case userspace can speculatively leak something that's already
in its address space.

I'll respin this set along those lines, and drop the ifence bits.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
       [not found]         ` <CA+55aFxAFG5czVmCyhYMyHmXLNJ7pcXxWzusjZvLRh_qTGHj6Q@mail.gmail.com>
@ 2018-01-16 22:41           ` Linus Torvalds
  2018-01-17 14:17             ` Alan Cox
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2018-01-16 22:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, linux-arch, Andi Kleen, Kees Cook,
	kernel-hardening, Greg Kroah-Hartman, the arch/x86 maintainers,
	Ingo Molnar, Al Viro, H. Peter Anvin, Thomas Gleixner,
	Andrew Morton, Alan Cox

[-- Attachment #1: Type: text/plain, Size: 644 bytes --]

On Jan 16, 2018 14:23, "Dan Williams" <dan.j.williams@intel.com> wrote:


That said, for get_user specifically, can we do something even
cheaper. Dave H. reminds me that any valid user pointer that gets past
the address limit check will have the high bit clear. So instead of
calculating a mask, just unconditionally clear the high bit. It seems
worse case userspace can speculatively leak something that's already
in its address space.


That's not at all true.

The address may be a kernel address. That's the whole point of 'set_fs()'.

That's why we compare against the address limit variable, not against some
constant number.

     Linus

[-- Attachment #2: Type: text/html, Size: 1377 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-16 22:23       ` Dan Williams
       [not found]         ` <CA+55aFxAFG5czVmCyhYMyHmXLNJ7pcXxWzusjZvLRh_qTGHj6Q@mail.gmail.com>
@ 2018-01-17  4:30         ` Dan Williams
  2018-01-17  6:28           ` Al Viro
  2018-01-17 19:16           ` Linus Torvalds
  1 sibling, 2 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-17  4:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-arch, Andi Kleen, Kees Cook,
	kernel-hardening, Greg Kroah-Hartman, the arch/x86 maintainers,
	Ingo Molnar, Al Viro, H. Peter Anvin, Thomas Gleixner,
	Andrew Morton, Alan Cox

On Tue, Jan 16, 2018 at 2:23 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sat, Jan 13, 2018 at 11:33 AM, Linus Torvalds
[..]
> I'll respin this set along those lines, and drop the ifence bits.

So now I'm not so sure. Yes, get_user_{1,2,4,8} can mask the pointer
with the address limit result, but this doesn't work for the
access_ok() + __get_user() case. We can either change the access_ok()
calling convention to return a properly masked pointer to be used in
subsequent calls to __get_user(), or go with lfence on every
__get_user call. There seem to be several drivers that open code
copy_from_user() with __get_user loops, so the 'fence every
__get_user' approach might have noticeable overhead. On the other hand
the access_ok conversion, while it could be scripted with coccinelle,
is ~300 sites (VERIFY_READ), if you're concerned about having
something small to merge for 4.15.

I think the access_ok() conversion to return a speculation sanitized
pointer or NULL is the way to go unless I'm missing something simpler.
Other ideas?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-17  4:30         ` Dan Williams
@ 2018-01-17  6:28           ` Al Viro
  2018-01-17  6:50             ` Dan Williams
  2018-01-17 19:16           ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Al Viro @ 2018-01-17  6:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arch,
	Andi Kleen, Kees Cook, kernel-hardening, Greg Kroah-Hartman,
	the arch/x86 maintainers, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton, Alan Cox

On Tue, Jan 16, 2018 at 08:30:17PM -0800, Dan Williams wrote:
> On Tue, Jan 16, 2018 at 2:23 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > On Sat, Jan 13, 2018 at 11:33 AM, Linus Torvalds
> [..]
> > I'll respin this set along those lines, and drop the ifence bits.
> 
> So now I'm not so sure. Yes, get_user_{1,2,4,8} can mask the pointer
> with the address limit result, but this doesn't work for the
> access_ok() + __get_user() case. We can either change the access_ok()
> calling convention to return a properly masked pointer to be used in
> subsequent calls to __get_user(), or go with lfence on every
> __get_user call. There seem to be several drivers that open code
> copy_from_user() with __get_user loops, so the 'fence every
> __get_user' approach might have noticeable overhead. On the other hand
> the access_ok conversion, while it could be scripted with coccinelle,
> is ~300 sites (VERIFY_READ), if you're concerned about having
> something small to merge for 4.15.
> 
> I think the access_ok() conversion to return a speculation sanitized
> pointer or NULL is the way to go unless I'm missing something simpler.
> Other ideas?

What masked pointer?  access_ok() exists for other architectures as well,
and the fewer callers remain outside of arch/*, the better.

Anything that open-codes copy_from_user() that way is *ALREADY* fucked if
it cares about the overhead - recent x86 boxen will have slowdown from
hell on stac()/clac() pairs.  Anything like that on a hot path is already
deep in trouble and needs to be found and fixed.  What drivers would those
be?  We don't have that many __get_user() users left outside of arch/*
anymore...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-17  6:28           ` Al Viro
@ 2018-01-17  6:50             ` Dan Williams
  2018-01-17 10:07               ` [kernel-hardening] " David Laight
  2018-01-17 18:12               ` [kernel-hardening] " Dan Williams
  0 siblings, 2 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-17  6:50 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arch,
	Andi Kleen, Kees Cook, kernel-hardening, Greg Kroah-Hartman,
	the arch/x86 maintainers, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton, Alan Cox

On Tue, Jan 16, 2018 at 10:28 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Jan 16, 2018 at 08:30:17PM -0800, Dan Williams wrote:
>> On Tue, Jan 16, 2018 at 2:23 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> > On Sat, Jan 13, 2018 at 11:33 AM, Linus Torvalds
>> [..]
>> > I'll respin this set along those lines, and drop the ifence bits.
>>
>> So now I'm not so sure. Yes, get_user_{1,2,4,8} can mask the pointer
>> with the address limit result, but this doesn't work for the
>> access_ok() + __get_user() case. We can either change the access_ok()
>> calling convention to return a properly masked pointer to be used in
>> subsequent calls to __get_user(), or go with lfence on every
>> __get_user call. There seem to be several drivers that open code
>> copy_from_user() with __get_user loops, so the 'fence every
>> __get_user' approach might have noticeable overhead. On the other hand
>> the access_ok conversion, while it could be scripted with coccinelle,
>> is ~300 sites (VERIFY_READ), if you're concerned about having
>> something small to merge for 4.15.
>>
>> I think the access_ok() conversion to return a speculation sanitized
>> pointer or NULL is the way to go unless I'm missing something simpler.
>> Other ideas?
>
> What masked pointer?

The pointer value that is masked under speculation.

   diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
   index c97d935a29e8..4c378b485399 100644
   --- a/arch/x86/lib/getuser.S
   +++ b/arch/x86/lib/getuser.S
   @@ -40,6 +40,8 @@ ENTRY(__get_user_1)
           mov PER_CPU_VAR(current_task), %_ASM_DX
           cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
           jae bad_get_user
   +       sbb %_ASM_DX,%_ASM_DX
   +       and %_ASM_DX,%_ASM_AX
           ASM_STAC
    1:     movzbl (%_ASM_AX),%edx
           xor %eax,%eax

...i.e %_ASM_AX is guaranteed to be zero if userspace tries to cause
speculation with an address above the limit. The proposal is make
access_ok do that same masking so we never speculate on pointers from
userspace aimed at kernel memory.

> access_ok() exists for other architectures as well,

I'd modify those as well...

> and the fewer callers remain outside of arch/*, the better.
>
> Anything that open-codes copy_from_user() that way is *ALREADY* fucked if
> it cares about the overhead - recent x86 boxen will have slowdown from
> hell on stac()/clac() pairs.  Anything like that on a hot path is already
> deep in trouble and needs to be found and fixed.  What drivers would those
> be?

So I took a closer look and the pattern is not copy_from_user it's
more like __get_user + write-to-hardware loops. If the performance is
already expected to be bad for those then perhaps an lfence each loop
iteration won't be much worse. It's still a waste because the lfence
is only needed once after the access_ok.

> We don't have that many __get_user() users left outside of arch/*
> anymore...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] RE: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-17  6:50             ` Dan Williams
@ 2018-01-17 10:07               ` David Laight
  2018-01-17 18:12               ` [kernel-hardening] " Dan Williams
  1 sibling, 0 replies; 39+ messages in thread
From: David Laight @ 2018-01-17 10:07 UTC (permalink / raw)
  To: 'Dan Williams', Al Viro
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arch,
	Andi Kleen, Kees Cook, kernel-hardening, Greg Kroah-Hartman,
	the arch/x86 maintainers, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton, Alan Cox

From: Dan Williams
> Sent: 17 January 2018 06:50
...
> > Anything that open-codes copy_from_user() that way is *ALREADY* fucked if
> > it cares about the overhead - recent x86 boxen will have slowdown from
> > hell on stac()/clac() pairs.  Anything like that on a hot path is already
> > deep in trouble and needs to be found and fixed.  What drivers would those
> > be?
> 
> So I took a closer look and the pattern is not copy_from_user it's
> more like __get_user + write-to-hardware loops. If the performance is
> already expected to be bad for those then perhaps an lfence each loop
> iteration won't be much worse. It's still a waste because the lfence
> is only needed once after the access_ok.

Performance of PCIe writes isn't that back (since they are posted)
adding a synchronising instructions to __get_user() could easily
be noticeable.

Unfortunately you can't use copy_from_user() with a PCIe mapped
target memory address (or even memcpy_to_io()) because on x86
the copy is aliased to memcpy() and that uses 'rep movsb'
which has to do single byte transfers because the address is
uncached.
(This is really bad for PCIe reads.)

	David



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-16 22:41           ` Linus Torvalds
@ 2018-01-17 14:17             ` Alan Cox
  2018-01-17 18:52               ` Al Viro
  2018-01-17 19:26               ` [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths Linus Torvalds
  0 siblings, 2 replies; 39+ messages in thread
From: Alan Cox @ 2018-01-17 14:17 UTC (permalink / raw)
  To: Linus Torvalds, Dan Williams
  Cc: Linux Kernel Mailing List, linux-arch, Andi Kleen, Kees Cook,
	kernel-hardening, Greg Kroah-Hartman, the arch/x86 maintainers,
	Ingo Molnar, Al Viro, H. Peter Anvin, Thomas Gleixner,
	Andrew Morton

On Tue, 2018-01-16 at 14:41 -0800, Linus Torvalds wrote:
> 
> 
> On Jan 16, 2018 14:23, "Dan Williams" <dan.j.williams@intel.com>
> wrote:
> > That said, for get_user specifically, can we do something even
> > cheaper. Dave H. reminds me that any valid user pointer that gets
> > past
> > the address limit check will have the high bit clear. So instead of
> > calculating a mask, just unconditionally clear the high bit. It
> > seems
> > worse case userspace can speculatively leak something that's
> > already
> > in its address space.
> 
> That's not at all true.
> 
> The address may be a kernel address. That's the whole point of
> 'set_fs()'.

Can we kill off the remaining users of set_fs() ?

Alan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-17  6:50             ` Dan Williams
  2018-01-17 10:07               ` [kernel-hardening] " David Laight
@ 2018-01-17 18:12               ` Dan Williams
  1 sibling, 0 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-17 18:12 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arch,
	Andi Kleen, Kees Cook, kernel-hardening, Greg Kroah-Hartman,
	the arch/x86 maintainers, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton, Alan Cox

On Tue, Jan 16, 2018 at 10:50 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Jan 16, 2018 at 10:28 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
[..]
>> Anything that open-codes copy_from_user() that way is *ALREADY* fucked if
>> it cares about the overhead - recent x86 boxen will have slowdown from
>> hell on stac()/clac() pairs.  Anything like that on a hot path is already
>> deep in trouble and needs to be found and fixed.  What drivers would those
>> be?
>
> So I took a closer look and the pattern is not copy_from_user it's
> more like __get_user + write-to-hardware loops. If the performance is
> already expected to be bad for those then perhaps an lfence each loop
> iteration won't be much worse. It's still a waste because the lfence
> is only needed once after the access_ok.
>
>> We don't have that many __get_user() users left outside of arch/*
>> anymore...

Given the concern of having something easy to backport first I think
we should start with lfence in __uaccess_begin(). Any deeper changes
to the access_ok() + __get_user calling convention can build on top of
that baseline.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-17 14:17             ` Alan Cox
@ 2018-01-17 18:52               ` Al Viro
  2018-01-17 19:54                 ` Dan Williams
  2018-01-18  3:06                 ` [kernel-hardening] [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc Al Viro
  2018-01-17 19:26               ` [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths Linus Torvalds
  1 sibling, 2 replies; 39+ messages in thread
From: Al Viro @ 2018-01-17 18:52 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Dan Williams, Linux Kernel Mailing List,
	linux-arch, Andi Kleen, Kees Cook, kernel-hardening,
	Greg Kroah-Hartman, the arch/x86 maintainers, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Andrew Morton

On Wed, Jan 17, 2018 at 02:17:26PM +0000, Alan Cox wrote:
> On Tue, 2018-01-16 at 14:41 -0800, Linus Torvalds wrote:
> > 
> > 
> > On Jan 16, 2018 14:23, "Dan Williams" <dan.j.williams@intel.com>
> > wrote:
> > > That said, for get_user specifically, can we do something even
> > > cheaper. Dave H. reminds me that any valid user pointer that gets
> > > past
> > > the address limit check will have the high bit clear. So instead of
> > > calculating a mask, just unconditionally clear the high bit. It
> > > seems
> > > worse case userspace can speculatively leak something that's
> > > already
> > > in its address space.
> > 
> > That's not at all true.
> > 
> > The address may be a kernel address. That's the whole point of
> > 'set_fs()'.
> 
> Can we kill off the remaining users of set_fs() ?

Not easily.  They tend to come in pairs (the usual pattern is get_fs(),
save the result, set_fs(something), do work, set_fs(saved)), and
counting each such area as single instance we have (in my tree right
now) 121 locations.  Some could be killed (and will eventually be -
the number of set_fs()/access_ok()/__{get,put}_user()/__copy_...()
call sites had been seriously decreasing during the last couple of
years), but some are really hard to kill off.

How, for example, would you deal with this one:

/*
 * Receive a datagram from a UDP socket.
 */
static int svc_udp_recvfrom(struct svc_rqst *rqstp)
{
        struct svc_sock *svsk =
                container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
        struct svc_serv *serv = svsk->sk_xprt.xpt_server;
        struct sk_buff  *skb;
        union {
                struct cmsghdr  hdr;
                long            all[SVC_PKTINFO_SPACE / sizeof(long)];
        } buffer;
        struct cmsghdr *cmh = &buffer.hdr;
        struct msghdr msg = {
                .msg_name = svc_addr(rqstp),
                .msg_control = cmh,
                .msg_controllen = sizeof(buffer),
                .msg_flags = MSG_DONTWAIT,
        };
...
        err = kernel_recvmsg(svsk->sk_sock, &msg, NULL,
                             0, 0, MSG_PEEK | MSG_DONTWAIT);

With kernel_recvmsg() (and in my tree the above is its last surviving caller)
being

int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
                   struct kvec *vec, size_t num, size_t size, int flags)
{
        mm_segment_t oldfs = get_fs();
        int result;

        iov_iter_kvec(&msg->msg_iter, READ | ITER_KVEC, vec, num, size);
        set_fs(KERNEL_DS);
        result = sock_recvmsg(sock, msg, flags);
        set_fs(oldfs);
        return result;
}
EXPORT_SYMBOL(kernel_recvmsg);

We are asking for recvmsg() with zero data length; what we really want is
->msg_control.  And _that_ is why we need that set_fs() - we want the damn
thing to go into local variable.

But note that filling ->msg_control will happen in put_cmsg(), called
from ip_cmsg_recv_pktinfo(), called from ip_cmsg_recv_offset(),
called from udp_recvmsg(), called from sock_recvmsg_nosec(), called
from sock_recvmsg().  Or in another path in case of IPv6.
Sure, we can arrange for propagation of that all way down those
call chains.  My preference would be to try and mark that (one and
only) case in ->msg_flags, so that put_cmsg() would be able to
check.  ___sys_recvmsg() sets that as
        msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
so we ought to be free to use any bit other than those two.  Since
put_cmsg() already checks ->msg_flags, that shouldn't put too much
overhead.  But then we'll need to do something to prevent speculative
execution straying down that way, won't we?  I'm not saying it can't
be done, but quite a few of the remaining call sites will take
serious work.

	Incidentally, what about copy_to_iter() and friends?  They
check iov_iter flavour and go either into the "copy to kernel buffer"
or "copy to userland" paths.  Do we need to deal with mispredictions
there?  We are calling a bunch of those on read()...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-17  4:30         ` Dan Williams
  2018-01-17  6:28           ` Al Viro
@ 2018-01-17 19:16           ` Linus Torvalds
  1 sibling, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2018-01-17 19:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, linux-arch, Andi Kleen, Kees Cook,
	kernel-hardening, Greg Kroah-Hartman, the arch/x86 maintainers,
	Ingo Molnar, Al Viro, H. Peter Anvin, Thomas Gleixner,
	Andrew Morton, Alan Cox

On Tue, Jan 16, 2018 at 8:30 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>
> I think the access_ok() conversion to return a speculation sanitized
> pointer or NULL is the way to go unless I'm missing something simpler.

No, that's way too big of a conversion.

Just make get_user() and friends (that currently use ASM_STAC) use the
address masking.

The people who use uaccess_begin() can use the lfence there.

Basically, the rule is trivial: find all 'stac' users, and use address
masking if those users already integrate the limit check, and lfence
they don't.

                Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-17 14:17             ` Alan Cox
  2018-01-17 18:52               ` Al Viro
@ 2018-01-17 19:26               ` Linus Torvalds
  2018-01-17 20:01                 ` Eric Dumazet
  2018-01-18 16:38                 ` Christoph Hellwig
  1 sibling, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2018-01-17 19:26 UTC (permalink / raw)
  To: Alan Cox, Eric Dumazet
  Cc: Dan Williams, Linux Kernel Mailing List, linux-arch, Andi Kleen,
	Kees Cook, kernel-hardening, Greg Kroah-Hartman,
	the arch/x86 maintainers, Ingo Molnar, Al Viro, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton

On Wed, Jan 17, 2018 at 6:17 AM, Alan Cox <alan@linux.intel.com> wrote:
>
> Can we kill off the remaining users of set_fs() ?

I would love to, but it's not going to happen short-term. If ever.

Some could be removed today: the code in arch/x86/net/bpf_jit_comp.c
seems to be literally the ramblings of a diseased mind. There's no
reason for the set_fs(), there's no reason for the
flush_icache_range() (it's a no-op on x86 anyway), and the smp_wmb()
looks bogus too.

I have no idea how that braindamage happened, but I assume it got
copied from some broken source.

But there are about ~100 set_fs() calls in generic code, and some of
those really are pretty fundamental. Doing things like "kernel_read()"
without set_fs() is basically impossible.

We've had set_fs() since the beginning. The naming is obviously very
historical. We have it for a very good reason, and I don't really see
that reason going away.

So realistically, we want to _minimize_ set_fs(), and we might want to
make sure that it's done only in limited settings (it might, for
example, be a good idea and a realistic goal to make sure that drivers
and modules can't do it, and use proper helper functions like that
"read_kernel()").

But getting rid of the concept entirely? Doesn't seem likely.

             Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-17 18:52               ` Al Viro
@ 2018-01-17 19:54                 ` Dan Williams
  2018-01-17 20:05                   ` Al Viro
  2018-01-18  3:06                 ` [kernel-hardening] [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc Al Viro
  1 sibling, 1 reply; 39+ messages in thread
From: Dan Williams @ 2018-01-17 19:54 UTC (permalink / raw)
  To: Al Viro
  Cc: Alan Cox, Linus Torvalds, Linux Kernel Mailing List, linux-arch,
	Andi Kleen, Kees Cook, kernel-hardening, Greg Kroah-Hartman,
	the arch/x86 maintainers, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton

On Wed, Jan 17, 2018 at 10:52 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Jan 17, 2018 at 02:17:26PM +0000, Alan Cox wrote:
[..]
>         Incidentally, what about copy_to_iter() and friends?  They
> check iov_iter flavour and go either into the "copy to kernel buffer"
> or "copy to userland" paths.  Do we need to deal with mispredictions
> there?  We are calling a bunch of those on read()...
>

Those should be protected by the conversion of __uaccess_begin to
__uaccess_begin_nospec that includes the lfence.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-17 19:26               ` [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths Linus Torvalds
@ 2018-01-17 20:01                 ` Eric Dumazet
  2018-01-18 16:38                 ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Eric Dumazet @ 2018-01-17 20:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Dan Williams, Linux Kernel Mailing List, linux-arch,
	Andi Kleen, Kees Cook, kernel-hardening, Greg Kroah-Hartman,
	the arch/x86 maintainers, Ingo Molnar, Al Viro, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton

On Wed, Jan 17, 2018 at 11:26 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Jan 17, 2018 at 6:17 AM, Alan Cox <alan@linux.intel.com> wrote:
>>
>> Can we kill off the remaining users of set_fs() ?
>
> I would love to, but it's not going to happen short-term. If ever.
>
> Some could be removed today: the code in arch/x86/net/bpf_jit_comp.c
> seems to be literally the ramblings of a diseased mind. There's no
> reason for the set_fs(), there's no reason for the
> flush_icache_range() (it's a no-op on x86 anyway), and the smp_wmb()
> looks bogus too.
>
> I have no idea how that braindamage happened, but I assume it got
> copied from some broken source.

At the time commit 0a14842f5a3c0e88a1e59fac5c3025db39721f74 went in,
this was the first JIT implementation for BPF, so maybe I wanted to avoid
other arches to forget to flush icache : You bet that my implementation served
as a reference for other JIT.

At that time, various calls to flush_icache_range() were definitely in arch/x86
or kernel/module.c

(I believe I must have copied the code from kernel/module.c, but that
I am not sure)

>
> But there are about ~100 set_fs() calls in generic code, and some of
> those really are pretty fundamental. Doing things like "kernel_read()"
> without set_fs() is basically impossible.
>
> We've had set_fs() since the beginning. The naming is obviously very
> historical. We have it for a very good reason, and I don't really see
> that reason going away.
>
> So realistically, we want to _minimize_ set_fs(), and we might want to
> make sure that it's done only in limited settings (it might, for
> example, be a good idea and a realistic goal to make sure that drivers
> and modules can't do it, and use proper helper functions like that
> "read_kernel()").
>
> But getting rid of the concept entirely? Doesn't seem likely.
>
>              Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-17 19:54                 ` Dan Williams
@ 2018-01-17 20:05                   ` Al Viro
  2018-01-17 20:14                     ` Dan Williams
  0 siblings, 1 reply; 39+ messages in thread
From: Al Viro @ 2018-01-17 20:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Alan Cox, Linus Torvalds, Linux Kernel Mailing List, linux-arch,
	Andi Kleen, Kees Cook, kernel-hardening, Greg Kroah-Hartman,
	the arch/x86 maintainers, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton

On Wed, Jan 17, 2018 at 11:54:12AM -0800, Dan Williams wrote:
> On Wed, Jan 17, 2018 at 10:52 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Wed, Jan 17, 2018 at 02:17:26PM +0000, Alan Cox wrote:
> [..]
> >         Incidentally, what about copy_to_iter() and friends?  They
> > check iov_iter flavour and go either into the "copy to kernel buffer"
> > or "copy to userland" paths.  Do we need to deal with mispredictions
> > there?  We are calling a bunch of those on read()...
> >
> 
> Those should be protected by the conversion of __uaccess_begin to
> __uaccess_begin_nospec that includes the lfence.

Huh?  What the hell does it do to speculative execution of "memcpy those
suckers" branch?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-17 20:05                   ` Al Viro
@ 2018-01-17 20:14                     ` Dan Williams
  0 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2018-01-17 20:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Alan Cox, Linus Torvalds, Linux Kernel Mailing List, linux-arch,
	Andi Kleen, Kees Cook, kernel-hardening, Greg Kroah-Hartman,
	the arch/x86 maintainers, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton

On Wed, Jan 17, 2018 at 12:05 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Jan 17, 2018 at 11:54:12AM -0800, Dan Williams wrote:
>> On Wed, Jan 17, 2018 at 10:52 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> > On Wed, Jan 17, 2018 at 02:17:26PM +0000, Alan Cox wrote:
>> [..]
>> >         Incidentally, what about copy_to_iter() and friends?  They
>> > check iov_iter flavour and go either into the "copy to kernel buffer"
>> > or "copy to userland" paths.  Do we need to deal with mispredictions
>> > there?  We are calling a bunch of those on read()...
>> >
>>
>> Those should be protected by the conversion of __uaccess_begin to
>> __uaccess_begin_nospec that includes the lfence.
>
> Huh?  What the hell does it do to speculative execution of "memcpy those
> suckers" branch?

'raw_copy_from_user 'is changed to use 'uaccess_begin_nospec' instead
of plain 'uacess_begin'. The only difference between those being that
the former includes an lfence. So with this sequence.

        if (access_ok(VERIFY_READ, from, n)) {
                kasan_check_write(to, n);
                n = raw_copy_from_user(to, from, n);
        }
        return n;

...'from' is guaranteed to be within the address limit with respect to
speculative execution, or otherwise never de-referenced.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
  2018-01-17 18:52               ` Al Viro
  2018-01-17 19:54                 ` Dan Williams
@ 2018-01-18  3:06                 ` Al Viro
  2018-01-18  3:16                   ` [kernel-hardening] " Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Al Viro @ 2018-01-18  3:06 UTC (permalink / raw)
  To: netdev
  Cc: Linus Torvalds, Dan Williams, Linux Kernel Mailing List,
	linux-arch, Andi Kleen, Kees Cook, kernel-hardening,
	Greg Kroah-Hartman, the arch/x86 maintainers, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Andrew Morton, Alan Cox,
	David Miller

On Wed, Jan 17, 2018 at 06:52:32PM +0000, Al Viro wrote:

[use of set_fs() by sunrpc]
> We are asking for recvmsg() with zero data length; what we really want is
> ->msg_control.  And _that_ is why we need that set_fs() - we want the damn
> thing to go into local variable.
> 
> But note that filling ->msg_control will happen in put_cmsg(), called
> from ip_cmsg_recv_pktinfo(), called from ip_cmsg_recv_offset(),
> called from udp_recvmsg(), called from sock_recvmsg_nosec(), called
> from sock_recvmsg().  Or in another path in case of IPv6.
> Sure, we can arrange for propagation of that all way down those
> call chains.  My preference would be to try and mark that (one and
> only) case in ->msg_flags, so that put_cmsg() would be able to
> check.  ___sys_recvmsg() sets that as
>         msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
> so we ought to be free to use any bit other than those two.  Since
> put_cmsg() already checks ->msg_flags, that shouldn't put too much
> overhead.

Folks, does anybody have objections against the following:

Get rid of kernel_recvmsg() (and thus set_fs()) use in sunrpc

In net/sunrpc/svcsock.c:svc_udp_recvfrom() we want to get
IP_PKTINFO; currently we stash the address of local variable
into ->msg_control (which normall contains a userland pointer
in recepients) and issue zero-length ->recvmsg() under
set_fs(KERNEL_DS).

Similar to the way put_cmsg() handles 32bit case on biarch
targets, introduce a flag telling put_cmsg() to treat
->msg_control as kernel pointer, using memcpy instead of
copy_to_user().  That allows to avoid the use of kernel_recvmsg()
with its set_fs().

All places that might have non-NULL ->msg_control either pass the
msghdr only to ->sendmsg() and its ilk, or have ->msg_flags
sanitized before passing msghdr anywhere.  IOW, there no
way the new flag to reach put_cmsg() in the mainline kernel,
and after this change it will only be seen in sunrpc case.

Incidentally, all other kernel_recvmsg() users are very easy to
convert to sock_recvmsg(), so that would allow to kill
kernel_recvmsg() off...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 9286a5a8c60c..60947da9c806 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -298,6 +298,7 @@ struct ucred {
 #else
 #define MSG_CMSG_COMPAT	0		/* We never have 32 bit fixups */
 #endif
+#define MSG_CMSG_KERNEL	0x10000000
 
 
 /* Setsockoptions(2) level. Thanks to BSD these must match IPPROTO_xxx */
diff --git a/net/core/scm.c b/net/core/scm.c
index b1ff8a441748..1b73b682e441 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -237,10 +237,17 @@ int put_cmsg(struct msghdr * msg, int level, int type, int len, void *data)
 	cmhdr.cmsg_len = cmlen;
 
 	err = -EFAULT;
-	if (copy_to_user(cm, &cmhdr, sizeof cmhdr))
-		goto out;
-	if (copy_to_user(CMSG_DATA(cm), data, cmlen - sizeof(struct cmsghdr)))
-		goto out;
+	if (unlikely(MSG_CMSG_KERNEL & msg->msg_flags)) {
+		struct cmsghdr *p = msg->msg_control;
+		memcpy(p, &cmhdr, sizeof cmhdr);
+		memcpy(CMSG_DATA(p), data, cmlen - sizeof(struct cmsghdr));
+	} else {
+		if (copy_to_user(cm, &cmhdr, sizeof cmhdr))
+			goto out;
+		if (copy_to_user(CMSG_DATA(cm), data,
+				 cmlen - sizeof(struct cmsghdr)))
+			goto out;
+	}
 	cmlen = CMSG_SPACE(len);
 	if (msg->msg_controllen < cmlen)
 		cmlen = msg->msg_controllen;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 5570719e4787..774904f67c93 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -545,7 +545,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 		.msg_name = svc_addr(rqstp),
 		.msg_control = cmh,
 		.msg_controllen = sizeof(buffer),
-		.msg_flags = MSG_DONTWAIT,
+		.msg_flags = MSG_DONTWAIT | MSG_CMSG_KERNEL,
 	};
 	size_t len;
 	int err;
@@ -565,8 +565,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 
 	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 	skb = NULL;
-	err = kernel_recvmsg(svsk->sk_sock, &msg, NULL,
-			     0, 0, MSG_PEEK | MSG_DONTWAIT);
+	err = sock_recvmsg(svsk->sk_sock, &msg, MSG_PEEK | MSG_DONTWAIT);
 	if (err >= 0)
 		skb = skb_recv_udp(svsk->sk_sk, 0, 1, &err);
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
  2018-01-18  3:06                 ` [kernel-hardening] [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc Al Viro
@ 2018-01-18  3:16                   ` Linus Torvalds
  2018-01-18  4:43                     ` Al Viro
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2018-01-18  3:16 UTC (permalink / raw)
  To: Al Viro
  Cc: Network Development, Dan Williams, Linux Kernel Mailing List,
	linux-arch, Andi Kleen, Kees Cook, kernel-hardening,
	Greg Kroah-Hartman, the arch/x86 maintainers, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Andrew Morton, Alan Cox,
	David Miller

On Wed, Jan 17, 2018 at 7:06 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Similar to the way put_cmsg() handles 32bit case on biarch
> targets, introduce a flag telling put_cmsg() to treat
> ->msg_control as kernel pointer, using memcpy instead of
> copy_to_user().  That allows to avoid the use of kernel_recvmsg()
> with its set_fs().

If this is the only case that kernel_recvmsg() exists for, then by all
means, that patch certainly looks like a good thing.

                 Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
  2018-01-18  3:16                   ` [kernel-hardening] " Linus Torvalds
@ 2018-01-18  4:43                     ` Al Viro
  2018-01-18 16:29                       ` Christoph Hellwig
  2018-01-18 19:31                       ` Al Viro
  0 siblings, 2 replies; 39+ messages in thread
From: Al Viro @ 2018-01-18  4:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Network Development, Dan Williams, Linux Kernel Mailing List,
	linux-arch, Andi Kleen, Kees Cook, kernel-hardening,
	Greg Kroah-Hartman, the arch/x86 maintainers, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Andrew Morton, Alan Cox,
	David Miller

On Wed, Jan 17, 2018 at 07:16:02PM -0800, Linus Torvalds wrote:
> On Wed, Jan 17, 2018 at 7:06 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Similar to the way put_cmsg() handles 32bit case on biarch
> > targets, introduce a flag telling put_cmsg() to treat
> > ->msg_control as kernel pointer, using memcpy instead of
> > copy_to_user().  That allows to avoid the use of kernel_recvmsg()
> > with its set_fs().
> 
> If this is the only case that kernel_recvmsg() exists for, then by all
> means, that patch certainly looks like a good thing.

In -next that's the only remaining caller.  kernel_recvmsg() is
{
        mm_segment_t oldfs = get_fs();
        int result;

        iov_iter_kvec(&msg->msg_iter, READ | ITER_KVEC, vec, num, size);
        set_fs(KERNEL_DS);
        result = sock_recvmsg(sock, msg, flags);
        set_fs(oldfs);
        return result;
}
and 
        iov_iter_kvec(&msg->msg_iter, READ | ITER_KVEC, vec, num, size);
        result = sock_recvmsg(sock, msg, flags);
works just fine for copying the data - that gets handled by copy_to_iter()
and copy_page_to_iter().  Those don't care about set_fs(); the trouble with
sunrpc call site is that we want to fill msg_control-pointed kernel object.
That gets copied by put_cmsg().

	We could turn ->msg_control/->msg_controllen into another
iov_iter, but seeing that we never do scatter-gather for those
IMO that would be a massive overkill.  A flag controlling whether
->msg_control is kernel or userland pointer would do, especially
since we already have a flag for "do we want a native or compat
layout for cmsg" in there.

	That's the only caller we need it for, but that thing looks cheap
enough.  Obviously needs to pass testing, including "is it too ugly to
live as far as Davem is concerned" test, though...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
  2018-01-18  4:43                     ` Al Viro
@ 2018-01-18 16:29                       ` Christoph Hellwig
  2018-01-18 17:10                         ` Al Viro
  2018-01-18 19:31                       ` Al Viro
  1 sibling, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2018-01-18 16:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Network Development, Dan Williams,
	Linux Kernel Mailing List, linux-arch, Andi Kleen, Kees Cook,
	kernel-hardening, Greg Kroah-Hartman, the arch/x86 maintainers,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Andrew Morton,
	Alan Cox, David Miller

> 	We could turn ->msg_control/->msg_controllen into another
> iov_iter, but seeing that we never do scatter-gather for those
> IMO that would be a massive overkill.  A flag controlling whether
> ->msg_control is kernel or userland pointer would do, especially
> since we already have a flag for "do we want a native or compat
> layout for cmsg" in there.

While your current hack seems like a nice short term improvement
I think we need an iov_iter or iov_iter-light there in the long run.
Same for ioctl so that we can pass properly typed kernel or user
buffers through without all these set_fs hacks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-17 19:26               ` [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths Linus Torvalds
  2018-01-17 20:01                 ` Eric Dumazet
@ 2018-01-18 16:38                 ` Christoph Hellwig
  2018-01-18 16:49                   ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2018-01-18 16:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Eric Dumazet, Dan Williams, Linux Kernel Mailing List,
	linux-arch, Andi Kleen, Kees Cook, kernel-hardening,
	Greg Kroah-Hartman, the arch/x86 maintainers, Ingo Molnar,
	Al Viro, H. Peter Anvin, Thomas Gleixner, Andrew Morton

On Wed, Jan 17, 2018 at 11:26:08AM -0800, Linus Torvalds wrote:
> But there are about ~100 set_fs() calls in generic code, and some of
> those really are pretty fundamental. Doing things like "kernel_read()"
> without set_fs() is basically impossible.

Not if we move to iov_iter or iov_iter-like behavior for all reads
and writes.  There is an issue with how vectored writes are handles
in plain read/write vs read_iter/write_iter inherited from readv/writev,
but that's nothing a flag, or a second set of methods with the
same signature.

But there are more annoying things, most notable in-kernel ioctls
calls.  We have quite a few of them, and while many are just utterly
stupid and can be replaced with direct function calls or new methods
(I've done quite a few conversions of those) some might be left.
Something like iov_iter might be the answer again.

Then we have things like probe_kernel_read/probe_kernel_write which
abuse the exception handling in get/put user.  But with a little
arch helper we don't strictly need get_fs/set_fs for that.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-18 16:38                 ` Christoph Hellwig
@ 2018-01-18 16:49                   ` Linus Torvalds
  2018-01-18 18:12                     ` Al Viro
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2018-01-18 16:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alan Cox, Eric Dumazet, Dan Williams, Linux Kernel Mailing List,
	linux-arch, Andi Kleen, Kees Cook, kernel-hardening,
	Greg Kroah-Hartman, the arch/x86 maintainers, Ingo Molnar,
	Al Viro, H. Peter Anvin, Thomas Gleixner, Andrew Morton

On Thu, Jan 18, 2018 at 8:38 AM, Christoph Hellwig <hch@infradead.org> wrote:
>
> > But there are about ~100 set_fs() calls in generic code, and some of
> > those really are pretty fundamental. Doing things like "kernel_read()"
> > without set_fs() is basically impossible.
>
> Not if we move to iov_iter or iov_iter-like behavior for all reads
> and writes.

Not going to happen. Really. We have how many tens of thousands of
drivers again, all doing "copy_to_user()".

And the fact is, set_fs() really isn't even a problem for this. Never
really has been.   From a security standpoint, it would actually be
*much* worse if we made those ten thousand places do "if (kernel_flag)
memcpy() else copy_to_user()".

We've had some issues with set_fs() being abused in interesting ways.
But "kernel_read()" and friends is not it.

               Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
  2018-01-18 16:29                       ` Christoph Hellwig
@ 2018-01-18 17:10                         ` Al Viro
  0 siblings, 0 replies; 39+ messages in thread
From: Al Viro @ 2018-01-18 17:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Network Development, Dan Williams,
	Linux Kernel Mailing List, linux-arch, Andi Kleen, Kees Cook,
	kernel-hardening, Greg Kroah-Hartman, the arch/x86 maintainers,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Andrew Morton,
	Alan Cox, David Miller

On Thu, Jan 18, 2018 at 08:29:57AM -0800, Christoph Hellwig wrote:
> > 	We could turn ->msg_control/->msg_controllen into another
> > iov_iter, but seeing that we never do scatter-gather for those
> > IMO that would be a massive overkill.  A flag controlling whether
> > ->msg_control is kernel or userland pointer would do, especially
> > since we already have a flag for "do we want a native or compat
> > layout for cmsg" in there.
> 
> While your current hack seems like a nice short term improvement
> I think we need an iov_iter or iov_iter-light there in the long run.

For one caller in the entire history of the kernel?

> Same for ioctl so that we can pass properly typed kernel or user
> buffers through without all these set_fs hacks.

Umm...  Most of the PITA with ioctls is due to compat ones being
reformatted for native and fed under set_fs().  I actually have
a series dealing with most of such places for net ioctls.  Sure,
there's also ioctl_by_bdev(), but for those we might be better
off exposing the things like ->get_last_session() and its ilk to
filesystems that want to deal with cdroms...

It's kernel_setsockopt() that is the real PITA...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
  2018-01-18 16:49                   ` Linus Torvalds
@ 2018-01-18 18:12                     ` Al Viro
  0 siblings, 0 replies; 39+ messages in thread
From: Al Viro @ 2018-01-18 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Hellwig, Alan Cox, Eric Dumazet, Dan Williams,
	Linux Kernel Mailing List, linux-arch, Andi Kleen, Kees Cook,
	kernel-hardening, Greg Kroah-Hartman, the arch/x86 maintainers,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Andrew Morton

On Thu, Jan 18, 2018 at 08:49:31AM -0800, Linus Torvalds wrote:
> On Thu, Jan 18, 2018 at 8:38 AM, Christoph Hellwig <hch@infradead.org> wrote:
> >
> > > But there are about ~100 set_fs() calls in generic code, and some of
> > > those really are pretty fundamental. Doing things like "kernel_read()"
> > > without set_fs() is basically impossible.
> >
> > Not if we move to iov_iter or iov_iter-like behavior for all reads
> > and writes.
> 
> Not going to happen. Really. We have how many tens of thousands of
> drivers again, all doing "copy_to_user()".

The real PITA is not even that (we could provide helpers making
conversion from ->read() to ->read_iter() easy for char devices,
etc.).  It's the semantics of readv(2).  Consider e.g. readv()
from /dev/rtc, with iovec array consisting of 10 segments, each
int-sized.  Right now we'll get rtc_dev_read() called in a loop,
once for each segment.  Single read() into 40-byte buffer will
fill one long and bugger off.  Converting it to ->read_iter()
will mean more than just "use copy_to_iter() instead of put_user()" -
that would be trivial.  But to preserve the current behaviour
we would need something like
	total = 0;
	while (iov_iter_count(to)) {
		count = iov_iter_single_seg_count(to);
		/* current body of rtc_dev_read(), with
		 * put_user() replaced with copy_to_iter()
		 */
		....
		if (res < 0) {
			if (!total)
				total = res;
			break;
		}
		total += res;
		if (res != count)
			break;
	}
	return total;
in that thing.  And similar boilerplates would be needed in
a whole lot of drivers.  Sure, they are individually trivial,
but they would add up to shitloads of code to get wrong.

These are basically all ->read() instances that ignore *ppos
and, unlike pipes, do not attempt to fill as much of the
buffer as possible.  We do have quite a few of such.

Some ->read() instances can be easily converted to ->read_iter()
and will, in fact, be better off that way.  We had patches of
that sort and I'm certain that we still have such places left.
Ditto for ->write() and ->write_iter().  But those are not
even close to being the majority.  Sorry.

We could, in principle, do something like

dev_rtc_read_iter(iocb, to)
{
	return loop_read_iter(iocb, to, modified_dev_rtc_read);
}
with modified_dev_rtc_read() being the result of minimal
conversion (put_user() and copy_to_user() replaced with used
of copy_to_iter()).  It would be less boilerplate that way,
but I really don't see big benefits from doing that.

On the write side the things are just as unpleasant - we have
a lot of ->write() instances that parse the beginning of the
buffer, ignore the rest and report that everything got written.
writev() on those will parse each iovec segment, ignoring the
junk in the end of each.  Again, that loop needs to go somewhere.
And we do have a bunch of "parse the buffer and do some action
once" ->write() instances - in char devices, debugfs, etc.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
  2018-01-18  4:43                     ` Al Viro
  2018-01-18 16:29                       ` Christoph Hellwig
@ 2018-01-18 19:31                       ` Al Viro
  2018-01-18 20:33                         ` Al Viro
  2018-01-19  3:27                         ` Al Viro
  1 sibling, 2 replies; 39+ messages in thread
From: Al Viro @ 2018-01-18 19:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Network Development, Dan Williams, Linux Kernel Mailing List,
	linux-arch, Andi Kleen, Kees Cook, kernel-hardening,
	Greg Kroah-Hartman, the arch/x86 maintainers, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Andrew Morton, Alan Cox,
	David Miller

On Thu, Jan 18, 2018 at 04:43:02AM +0000, Al Viro wrote:

> 	We could turn ->msg_control/->msg_controllen into another
> iov_iter, but seeing that we never do scatter-gather for those
> IMO that would be a massive overkill.  A flag controlling whether
> ->msg_control is kernel or userland pointer would do, especially
> since we already have a flag for "do we want a native or compat
> layout for cmsg" in there.
> 
> 	That's the only caller we need it for, but that thing looks cheap
> enough.  Obviously needs to pass testing, including "is it too ugly to
> live as far as Davem is concerned" test, though...

	BTW, there's another series of set_fs-removal patches in
net ioctls; still needs review, though.  With that one we would be down
to 11 instances in the entire net/*:

* SO_RCVTIMEO/SO_SNDTIMEO handling in compat [sg]etsockopt()
* passing SIOC{ADD,DEL}TUNNEL down (ipmr_del_tunnel(),ipmr_new_tunnel(),
  addrconf_set_dstaddr())
* SIOCGSTAMP/SIOCGSTAMPNS in compat ioctls
* SIOCADDRT/SIOCDELRT in compat ioctls
* kernel_[gs]etsockopt()
* ipv6_renew_options_kern()

I don't know if all of that stuff can be realistically done without set_fs().
kernel_setsockopt(), in particular, is unpleasant...

The patches need review and testing, obviously; I'll post them in followups,
the entire series (on top of net/master) is in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.net-ioctl

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
  2018-01-18 19:31                       ` Al Viro
@ 2018-01-18 20:33                         ` Al Viro
  2018-01-19  3:27                         ` Al Viro
  1 sibling, 0 replies; 39+ messages in thread
From: Al Viro @ 2018-01-18 20:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Network Development, Dan Williams, Linux Kernel Mailing List,
	linux-arch, Andi Kleen, Kees Cook, kernel-hardening,
	Greg Kroah-Hartman, the arch/x86 maintainers, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Andrew Morton, Alan Cox,
	David Miller

On Thu, Jan 18, 2018 at 07:31:56PM +0000, Al Viro wrote:

> * SO_RCVTIMEO/SO_SNDTIMEO handling in compat [sg]etsockopt()
> * passing SIOC{ADD,DEL}TUNNEL down (ipmr_del_tunnel(),ipmr_new_tunnel(),
>   addrconf_set_dstaddr())
> * SIOCGSTAMP/SIOCGSTAMPNS in compat ioctls
> * SIOCADDRT/SIOCDELRT in compat ioctls
> * kernel_[gs]etsockopt()
> * ipv6_renew_options_kern()
> 
> I don't know if all of that stuff can be realistically done without set_fs().
> kernel_setsockopt(), in particular, is unpleasant...

Speaking of weird indirect calls: in net/packet/af_packet.c:packet_ioctl() we
have this:
#ifdef CONFIG_INET
        case SIOCADDRT:
        case SIOCDELRT:
        case SIOCDARP:
        case SIOCGARP:
        case SIOCSARP:
        case SIOCGIFADDR:
        case SIOCSIFADDR:
        case SIOCGIFBRDADDR:
        case SIOCSIFBRDADDR:
        case SIOCGIFNETMASK:
        case SIOCSIFNETMASK:
        case SIOCGIFDSTADDR:
        case SIOCSIFDSTADDR:
        case SIOCSIFFLAGS:
                return inet_dgram_ops.ioctl(sock, cmd, arg);
#endif

That's inet_ioctl(sock, cmd, arg) disguised by indirect.  AFAICS,
that line dates back to 2.1.89; back then inet_dgram_ops had been
exported and inet_ioctl() had been static.  When SCTP went in
they'd exported inet_ioctl() rather than playing that kind of
games.

Is there anything subtle I'm missing here that would make it
wrong to replace that with explicit call of inet_ioctl()?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [kernel-hardening] Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
  2018-01-18 19:31                       ` Al Viro
  2018-01-18 20:33                         ` Al Viro
@ 2018-01-19  3:27                         ` Al Viro
  1 sibling, 0 replies; 39+ messages in thread
From: Al Viro @ 2018-01-19  3:27 UTC (permalink / raw)
  To: Network Development
  Cc: Dan Williams, Linux Kernel Mailing List, linux-arch, Andi Kleen,
	Kees Cook, kernel-hardening, Greg Kroah-Hartman,
	the arch/x86 maintainers, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton, Alan Cox, David Miller,
	Linus Torvalds, Ralf Baechle

On Thu, Jan 18, 2018 at 07:31:56PM +0000, Al Viro wrote:

> * SIOCADDRT/SIOCDELRT in compat ioctls

To bring back a question I'd asked back in October - what do
we do about SIOC...RT compat?

To recap:
	* AF_INET sockets expect struct rtentry; it differs
between 32bit and 64bit, so routing_ioctl() in net/socket.c
is called from compat_sock_ioctl_trans() and does the right
thing.  All proto_ops instances with .family = PF_INET (and
only they) have inet_ioctl() as ->ioctl(), and end up with
ip_rt_ioctl() called for native ones.  Three of those have
->compat_ioctl() set to inet_compat_ioctl(), the rest have
it NULL.  In any case, inet_compat_ioctl() ignores those,
leaving them to compat_sock_ioctl_trans() to pick up.
	* for AF_INET6 the situation is similar, except that
they use struct in6_rtmsg.  Compat is also dealt with in
routing_ioctl().  inet6_ioctl() for all such proto_ops
(and only those), ipv6_route_ioctl() is what ends up
handling the native ones.  No ->compat_ioctl() in any
of those.
	* AF_PACKET sockets expect struct rt_entry and
actually bounce the native calls to inet_ioctl().  No
->compat_ioctl() there, but routing_ioctl() in net/socket.c
does the right thing.
	* AF_APPLETALK sockets expect struct rt_entry.
Native handled in atrtr_ioctl(); there is ->compat_ioctl(),
but it ignores those ioctls, so we go through the conversion
in net/socket.c.  Also happens to work correctly.

	* ax25, ipx, netrom, rose and x25 use structures
of their own, and those structures have identical layouts on
32bit and 64bit.  x25 has ->compat_ioctl() that does the
right thing (bounces to native), the rest either have
->compat_ioctl() ignoring those ioctls (ipx) or do not
have ->compat_ioctl() at all.  That ends up with generic
code picking those and buggering them up - routing_ioctl()
assumes that we want either in6_rtmsg (ipv6) or rtentry
(everything else).  Unfortunately, in case of these
protocols we should just leave the suckers alone.
	Back then Ralf has verified that the bug exists
and said he'd put together a fix.  Looks like that fix
has fallen through the cracks, though.

	* all other protocols fail those; usually with
ENOTTY, except for AF_QIPCRTR that fails with EINVAL.
Either way, compat is not an issue.

	Note that handling of SIOCADDRT on e.g. raw ipv4
sockets from 32bit process is convoluted as hell.  The
call chain is
	compat_sys_ioctl()
		compat_sock_ioctl()
			inet_compat_ioctl()
				compat_raw_ioctl()
					=> -ENOIOCTLCMD, possibly
					    by way of ipmr_compat_ioctl()
			compat_sock_ioctl_trans()
				routing_ioctl() [conversion done here]
					sock_do_ioctl()
						inet_ioctl()
							ip_rt_ioctl()
A lot of those are method calls, BTW, and the overhead on those has
just grown...

Does anybody have objections against the following?

1) Somewhere in net/core (or net/compat.c, for that matter) add
int compat_get_rtentry(struct rtentry *r, struct rtentry32 __user *p);

2) In inet_compat_ioctl() recognize SIOC{ADD,DEL}RT and do
		err = compat_get_rtentry(&r, (void __user *)arg);
		if (!err)
			err = ip_rt_ioctl(...)
		return err;

3) Add inet_compat_ioctl() as ->compat_ioctl in all PF_INET proto_ops.

4) Lift copyin from atrtr_ioctl() to atalk_ioctl(), teach
atalk_compat_ioctl() about these ioctls (using compat_get_rtentry()
and atrtr_ioctl(), that is).

5) Add ->compat_ioctl() to AF_PACKET, let it just call inet_compat_ioctl()
for those two.

6) Lift copyin from ipv6_route_ioctl() to inet6_ioctl().  
Add inet6_compat_ioctl() that would recognize those two, do compat copyin
and call ipv6_route_ioctl().  Make it ->compat_ioctl for all PF_INET6
proto_ops.

7) Tell compat_sock_ioctl_trans() to move these two into the "just call
sock_do_ioctl()" group of cases.  Or, with Ralf's fix, just remove these
two cases from compat_sock_ioctl_trans() completely.  Either way,
routing_ioctl() becomes dead code and can be removed.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2018-01-19  3:27 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-13 18:17 [kernel-hardening] [PATCH v3 0/9] core, x86: prevent bounds-check bypass via speculative execution Dan Williams
2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 1/9] Documentation: document array_ptr Dan Williams
2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 2/9] arm64: implement ifence_array_ptr() Dan Williams
2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 3/9] arm: " Dan Williams
2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 4/9] x86: implement ifence() Dan Williams
2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 5/9] x86: implement ifence_array_ptr() and array_ptr_mask() Dan Williams
2018-01-13 18:17 ` [kernel-hardening] [PATCH v3 6/9] asm/nospec: mask speculative execution flows Dan Williams
2018-01-13 18:18 ` [kernel-hardening] [PATCH v3 7/9] x86: introduce __uaccess_begin_nospec and ASM_IFENCE Dan Williams
2018-01-13 18:18 ` [kernel-hardening] [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths Dan Williams
2018-01-13 19:05   ` [kernel-hardening] " Linus Torvalds
2018-01-13 19:33     ` Linus Torvalds
2018-01-13 20:22       ` Eric W. Biederman
2018-01-16 22:23       ` Dan Williams
     [not found]         ` <CA+55aFxAFG5czVmCyhYMyHmXLNJ7pcXxWzusjZvLRh_qTGHj6Q@mail.gmail.com>
2018-01-16 22:41           ` Linus Torvalds
2018-01-17 14:17             ` Alan Cox
2018-01-17 18:52               ` Al Viro
2018-01-17 19:54                 ` Dan Williams
2018-01-17 20:05                   ` Al Viro
2018-01-17 20:14                     ` Dan Williams
2018-01-18  3:06                 ` [kernel-hardening] [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc Al Viro
2018-01-18  3:16                   ` [kernel-hardening] " Linus Torvalds
2018-01-18  4:43                     ` Al Viro
2018-01-18 16:29                       ` Christoph Hellwig
2018-01-18 17:10                         ` Al Viro
2018-01-18 19:31                       ` Al Viro
2018-01-18 20:33                         ` Al Viro
2018-01-19  3:27                         ` Al Viro
2018-01-17 19:26               ` [kernel-hardening] Re: [PATCH v3 8/9] x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths Linus Torvalds
2018-01-17 20:01                 ` Eric Dumazet
2018-01-18 16:38                 ` Christoph Hellwig
2018-01-18 16:49                   ` Linus Torvalds
2018-01-18 18:12                     ` Al Viro
2018-01-17  4:30         ` Dan Williams
2018-01-17  6:28           ` Al Viro
2018-01-17  6:50             ` Dan Williams
2018-01-17 10:07               ` [kernel-hardening] " David Laight
2018-01-17 18:12               ` [kernel-hardening] " Dan Williams
2018-01-17 19:16           ` Linus Torvalds
2018-01-13 18:18 ` [kernel-hardening] [PATCH v3 9/9] vfs, fdtable: prevent bounds-check bypass via speculative execution Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).