LKML Archive on lore.kernel.org
 help / color / Atom feed
* [RFC v2] Introduce rare_write() infrastructure
@ 2017-03-29 18:15 Kees Cook
  2017-03-29 18:15 ` [RFC v2][PATCH 01/11] " Kees Cook
                   ` (11 more replies)
  0 siblings, 12 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:15 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel

This is take 2 of an RFC series to demonstrate a possible infrastructure
for the "write rarely" memory storage type in the kernel (patch 1). The
intent is to further reduce the internal attack surface of the kernel
by making more variables read-only while "at rest". This is heavily
based on the "__read_only" portion of the KERNEXEC infrastructure from
PaX/grsecurity, though I tried to adjust it to be more in line with
the upstream discussions around the APIs.

Also included is the PaX/grsecurity constify plugin (patch 10) which will
automatically make all instances of certain structures read-only, to help
demonstrate more complex cases of "write rarely" targets. (The plugin in
this series is altered to only operate on marked structures, rather than
the full automatic constification.)

As part of the series I've included both x86 support (patch 4), exactly
as done in PaX, and ARM support (patches 5-7), similar to what is done in
grsecurity but without support for earlier ARM CPUs. Both are lightly
tested by me, though they need a bit more work, especially ARM as it is
missing the correct domain marking for kernel modules.

I've added an lkdtm test (patch 2), which serves as a stand-alone example
of the new infrastructure.

Included are two example "conversions" to the rare_write()-style of
variable manipulation: a simple one, which switches the inet diag handler
table to write-rarely during register/unregister calls (patch 3), and
a more complex one: cgroup types (patch 11), which is made read-only via
the constify plugin. The latter uses rare-write linked lists (patch 9)
and multi-field updates. Both examples are refactorings of what already
appears in PaX/grsecurity.

The patches are:

	[PATCH 01/11] Introduce rare_write() infrastructure
	[PATCH 02/11] lkdtm: add test for rare_write() infrastructure
	[PATCH 03/11] net: switch sock_diag handlers to rare_write()
	[PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
	[PATCH 05/11] ARM: mm: dump: Add domain to output
	[PATCH 06/11] ARM: domains: Extract common USER domain init
	[PATCH 07/11] ARM: mm: set DOMAIN_WR_RARE for rodata
	[PATCH 08/11] ARM: Implement __arch_rare_write_begin/unmap()
	[PATCH 09/11] list: add rare_write() list helpers
	[PATCH 10/11] gcc-plugins: Add constify plugin
	[PATCH 11/11] cgroups: force all struct cftype const

-Kees

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC v2][PATCH 01/11] Introduce rare_write() infrastructure
  2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
@ 2017-03-29 18:15 ` Kees Cook
  2017-03-29 18:23   ` Kees Cook
  2017-04-07  8:09   ` Ho-Eun Ryu
  2017-03-29 18:15 ` [RFC v2][PATCH 02/11] lkdtm: add test for " Kees Cook
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:15 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel

Several types of data storage exist in the kernel: read-write data (.data,
.bss), read-only data (.rodata), and RO-after-init. This introduces the
infrastructure for another type: write-rarely, which is intended for data
that is either only rarely modified or especially security-sensitive. The
goal is to further reduce the internal attack surface of the kernel by
making this storage read-only when "at rest". This makes it much harder
to be subverted by attackers who have a kernel-write flaw, since they
cannot directly change these memory contents.

This work is heavily based on PaX and grsecurity's pax_{open,close}_kernel
API, its __read_only annotations, its constify plugin, and the work done
to identify sensitive structures that should be moved from .data into
.rodata. This builds the initial infrastructure to support these kinds
of changes, though the API and naming has been adjusted in places for
clarity and maintainability.

Variables declared with the __wr_rare annotation will be moved to the
.rodata section if an architecture supports CONFIG_HAVE_ARCH_WRITE_RARE.
To change these variables, either a single rare_write() macro can be used,
or multiple uses of __rare_write(), wrapped in a matching pair of
rare_write_begin() and rare_write_end() macros can be used. These macros
are expanded into the arch-specific functions that perform the actions
needed to write to otherwise read-only memory.

As detailed in the Kconfig help, the arch-specific helpers have several
requirements to make them sensible/safe for use by the kernel: they must
not allow non-current CPUs to write the memory area, they must run
non-preemptible to avoid accidentally leaving memory writable, and must
be inline to avoid making them desirable ROP targets for attackers.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/Kconfig             | 25 +++++++++++++++++++++++++
 include/linux/compiler.h | 32 ++++++++++++++++++++++++++++++++
 include/linux/preempt.h  |  6 ++++--
 3 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index cd211a14a88f..5ebf62500b99 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -847,4 +847,29 @@ config STRICT_MODULE_RWX
 config ARCH_WANT_RELAX_ORDER
 	bool
 
+config HAVE_ARCH_RARE_WRITE
+	def_bool n
+	help
+	  An arch should select this option if it has defined the functions
+	  __arch_rare_write_begin() and __arch_rare_write_end() to
+	  respectively enable and disable writing to read-only memory. The
+	  routines must meet the following requirements:
+	  - read-only memory writing must only be available on the current
+	    CPU (to make sure other CPUs can't race to make changes too).
+	  - the routines must be declared inline (to discourage ROP use).
+	  - the routines must not be preemptible (likely they will call
+	    preempt_disable() and preempt_enable_no_resched() respectively).
+	  - the routines must validate expected state (e.g. when enabling
+	    writes, BUG() if writes are already be enabled).
+
+config HAVE_ARCH_RARE_WRITE_MEMCPY
+	def_bool n
+	depends on HAVE_ARCH_RARE_WRITE
+	help
+	  An arch should select this option if a special accessor is needed
+	  to write to otherwise read-only memory, defined by the function
+	  __arch_rare_write_memcpy(). Without this, the write-rarely
+	  infrastructure will just attempt to write directly to the memory
+	  using a const-ignoring assignment.
+
 source "kernel/gcov/Kconfig"
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index f8110051188f..274bd03cfe9e 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -336,6 +336,38 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 	__u.__val;					\
 })
 
+/*
+ * Build "write rarely" infrastructure for flipping memory r/w
+ * on a per-CPU basis.
+ */
+#ifndef CONFIG_HAVE_ARCH_RARE_WRITE
+# define __wr_rare
+# define __wr_rare_type
+# define __rare_write(__var, __val)	(__var = (__val))
+# define rare_write_begin()		do { } while (0)
+# define rare_write_end()		do { } while (0)
+#else
+# define __wr_rare			__ro_after_init
+# define __wr_rare_type			const
+# ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY
+#  define __rare_write_n(dst, src, len)	({			\
+		BUILD_BUG(!builtin_const(len));			\
+		__arch_rare_write_memcpy((dst), (src), (len));	\
+	})
+#  define __rare_write(var, val)  __rare_write_n(&(var), &(val), sizeof(var))
+# else
+#  define __rare_write(var, val)  ((*(typeof((typeof(var))0) *)&(var)) = (val))
+# endif
+# define rare_write_begin()	__arch_rare_write_begin()
+# define rare_write_end()	__arch_rare_write_end()
+#endif
+#define rare_write(__var, __val) ({			\
+	rare_write_begin();				\
+	__rare_write(__var, __val);			\
+	rare_write_end();				\
+	__var;						\
+})
+
 #endif /* __KERNEL__ */
 
 #endif /* __ASSEMBLY__ */
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index cae461224948..4fc97aaa22ea 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -258,10 +258,12 @@ do { \
 /*
  * Modules have no business playing preemption tricks.
  */
-#undef sched_preempt_enable_no_resched
-#undef preempt_enable_no_resched
 #undef preempt_enable_no_resched_notrace
 #undef preempt_check_resched
+#ifndef CONFIG_HAVE_ARCH_RARE_WRITE
+#undef sched_preempt_enable_no_resched
+#undef preempt_enable_no_resched
+#endif
 #endif
 
 #define preempt_set_need_resched() \
-- 
2.7.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC v2][PATCH 02/11] lkdtm: add test for rare_write() infrastructure
  2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
  2017-03-29 18:15 ` [RFC v2][PATCH 01/11] " Kees Cook
@ 2017-03-29 18:15 ` Kees Cook
  2017-03-30  9:34   ` [kernel-hardening] " Ian Campbell
  2017-03-29 18:15 ` [RFC v2][PATCH 03/11] net: switch sock_diag handlers to rare_write() Kees Cook
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:15 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel

This adds the WRITE_RARE_WRITE test to validate variables marked with
__wr_rare.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 drivers/misc/lkdtm.h       |  1 +
 drivers/misc/lkdtm_core.c  |  1 +
 drivers/misc/lkdtm_perms.c | 19 ++++++++++++++++++-
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h
index 67d27be60405..d1fd5aefa235 100644
--- a/drivers/misc/lkdtm.h
+++ b/drivers/misc/lkdtm.h
@@ -39,6 +39,7 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void);
 void __init lkdtm_perms_init(void);
 void lkdtm_WRITE_RO(void);
 void lkdtm_WRITE_RO_AFTER_INIT(void);
+void lkdtm_WRITE_RARE_WRITE(void);
 void lkdtm_WRITE_KERN(void);
 void lkdtm_EXEC_DATA(void);
 void lkdtm_EXEC_STACK(void);
diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c
index b9a4cd4a9b68..ac8a55947189 100644
--- a/drivers/misc/lkdtm_core.c
+++ b/drivers/misc/lkdtm_core.c
@@ -219,6 +219,7 @@ struct crashtype crashtypes[] = {
 	CRASHTYPE(ACCESS_USERSPACE),
 	CRASHTYPE(WRITE_RO),
 	CRASHTYPE(WRITE_RO_AFTER_INIT),
+	CRASHTYPE(WRITE_RARE_WRITE),
 	CRASHTYPE(WRITE_KERN),
 	CRASHTYPE(REFCOUNT_SATURATE_INC),
 	CRASHTYPE(REFCOUNT_SATURATE_ADD),
diff --git a/drivers/misc/lkdtm_perms.c b/drivers/misc/lkdtm_perms.c
index c7635a79341f..8fbadfa4cc34 100644
--- a/drivers/misc/lkdtm_perms.c
+++ b/drivers/misc/lkdtm_perms.c
@@ -20,12 +20,15 @@
 /* This is non-const, so it will end up in the .data section. */
 static u8 data_area[EXEC_SIZE];
 
-/* This is cost, so it will end up in the .rodata section. */
+/* This is const, so it will end up in the .rodata section. */
 static const unsigned long rodata = 0xAA55AA55;
 
 /* This is marked __ro_after_init, so it should ultimately be .rodata. */
 static unsigned long ro_after_init __ro_after_init = 0x55AA5500;
 
+/* This is marked __wr_rare, so it should ultimately be .rodata. */
+static unsigned long wr_rare __wr_rare = 0xAA66AA66;
+
 /*
  * This just returns to the caller. It is designed to be copied into
  * non-executable memory regions.
@@ -103,6 +106,20 @@ void lkdtm_WRITE_RO_AFTER_INIT(void)
 	*ptr ^= 0xabcd1234;
 }
 
+void lkdtm_WRITE_RARE_WRITE(void)
+{
+	/* Explicitly cast away "const" for the test. */
+	unsigned long *ptr = (unsigned long *)&wr_rare;
+
+	pr_info("attempting good rare write at %p\n", ptr);
+	rare_write(*ptr, 0x11335577);
+	if (wr_rare != 0x11335577)
+		pr_warn("Yikes: wr_rare did not actually change!\n");
+
+	pr_info("attempting bad rare write at %p\n", ptr);
+	*ptr ^= 0xbcd12345;
+}
+
 void lkdtm_WRITE_KERN(void)
 {
 	size_t size;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC v2][PATCH 03/11] net: switch sock_diag handlers to rare_write()
  2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
  2017-03-29 18:15 ` [RFC v2][PATCH 01/11] " Kees Cook
  2017-03-29 18:15 ` [RFC v2][PATCH 02/11] lkdtm: add test for " Kees Cook
@ 2017-03-29 18:15 ` Kees Cook
  2017-03-29 18:15 ` [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap() Kees Cook
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:15 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel

This is a simple example a register/unregister case for __wr_rare markings,
which only needs a simple rare_write() call to make updates.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 net/core/sock_diag.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 6b10573cc9fa..67253026106f 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -14,7 +14,7 @@
 #include <linux/inet_diag.h>
 #include <linux/sock_diag.h>
 
-static const struct sock_diag_handler *sock_diag_handlers[AF_MAX];
+static const struct sock_diag_handler *sock_diag_handlers[AF_MAX] __wr_rare;
 static int (*inet_rcv_compat)(struct sk_buff *skb, struct nlmsghdr *nlh);
 static DEFINE_MUTEX(sock_diag_table_mutex);
 static struct workqueue_struct *broadcast_wq;
@@ -194,7 +194,7 @@ int sock_diag_register(const struct sock_diag_handler *hndl)
 	if (sock_diag_handlers[hndl->family])
 		err = -EBUSY;
 	else
-		sock_diag_handlers[hndl->family] = hndl;
+		rare_write(sock_diag_handlers[hndl->family], hndl);
 	mutex_unlock(&sock_diag_table_mutex);
 
 	return err;
@@ -210,7 +210,7 @@ void sock_diag_unregister(const struct sock_diag_handler *hnld)
 
 	mutex_lock(&sock_diag_table_mutex);
 	BUG_ON(sock_diag_handlers[family] != hnld);
-	sock_diag_handlers[family] = NULL;
+	rare_write(sock_diag_handlers[family], NULL);
 	mutex_unlock(&sock_diag_table_mutex);
 }
 EXPORT_SYMBOL_GPL(sock_diag_unregister);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
                   ` (2 preceding siblings ...)
  2017-03-29 18:15 ` [RFC v2][PATCH 03/11] net: switch sock_diag handlers to rare_write() Kees Cook
@ 2017-03-29 18:15 ` Kees Cook
  2017-03-29 22:38   ` Andy Lutomirski
  2017-04-07  9:37   ` Peter Zijlstra
  2017-03-29 18:15 ` [RFC v2][PATCH 05/11] ARM: mm: dump: Add domain to output Kees Cook
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:15 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel

Based on PaX's x86 pax_{open,close}_kernel() implementation, this
allows HAVE_ARCH_RARE_WRITE to work on x86.

There is missing work to sort out some header file issues where preempt.h
is missing, though it can't be included in pg_table.h unconditionally...
some other solution will be needed, perhaps an entirely separate header
file for rare_write()-related defines...

This patch is also missing paravirt support.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig               |  1 +
 arch/x86/include/asm/pgtable.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a294ee..2d1d707aa036 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -106,6 +106,7 @@ config X86
 	select HAVE_ARCH_KMEMCHECK
 	select HAVE_ARCH_MMAP_RND_BITS		if MMU
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS	if MMU && COMPAT
+	select HAVE_ARCH_RARE_WRITE
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1cfb36b8c024..2e6bf661bb84 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -91,6 +91,37 @@ extern struct mm_struct *pgd_page_get_mm(struct page *page);
 
 #endif	/* CONFIG_PARAVIRT */
 
+/* TODO: Bad hack to deal with preempt macros being missing sometimes. */
+#ifndef preempt_disable
+#include <linux/preempt.h>
+#endif
+
+static __always_inline unsigned long __arch_rare_write_begin(void)
+{
+	unsigned long cr0;
+
+	preempt_disable();
+	barrier();
+	cr0 = read_cr0() ^ X86_CR0_WP;
+	BUG_ON(cr0 & X86_CR0_WP);
+	write_cr0(cr0);
+	barrier();
+	return cr0 ^ X86_CR0_WP;
+}
+
+static __always_inline unsigned long __arch_rare_write_end(void)
+{
+	unsigned long cr0;
+
+	barrier();
+	cr0 = read_cr0() ^ X86_CR0_WP;
+	BUG_ON(!(cr0 & X86_CR0_WP));
+	write_cr0(cr0);
+	barrier();
+	preempt_enable_no_resched();
+	return cr0 ^ X86_CR0_WP;
+}
+
 /*
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
-- 
2.7.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC v2][PATCH 05/11] ARM: mm: dump: Add domain to output
  2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
                   ` (3 preceding siblings ...)
  2017-03-29 18:15 ` [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap() Kees Cook
@ 2017-03-29 18:15 ` Kees Cook
  2017-03-29 18:15 ` [RFC v2][PATCH 06/11] ARM: domains: Extract common USER domain init Kees Cook
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:15 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel

This adds the memory domain (on non-LPAE) to the PMD and PTE dumps. This
isn't in the regular PMD bits because I couldn't find a clean way to
fall back to retain some of the PMD bits when reporting PTE. So this is
special-cased currently.

New output example:

  ---[ Modules ]---
  0x7f000000-0x7f001000       4K KERNEL      ro x  SHD MEM/CACHED/WBWA
  0x7f001000-0x7f002000       4K KERNEL      ro NX SHD MEM/CACHED/WBWA
  0x7f002000-0x7f004000       8K KERNEL      RW NX SHD MEM/CACHED/WBWA
  ---[ Kernel Mapping ]---
  0x80000000-0x80100000       1M KERNEL      RW NX SHD
  0x80100000-0x80800000       7M KERNEL      ro x  SHD
  0x80800000-0x80b00000       3M KERNEL      ro NX SHD
  0x80b00000-0xa0000000     501M KERNEL      RW NX SHD
  ...
  ---[ Vectors ]---
  0xffff0000-0xffff1000       4K VECTORS USR ro x  SHD MEM/CACHED/WBWA
  0xffff1000-0xffff2000       4K VECTORS     ro x  SHD MEM/CACHED/WBWA

Signed-off-by: Kees Cook <keescook@chromium.org>
---
This patch is already queued in the ARM tree, but I'm including it here too
since a following patch updates the list of domain names from this patch...
---
 arch/arm/mm/dump.c | 54 ++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 10 deletions(-)

diff --git a/arch/arm/mm/dump.c b/arch/arm/mm/dump.c
index 21192d6eda40..35ff45470dbf 100644
--- a/arch/arm/mm/dump.c
+++ b/arch/arm/mm/dump.c
@@ -17,6 +17,7 @@
 #include <linux/mm.h>
 #include <linux/seq_file.h>
 
+#include <asm/domain.h>
 #include <asm/fixmap.h>
 #include <asm/memory.h>
 #include <asm/pgtable.h>
@@ -43,6 +44,7 @@ struct pg_state {
 	unsigned long start_address;
 	unsigned level;
 	u64 current_prot;
+	const char *current_domain;
 };
 
 struct prot_bits {
@@ -216,7 +218,8 @@ static void dump_prot(struct pg_state *st, const struct prot_bits *bits, size_t
 	}
 }
 
-static void note_page(struct pg_state *st, unsigned long addr, unsigned level, u64 val)
+static void note_page(struct pg_state *st, unsigned long addr,
+		      unsigned int level, u64 val, const char *domain)
 {
 	static const char units[] = "KMGTPE";
 	u64 prot = val & pg_level[level].mask;
@@ -224,8 +227,10 @@ static void note_page(struct pg_state *st, unsigned long addr, unsigned level, u
 	if (!st->level) {
 		st->level = level;
 		st->current_prot = prot;
+		st->current_domain = domain;
 		seq_printf(st->seq, "---[ %s ]---\n", st->marker->name);
 	} else if (prot != st->current_prot || level != st->level ||
+		   domain != st->current_domain ||
 		   addr >= st->marker[1].start_address) {
 		const char *unit = units;
 		unsigned long delta;
@@ -240,6 +245,8 @@ static void note_page(struct pg_state *st, unsigned long addr, unsigned level, u
 				unit++;
 			}
 			seq_printf(st->seq, "%9lu%c", delta, *unit);
+			if (st->current_domain)
+				seq_printf(st->seq, " %s", st->current_domain);
 			if (pg_level[st->level].bits)
 				dump_prot(st, pg_level[st->level].bits, pg_level[st->level].num);
 			seq_printf(st->seq, "\n");
@@ -251,11 +258,13 @@ static void note_page(struct pg_state *st, unsigned long addr, unsigned level, u
 		}
 		st->start_address = addr;
 		st->current_prot = prot;
+		st->current_domain = domain;
 		st->level = level;
 	}
 }
 
-static void walk_pte(struct pg_state *st, pmd_t *pmd, unsigned long start)
+static void walk_pte(struct pg_state *st, pmd_t *pmd, unsigned long start,
+		     const char *domain)
 {
 	pte_t *pte = pte_offset_kernel(pmd, 0);
 	unsigned long addr;
@@ -263,25 +272,50 @@ static void walk_pte(struct pg_state *st, pmd_t *pmd, unsigned long start)
 
 	for (i = 0; i < PTRS_PER_PTE; i++, pte++) {
 		addr = start + i * PAGE_SIZE;
-		note_page(st, addr, 4, pte_val(*pte));
+		note_page(st, addr, 4, pte_val(*pte), domain);
 	}
 }
 
+static const char *get_domain_name(pmd_t *pmd)
+{
+#ifndef CONFIG_ARM_LPAE
+	switch (pmd_val(*pmd) & PMD_DOMAIN_MASK) {
+	case PMD_DOMAIN(DOMAIN_KERNEL):
+		return "KERNEL ";
+	case PMD_DOMAIN(DOMAIN_USER):
+		return "USER   ";
+	case PMD_DOMAIN(DOMAIN_IO):
+		return "IO     ";
+	case PMD_DOMAIN(DOMAIN_VECTORS):
+		return "VECTORS";
+	default:
+		return "unknown";
+	}
+#endif
+	return NULL;
+}
+
 static void walk_pmd(struct pg_state *st, pud_t *pud, unsigned long start)
 {
 	pmd_t *pmd = pmd_offset(pud, 0);
 	unsigned long addr;
 	unsigned i;
+	const char *domain;
 
 	for (i = 0; i < PTRS_PER_PMD; i++, pmd++) {
 		addr = start + i * PMD_SIZE;
+		domain = get_domain_name(pmd);
 		if (pmd_none(*pmd) || pmd_large(*pmd) || !pmd_present(*pmd))
-			note_page(st, addr, 3, pmd_val(*pmd));
+			note_page(st, addr, 3, pmd_val(*pmd), domain);
 		else
-			walk_pte(st, pmd, addr);
+			walk_pte(st, pmd, addr, domain);
 
-		if (SECTION_SIZE < PMD_SIZE && pmd_large(pmd[1]))
-			note_page(st, addr + SECTION_SIZE, 3, pmd_val(pmd[1]));
+		if (SECTION_SIZE < PMD_SIZE && pmd_large(pmd[1])) {
+			addr += SECTION_SIZE;
+			pmd++;
+			domain = get_domain_name(pmd);
+			note_page(st, addr, 3, pmd_val(*pmd), domain);
+		}
 	}
 }
 
@@ -296,7 +330,7 @@ static void walk_pud(struct pg_state *st, pgd_t *pgd, unsigned long start)
 		if (!pud_none(*pud)) {
 			walk_pmd(st, pud, addr);
 		} else {
-			note_page(st, addr, 2, pud_val(*pud));
+			note_page(st, addr, 2, pud_val(*pud), NULL);
 		}
 	}
 }
@@ -317,11 +351,11 @@ static void walk_pgd(struct seq_file *m)
 		if (!pgd_none(*pgd)) {
 			walk_pud(&st, pgd, addr);
 		} else {
-			note_page(&st, addr, 1, pgd_val(*pgd));
+			note_page(&st, addr, 1, pgd_val(*pgd), NULL);
 		}
 	}
 
-	note_page(&st, 0, 0, 0);
+	note_page(&st, 0, 0, 0, NULL);
 }
 
 static int ptdump_show(struct seq_file *m, void *v)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC v2][PATCH 06/11] ARM: domains: Extract common USER domain init
  2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
                   ` (4 preceding siblings ...)
  2017-03-29 18:15 ` [RFC v2][PATCH 05/11] ARM: mm: dump: Add domain to output Kees Cook
@ 2017-03-29 18:15 ` Kees Cook
  2017-03-29 18:15 ` [RFC v2][PATCH 07/11] ARM: mm: set DOMAIN_WR_RARE for rodata Kees Cook
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:15 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel

Everything but the USER domain is the same with CONFIG_CPU_SW_DOMAIN_PAN
or not. This extracts the differences for a common DACR_INIT macro so it
is easier to make future changes (like adding the WR_RARE domain).

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/arm/include/asm/domain.h | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm/include/asm/domain.h b/arch/arm/include/asm/domain.h
index 99d9f630d6b6..8b33bd7f6bf9 100644
--- a/arch/arm/include/asm/domain.h
+++ b/arch/arm/include/asm/domain.h
@@ -59,18 +59,18 @@
 #define domain_val(dom,type)	((type) << (2 * (dom)))
 
 #ifdef CONFIG_CPU_SW_DOMAIN_PAN
-#define DACR_INIT \
-	(domain_val(DOMAIN_USER, DOMAIN_NOACCESS) | \
-	 domain_val(DOMAIN_KERNEL, DOMAIN_MANAGER) | \
-	 domain_val(DOMAIN_IO, DOMAIN_CLIENT) | \
-	 domain_val(DOMAIN_VECTORS, DOMAIN_CLIENT))
+#define __DACR_INIT_USER \
+	domain_val(DOMAIN_USER, DOMAIN_NOACCESS)
 #else
+#define __DACR_INIT_USER \
+	domain_val(DOMAIN_USER, DOMAIN_CLIENT)
+#endif
+
 #define DACR_INIT \
-	(domain_val(DOMAIN_USER, DOMAIN_CLIENT) | \
+	(__DACR_INIT_USER | \
 	 domain_val(DOMAIN_KERNEL, DOMAIN_MANAGER) | \
 	 domain_val(DOMAIN_IO, DOMAIN_CLIENT) | \
 	 domain_val(DOMAIN_VECTORS, DOMAIN_CLIENT))
-#endif
 
 #define __DACR_DEFAULT \
 	domain_val(DOMAIN_KERNEL, DOMAIN_CLIENT) | \
-- 
2.7.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC v2][PATCH 07/11] ARM: mm: set DOMAIN_WR_RARE for rodata
  2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
                   ` (5 preceding siblings ...)
  2017-03-29 18:15 ` [RFC v2][PATCH 06/11] ARM: domains: Extract common USER domain init Kees Cook
@ 2017-03-29 18:15 ` Kees Cook
  2017-03-29 18:16 ` [RFC v2][PATCH 08/11] ARM: Implement __arch_rare_write_begin/end() Kees Cook
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:15 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel

This creates DOMAIN_WR_RARE for the kernel's .rodata section, separate
from DOMAIN_KERNEL to avoid predictive fetching in device memory during
a DOMAIN_MANAGER transition.

TODO: handle kernel module vmalloc memory, which needs to be marked as
DOMAIN_WR_RARE too, for module .rodata sections.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/arm/include/asm/domain.h | 3 +++
 arch/arm/mm/dump.c            | 2 ++
 arch/arm/mm/init.c            | 7 ++++---
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm/include/asm/domain.h b/arch/arm/include/asm/domain.h
index 8b33bd7f6bf9..b5ca80ac823c 100644
--- a/arch/arm/include/asm/domain.h
+++ b/arch/arm/include/asm/domain.h
@@ -43,6 +43,7 @@
 #define DOMAIN_IO	0
 #endif
 #define DOMAIN_VECTORS	3
+#define DOMAIN_WR_RARE	4
 
 /*
  * Domain types
@@ -69,11 +70,13 @@
 #define DACR_INIT \
 	(__DACR_INIT_USER | \
 	 domain_val(DOMAIN_KERNEL, DOMAIN_MANAGER) | \
+	 domain_val(DOMAIN_WR_RARE, DOMAIN_CLIENT) | \
 	 domain_val(DOMAIN_IO, DOMAIN_CLIENT) | \
 	 domain_val(DOMAIN_VECTORS, DOMAIN_CLIENT))
 
 #define __DACR_DEFAULT \
 	domain_val(DOMAIN_KERNEL, DOMAIN_CLIENT) | \
+	domain_val(DOMAIN_WR_RARE, DOMAIN_CLIENT) | \
 	domain_val(DOMAIN_IO, DOMAIN_CLIENT) | \
 	domain_val(DOMAIN_VECTORS, DOMAIN_CLIENT)
 
diff --git a/arch/arm/mm/dump.c b/arch/arm/mm/dump.c
index 35ff45470dbf..b1aa9a17e0c3 100644
--- a/arch/arm/mm/dump.c
+++ b/arch/arm/mm/dump.c
@@ -288,6 +288,8 @@ static const char *get_domain_name(pmd_t *pmd)
 		return "IO     ";
 	case PMD_DOMAIN(DOMAIN_VECTORS):
 		return "VECTORS";
+	case PMD_DOMAIN(DOMAIN_WR_RARE):
+		return "WR_RARE";
 	default:
 		return "unknown";
 	}
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index 1d8558ff9827..d54a74b5718b 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -642,9 +642,10 @@ static struct section_perm ro_perms[] = {
 		.mask   = ~L_PMD_SECT_RDONLY,
 		.prot   = L_PMD_SECT_RDONLY,
 #else
-		.mask   = ~(PMD_SECT_APX | PMD_SECT_AP_WRITE),
-		.prot   = PMD_SECT_APX | PMD_SECT_AP_WRITE,
-		.clear  = PMD_SECT_AP_WRITE,
+		.mask   = ~(PMD_SECT_APX | PMD_SECT_AP_WRITE | PMD_DOMAIN_MASK),
+		.prot   = PMD_SECT_APX | PMD_SECT_AP_WRITE | \
+			  PMD_DOMAIN(DOMAIN_WR_RARE),
+		.clear  = PMD_SECT_AP_WRITE | PMD_DOMAIN(DOMAIN_KERNEL),
 #endif
 	},
 };
-- 
2.7.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC v2][PATCH 08/11] ARM: Implement __arch_rare_write_begin/end()
  2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
                   ` (6 preceding siblings ...)
  2017-03-29 18:15 ` [RFC v2][PATCH 07/11] ARM: mm: set DOMAIN_WR_RARE for rodata Kees Cook
@ 2017-03-29 18:16 ` Kees Cook
  2017-04-07  9:36   ` Peter Zijlstra
  2017-03-29 18:16 ` [RFC v2][PATCH 09/11] list: add rare_write() list helpers Kees Cook
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:16 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel

Based on grsecurity's ARM pax_{open,close}_kernel() implementation, this
allows HAVE_ARCH_RARE_WRITE to work on ARM.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/arm/Kconfig               |  1 +
 arch/arm/include/asm/domain.h  |  3 ++-
 arch/arm/include/asm/pgtable.h | 27 +++++++++++++++++++++++++++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 0d4e71b42c77..57b8aeaf501c 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -45,6 +45,7 @@ config ARM
 	select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_MMAP_RND_BITS if MMU
+	select HAVE_ARCH_RARE_WRITE if MMU && !ARM_LPAE && !CPU_USE_DOMAINS
 	select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
diff --git a/arch/arm/include/asm/domain.h b/arch/arm/include/asm/domain.h
index b5ca80ac823c..b3fb5c0a2efd 100644
--- a/arch/arm/include/asm/domain.h
+++ b/arch/arm/include/asm/domain.h
@@ -54,6 +54,7 @@
 #define DOMAIN_MANAGER	3
 #else
 #define DOMAIN_MANAGER	1
+#define DOMAIN_FORCE_MANAGER 3
 #endif
 
 #define domain_mask(dom)	((3) << (2 * (dom)))
@@ -118,7 +119,7 @@ static inline void set_domain(unsigned val)
 }
 #endif
 
-#ifdef CONFIG_CPU_USE_DOMAINS
+#if defined(CONFIG_CPU_USE_DOMAINS) || defined(CONFIG_HAVE_ARCH_RARE_WRITE)
 #define modify_domain(dom,type)					\
 	do {							\
 		unsigned int domain = get_domain();		\
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index 1c462381c225..104923ea9eb5 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -57,6 +57,33 @@ extern void __pgd_error(const char *file, int line, pgd_t);
 #define pmd_ERROR(pmd)		__pmd_error(__FILE__, __LINE__, pmd)
 #define pgd_ERROR(pgd)		__pgd_error(__FILE__, __LINE__, pgd)
 
+#ifdef CONFIG_HAVE_ARCH_RARE_WRITE
+#include <asm/domain.h>
+#include <linux/preempt.h>
+
+static inline int test_domain(int domain, int domaintype)
+{
+	return (get_domain() & domain_val(domain, 3)) ==
+		domain_val(domain, domaintype);
+}
+
+static inline unsigned long __arch_rare_write_begin(void)
+{
+	preempt_disable();
+	BUG_ON(test_domain(DOMAIN_WR_RARE, DOMAIN_FORCE_MANAGER));
+	modify_domain(DOMAIN_WR_RARE, DOMAIN_FORCE_MANAGER);
+	return 0;
+}
+
+static inline unsigned long __arch_rare_write_end(void)
+{
+	BUG_ON(test_domain(DOMAIN_WR_RARE, DOMAIN_CLIENT));
+	modify_domain(DOMAIN_WR_RARE, DOMAIN_CLIENT);
+	preempt_enable_no_resched();
+	return 0;
+}
+#endif
+
 /*
  * This is the lowest virtual address we can permit any user space
  * mapping to be mapped at.  This is particularly important for
-- 
2.7.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC v2][PATCH 09/11] list: add rare_write() list helpers
  2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
                   ` (7 preceding siblings ...)
  2017-03-29 18:16 ` [RFC v2][PATCH 08/11] ARM: Implement __arch_rare_write_begin/end() Kees Cook
@ 2017-03-29 18:16 ` Kees Cook
  2017-03-29 18:16 ` [RFC v2][PATCH 10/11] gcc-plugins: Add constify plugin Kees Cook
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:16 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel

Some structures that are intended to be made write-rarely are designed to
be linked by lists. As a result, there need to be rare_write()-supported
linked list primitives.

As found in PaX, this adds list management helpers for doing updates to
rarely-changed lists.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/linux/list.h | 17 +++++++++++++++++
 lib/Makefile         |  2 +-
 lib/list_debug.c     | 37 +++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index ae537fa46216..50fdd5b737aa 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -126,6 +126,23 @@ static inline void list_del(struct list_head *entry)
 	entry->prev = LIST_POISON2;
 }
 
+extern void __rare_list_add(struct list_head *new,
+			    struct list_head *prev,
+			    struct list_head *next);
+
+static inline void
+rare_list_add(__wr_rare_type struct list_head *new, struct list_head *head)
+{
+	__rare_list_add((struct list_head *)new, head, head->next);
+}
+static inline void
+rare_list_add_tail(__wr_rare_type struct list_head *new, struct list_head *head)
+{
+	__rare_list_add((struct list_head *)new, head->prev, head);
+}
+
+extern void rare_list_del(__wr_rare_type struct list_head *entry);
+
 /**
  * list_replace - replace old entry by new one
  * @old : the element to be replaced
diff --git a/lib/Makefile b/lib/Makefile
index 320ac46a8725..cd64fd8f7a21 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -83,7 +83,7 @@ obj-$(CONFIG_BTREE) += btree.o
 obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o
 obj-$(CONFIG_ASSOCIATIVE_ARRAY) += assoc_array.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
-obj-$(CONFIG_DEBUG_LIST) += list_debug.o
+obj-y += list_debug.o
 obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o
 
 ifneq ($(CONFIG_HAVE_DEC_LOCK),y)
diff --git a/lib/list_debug.c b/lib/list_debug.c
index a34db8d27667..1add73f9479a 100644
--- a/lib/list_debug.c
+++ b/lib/list_debug.c
@@ -10,7 +10,9 @@
 #include <linux/bug.h>
 #include <linux/kernel.h>
 #include <linux/rculist.h>
+#include <linux/mm.h>
 
+#ifdef CONFIG_DEBUG_LIST
 /*
  * Check that the data structures for the list manipulations are reasonably
  * valid. Failures here indicate memory corruption (and possibly an exploit
@@ -60,3 +62,38 @@ bool __list_del_entry_valid(struct list_head *entry)
 
 }
 EXPORT_SYMBOL(__list_del_entry_valid);
+
+#endif /* CONFIG_DEBUG_LIST */
+
+void __rare_list_add(struct list_head *new, struct list_head *prev,
+		     struct list_head *next)
+{
+	if (!__list_add_valid(new, prev, next))
+		return;
+
+	rare_write_begin();
+	__rare_write(next->prev, new);
+	__rare_write(new->next, next);
+	__rare_write(new->prev, prev);
+	__rare_write(prev->next, new);
+	rare_write_end();
+}
+EXPORT_SYMBOL(__rare_list_add);
+
+void rare_list_del(__wr_rare_type struct list_head *entry_const)
+{
+	struct list_head *entry = (struct list_head *)entry_const;
+	struct list_head *prev = entry->prev;
+	struct list_head *next = entry->next;
+
+	if (!__list_del_entry_valid(entry))
+		return;
+
+	rare_write_begin();
+	__rare_write(next->prev, prev);
+	__rare_write(prev->next, next);
+	__rare_write(entry->next, LIST_POISON1);
+	__rare_write(entry->prev, LIST_POISON2);
+	rare_write_end();
+}
+EXPORT_SYMBOL(rare_list_del);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC v2][PATCH 10/11] gcc-plugins: Add constify plugin
  2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
                   ` (8 preceding siblings ...)
  2017-03-29 18:16 ` [RFC v2][PATCH 09/11] list: add rare_write() list helpers Kees Cook
@ 2017-03-29 18:16 ` Kees Cook
  2017-03-29 18:16 ` [RFC v2][PATCH 11/11] cgroups: force all struct cftype const Kees Cook
  2017-03-29 19:00 ` [RFC v2] Introduce rare_write() infrastructure Russell King - ARM Linux
  11 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:16 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel

This is a port of the PaX/grsecurity constify plugin. However, it has
the automatic function-pointer struct detection temporarily disabled. As
a result, this will only recognize the __do_const annotation, which
makes all instances of a marked structure read-only. The rare_write()
infrastructure can be used to make changes to such variables.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/Kconfig                          |  13 +
 include/linux/compiler-gcc.h          |   5 +
 include/linux/compiler.h              |   8 +
 scripts/Makefile.gcc-plugins          |   2 +
 scripts/gcc-plugins/constify_plugin.c | 585 ++++++++++++++++++++++++++++++++++
 5 files changed, 613 insertions(+)
 create mode 100644 scripts/gcc-plugins/constify_plugin.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 5ebf62500b99..79447d6a40bc 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -436,6 +436,19 @@ config GCC_PLUGIN_STRUCTLEAK_VERBOSE
 	  initialized. Since not all existing initializers are detected
 	  by the plugin, this can produce false positive warnings.
 
+config GCC_PLUGIN_CONSTIFY
+	bool "Make function pointer structures and marked variables read-only"
+	depends on GCC_PLUGINS && HAVE_ARCH_RARE_WRITE && !UML
+	help
+	  By saying Y here the compiler will automatically constify a class
+	  of types that contain only function pointers, as well as any
+	  manually annotated structures. This reduces the kernel's attack
+	  surface and also produces a better memory layout.
+
+	  This plugin was ported from grsecurity/PaX. More information at:
+	   * https://grsecurity.net/
+	   * https://pax.grsecurity.net/
+
 config HAVE_CC_STACKPROTECTOR
 	bool
 	help
diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index 0efef9cf014f..cee2bf7c32a4 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -191,6 +191,11 @@
 
 #if GCC_VERSION >= 40500
 
+#ifdef CONSTIFY_PLUGIN
+#define __no_const __attribute__((no_const))
+#define __do_const __attribute__((do_const))
+#endif
+
 #ifndef __CHECKER__
 #ifdef LATENT_ENTROPY_PLUGIN
 #define __latent_entropy __attribute__((latent_entropy))
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 274bd03cfe9e..2def0f12a71c 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -476,6 +476,14 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 # define __latent_entropy
 #endif
 
+#ifndef __do_const
+# define __do_const
+#endif
+
+#ifndef __no_const
+# define __no_const
+#endif
+
 /*
  * Tell gcc if a function is cold. The compiler will assume any path
  * directly leading to the call is unlikely.
diff --git a/scripts/Makefile.gcc-plugins b/scripts/Makefile.gcc-plugins
index 82335533620e..8d264c0bb758 100644
--- a/scripts/Makefile.gcc-plugins
+++ b/scripts/Makefile.gcc-plugins
@@ -5,6 +5,8 @@ ifdef CONFIG_GCC_PLUGINS
   SANCOV_PLUGIN := -fplugin=$(objtree)/scripts/gcc-plugins/sancov_plugin.so
 
   gcc-plugin-$(CONFIG_GCC_PLUGIN_CYC_COMPLEXITY)	+= cyc_complexity_plugin.so
+  gcc-plugin-$(CONFIG_GCC_PLUGIN_CONSTIFY)	+= constify_plugin.so
+  gcc-plugin-cflags-$(CONFIG_GCC_PLUGIN_CONSTIFY)	+= -DCONSTIFY_PLUGIN
 
   gcc-plugin-$(CONFIG_GCC_PLUGIN_LATENT_ENTROPY)	+= latent_entropy_plugin.so
   gcc-plugin-cflags-$(CONFIG_GCC_PLUGIN_LATENT_ENTROPY)	+= -DLATENT_ENTROPY_PLUGIN
diff --git a/scripts/gcc-plugins/constify_plugin.c b/scripts/gcc-plugins/constify_plugin.c
new file mode 100644
index 000000000000..e7d6f3140e87
--- /dev/null
+++ b/scripts/gcc-plugins/constify_plugin.c
@@ -0,0 +1,585 @@
+/*
+ * Copyright 2011 by Emese Revfy <re.emese@gmail.com>
+ * Copyright 2011-2016 by PaX Team <pageexec@freemail.hu>
+ * Licensed under the GPL v2, or (at your option) v3
+ *
+ * This gcc plugin constifies all structures which contain only function pointers or are explicitly marked for constification.
+ *
+ * Homepage:
+ * http://www.grsecurity.net/~ephox/const_plugin/
+ *
+ * Usage:
+ * $ gcc -I`gcc -print-file-name=plugin`/include -fPIC -shared -O2 -o constify_plugin.so constify_plugin.c
+ * $ gcc -fplugin=constify_plugin.so test.c -O2
+ */
+
+#include "gcc-common.h"
+
+// unused C type flag in all versions 4.5-6
+#define TYPE_CONSTIFY_VISITED(TYPE) TYPE_LANG_FLAG_4(TYPE)
+
+__visible int plugin_is_GPL_compatible;
+
+static bool enabled = true;
+
+static struct plugin_info const_plugin_info = {
+	.version	= "201607241840vanilla",
+	.help		= "disable\tturn off constification\n",
+};
+
+static struct {
+	const char *name;
+	const char *asm_op;
+} const_sections[] = {
+	{".init.rodata",     "\t.section\t.init.rodata,\"a\""},
+	{".ref.rodata",      "\t.section\t.ref.rodata,\"a\""},
+	{".devinit.rodata",  "\t.section\t.devinit.rodata,\"a\""},
+	{".devexit.rodata",  "\t.section\t.devexit.rodata,\"a\""},
+	{".cpuinit.rodata",  "\t.section\t.cpuinit.rodata,\"a\""},
+	{".cpuexit.rodata",  "\t.section\t.cpuexit.rodata,\"a\""},
+	{".meminit.rodata",  "\t.section\t.meminit.rodata,\"a\""},
+	{".memexit.rodata",  "\t.section\t.memexit.rodata,\"a\""},
+	{".data..read_only", "\t.section\t.data..read_only,\"a\""},
+};
+
+typedef struct {
+	bool has_fptr_field;
+	bool has_writable_field;
+	bool has_do_const_field;
+	bool has_no_const_field;
+} constify_info;
+
+static const_tree get_field_type(const_tree field)
+{
+	return strip_array_types(TREE_TYPE(field));
+}
+
+static bool is_fptr(const_tree field)
+{
+	/* XXX: disable automatic constification. */
+	return false;
+
+	const_tree ptr = get_field_type(field);
+
+	if (TREE_CODE(ptr) != POINTER_TYPE)
+		return false;
+
+	return TREE_CODE(TREE_TYPE(ptr)) == FUNCTION_TYPE;
+}
+
+/*
+ * determine whether the given structure type meets the requirements for automatic constification,
+ * including the constification attributes on nested structure types
+ */
+static void constifiable(const_tree node, constify_info *cinfo)
+{
+	const_tree field;
+
+	gcc_assert(TREE_CODE(node) == RECORD_TYPE || TREE_CODE(node) == UNION_TYPE);
+
+	// e.g., pointer to structure fields while still constructing the structure type
+	if (TYPE_FIELDS(node) == NULL_TREE)
+		return;
+
+	for (field = TYPE_FIELDS(node); field; field = TREE_CHAIN(field)) {
+		const_tree type = get_field_type(field);
+		enum tree_code code = TREE_CODE(type);
+
+		if (node == type)
+			continue;
+
+		if (is_fptr(field))
+			cinfo->has_fptr_field = true;
+		else if (code == RECORD_TYPE || code == UNION_TYPE) {
+			if (lookup_attribute("do_const", TYPE_ATTRIBUTES(type)))
+				cinfo->has_do_const_field = true;
+			else if (lookup_attribute("no_const", TYPE_ATTRIBUTES(type)))
+				cinfo->has_no_const_field = true;
+			else
+				constifiable(type, cinfo);
+		} else if (!TREE_READONLY(field))
+			cinfo->has_writable_field = true;
+	}
+}
+
+static bool constified(const_tree node)
+{
+	constify_info cinfo = {
+		.has_fptr_field = false,
+		.has_writable_field = false,
+		.has_do_const_field = false,
+		.has_no_const_field = false
+	};
+
+	gcc_assert(TREE_CODE(node) == RECORD_TYPE || TREE_CODE(node) == UNION_TYPE);
+
+	if (lookup_attribute("no_const", TYPE_ATTRIBUTES(node))) {
+//		gcc_assert(!TYPE_READONLY(node));
+		return false;
+	}
+
+	if (lookup_attribute("do_const", TYPE_ATTRIBUTES(node))) {
+		gcc_assert(TYPE_READONLY(node));
+		return true;
+	}
+
+	constifiable(node, &cinfo);
+	if ((!cinfo.has_fptr_field || cinfo.has_writable_field || cinfo.has_no_const_field) && !cinfo.has_do_const_field)
+		return false;
+
+	return TYPE_READONLY(node);
+}
+
+static void deconstify_tree(tree node);
+
+static void deconstify_type(tree type)
+{
+	tree field;
+
+	gcc_assert(TREE_CODE(type) == RECORD_TYPE || TREE_CODE(type) == UNION_TYPE);
+
+	for (field = TYPE_FIELDS(type); field; field = TREE_CHAIN(field)) {
+		const_tree fieldtype = get_field_type(field);
+
+		// special case handling of simple ptr-to-same-array-type members
+		if (TREE_CODE(TREE_TYPE(field)) == POINTER_TYPE) {
+			tree ptrtype = TREE_TYPE(TREE_TYPE(field));
+
+			if (TREE_TYPE(TREE_TYPE(field)) == type)
+				continue;
+			if (TREE_CODE(ptrtype) != RECORD_TYPE && TREE_CODE(ptrtype) != UNION_TYPE)
+				continue;
+			if (!constified(ptrtype))
+				continue;
+			if (TYPE_MAIN_VARIANT(ptrtype) == TYPE_MAIN_VARIANT(type))
+				TREE_TYPE(field) = build_pointer_type(build_qualified_type(type, TYPE_QUALS(ptrtype) & ~TYPE_QUAL_CONST));
+			continue;
+		}
+		if (TREE_CODE(fieldtype) != RECORD_TYPE && TREE_CODE(fieldtype) != UNION_TYPE)
+			continue;
+		if (!constified(fieldtype))
+			continue;
+
+		deconstify_tree(field);
+		TREE_READONLY(field) = 0;
+	}
+	TYPE_READONLY(type) = 0;
+	C_TYPE_FIELDS_READONLY(type) = 0;
+	if (lookup_attribute("do_const", TYPE_ATTRIBUTES(type))) {
+		TYPE_ATTRIBUTES(type) = copy_list(TYPE_ATTRIBUTES(type));
+		TYPE_ATTRIBUTES(type) = remove_attribute("do_const", TYPE_ATTRIBUTES(type));
+	}
+}
+
+static void deconstify_tree(tree node)
+{
+	tree old_type, new_type, field;
+
+	old_type = TREE_TYPE(node);
+	while (TREE_CODE(old_type) == ARRAY_TYPE && TREE_CODE(TREE_TYPE(old_type)) != ARRAY_TYPE) {
+		node = TREE_TYPE(node) = copy_node(old_type);
+		old_type = TREE_TYPE(old_type);
+	}
+
+	gcc_assert(TREE_CODE(old_type) == RECORD_TYPE || TREE_CODE(old_type) == UNION_TYPE);
+	gcc_assert(TYPE_READONLY(old_type) && (TYPE_QUALS(old_type) & TYPE_QUAL_CONST));
+
+	new_type = build_qualified_type(old_type, TYPE_QUALS(old_type) & ~TYPE_QUAL_CONST);
+	TYPE_FIELDS(new_type) = copy_list(TYPE_FIELDS(new_type));
+	for (field = TYPE_FIELDS(new_type); field; field = TREE_CHAIN(field))
+		DECL_FIELD_CONTEXT(field) = new_type;
+
+	deconstify_type(new_type);
+
+	TREE_TYPE(node) = new_type;
+}
+
+static tree handle_no_const_attribute(tree *node, tree name, tree args, int flags, bool *no_add_attrs)
+{
+	tree type;
+	constify_info cinfo = {
+		.has_fptr_field = false,
+		.has_writable_field = false,
+		.has_do_const_field = false,
+		.has_no_const_field = false
+	};
+
+	*no_add_attrs = true;
+	if (TREE_CODE(*node) == FUNCTION_DECL) {
+		error("%qE attribute does not apply to functions (%qF)", name, *node);
+		return NULL_TREE;
+	}
+
+	if (TREE_CODE(*node) == PARM_DECL) {
+		error("%qE attribute does not apply to function parameters (%qD)", name, *node);
+		return NULL_TREE;
+	}
+
+	if (TREE_CODE(*node) == VAR_DECL) {
+		error("%qE attribute does not apply to variables (%qD)", name, *node);
+		return NULL_TREE;
+	}
+
+	if (TYPE_P(*node)) {
+		type = *node;
+	} else {
+		if (TREE_CODE(*node) != TYPE_DECL) {
+			error("%qE attribute does not apply to %qD (%qT)", name, *node, TREE_TYPE(*node));
+			return NULL_TREE;
+		}
+		type = TREE_TYPE(*node);
+	}
+
+	if (TREE_CODE(type) != RECORD_TYPE && TREE_CODE(type) != UNION_TYPE) {
+		error("%qE attribute used on %qT applies to struct and union types only", name, type);
+		return NULL_TREE;
+	}
+
+	if (lookup_attribute(IDENTIFIER_POINTER(name), TYPE_ATTRIBUTES(type))) {
+		error("%qE attribute is already applied to the type %qT", name, type);
+		return NULL_TREE;
+	}
+
+	if (TYPE_P(*node)) {
+		if (lookup_attribute("do_const", TYPE_ATTRIBUTES(type)))
+			error("%qE attribute used on type %qT is incompatible with 'do_const'", name, type);
+		else
+			*no_add_attrs = false;
+		return NULL_TREE;
+	}
+
+	constifiable(type, &cinfo);
+	if ((cinfo.has_fptr_field && !cinfo.has_writable_field && !cinfo.has_no_const_field) || lookup_attribute("do_const", TYPE_ATTRIBUTES(type))) {
+		if (enabled) {
+			if TYPE_P(*node)
+				deconstify_type(*node);
+			else
+				deconstify_tree(*node);
+		}
+		if (TYPE_P(*node))
+			TYPE_CONSTIFY_VISITED(*node) = 1;
+		else
+			TYPE_CONSTIFY_VISITED(TREE_TYPE(*node)) = 1;
+		return NULL_TREE;
+	}
+
+	if (enabled && TYPE_FIELDS(type))
+		error("%qE attribute used on type %qT that is not constified", name, type);
+	return NULL_TREE;
+}
+
+static void constify_type(tree type)
+{
+	gcc_assert(type == TYPE_MAIN_VARIANT(type));
+	TYPE_READONLY(type) = 1;
+	C_TYPE_FIELDS_READONLY(type) = 1;
+	TYPE_CONSTIFY_VISITED(type) = 1;
+//	TYPE_ATTRIBUTES(type) = copy_list(TYPE_ATTRIBUTES(type));
+//	TYPE_ATTRIBUTES(type) = tree_cons(get_identifier("do_const"), NULL_TREE, TYPE_ATTRIBUTES(type));
+}
+
+static tree handle_do_const_attribute(tree *node, tree name, tree args, int flags, bool *no_add_attrs)
+{
+	*no_add_attrs = true;
+	if (!TYPE_P(*node)) {
+		error("%qE attribute applies to types only (%qD)", name, *node);
+		return NULL_TREE;
+	}
+
+	if (TREE_CODE(*node) != RECORD_TYPE && TREE_CODE(*node) != UNION_TYPE) {
+		error("%qE attribute used on %qT applies to struct and union types only", name, *node);
+		return NULL_TREE;
+	}
+
+	if (lookup_attribute(IDENTIFIER_POINTER(name), TYPE_ATTRIBUTES(*node))) {
+		error("%qE attribute used on %qT is already applied to the type", name, *node);
+		return NULL_TREE;
+	}
+
+	if (lookup_attribute("no_const", TYPE_ATTRIBUTES(*node))) {
+		error("%qE attribute used on %qT is incompatible with 'no_const'", name, *node);
+		return NULL_TREE;
+	}
+
+	*no_add_attrs = false;
+	return NULL_TREE;
+}
+
+static struct attribute_spec no_const_attr = {
+	.name			= "no_const",
+	.min_length		= 0,
+	.max_length		= 0,
+	.decl_required		= false,
+	.type_required		= false,
+	.function_type_required	= false,
+	.handler		= handle_no_const_attribute,
+#if BUILDING_GCC_VERSION >= 4007
+	.affects_type_identity	= true
+#endif
+};
+
+static struct attribute_spec do_const_attr = {
+	.name			= "do_const",
+	.min_length		= 0,
+	.max_length		= 0,
+	.decl_required		= false,
+	.type_required		= false,
+	.function_type_required	= false,
+	.handler		= handle_do_const_attribute,
+#if BUILDING_GCC_VERSION >= 4007
+	.affects_type_identity	= true
+#endif
+};
+
+static void register_attributes(void *event_data, void *data)
+{
+	register_attribute(&no_const_attr);
+	register_attribute(&do_const_attr);
+}
+
+static void finish_type(void *event_data, void *data)
+{
+	tree type = (tree)event_data;
+	constify_info cinfo = {
+		.has_fptr_field = false,
+		.has_writable_field = false,
+		.has_do_const_field = false,
+		.has_no_const_field = false
+	};
+
+	if (type == NULL_TREE || type == error_mark_node)
+		return;
+
+#if BUILDING_GCC_VERSION >= 5000
+	if (TREE_CODE(type) == ENUMERAL_TYPE)
+		return;
+#endif
+
+	if (TYPE_FIELDS(type) == NULL_TREE || TYPE_CONSTIFY_VISITED(type))
+		return;
+
+	constifiable(type, &cinfo);
+
+	if (lookup_attribute("no_const", TYPE_ATTRIBUTES(type))) {
+		if ((cinfo.has_fptr_field && !cinfo.has_writable_field && !cinfo.has_no_const_field) || cinfo.has_do_const_field) {
+			deconstify_type(type);
+			TYPE_CONSTIFY_VISITED(type) = 1;
+		} else
+			error("'no_const' attribute used on type %qT that is not constified", type);
+		return;
+	}
+
+	if (lookup_attribute("do_const", TYPE_ATTRIBUTES(type))) {
+		if (!cinfo.has_writable_field && !cinfo.has_no_const_field) {
+			error("'do_const' attribute used on type %qT that is%sconstified", type, cinfo.has_fptr_field ? " " : " not ");
+			return;
+		}
+		constify_type(type);
+		return;
+	}
+
+	if (cinfo.has_fptr_field && !cinfo.has_writable_field && !cinfo.has_no_const_field) {
+		if (lookup_attribute("do_const", TYPE_ATTRIBUTES(type))) {
+			error("'do_const' attribute used on type %qT that is constified", type);
+			return;
+		}
+		constify_type(type);
+		return;
+	}
+
+	deconstify_type(type);
+	TYPE_CONSTIFY_VISITED(type) = 1;
+}
+
+static bool is_constified_var(varpool_node_ptr node)
+{
+	tree var = NODE_DECL(node);
+	tree type = TREE_TYPE(var);
+
+	if (node->alias)
+		return false;
+
+	if (DECL_EXTERNAL(var))
+		return false;
+
+	// XXX handle more complex nesting of arrays/structs
+	if (TREE_CODE(type) == ARRAY_TYPE)
+		type = TREE_TYPE(type);
+
+	if (TREE_CODE(type) != RECORD_TYPE && TREE_CODE(type) != UNION_TYPE)
+		return false;
+
+	if (!TYPE_READONLY(type) || !C_TYPE_FIELDS_READONLY(type))
+		return false;
+
+	if (!TYPE_CONSTIFY_VISITED(type))
+		return false;
+
+	return true;
+}
+
+static void check_section_mismatch(varpool_node_ptr node)
+{
+	tree var, section;
+	size_t i;
+	const char *name;
+
+	var = NODE_DECL(node);
+	name = get_decl_section_name(var);
+	section = lookup_attribute("section", DECL_ATTRIBUTES(var));
+	if (!section) {
+		if (name) {
+			fprintf(stderr, "DECL_SECTION [%s] ", name);
+			dump_varpool_node(stderr, node);
+			gcc_unreachable();
+		}
+		return;
+	} else
+		gcc_assert(name);
+
+//fprintf(stderr, "SECTIONAME: [%s] ", get_decl_section_name(var));
+//debug_tree(var);
+
+	gcc_assert(!TREE_CHAIN(section));
+	gcc_assert(TREE_VALUE(section));
+
+	section = TREE_VALUE(TREE_VALUE(section));
+	gcc_assert(!strcmp(TREE_STRING_POINTER(section), name));
+//debug_tree(section);
+
+	for (i = 0; i < ARRAY_SIZE(const_sections); i++)
+		if (!strcmp(const_sections[i].name, name))
+			return;
+
+	error_at(DECL_SOURCE_LOCATION(var), "constified variable %qD placed into writable section %E", var, section);
+}
+
+// this works around a gcc bug/feature where uninitialized globals
+// are moved into the .bss section regardless of any constification
+// see gcc/varasm.c:bss_initializer_p()
+static void fix_initializer(varpool_node_ptr node)
+{
+	tree var = NODE_DECL(node);
+	tree type = TREE_TYPE(var);
+
+	if (DECL_INITIAL(var))
+		return;
+
+	DECL_INITIAL(var) = build_constructor(type, NULL);
+//	inform(DECL_SOURCE_LOCATION(var), "constified variable %qE moved into .rodata", var);
+}
+
+static void check_global_variables(void *event_data, void *data)
+{
+	varpool_node_ptr node;
+
+	FOR_EACH_VARIABLE(node) {
+		if (!is_constified_var(node))
+			continue;
+
+		check_section_mismatch(node);
+		fix_initializer(node);
+	}
+}
+
+static unsigned int check_local_variables_execute(void)
+{
+	unsigned int ret = 0;
+	tree var;
+
+	unsigned int i;
+
+	FOR_EACH_LOCAL_DECL(cfun, i, var) {
+		tree type = TREE_TYPE(var);
+
+		gcc_assert(DECL_P(var));
+		if (is_global_var(var))
+			continue;
+
+		if (TREE_CODE(type) != RECORD_TYPE && TREE_CODE(type) != UNION_TYPE)
+			continue;
+
+		if (!TYPE_READONLY(type) || !C_TYPE_FIELDS_READONLY(type))
+			continue;
+
+		if (!TYPE_CONSTIFY_VISITED(type))
+			continue;
+
+		error_at(DECL_SOURCE_LOCATION(var), "constified variable %qE cannot be local", var);
+		ret = 1;
+	}
+	return ret;
+}
+
+#define PASS_NAME check_local_variables
+#define NO_GATE
+#include "gcc-generate-gimple-pass.h"
+
+static unsigned int (*old_section_type_flags)(tree decl, const char *name, int reloc);
+
+static unsigned int constify_section_type_flags(tree decl, const char *name, int reloc)
+{
+	size_t i;
+
+	for (i = 0; i < ARRAY_SIZE(const_sections); i++)
+		if (!strcmp(const_sections[i].name, name))
+			return 0;
+
+	return old_section_type_flags(decl, name, reloc);
+}
+
+static void constify_start_unit(void *gcc_data, void *user_data)
+{
+//	size_t i;
+
+//	for (i = 0; i < ARRAY_SIZE(const_sections); i++)
+//		const_sections[i].section = get_unnamed_section(0, output_section_asm_op, const_sections[i].asm_op);
+//		const_sections[i].section = get_section(const_sections[i].name, 0, NULL);
+
+	old_section_type_flags = targetm.section_type_flags;
+	targetm.section_type_flags = constify_section_type_flags;
+}
+
+__visible int plugin_init(struct plugin_name_args *plugin_info, struct plugin_gcc_version *version)
+{
+	const char * const plugin_name = plugin_info->base_name;
+	const int argc = plugin_info->argc;
+	const struct plugin_argument * const argv = plugin_info->argv;
+	int i;
+
+	struct register_pass_info check_local_variables_pass_info;
+
+	check_local_variables_pass_info.pass				= make_check_local_variables_pass();
+	check_local_variables_pass_info.reference_pass_name		= "ssa";
+	check_local_variables_pass_info.ref_pass_instance_number	= 1;
+	check_local_variables_pass_info.pos_op				= PASS_POS_INSERT_BEFORE;
+
+	if (!plugin_default_version_check(version, &gcc_version)) {
+		error(G_("incompatible gcc/plugin versions"));
+		return 1;
+	}
+
+	for (i = 0; i < argc; ++i) {
+		if (!(strcmp(argv[i].key, "disable"))) {
+			enabled = false;
+			continue;
+		}
+		error(G_("unknown option '-fplugin-arg-%s-%s'"), plugin_name, argv[i].key);
+	}
+
+	if (strncmp(lang_hooks.name, "GNU C", 5) && !strncmp(lang_hooks.name, "GNU C+", 6)) {
+		inform(UNKNOWN_LOCATION, G_("%s supports C only, not %s"), plugin_name, lang_hooks.name);
+		enabled = false;
+	}
+
+	register_callback(plugin_name, PLUGIN_INFO, NULL, &const_plugin_info);
+	if (enabled) {
+		register_callback(plugin_name, PLUGIN_ALL_IPA_PASSES_START, check_global_variables, NULL);
+		register_callback(plugin_name, PLUGIN_FINISH_TYPE, finish_type, NULL);
+		register_callback(plugin_name, PLUGIN_PASS_MANAGER_SETUP, NULL, &check_local_variables_pass_info);
+		register_callback(plugin_name, PLUGIN_START_UNIT, constify_start_unit, NULL);
+	}
+	register_callback(plugin_name, PLUGIN_ATTRIBUTES, register_attributes, NULL);
+
+	return 0;
+}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC v2][PATCH 11/11] cgroups: force all struct cftype const
  2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
                   ` (9 preceding siblings ...)
  2017-03-29 18:16 ` [RFC v2][PATCH 10/11] gcc-plugins: Add constify plugin Kees Cook
@ 2017-03-29 18:16 ` Kees Cook
  2017-03-29 19:00 ` [RFC v2] Introduce rare_write() infrastructure Russell King - ARM Linux
  11 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:16 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel

As found in PaX, mark struct cftype with __do_const and add helpers
to deal with rare writes. This is a more complex example of a write-rarely
structure, which needs to use list helpers and blocks of begin/end pairs
to perform the needed updates.

With this change and the constify plugin enabled, the before/after
section byte sizes show:

before:
    rodata: 0x2cc2f0
    data:   0x130d00

after:
    rodata: 0x2cf2f0 (+74478)
    data:   0x12e5c0 (-65710)

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/linux/cgroup-defs.h |  2 +-
 kernel/cgroup/cgroup.c      | 35 +++++++++++++++++++++++------------
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6a3f850cabab..67563a80d01f 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -434,7 +434,7 @@ struct cftype {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lock_class_key	lockdep_key;
 #endif
-};
+} __do_const;
 
 /*
  * Control Group subsystem type.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 48851327a15e..94188df45f96 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3058,11 +3058,11 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
 	int ret;
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-	key = &cft->lockdep_key;
+	key = (struct lock_class_key *)&cft->lockdep_key;
 #endif
 	kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name),
-				  cgroup_file_mode(cft), 0, cft->kf_ops, cft,
-				  NULL, key);
+				  cgroup_file_mode(cft), 0, cft->kf_ops,
+				  (void *)cft, NULL, key);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 
@@ -3165,11 +3165,16 @@ static void cgroup_exit_cftypes(struct cftype *cfts)
 		/* free copy for custom atomic_write_len, see init_cftypes() */
 		if (cft->max_write_len && cft->max_write_len != PAGE_SIZE)
 			kfree(cft->kf_ops);
-		cft->kf_ops = NULL;
-		cft->ss = NULL;
+
+		rare_write_begin();
+		__rare_write(cft->kf_ops, NULL);
+		__rare_write(cft->ss, NULL);
 
 		/* revert flags set by cgroup core while adding @cfts */
-		cft->flags &= ~(__CFTYPE_ONLY_ON_DFL | __CFTYPE_NOT_ON_DFL);
+		__rare_write(cft->flags,
+			     cft->flags & ~(__CFTYPE_ONLY_ON_DFL |
+					    __CFTYPE_NOT_ON_DFL));
+		rare_write_end();
 	}
 }
 
@@ -3200,8 +3205,10 @@ static int cgroup_init_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
 			kf_ops->atomic_write_len = cft->max_write_len;
 		}
 
-		cft->kf_ops = kf_ops;
-		cft->ss = ss;
+		rare_write_begin();
+		__rare_write(cft->kf_ops, kf_ops);
+		__rare_write(cft->ss, ss);
+		rare_write_end();
 	}
 
 	return 0;
@@ -3214,7 +3221,7 @@ static int cgroup_rm_cftypes_locked(struct cftype *cfts)
 	if (!cfts || !cfts[0].ss)
 		return -ENOENT;
 
-	list_del(&cfts->node);
+	rare_list_del(&cfts->node);
 	cgroup_apply_cftypes(cfts, false);
 	cgroup_exit_cftypes(cfts);
 	return 0;
@@ -3271,7 +3278,7 @@ static int cgroup_add_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
 
 	mutex_lock(&cgroup_mutex);
 
-	list_add_tail(&cfts->node, &ss->cfts);
+	rare_list_add_tail(&cfts->node, &ss->cfts);
 	ret = cgroup_apply_cftypes(cfts, true);
 	if (ret)
 		cgroup_rm_cftypes_locked(cfts);
@@ -3292,8 +3299,10 @@ int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
 {
 	struct cftype *cft;
 
+	rare_write_begin();
 	for (cft = cfts; cft && cft->name[0] != '\0'; cft++)
-		cft->flags |= __CFTYPE_ONLY_ON_DFL;
+		__rare_write(cft->flags, cft->flags | __CFTYPE_ONLY_ON_DFL);
+	rare_write_end();
 	return cgroup_add_cftypes(ss, cfts);
 }
 
@@ -3309,8 +3318,10 @@ int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
 {
 	struct cftype *cft;
 
+	rare_write_begin();
 	for (cft = cfts; cft && cft->name[0] != '\0'; cft++)
-		cft->flags |= __CFTYPE_NOT_ON_DFL;
+		__rare_write(cft->flags, cft->flags | __CFTYPE_NOT_ON_DFL);
+	rare_write_end();
 	return cgroup_add_cftypes(ss, cfts);
 }
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2][PATCH 01/11] Introduce rare_write() infrastructure
  2017-03-29 18:15 ` [RFC v2][PATCH 01/11] " Kees Cook
@ 2017-03-29 18:23   ` Kees Cook
  2017-03-30  7:44     ` Ho-Eun Ryu
  2017-04-07  8:09   ` Ho-Eun Ryu
  1 sibling, 1 reply; 63+ messages in thread
From: Kees Cook @ 2017-03-29 18:23 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, x86, LKML, linux-arm-kernel

On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
> +/*
> + * Build "write rarely" infrastructure for flipping memory r/w
> + * on a per-CPU basis.
> + */
> +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE
> +# define __wr_rare
> +# define __wr_rare_type
> +# define __rare_write(__var, __val)    (__var = (__val))
> +# define rare_write_begin()            do { } while (0)
> +# define rare_write_end()              do { } while (0)
> +#else
> +# define __wr_rare                     __ro_after_init
> +# define __wr_rare_type                        const
> +# ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY
> +#  define __rare_write_n(dst, src, len)        ({                      \
> +               BUILD_BUG(!builtin_const(len));                 \
> +               __arch_rare_write_memcpy((dst), (src), (len));  \
> +       })
> +#  define __rare_write(var, val)  __rare_write_n(&(var), &(val), sizeof(var))
> +# else
> +#  define __rare_write(var, val)  ((*(typeof((typeof(var))0) *)&(var)) = (val))
> +# endif
> +# define rare_write_begin()    __arch_rare_write_begin()
> +# define rare_write_end()      __arch_rare_write_end()
> +#endif
> +#define rare_write(__var, __val) ({                    \
> +       rare_write_begin();                             \
> +       __rare_write(__var, __val);                     \
> +       rare_write_end();                               \
> +       __var;                                          \
> +})
> +

Of course, only after sending this do I realize that the MEMCPY case
will need to be further adjusted, since it currently can't take
literals. I guess something like this needs to be done:

#define __rare_write(var, val) ({ \
    typeof(var) __src = (val);     \
    __rare_write_n(&(var), &(__src), sizeof(var)); \
})

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2] Introduce rare_write() infrastructure
  2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
                   ` (10 preceding siblings ...)
  2017-03-29 18:16 ` [RFC v2][PATCH 11/11] cgroups: force all struct cftype const Kees Cook
@ 2017-03-29 19:00 ` Russell King - ARM Linux
  2017-03-29 19:14   ` Kees Cook
  11 siblings, 1 reply; 63+ messages in thread
From: Russell King - ARM Linux @ 2017-03-29 19:00 UTC (permalink / raw)
  To: Kees Cook
  Cc: kernel-hardening, Mark Rutland, Andy Lutomirski, Hoeun Ryu,
	PaX Team, Emese Revfy, x86, linux-kernel, linux-arm-kernel

On Wed, Mar 29, 2017 at 11:15:52AM -0700, Kees Cook wrote:
> This is take 2 of an RFC series to demonstrate a possible infrastructure
> for the "write rarely" memory storage type in the kernel (patch 1). The
> intent is to further reduce the internal attack surface of the kernel
> by making more variables read-only while "at rest". This is heavily
> based on the "__read_only" portion of the KERNEXEC infrastructure from
> PaX/grsecurity, though I tried to adjust it to be more in line with
> the upstream discussions around the APIs.

How much data are we talking about here?

> As part of the series I've included both x86 support (patch 4), exactly
> as done in PaX, and ARM support (patches 5-7), similar to what is done in
> grsecurity but without support for earlier ARM CPUs. Both are lightly
> tested by me, though they need a bit more work, especially ARM as it is
> missing the correct domain marking for kernel modules.

The implementation as it stands on ARM is going to gobble up
multiples of 1MB of RAM as you need it to be section aligned due to
using domains, so if we're talking about a small amount of data,
this is very inefficient.  That also only works for non-LPAE as LPAE
is unable to use that method.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2] Introduce rare_write() infrastructure
  2017-03-29 19:00 ` [RFC v2] Introduce rare_write() infrastructure Russell King - ARM Linux
@ 2017-03-29 19:14   ` Kees Cook
  0 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-29 19:14 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: kernel-hardening, Mark Rutland, Andy Lutomirski, Hoeun Ryu,
	PaX Team, Emese Revfy, x86, LKML, linux-arm-kernel

On Wed, Mar 29, 2017 at 12:00 PM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Wed, Mar 29, 2017 at 11:15:52AM -0700, Kees Cook wrote:
>> This is take 2 of an RFC series to demonstrate a possible infrastructure
>> for the "write rarely" memory storage type in the kernel (patch 1). The
>> intent is to further reduce the internal attack surface of the kernel
>> by making more variables read-only while "at rest". This is heavily
>> based on the "__read_only" portion of the KERNEXEC infrastructure from
>> PaX/grsecurity, though I tried to adjust it to be more in line with
>> the upstream discussions around the APIs.
>
> How much data are we talking about here?

The goal is to put as much sensitive stuff from .data as possible into
.rodata, which is mostly structures with function pointers, etc. I
haven't measured the "final" results, since there's still a lot of
work to do to get all the other annotations into upstream.

>> As part of the series I've included both x86 support (patch 4), exactly
>> as done in PaX, and ARM support (patches 5-7), similar to what is done in
>> grsecurity but without support for earlier ARM CPUs. Both are lightly
>> tested by me, though they need a bit more work, especially ARM as it is
>> missing the correct domain marking for kernel modules.
>
> The implementation as it stands on ARM is going to gobble up
> multiples of 1MB of RAM as you need it to be section aligned due to
> using domains, so if we're talking about a small amount of data,
> this is very inefficient.  That also only works for non-LPAE as LPAE
> is unable to use that method.

AIUI, we're just flipping toe domain from DOMAIN_KERNEL to
DOMAIN_WR_RARE on the .rodata section (which is already 1MB aligned),
and (in the future if I or someone else can figure out how) on the
kernel module vmap area (which also shouldn't be changing alignment at
all).

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-03-29 18:15 ` [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap() Kees Cook
@ 2017-03-29 22:38   ` Andy Lutomirski
  2017-03-30  1:41     ` Kees Cook
  2017-04-07  9:37   ` Peter Zijlstra
  1 sibling, 1 reply; 63+ messages in thread
From: Andy Lutomirski @ 2017-03-29 22:38 UTC (permalink / raw)
  To: Kees Cook
  Cc: kernel-hardening, Mark Rutland, Andy Lutomirski, Hoeun Ryu,
	PaX Team, Emese Revfy, Russell King, X86 ML, linux-kernel,
	linux-arm-kernel

On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
> allows HAVE_ARCH_RARE_WRITE to work on x86.
>

> +
> +static __always_inline unsigned long __arch_rare_write_begin(void)
> +{
> +       unsigned long cr0;
> +
> +       preempt_disable();

This looks wrong.  DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
as would local_irq_disable().  There's no way that just disabling
preemption is enough.

(Also, how does this interact with perf nmis?)

--Andy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-03-29 22:38   ` Andy Lutomirski
@ 2017-03-30  1:41     ` Kees Cook
  2017-04-05 23:57       ` Andy Lutomirski
  0 siblings, 1 reply; 63+ messages in thread
From: Kees Cook @ 2017-03-30  1:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kernel-hardening, Mark Rutland, Andy Lutomirski, Hoeun Ryu,
	PaX Team, Emese Revfy, Russell King, X86 ML, linux-kernel,
	linux-arm-kernel

On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>
>
>> +
>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>> +{
>> +       unsigned long cr0;
>> +
>> +       preempt_disable();
>
> This looks wrong.  DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
> as would local_irq_disable().  There's no way that just disabling
> preemption is enough.
>
> (Also, how does this interact with perf nmis?)

Do you mean preempt_disable() isn't strong enough here? I'm open to
suggestions. The goal would be to make sure nothing between _begin and
_end would get executed without interruption...

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2][PATCH 01/11] Introduce rare_write() infrastructure
  2017-03-29 18:23   ` Kees Cook
@ 2017-03-30  7:44     ` Ho-Eun Ryu
  2017-03-30 17:02       ` Kees Cook
  0 siblings, 1 reply; 63+ messages in thread
From: Ho-Eun Ryu @ 2017-03-30  7:44 UTC (permalink / raw)
  To: Kees Cook
  Cc: kernel-hardening, Mark Rutland, Andy Lutomirski, PaX Team,
	Emese Revfy, Russell King, x86, LKML, linux-arm-kernel


> On 30 Mar 2017, at 3:23 AM, Kees Cook <keescook@chromium.org> wrote:
> 
> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>> +/*
>> + * Build "write rarely" infrastructure for flipping memory r/w
>> + * on a per-CPU basis.
>> + */
>> +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE
>> +# define __wr_rare
>> +# define __wr_rare_type
>> +# define __rare_write(__var, __val)    (__var = (__val))
>> +# define rare_write_begin()            do { } while (0)
>> +# define rare_write_end()              do { } while (0)
>> +#else
>> +# define __wr_rare                     __ro_after_init
>> +# define __wr_rare_type                        const
>> +# ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY
>> +#  define __rare_write_n(dst, src, len)        ({                      \
>> +               BUILD_BUG(!builtin_const(len));                 \
>> +               __arch_rare_write_memcpy((dst), (src), (len));  \
>> +       })
>> +#  define __rare_write(var, val)  __rare_write_n(&(var), &(val), sizeof(var))
>> +# else
>> +#  define __rare_write(var, val)  ((*(typeof((typeof(var))0) *)&(var)) = (val))
>> +# endif
>> +# define rare_write_begin()    __arch_rare_write_begin()
>> +# define rare_write_end()      __arch_rare_write_end()
>> +#endif
>> +#define rare_write(__var, __val) ({                    \
>> +       rare_write_begin();                             \
>> +       __rare_write(__var, __val);                     \
>> +       rare_write_end();                               \
>> +       __var;                                          \
>> +})
>> +
> 
> Of course, only after sending this do I realize that the MEMCPY case
> will need to be further adjusted, since it currently can't take
> literals. I guess something like this needs to be done:
> 
> #define __rare_write(var, val) ({ \
>    typeof(var) __src = (val);     \
>    __rare_write_n(&(var), &(__src), sizeof(var)); \
> })
> 

Right, and it has a problem with BUILD_BUG, which causes compilation error when CONFIG_HABE_ARCH_RARE_WRITE_MEMCPY is true
BUILD_BUG is defined in <linux/bug.h> but <linux/bug.h> includes <linux/compiler.h>

Please see the following.

diff --git a/include/linux/compiler.h b/include/linux/compiler.h                
index 3334fa9..3fa50e1 100644                                                   
--- a/include/linux/compiler.h                                                  
+++ b/include/linux/compiler.h                                                  
@@ -350,11 +350,11 @@ static __always_inline void __write_once_size(volatile vo\
id *p, void *res, int s
 # define __wr_rare                     __ro_after_init                         
 # define __wr_rare_type                        const                           
 # ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY                                     
-#  define __rare_write_n(dst, src, len)        ({                      \       
-               BUILD_BUG(!builtin_const(len));                         \
-               __arch_rare_write_memcpy((dst), (src), (len));   \               
+#  define __rare_write_n(var, val, len)        ({                                 \                                                                              
+               typeof(val) __val = val;                                                    \
+               __arch_rare_write_memcpy(&(var), &(__val), (len));      \       
        })
-#  define __rare_write(var, val)  __rare_write_n(&(var), &(val), sizeof(var))  
+#  define __rare_write(var, val)  __rare_write_n((var), (val), sizeof(var))    
 # else                                                                         
 #  define __rare_write(var, val)  ((*(typeof((typeof(var))0) *)&(var)) = (val)\
)                                                                               
 # endif                                                                        


> -Kees
> 
> -- 
> Kees Cook
> Pixel Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] [RFC v2][PATCH 02/11] lkdtm: add test for rare_write() infrastructure
  2017-03-29 18:15 ` [RFC v2][PATCH 02/11] lkdtm: add test for " Kees Cook
@ 2017-03-30  9:34   ` Ian Campbell
  2017-03-30 16:16     ` Kees Cook
  0 siblings, 1 reply; 63+ messages in thread
From: Ian Campbell @ 2017-03-30  9:34 UTC (permalink / raw)
  To: Kees Cook, kernel-hardening
  Cc: Mark Rutland, Andy Lutomirski, Hoeun Ryu, PaX Team, Emese Revfy,
	Russell King, x86, linux-kernel, linux-arm-kernel

On Wed, 2017-03-29 at 11:15 -0700, Kees Cook wrote:
> diff --git a/drivers/misc/lkdtm_perms.c b/drivers/misc/lkdtm_perms.c
> index c7635a79341f..8fbadfa4cc34 100644
> --- a/drivers/misc/lkdtm_perms.c
> +++ b/drivers/misc/lkdtm_perms.c
> [...]
> +/* This is marked __wr_rare, so it should ultimately be .rodata. */
> +static unsigned long wr_rare __wr_rare = 0xAA66AA66;
> [...]
> +void lkdtm_WRITE_RARE_WRITE(void)
> +{
> +	/* Explicitly cast away "const" for the test. */

wr_rare isn't actually declared const above though? I don't think
__wr_rare includes a const, apologies if I missed it.

OOI, if wr_rare _were_ const then can the compiler optimise the a pair
of reads spanning the rare_write? i.e. adding const to the declaration
above to get:

    static const unsigned long wr_rare __wr_rare = 0xAA66AA66;
x = wr_read;
rare_write(x, 0xf000baaa);
y = wr_read;

Is it possible that x == y == 0xaa66aa66 because gcc realises the x and
y came from the same const location? Have I missed a clobber somewhere
(I can't actually find a definition of __arch_rare_write_memcpy in this
series so maybe it's there), or is such code expected to always cast
away the const first?

I suppose such constructs are rare in practice in the sorts of places
where rare_write is appropriate, but with aggressive inlining it could
occur as an unexpected trap for the unwary perhaps.

Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] [RFC v2][PATCH 02/11] lkdtm: add test for rare_write() infrastructure
  2017-03-30  9:34   ` [kernel-hardening] " Ian Campbell
@ 2017-03-30 16:16     ` Kees Cook
  0 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-30 16:16 UTC (permalink / raw)
  To: Ian Campbell
  Cc: kernel-hardening, Mark Rutland, Andy Lutomirski, Hoeun Ryu,
	PaX Team, Emese Revfy, Russell King, x86, LKML, linux-arm-kernel

On Thu, Mar 30, 2017 at 2:34 AM, Ian Campbell <ijc@hellion.org.uk> wrote:
> On Wed, 2017-03-29 at 11:15 -0700, Kees Cook wrote:
>> diff --git a/drivers/misc/lkdtm_perms.c b/drivers/misc/lkdtm_perms.c
>> index c7635a79341f..8fbadfa4cc34 100644
>> --- a/drivers/misc/lkdtm_perms.c
>> +++ b/drivers/misc/lkdtm_perms.c
>> [...]
>> +/* This is marked __wr_rare, so it should ultimately be .rodata. */
>> +static unsigned long wr_rare __wr_rare = 0xAA66AA66;
>> [...]
>> +void lkdtm_WRITE_RARE_WRITE(void)
>> +{
>> +     /* Explicitly cast away "const" for the test. */
>
> wr_rare isn't actually declared const above though? I don't think
> __wr_rare includes a const, apologies if I missed it.

Yeah, good point. I think this was a left-over from an earlier version
where I'd forgotten about that detail.

> OOI, if wr_rare _were_ const then can the compiler optimise the a pair
> of reads spanning the rare_write? i.e. adding const to the declaration
> above to get:
>
>     static const unsigned long wr_rare __wr_rare = 0xAA66AA66;
> x = wr_read;
> rare_write(x, 0xf000baaa);
> y = wr_read;
>
> Is it possible that x == y == 0xaa66aa66 because gcc realises the x and
> y came from the same const location? Have I missed a clobber somewhere
> (I can't actually find a definition of __arch_rare_write_memcpy in this
> series so maybe it's there), or is such code expected to always cast
> away the const first?
>
> I suppose such constructs are rare in practice in the sorts of places
> where rare_write is appropriate, but with aggressive inlining it could
> occur as an unexpected trap for the unwary perhaps.

Right, __wr_rare is actually marked as .data..ro_after_init, which gcc
effectively ignores (thinking it's part of .data), but the linker
script later movies this section into the read-only portion with
.rodata. As a result, the compiler treats it as writable, but the
storage location is actually read-only.

(And, AIUI, the constify plugin makes things read-only in a similar
way, though I think it's more subtle but still avoids the
const-optimization dangers.)

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2][PATCH 01/11] Introduce rare_write() infrastructure
  2017-03-30  7:44     ` Ho-Eun Ryu
@ 2017-03-30 17:02       ` Kees Cook
  0 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2017-03-30 17:02 UTC (permalink / raw)
  To: Ho-Eun Ryu
  Cc: kernel-hardening, Mark Rutland, Andy Lutomirski, PaX Team,
	Emese Revfy, Russell King, x86, LKML, linux-arm-kernel

On Thu, Mar 30, 2017 at 12:44 AM, Ho-Eun Ryu <hoeun.ryu@gmail.com> wrote:
>
>> On 30 Mar 2017, at 3:23 AM, Kees Cook <keescook@chromium.org> wrote:
>>
>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>> +/*
>>> + * Build "write rarely" infrastructure for flipping memory r/w
>>> + * on a per-CPU basis.
>>> + */
>>> +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE
>>> +# define __wr_rare
>>> +# define __wr_rare_type
>>> +# define __rare_write(__var, __val)    (__var = (__val))
>>> +# define rare_write_begin()            do { } while (0)
>>> +# define rare_write_end()              do { } while (0)
>>> +#else
>>> +# define __wr_rare                     __ro_after_init
>>> +# define __wr_rare_type                        const
>>> +# ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY
>>> +#  define __rare_write_n(dst, src, len)        ({                      \
>>> +               BUILD_BUG(!builtin_const(len));                 \
>>> +               __arch_rare_write_memcpy((dst), (src), (len));  \
>>> +       })
>>> +#  define __rare_write(var, val)  __rare_write_n(&(var), &(val), sizeof(var))
>>> +# else
>>> +#  define __rare_write(var, val)  ((*(typeof((typeof(var))0) *)&(var)) = (val))
>>> +# endif
>>> +# define rare_write_begin()    __arch_rare_write_begin()
>>> +# define rare_write_end()      __arch_rare_write_end()
>>> +#endif
>>> +#define rare_write(__var, __val) ({                    \
>>> +       rare_write_begin();                             \
>>> +       __rare_write(__var, __val);                     \
>>> +       rare_write_end();                               \
>>> +       __var;                                          \
>>> +})
>>> +
>>
>> Of course, only after sending this do I realize that the MEMCPY case
>> will need to be further adjusted, since it currently can't take
>> literals. I guess something like this needs to be done:
>>
>> #define __rare_write(var, val) ({ \
>>    typeof(var) __src = (val);     \
>>    __rare_write_n(&(var), &(__src), sizeof(var)); \
>> })
>>
>
> Right, and it has a problem with BUILD_BUG, which causes compilation error when CONFIG_HABE_ARCH_RARE_WRITE_MEMCPY is true
> BUILD_BUG is defined in <linux/bug.h> but <linux/bug.h> includes <linux/compiler.h>
>
> Please see the following.
>
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 3334fa9..3fa50e1 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -350,11 +350,11 @@ static __always_inline void __write_once_size(volatile vo\
> id *p, void *res, int s
>  # define __wr_rare                     __ro_after_init
>  # define __wr_rare_type                        const
>  # ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY
> -#  define __rare_write_n(dst, src, len)        ({                      \
> -               BUILD_BUG(!builtin_const(len));                         \
> -               __arch_rare_write_memcpy((dst), (src), (len));   \
> +#  define __rare_write_n(var, val, len)        ({                                 \
> +               typeof(val) __val = val;                                                    \
> +               __arch_rare_write_memcpy(&(var), &(__val), (len));      \
>         })
> -#  define __rare_write(var, val)  __rare_write_n(&(var), &(val), sizeof(var))
> +#  define __rare_write(var, val)  __rare_write_n((var), (val), sizeof(var))
>  # else
>  #  define __rare_write(var, val)  ((*(typeof((typeof(var))0) *)&(var)) = (val)\
> )
>  # endif

Thanks for the catch, I'll update this (I'll use compiletime_assert,
which is defined in compiler.h).

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-03-30  1:41     ` Kees Cook
@ 2017-04-05 23:57       ` Andy Lutomirski
  2017-04-06  0:14         ` Kees Cook
  0 siblings, 1 reply; 63+ messages in thread
From: Andy Lutomirski @ 2017-04-05 23:57 UTC (permalink / raw)
  To: Kees Cook
  Cc: kernel-hardening, Mark Rutland, Andy Lutomirski, Hoeun Ryu,
	PaX Team, Emese Revfy, Russell King, X86 ML, linux-kernel,
	linux-arm-kernel

On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>>
>>
>>> +
>>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>>> +{
>>> +       unsigned long cr0;
>>> +
>>> +       preempt_disable();
>>
>> This looks wrong.  DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
>> as would local_irq_disable().  There's no way that just disabling
>> preemption is enough.
>>
>> (Also, how does this interact with perf nmis?)
>
> Do you mean preempt_disable() isn't strong enough here? I'm open to
> suggestions. The goal would be to make sure nothing between _begin and
> _end would get executed without interruption...
>

Sorry for the very slow response.

preempt_disable() isn't strong enough to prevent interrupts, and an
interrupt here would run with WP off, causing unknown havoc.  I tend
to think that the caller should be responsible for turning off
interrupts.

--Andy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-05 23:57       ` Andy Lutomirski
@ 2017-04-06  0:14         ` Kees Cook
  2017-04-06 15:59           ` Andy Lutomirski
  0 siblings, 1 reply; 63+ messages in thread
From: Kees Cook @ 2017-04-06  0:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kernel-hardening, Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel

On Wed, Apr 5, 2017 at 4:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>>>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>>>
>>>
>>>> +
>>>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>>>> +{
>>>> +       unsigned long cr0;
>>>> +
>>>> +       preempt_disable();
>>>
>>> This looks wrong.  DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
>>> as would local_irq_disable().  There's no way that just disabling
>>> preemption is enough.
>>>
>>> (Also, how does this interact with perf nmis?)
>>
>> Do you mean preempt_disable() isn't strong enough here? I'm open to
>> suggestions. The goal would be to make sure nothing between _begin and
>> _end would get executed without interruption...
>>
>
> Sorry for the very slow response.
>
> preempt_disable() isn't strong enough to prevent interrupts, and an
> interrupt here would run with WP off, causing unknown havoc.  I tend
> to think that the caller should be responsible for turning off
> interrupts.

So, something like:

Top-level functions:

static __always_inline rare_write_begin(void)
{
    preempt_disable();
    local_irq_disable();
    barrier();
    __arch_rare_write_begin();
    barrier();
}

static __always_inline rare_write_end(void)
{
    barrier();
    __arch_rare_write_end();
    barrier();
    local_irq_enable();
    preempt_enable_no_resched();
}

x86-specific helpers:

static __always_inline unsigned long __arch_rare_write_begin(void)
{
       unsigned long cr0;

       cr0 = read_cr0() ^ X86_CR0_WP;
       BUG_ON(cr0 & X86_CR0_WP);
       write_cr0(cr0);
       return cr0 ^ X86_CR0_WP;
}

static __always_inline unsigned long __arch_rare_write_end(void)
{
       unsigned long cr0;

       cr0 = read_cr0() ^ X86_CR0_WP;
       BUG_ON(!(cr0 & X86_CR0_WP));
       write_cr0(cr0);
       return cr0 ^ X86_CR0_WP;
}

I can give it a spin...

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-06  0:14         ` Kees Cook
@ 2017-04-06 15:59           ` Andy Lutomirski
  2017-04-07  8:34             ` [kernel-hardening] " Mathias Krause
  0 siblings, 1 reply; 63+ messages in thread
From: Andy Lutomirski @ 2017-04-06 15:59 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, kernel-hardening, Mark Rutland, Hoeun Ryu,
	PaX Team, Emese Revfy, Russell King, X86 ML, linux-kernel,
	linux-arm-kernel

On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Apr 5, 2017 at 4:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote:
>>> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>>>>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>>>>
>>>>
>>>>> +
>>>>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>>>>> +{
>>>>> +       unsigned long cr0;
>>>>> +
>>>>> +       preempt_disable();
>>>>
>>>> This looks wrong.  DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
>>>> as would local_irq_disable().  There's no way that just disabling
>>>> preemption is enough.
>>>>
>>>> (Also, how does this interact with perf nmis?)
>>>
>>> Do you mean preempt_disable() isn't strong enough here? I'm open to
>>> suggestions. The goal would be to make sure nothing between _begin and
>>> _end would get executed without interruption...
>>>
>>
>> Sorry for the very slow response.
>>
>> preempt_disable() isn't strong enough to prevent interrupts, and an
>> interrupt here would run with WP off, causing unknown havoc.  I tend
>> to think that the caller should be responsible for turning off
>> interrupts.
>
> So, something like:
>
> Top-level functions:
>
> static __always_inline rare_write_begin(void)
> {
>     preempt_disable();
>     local_irq_disable();
>     barrier();
>     __arch_rare_write_begin();
>     barrier();
> }

Looks good, except you don't need preempt_disable().
local_irq_disable() also disables preemption.  You might need to use
local_irq_save(), though, depending on whether any callers already
have IRQs off.

--Andy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2][PATCH 01/11] Introduce rare_write() infrastructure
  2017-03-29 18:15 ` [RFC v2][PATCH 01/11] " Kees Cook
  2017-03-29 18:23   ` Kees Cook
@ 2017-04-07  8:09   ` Ho-Eun Ryu
  2017-04-07 20:38     ` Kees Cook
  1 sibling, 1 reply; 63+ messages in thread
From: Ho-Eun Ryu @ 2017-04-07  8:09 UTC (permalink / raw)
  To: Kees Cook
  Cc: kernel-hardening, Mark Rutland, Andy Lutomirski, PaX Team,
	Emese Revfy, Russell King, x86, linux-kernel, linux-arm-kernel


> On 30 Mar 2017, at 3:15 AM, Kees Cook <keescook@chromium.org> wrote:
> 
> Several types of data storage exist in the kernel: read-write data (.data,
> .bss), read-only data (.rodata), and RO-after-init. This introduces the
> infrastructure for another type: write-rarely, which is intended for data
> that is either only rarely modified or especially security-sensitive. The
> goal is to further reduce the internal attack surface of the kernel by
> making this storage read-only when "at rest". This makes it much harder
> to be subverted by attackers who have a kernel-write flaw, since they
> cannot directly change these memory contents.
> 
> This work is heavily based on PaX and grsecurity's pax_{open,close}_kernel
> API, its __read_only annotations, its constify plugin, and the work done
> to identify sensitive structures that should be moved from .data into
> .rodata. This builds the initial infrastructure to support these kinds
> of changes, though the API and naming has been adjusted in places for
> clarity and maintainability.
> 
> Variables declared with the __wr_rare annotation will be moved to the
> .rodata section if an architecture supports CONFIG_HAVE_ARCH_WRITE_RARE.
> To change these variables, either a single rare_write() macro can be used,
> or multiple uses of __rare_write(), wrapped in a matching pair of
> rare_write_begin() and rare_write_end() macros can be used. These macros
> are expanded into the arch-specific functions that perform the actions
> needed to write to otherwise read-only memory.
> 
> As detailed in the Kconfig help, the arch-specific helpers have several
> requirements to make them sensible/safe for use by the kernel: they must
> not allow non-current CPUs to write the memory area, they must run
> non-preemptible to avoid accidentally leaving memory writable, and must
> be inline to avoid making them desirable ROP targets for attackers.
> 
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
> arch/Kconfig             | 25 +++++++++++++++++++++++++
> include/linux/compiler.h | 32 ++++++++++++++++++++++++++++++++
> include/linux/preempt.h  |  6 ++++--
> 3 files changed, 61 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index cd211a14a88f..5ebf62500b99 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -847,4 +847,29 @@ config STRICT_MODULE_RWX
> config ARCH_WANT_RELAX_ORDER
> 	bool
> 
> +config HAVE_ARCH_RARE_WRITE
> +	def_bool n
> +	help
> +	  An arch should select this option if it has defined the functions
> +	  __arch_rare_write_begin() and __arch_rare_write_end() to
> +	  respectively enable and disable writing to read-only memory. The
> +	  routines must meet the following requirements:
> +	  - read-only memory writing must only be available on the current
> +	    CPU (to make sure other CPUs can't race to make changes too).
> +	  - the routines must be declared inline (to discourage ROP use).
> +	  - the routines must not be preemptible (likely they will call
> +	    preempt_disable() and preempt_enable_no_resched() respectively).
> +	  - the routines must validate expected state (e.g. when enabling
> +	    writes, BUG() if writes are already be enabled).
> +
> +config HAVE_ARCH_RARE_WRITE_MEMCPY
> +	def_bool n
> +	depends on HAVE_ARCH_RARE_WRITE
> +	help
> +	  An arch should select this option if a special accessor is needed
> +	  to write to otherwise read-only memory, defined by the function
> +	  __arch_rare_write_memcpy(). Without this, the write-rarely
> +	  infrastructure will just attempt to write directly to the memory
> +	  using a const-ignoring assignment.
> +
> source "kernel/gcov/Kconfig"
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index f8110051188f..274bd03cfe9e 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -336,6 +336,38 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
> 	__u.__val;					\
> })
> 
> +/*
> + * Build "write rarely" infrastructure for flipping memory r/w
> + * on a per-CPU basis.
> + */
> +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE
> +# define __wr_rare
> +# define __wr_rare_type
> +# define __rare_write(__var, __val)	(__var = (__val))
> +# define rare_write_begin()		do { } while (0)
> +# define rare_write_end()		do { } while (0)
> +#else
> +# define __wr_rare			__ro_after_init
> +# define __wr_rare_type			const
> +# ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY
> +#  define __rare_write_n(dst, src, len)	({			\
> +		BUILD_BUG(!builtin_const(len));			\
> +		__arch_rare_write_memcpy((dst), (src), (len));	\
> +	})
> +#  define __rare_write(var, val)  __rare_write_n(&(var), &(val), sizeof(var))
> +# else
> +#  define __rare_write(var, val)  ((*(typeof((typeof(var))0) *)&(var)) = (val))
> +# endif
> +# define rare_write_begin()	__arch_rare_write_begin()
> +# define rare_write_end()	__arch_rare_write_end()
> +#endif
> +#define rare_write(__var, __val) ({			\
> +	rare_write_begin();				\
> +	__rare_write(__var, __val);			\
> +	rare_write_end();				\
> +	__var;						\
> +})
> +

How about we have a separate header file splitting section annotations and the actual APIs.

include/linux/compiler.h:
    __wr_rare
    __wr_rare_type

include/linux/rare_write.h:
    __rare_write_n()
    __rare_write()
    rare_write_begin()
    rare_write_end()

OR moving all of them to include/linux/rare_write.h.

I’m writing the arm64 port for rare_write feature and I’ve stucked in some header problems for the next version of the patch.
I need some other mmu related APIs (mostly defined in `arch/arm64/include/asm/mmu_context.h`) to implement those helpers.
but I cannot include the header in `include/linux/compiler.h` prior to definition of rare_write macros (huge compilation errors).
You know that `linux/compiler.h` header is mostly base of other headers not a user.

I have to define `__arch_rare_write_[begin/end/memcpy]()` functions as `static inline` in a header file somewhere in `arch/arm64/include/asm/`
to avoid making them ROP targets and the helpers need other mmu related APIs.
And the helpers and the mmu related APIs cannot be defined prior to definition of rare_write macros in `linux/compile.h`.

And I think I'll need `include/linux/module.h` for module related APIs like `is_module_address()` to support module address conversion
in `__arch_rare_write_memcpy()` and it’ll make the situation worse.

* make `include/linux/rare_write.h`
* the header defines rare_write APIs
* the header includes `include/asm/rare_write.h` for arch-specific helpers
* users using rare_write feature should include `linux/rare_write.h`

OR suggest other solutions please.

> #endif /* __KERNEL__ */
> 
> #endif /* __ASSEMBLY__ */
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index cae461224948..4fc97aaa22ea 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -258,10 +258,12 @@ do { \
> /*
>  * Modules have no business playing preemption tricks.
>  */
> -#undef sched_preempt_enable_no_resched
> -#undef preempt_enable_no_resched
> #undef preempt_enable_no_resched_notrace
> #undef preempt_check_resched
> +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE
> +#undef sched_preempt_enable_no_resched
> +#undef preempt_enable_no_resched
> +#endif
> #endif
> 
> #define preempt_set_need_resched() \
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-06 15:59           ` Andy Lutomirski
@ 2017-04-07  8:34             ` Mathias Krause
  2017-04-07  9:46               ` Thomas Gleixner
  0 siblings, 1 reply; 63+ messages in thread
From: Mathias Krause @ 2017-04-07  8:34 UTC (permalink / raw)
  To: Andy Lutomirski, Kees Cook
  Cc: Andy Lutomirski, kernel-hardening, Mark Rutland, Hoeun Ryu,
	PaX Team, Emese Revfy, Russell King, X86 ML, linux-kernel,
	linux-arm-kernel

On 6 April 2017 at 17:59, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Wed, Apr 5, 2017 at 4:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote:
>>>> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>>>>>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>>>>>
>>>>>> +
>>>>>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>>>>>> +{
>>>>>> +       unsigned long cr0;
>>>>>> +
>>>>>> +       preempt_disable();
>>>>>
>>>>> This looks wrong.  DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
>>>>> as would local_irq_disable().  There's no way that just disabling
>>>>> preemption is enough.
>>>>>
>>>>> (Also, how does this interact with perf nmis?)
>>>>
>>>> Do you mean preempt_disable() isn't strong enough here? I'm open to
>>>> suggestions. The goal would be to make sure nothing between _begin and
>>>> _end would get executed without interruption...
>>>>
>>>
>>> Sorry for the very slow response.
>>>
>>> preempt_disable() isn't strong enough to prevent interrupts, and an
>>> interrupt here would run with WP off, causing unknown havoc.  I tend
>>> to think that the caller should be responsible for turning off
>>> interrupts.
>>
>> So, something like:
>>
>> Top-level functions:
>>
>> static __always_inline rare_write_begin(void)
>> {
>>     preempt_disable();
>>     local_irq_disable();
>>     barrier();
>>     __arch_rare_write_begin();
>>     barrier();
>> }
>
> Looks good, except you don't need preempt_disable().
> local_irq_disable() also disables preemption.  You might need to use
> local_irq_save(), though, depending on whether any callers already
> have IRQs off.

Well, doesn't look good to me. NMIs will still be able to interrupt
this code and will run with CR0.WP = 0.

Shouldn't you instead question yourself why PaX can do it "just" with
preempt_disable() instead?!

Cheers,
Mathias

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2][PATCH 08/11] ARM: Implement __arch_rare_write_begin/end()
  2017-03-29 18:16 ` [RFC v2][PATCH 08/11] ARM: Implement __arch_rare_write_begin/end() Kees Cook
@ 2017-04-07  9:36   ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2017-04-07  9:36 UTC (permalink / raw)
  To: Kees Cook
  Cc: kernel-hardening, Mark Rutland, Andy Lutomirski, Hoeun Ryu,
	PaX Team, Emese Revfy, Russell King, x86, linux-kernel,
	linux-arm-kernel

On Wed, Mar 29, 2017 at 11:16:00AM -0700, Kees Cook wrote:
> +static inline unsigned long __arch_rare_write_end(void)
> +{
> +	BUG_ON(test_domain(DOMAIN_WR_RARE, DOMAIN_CLIENT));
> +	modify_domain(DOMAIN_WR_RARE, DOMAIN_CLIENT);
> +	preempt_enable_no_resched();

NAK

> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-03-29 18:15 ` [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap() Kees Cook
  2017-03-29 22:38   ` Andy Lutomirski
@ 2017-04-07  9:37   ` Peter Zijlstra
  1 sibling, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2017-04-07  9:37 UTC (permalink / raw)
  To: Kees Cook
  Cc: kernel-hardening, Mark Rutland, Andy Lutomirski, Hoeun Ryu,
	PaX Team, Emese Revfy, Russell King, x86, linux-kernel,
	linux-arm-kernel

On Wed, Mar 29, 2017 at 11:15:56AM -0700, Kees Cook wrote:
> +static __always_inline unsigned long __arch_rare_write_end(void)
> +{
> +	unsigned long cr0;
> +
> +	barrier();
> +	cr0 = read_cr0() ^ X86_CR0_WP;
> +	BUG_ON(!(cr0 & X86_CR0_WP));
> +	write_cr0(cr0);
> +	barrier();
> +	preempt_enable_no_resched();

NAK

> +	return cr0 ^ X86_CR0_WP;
> +}

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07  8:34             ` [kernel-hardening] " Mathias Krause
@ 2017-04-07  9:46               ` Thomas Gleixner
  2017-04-07 10:51                 ` Mathias Krause
  2017-04-07 19:52                 ` PaX Team
  0 siblings, 2 replies; 63+ messages in thread
From: Thomas Gleixner @ 2017-04-07  9:46 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, kernel-hardening,
	Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy, Russell King,
	X86 ML, linux-kernel, linux-arm-kernel, Peter Zijlstra

On Fri, 7 Apr 2017, Mathias Krause wrote:
> On 6 April 2017 at 17:59, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
> >> static __always_inline rare_write_begin(void)
> >> {
> >>     preempt_disable();
> >>     local_irq_disable();
> >>     barrier();
> >>     __arch_rare_write_begin();
> >>     barrier();
> >> }
> >
> > Looks good, except you don't need preempt_disable().
> > local_irq_disable() also disables preemption.  You might need to use
> > local_irq_save(), though, depending on whether any callers already
> > have IRQs off.
> 
> Well, doesn't look good to me. NMIs will still be able to interrupt
> this code and will run with CR0.WP = 0.
> 
> Shouldn't you instead question yourself why PaX can do it "just" with
> preempt_disable() instead?!

That's silly. Just because PaX does it, doesn't mean it's correct. To be
honest, playing games with the CR0.WP bit is outright stupid to begin with.

Whether protected by preempt_disable or local_irq_disable, to make that
work it needs CR0 handling in the exception entry/exit at the lowest
level. And that's just a nightmare maintainence wise as it's prone to be
broken over time. Aside of that it's pointless overhead for the normal case.

The proper solution is:

write_rare(ptr, val)
{
	mp = map_shadow_rw(ptr);
	*mp = val;
	unmap_shadow_rw(mp);
}

map_shadow_rw() is essentially the same thing as we do in the highmem case
where the kernel creates a shadow mapping of the user space pages via
kmap_atomic().

It's valid (at least on x86) to have a shadow map with the same page
attributes but write enabled. That does not require any fixups of CR0 and
just works.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07  9:46               ` Thomas Gleixner
@ 2017-04-07 10:51                 ` Mathias Krause
  2017-04-07 13:14                   ` Thomas Gleixner
  2017-04-07 14:45                   ` Peter Zijlstra
  2017-04-07 19:52                 ` PaX Team
  1 sibling, 2 replies; 63+ messages in thread
From: Mathias Krause @ 2017-04-07 10:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, kernel-hardening,
	Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy, Russell King,
	X86 ML, linux-kernel, linux-arm-kernel, Peter Zijlstra

On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 7 Apr 2017, Mathias Krause wrote:
>> On 6 April 2017 at 17:59, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
>> >> static __always_inline rare_write_begin(void)
>> >> {
>> >>     preempt_disable();
>> >>     local_irq_disable();
>> >>     barrier();
>> >>     __arch_rare_write_begin();
>> >>     barrier();
>> >> }
>> >
>> > Looks good, except you don't need preempt_disable().
>> > local_irq_disable() also disables preemption.  You might need to use
>> > local_irq_save(), though, depending on whether any callers already
>> > have IRQs off.
>>
>> Well, doesn't look good to me. NMIs will still be able to interrupt
>> this code and will run with CR0.WP = 0.
>>
>> Shouldn't you instead question yourself why PaX can do it "just" with
>> preempt_disable() instead?!
>
> That's silly. Just because PaX does it, doesn't mean it's correct. To be
> honest, playing games with the CR0.WP bit is outright stupid to begin with.

Why that? It allows fast and CPU local modifications of r/o memory.
OTOH, an approach that needs to fiddle with page table entries
requires global synchronization to keep the individual TLB states in
sync. Hmm.. Not that fast, I'd say.

> Whether protected by preempt_disable or local_irq_disable, to make that
> work it needs CR0 handling in the exception entry/exit at the lowest
> level. And that's just a nightmare maintainence wise as it's prone to be
> broken over time.

It seems to be working fine for more than a decade now in PaX. So it
can't be such a big maintenance nightmare ;)

>                    Aside of that it's pointless overhead for the normal case.

>
> The proper solution is:
>
> write_rare(ptr, val)
> {
>         mp = map_shadow_rw(ptr);
>         *mp = val;
>         unmap_shadow_rw(mp);
> }
>
> map_shadow_rw() is essentially the same thing as we do in the highmem case
> where the kernel creates a shadow mapping of the user space pages via
> kmap_atomic().

The "proper solution" seems to be much slower compared to just
toggling CR0.WP (which is costly in itself, already) because of the
TLB invalidation / synchronisation involved.

> It's valid (at least on x86) to have a shadow map with the same page
> attributes but write enabled. That does not require any fixups of CR0 and
> just works.

"Just works", sure -- but it's not as tightly focused as the PaX
solution which is CPU local, while your proposed solution is globally
visible.


Cheers,
Mathias

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 10:51                 ` Mathias Krause
@ 2017-04-07 13:14                   ` Thomas Gleixner
  2017-04-07 13:30                     ` Mathias Krause
  2017-04-07 14:45                   ` Peter Zijlstra
  1 sibling, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2017-04-07 13:14 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, kernel-hardening,
	Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy, Russell King,
	X86 ML, linux-kernel, linux-arm-kernel, Peter Zijlstra

On Fri, 7 Apr 2017, Mathias Krause wrote:
> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote:
> > Whether protected by preempt_disable or local_irq_disable, to make that
> > work it needs CR0 handling in the exception entry/exit at the lowest
> > level. And that's just a nightmare maintainence wise as it's prone to be
> > broken over time.
> 
> It seems to be working fine for more than a decade now in PaX. So it
> can't be such a big maintenance nightmare ;)

I really do not care whether PaX wants to chase and verify that over and
over. I certainly don't want to take the chance to leak CR0.WP ever and I
very much care about extra stuff to check in the entry/exit path.

> The "proper solution" seems to be much slower compared to just
> toggling CR0.WP (which is costly in itself, already) because of the
> TLB invalidation / synchronisation involved.

Why the heck should we care about rare writes being performant?

> > It's valid (at least on x86) to have a shadow map with the same page
> > attributes but write enabled. That does not require any fixups of CR0 and
> > just works.
> 
> "Just works", sure -- but it's not as tightly focused as the PaX
> solution which is CPU local, while your proposed solution is globally
> visible.

Making the world and some more writeable hardly qualifies as tightly
focussed. Making the mapping concept CPU local is not rocket science
either. The question is whethers it's worth the trouble.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 13:14                   ` Thomas Gleixner
@ 2017-04-07 13:30                     ` Mathias Krause
  2017-04-07 16:14                       ` Andy Lutomirski
  2017-04-07 19:25                       ` Thomas Gleixner
  0 siblings, 2 replies; 63+ messages in thread
From: Mathias Krause @ 2017-04-07 13:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, kernel-hardening,
	Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy, Russell King,
	X86 ML, linux-kernel, linux-arm-kernel, Peter Zijlstra

On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 7 Apr 2017, Mathias Krause wrote:
>> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Whether protected by preempt_disable or local_irq_disable, to make that
>> > work it needs CR0 handling in the exception entry/exit at the lowest
>> > level. And that's just a nightmare maintainence wise as it's prone to be
>> > broken over time.
>>
>> It seems to be working fine for more than a decade now in PaX. So it
>> can't be such a big maintenance nightmare ;)
>
> I really do not care whether PaX wants to chase and verify that over and
> over. I certainly don't want to take the chance to leak CR0.WP ever and I
> very much care about extra stuff to check in the entry/exit path.

Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
somewhere sensible should make those "leaks" visible fast -- and their
exploitation impossible, i.e. fail hard.

>> The "proper solution" seems to be much slower compared to just
>> toggling CR0.WP (which is costly in itself, already) because of the
>> TLB invalidation / synchronisation involved.
>
> Why the heck should we care about rare writes being performant?

As soon as they stop being rare and people start extending the r/o
protection to critical data structures accessed often. Then
performance matters.

>> > It's valid (at least on x86) to have a shadow map with the same page
>> > attributes but write enabled. That does not require any fixups of CR0 and
>> > just works.
>>
>> "Just works", sure -- but it's not as tightly focused as the PaX
>> solution which is CPU local, while your proposed solution is globally
>> visible.
>
> Making the world and some more writeable hardly qualifies as tightly
> focussed. Making the mapping concept CPU local is not rocket science
> either. The question is whethers it's worth the trouble.

No, the question is if the value of the concept is well understood and
if people can see what could be done with such a strong primitive.
Apparently not...


Cheers,
Mathias

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 10:51                 ` Mathias Krause
  2017-04-07 13:14                   ` Thomas Gleixner
@ 2017-04-07 14:45                   ` Peter Zijlstra
  2017-04-10 10:29                     ` Mark Rutland
  1 sibling, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2017-04-07 14:45 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Thomas Gleixner, Andy Lutomirski, Kees Cook, Andy Lutomirski,
	kernel-hardening, Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel

On Fri, Apr 07, 2017 at 12:51:15PM +0200, Mathias Krause wrote:
> Why that? It allows fast and CPU local modifications of r/o memory.
> OTOH, an approach that needs to fiddle with page table entries
> requires global synchronization to keep the individual TLB states in
> sync. Hmm.. Not that fast, I'd say.

The fixmaps used for kmap_atomic are per-cpu, no global sync required.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 13:30                     ` Mathias Krause
@ 2017-04-07 16:14                       ` Andy Lutomirski
  2017-04-07 16:22                         ` Mark Rutland
                                           ` (3 more replies)
  2017-04-07 19:25                       ` Thomas Gleixner
  1 sibling, 4 replies; 63+ messages in thread
From: Andy Lutomirski @ 2017-04-07 16:14 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Thomas Gleixner, Kees Cook, Andy Lutomirski, kernel-hardening,
	Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy, Russell King,
	X86 ML, linux-kernel, linux-arm-kernel, Peter Zijlstra

On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote:
> On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Fri, 7 Apr 2017, Mathias Krause wrote:
>>> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> > Whether protected by preempt_disable or local_irq_disable, to make that
>>> > work it needs CR0 handling in the exception entry/exit at the lowest
>>> > level. And that's just a nightmare maintainence wise as it's prone to be
>>> > broken over time.
>>>
>>> It seems to be working fine for more than a decade now in PaX. So it
>>> can't be such a big maintenance nightmare ;)
>>
>> I really do not care whether PaX wants to chase and verify that over and
>> over. I certainly don't want to take the chance to leak CR0.WP ever and I
>> very much care about extra stuff to check in the entry/exit path.
>
> Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
> somewhere sensible should make those "leaks" visible fast -- and their
> exploitation impossible, i.e. fail hard.

The leaks surely exist and now we'll just add an exploitable BUG.

>>> > It's valid (at least on x86) to have a shadow map with the same page
>>> > attributes but write enabled. That does not require any fixups of CR0 and
>>> > just works.
>>>
>>> "Just works", sure -- but it's not as tightly focused as the PaX
>>> solution which is CPU local, while your proposed solution is globally
>>> visible.
>>
>> Making the world and some more writeable hardly qualifies as tightly
>> focussed. Making the mapping concept CPU local is not rocket science
>> either. The question is whethers it's worth the trouble.
>
> No, the question is if the value of the concept is well understood and
> if people can see what could be done with such a strong primitive.
> Apparently not...

I think we're approaching this all wrong, actually.  The fact that x86
has this CR0.WP thing is arguably a historical accident, and the fact
that PaX uses it doesn't mean that PaX is doing it the best way for
upstream Linux.

Why don't we start at the other end and do a generic non-arch-specific
implementation: set up an mm_struct that contains an RW alias of the
relevant parts of rodata and use use_mm to access it.  (That is,
get_fs() to back up the old fs, set_fs(USER_DS),
use_mm(&rare_write_mm), do the write using copy_to_user, undo
everything.)

Then someone who cares about performance can benchmark the CR0.WP
approach against it and try to argue that it's a good idea.  This
benchmark should wait until I'm done with my PCID work, because PCID
is going to make use_mm() a whole heck of a lot faster.

--Andy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 16:14                       ` Andy Lutomirski
@ 2017-04-07 16:22                         ` Mark Rutland
  2017-04-07 19:58                         ` PaX Team
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Mark Rutland @ 2017-04-07 16:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathias Krause, Thomas Gleixner, Kees Cook, kernel-hardening,
	Hoeun Ryu, PaX Team, Emese Revfy, Russell King, X86 ML,
	linux-kernel, linux-arm-kernel, Peter Zijlstra

On Fri, Apr 07, 2017 at 09:14:29AM -0700, Andy Lutomirski wrote:
> I think we're approaching this all wrong, actually.  The fact that x86
> has this CR0.WP thing is arguably a historical accident, and the fact
> that PaX uses it doesn't mean that PaX is doing it the best way for
> upstream Linux.
> 
> Why don't we start at the other end and do a generic non-arch-specific
> implementation: set up an mm_struct that contains an RW alias of the
> relevant parts of rodata and use use_mm to access it.  (That is,
> get_fs() to back up the old fs, set_fs(USER_DS),
> use_mm(&rare_write_mm), do the write using copy_to_user, undo
> everything.)

FWIW, I completely agree with this approach.

That's largely the approach arm64 would have to take regardless (as per
Hoeun's patches), and having a consistent common implementation would be
desireable.

There are a couple of other complications to handle (e.g. a perf
interrupt coming in and trying to read from the mapping), but I think we
can handle that generically.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 13:30                     ` Mathias Krause
  2017-04-07 16:14                       ` Andy Lutomirski
@ 2017-04-07 19:25                       ` Thomas Gleixner
  1 sibling, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2017-04-07 19:25 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, kernel-hardening,
	Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy, Russell King,
	X86 ML, linux-kernel, linux-arm-kernel, Peter Zijlstra

On Fri, 7 Apr 2017, Mathias Krause wrote:
> On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Fri, 7 Apr 2017, Mathias Krause wrote:
> >> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > Whether protected by preempt_disable or local_irq_disable, to make that
> >> > work it needs CR0 handling in the exception entry/exit at the lowest
> >> > level. And that's just a nightmare maintainence wise as it's prone to be
> >> > broken over time.
> >>
> >> It seems to be working fine for more than a decade now in PaX. So it
> >> can't be such a big maintenance nightmare ;)
> >
> > I really do not care whether PaX wants to chase and verify that over and
> > over. I certainly don't want to take the chance to leak CR0.WP ever and I
> > very much care about extra stuff to check in the entry/exit path.
> 
> Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
> somewhere sensible should make those "leaks" visible fast -- and their
> exploitation impossible, i.e. fail hard.

Sure, you trade leaking WP with an potentially exploitable BUG().

> >> The "proper solution" seems to be much slower compared to just
> >> toggling CR0.WP (which is costly in itself, already) because of the
> >> TLB invalidation / synchronisation involved.
> >
> > Why the heck should we care about rare writes being performant?
> 
> As soon as they stop being rare and people start extending the r/o
> protection to critical data structures accessed often. Then
> performance matters.

Emphasis on "Then". I'm not seeing it, because no matter what you do it's
going to be slow. Aside of that increasing the usage will also increase the
chance to leak stuff. In that case I rather leak a single page mapping
temporarily than taking the chance to leak WP.

> >> > It's valid (at least on x86) to have a shadow map with the same page
> >> > attributes but write enabled. That does not require any fixups of CR0 and
> >> > just works.
> >>
> >> "Just works", sure -- but it's not as tightly focused as the PaX
> >> solution which is CPU local, while your proposed solution is globally
> >> visible.
> >
> > Making the world and some more writeable hardly qualifies as tightly
> > focussed. Making the mapping concept CPU local is not rocket science
> > either. The question is whethers it's worth the trouble.
> 
> No, the question is if the value of the concept is well understood and
> if people can see what could be done with such a strong primitive.
> Apparently not...

Oh, well. We can stop that discussion right here, if all you can provide
is a killer phrase.

I'm well aware what can be done with a strong primitive and I certainly
understand the concept, but I'm not naive enough to believe that lifting
one of the strong protections the kernel has by globaly disabling WP is
anything which should be even considered. That bit is a horrible
misconception and should be fused to 1.

Aside of that, if you had taken the time to figure out how kmap_atomic
stuff works then you would have noticed that it does not require cross CPU
pagetable syncs and that the mapping place can be randomized to a certain
degree. So this has neither global impact, nor does it become immediately
globally visible.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07  9:46               ` Thomas Gleixner
  2017-04-07 10:51                 ` Mathias Krause
@ 2017-04-07 19:52                 ` PaX Team
  2017-04-10  8:26                   ` Thomas Gleixner
  1 sibling, 1 reply; 63+ messages in thread
From: PaX Team @ 2017-04-07 19:52 UTC (permalink / raw)
  To: Mathias Krause, Thomas Gleixner
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, kernel-hardening,
	Mark Rutland, Hoeun Ryu, Emese Revfy, Russell King, X86 ML,
	linux-kernel, linux-arm-kernel, Peter Zijlstra

On 7 Apr 2017 at 11:46, Thomas Gleixner wrote:

> On Fri, 7 Apr 2017, Mathias Krause wrote:
> > Well, doesn't look good to me. NMIs will still be able to interrupt
> > this code and will run with CR0.WP = 0.
> > 
> > Shouldn't you instead question yourself why PaX can do it "just" with
> > preempt_disable() instead?!
> 
> That's silly. Just because PaX does it, doesn't mean it's correct.

is that FUD or do you have actionable information to share?

> To be honest, playing games with the CR0.WP bit is outright stupid to begin with.

why is that? cr0.wp exists since the i486 and its behaviour fits my
purposes quite well, it's the best security/performance i know of.

> Whether protected by preempt_disable or local_irq_disable, to make that
> work it needs CR0 handling in the exception entry/exit at the lowest
> level.

correct.

> And that's just a nightmare maintainence wise as it's prone to be
> broken over time.

i've got 14 years of experience of maintaining it and i never saw it break.

> Aside of that it's pointless overhead for the normal case.

unless it's optional code as the whole feature already is.

> The proper solution is:
> 
> write_rare(ptr, val)
> {
>  mp = map_shadow_rw(ptr);
>  *mp = val;
>  unmap_shadow_rw(mp);
> }

this is not *the* proper solution, but only a naive one that suffers from
the exact same need that the cr0.wp approach does and has worse performance
impact. not exactly a win...

[continuing from your next mail in order to save round-trip time]

> I really do not care whether PaX wants to chase and verify that over and
> over.

verifying it is no different than verifying, say, swapgs use.

> I certainly don't want to take the chance to leak CR0.WP ever

why and where would cr0.wp leak?

> and I very much care about extra stuff to check in the entry/exit path.

your 'proper' solution needs (a lot more) extra stuff too.

> Why the heck should we care about rare writes being performant?

because you've been misled by the NIH crowd here that the PaX feature they
tried to (badly) extract from has anything to do with frequency of writes.
it does not. what it does do is provide an environment for variables that
are conceptually writable but for security reasons should be read-only most
of the time by most of the code (ditto for the grossly misunderstood and thus
misnamed ro-after-shit). now imagine locking down the page table hierarchy
with it...

> Making the world and some more writeable hardly qualifies as tightly
> focused.

you forgot to add 'for a window of a few insns' and that the map/unmap
approach does the same under an attacker controlled ptr.

> Making the mapping concept CPU local is not rocket science
> either. The question is whether it's worth the trouble.

it is for people who care about the integrity of the kernel, and all this
read-onlyness stuff implies that some do.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 16:14                       ` Andy Lutomirski
  2017-04-07 16:22                         ` Mark Rutland
@ 2017-04-07 19:58                         ` PaX Team
  2017-04-08  4:58                           ` Andy Lutomirski
  2017-04-07 20:44                         ` Thomas Gleixner
  2017-04-08  4:21                         ` Daniel Micay
  3 siblings, 1 reply; 63+ messages in thread
From: PaX Team @ 2017-04-07 19:58 UTC (permalink / raw)
  To: Mathias Krause, Andy Lutomirski
  Cc: Thomas Gleixner, Kees Cook, Andy Lutomirski, kernel-hardening,
	Mark Rutland, Hoeun Ryu, Emese Revfy, Russell King, X86 ML,
	linux-kernel, linux-arm-kernel, Peter Zijlstra

On 7 Apr 2017 at 9:14, Andy Lutomirski wrote:

> On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote:
> > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> On Fri, 7 Apr 2017, Mathias Krause wrote:
> > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
> > somewhere sensible should make those "leaks" visible fast -- and their
> > exploitation impossible, i.e. fail hard.
> 
> The leaks surely exist and now we'll just add an exploitable BUG.

can you please share those leaks that 'surely exist' and CC oss-security
while at it?

> I think we're approaching this all wrong, actually.  The fact that x86
> has this CR0.WP thing is arguably a historical accident, and the fact
> that PaX uses it doesn't mean that PaX is doing it the best way for
> upstream Linux.
> 
> Why don't we start at the other end and do a generic non-arch-specific
> implementation: set up an mm_struct that contains an RW alias of the
> relevant parts of rodata and use use_mm to access it.  (That is,
> get_fs() to back up the old fs, set_fs(USER_DS),
> use_mm(&rare_write_mm), do the write using copy_to_user, undo
> everything.)
> 
> Then someone who cares about performance can benchmark the CR0.WP
> approach against it and try to argue that it's a good idea.  This
> benchmark should wait until I'm done with my PCID work, because PCID
> is going to make use_mm() a whole heck of a lot faster.

in my measurements switching PCID is hovers around 230 cycles for snb-ivb
and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's
of course a whole lot more impact for switching address spaces so it'll never
be fast enough to beat cr0.wp.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC v2][PATCH 01/11] Introduce rare_write() infrastructure
  2017-04-07  8:09   ` Ho-Eun Ryu
@ 2017-04-07 20:38     ` Kees Cook
  0 siblings, 0 replies; 63+ messages in thread
From: Kees Cook @ 2017-04-07 20:38 UTC (permalink / raw)
  To: Ho-Eun Ryu
  Cc: kernel-hardening, Mark Rutland, Andy Lutomirski, PaX Team,
	Emese Revfy, Russell King, x86, LKML, linux-arm-kernel

On Fri, Apr 7, 2017 at 1:09 AM, Ho-Eun Ryu <hoeun.ryu@gmail.com> wrote:
>
>> On 30 Mar 2017, at 3:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>
>> Several types of data storage exist in the kernel: read-write data (.data,
>> .bss), read-only data (.rodata), and RO-after-init. This introduces the
>> infrastructure for another type: write-rarely, which is intended for data
>> that is either only rarely modified or especially security-sensitive. The
>> goal is to further reduce the internal attack surface of the kernel by
>> making this storage read-only when "at rest". This makes it much harder
>> to be subverted by attackers who have a kernel-write flaw, since they
>> cannot directly change these memory contents.
>>
>> This work is heavily based on PaX and grsecurity's pax_{open,close}_kernel
>> API, its __read_only annotations, its constify plugin, and the work done
>> to identify sensitive structures that should be moved from .data into
>> .rodata. This builds the initial infrastructure to support these kinds
>> of changes, though the API and naming has been adjusted in places for
>> clarity and maintainability.
>>
>> Variables declared with the __wr_rare annotation will be moved to the
>> .rodata section if an architecture supports CONFIG_HAVE_ARCH_WRITE_RARE.
>> To change these variables, either a single rare_write() macro can be used,
>> or multiple uses of __rare_write(), wrapped in a matching pair of
>> rare_write_begin() and rare_write_end() macros can be used. These macros
>> are expanded into the arch-specific functions that perform the actions
>> needed to write to otherwise read-only memory.
>>
>> As detailed in the Kconfig help, the arch-specific helpers have several
>> requirements to make them sensible/safe for use by the kernel: they must
>> not allow non-current CPUs to write the memory area, they must run
>> non-preemptible to avoid accidentally leaving memory writable, and must
>> be inline to avoid making them desirable ROP targets for attackers.
>>
>> Signed-off-by: Kees Cook <keescook@chromium.org>
>> ---
>> arch/Kconfig             | 25 +++++++++++++++++++++++++
>> include/linux/compiler.h | 32 ++++++++++++++++++++++++++++++++
>> include/linux/preempt.h  |  6 ++++--
>> 3 files changed, 61 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index cd211a14a88f..5ebf62500b99 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -847,4 +847,29 @@ config STRICT_MODULE_RWX
>> config ARCH_WANT_RELAX_ORDER
>>       bool
>>
>> +config HAVE_ARCH_RARE_WRITE
>> +     def_bool n
>> +     help
>> +       An arch should select this option if it has defined the functions
>> +       __arch_rare_write_begin() and __arch_rare_write_end() to
>> +       respectively enable and disable writing to read-only memory. The
>> +       routines must meet the following requirements:
>> +       - read-only memory writing must only be available on the current
>> +         CPU (to make sure other CPUs can't race to make changes too).
>> +       - the routines must be declared inline (to discourage ROP use).
>> +       - the routines must not be preemptible (likely they will call
>> +         preempt_disable() and preempt_enable_no_resched() respectively).
>> +       - the routines must validate expected state (e.g. when enabling
>> +         writes, BUG() if writes are already be enabled).
>> +
>> +config HAVE_ARCH_RARE_WRITE_MEMCPY
>> +     def_bool n
>> +     depends on HAVE_ARCH_RARE_WRITE
>> +     help
>> +       An arch should select this option if a special accessor is needed
>> +       to write to otherwise read-only memory, defined by the function
>> +       __arch_rare_write_memcpy(). Without this, the write-rarely
>> +       infrastructure will just attempt to write directly to the memory
>> +       using a const-ignoring assignment.
>> +
>> source "kernel/gcov/Kconfig"
>> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
>> index f8110051188f..274bd03cfe9e 100644
>> --- a/include/linux/compiler.h
>> +++ b/include/linux/compiler.h
>> @@ -336,6 +336,38 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
>>       __u.__val;                                      \
>> })
>>
>> +/*
>> + * Build "write rarely" infrastructure for flipping memory r/w
>> + * on a per-CPU basis.
>> + */
>> +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE
>> +# define __wr_rare
>> +# define __wr_rare_type
>> +# define __rare_write(__var, __val)  (__var = (__val))
>> +# define rare_write_begin()          do { } while (0)
>> +# define rare_write_end()            do { } while (0)
>> +#else
>> +# define __wr_rare                   __ro_after_init
>> +# define __wr_rare_type                      const
>> +# ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY
>> +#  define __rare_write_n(dst, src, len)      ({                      \
>> +             BUILD_BUG(!builtin_const(len));                 \
>> +             __arch_rare_write_memcpy((dst), (src), (len));  \
>> +     })
>> +#  define __rare_write(var, val)  __rare_write_n(&(var), &(val), sizeof(var))
>> +# else
>> +#  define __rare_write(var, val)  ((*(typeof((typeof(var))0) *)&(var)) = (val))
>> +# endif
>> +# define rare_write_begin()  __arch_rare_write_begin()
>> +# define rare_write_end()    __arch_rare_write_end()
>> +#endif
>> +#define rare_write(__var, __val) ({                  \
>> +     rare_write_begin();                             \
>> +     __rare_write(__var, __val);                     \
>> +     rare_write_end();                               \
>> +     __var;                                          \
>> +})
>> +
>
> How about we have a separate header file splitting section annotations and the actual APIs.
>
> include/linux/compiler.h:
>     __wr_rare
>     __wr_rare_type
>
> include/linux/rare_write.h:
>     __rare_write_n()
>     __rare_write()
>     rare_write_begin()
>     rare_write_end()

Yeah, that's actually exactly what I did for the current tree:
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/write-rarely

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 16:14                       ` Andy Lutomirski
  2017-04-07 16:22                         ` Mark Rutland
  2017-04-07 19:58                         ` PaX Team
@ 2017-04-07 20:44                         ` Thomas Gleixner
  2017-04-07 21:20                           ` Kees Cook
  2017-04-08  4:21                         ` Daniel Micay
  3 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2017-04-07 20:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathias Krause, Kees Cook, kernel-hardening, Mark Rutland,
	Hoeun Ryu, PaX Team, Emese Revfy, Russell King, X86 ML,
	linux-kernel, linux-arm-kernel, Peter Zijlstra

On Fri, 7 Apr 2017, Andy Lutomirski wrote:
> On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote:
> > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> I really do not care whether PaX wants to chase and verify that over and
> >> over. I certainly don't want to take the chance to leak CR0.WP ever and I
> >> very much care about extra stuff to check in the entry/exit path.
> >
> > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
> > somewhere sensible should make those "leaks" visible fast -- and their
> > exploitation impossible, i.e. fail hard.
> 
> The leaks surely exist and now we'll just add an exploitable BUG.

:)

> I think we're approaching this all wrong, actually.  The fact that x86
> has this CR0.WP thing is arguably a historical accident, and the fact
> that PaX uses it doesn't mean that PaX is doing it the best way for
> upstream Linux.

As I said before. It should be fused to 1.

> Why don't we start at the other end and do a generic non-arch-specific
> implementation: set up an mm_struct that contains an RW alias of the
> relevant parts of rodata and use use_mm to access it.  (That is,
> get_fs() to back up the old fs, set_fs(USER_DS),
> use_mm(&rare_write_mm), do the write using copy_to_user, undo
> everything.)

That works too, though I'm not sure that it's more efficient than
temporarily creating and undoing a single cpu local alias mapping similar
to the kmap_atomic mechanism in the highmem case.

Aside of that the alias mapping requires a full PGDIR entry unless you want
to get into the mess of keeping yet another identity mapping up to date.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 20:44                         ` Thomas Gleixner
@ 2017-04-07 21:20                           ` Kees Cook
  2017-04-08  4:12                             ` Daniel Micay
  0 siblings, 1 reply; 63+ messages in thread
From: Kees Cook @ 2017-04-07 21:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Mathias Krause, kernel-hardening, Mark Rutland,
	Hoeun Ryu, PaX Team, Emese Revfy, Russell King, X86 ML,
	linux-kernel, linux-arm-kernel, Peter Zijlstra

On Fri, Apr 7, 2017 at 1:44 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 7 Apr 2017, Andy Lutomirski wrote:
>> On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote:
>> > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
>> >> I really do not care whether PaX wants to chase and verify that over and
>> >> over. I certainly don't want to take the chance to leak CR0.WP ever and I
>> >> very much care about extra stuff to check in the entry/exit path.
>> >
>> > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
>> > somewhere sensible should make those "leaks" visible fast -- and their
>> > exploitation impossible, i.e. fail hard.
>>
>> The leaks surely exist and now we'll just add an exploitable BUG.
>
> :)
>
>> I think we're approaching this all wrong, actually.  The fact that x86
>> has this CR0.WP thing is arguably a historical accident, and the fact
>> that PaX uses it doesn't mean that PaX is doing it the best way for
>> upstream Linux.
>
> As I said before. It should be fused to 1.
>
>> Why don't we start at the other end and do a generic non-arch-specific
>> implementation: set up an mm_struct that contains an RW alias of the
>> relevant parts of rodata and use use_mm to access it.  (That is,
>> get_fs() to back up the old fs, set_fs(USER_DS),
>> use_mm(&rare_write_mm), do the write using copy_to_user, undo
>> everything.)
>
> That works too, though I'm not sure that it's more efficient than
> temporarily creating and undoing a single cpu local alias mapping similar
> to the kmap_atomic mechanism in the highmem case.
>
> Aside of that the alias mapping requires a full PGDIR entry unless you want
> to get into the mess of keeping yet another identity mapping up to date.

I probably chose the wrong name for this feature (write rarely).
That's _usually_ true, but "sensitive_write()" was getting rather
long. The things that we need to protect with this are certainly stuff
that doesn't get much writing, but some things are just plain
sensitive (like page tables) and we should still try to be as fast as
possible with them.

I'll try to include a third example in the next posting of the series
that makes this more obvious.

I'm all for a general case for the infrastructure (as Andy and Mark
has mentioned), but I don't want to get into the situation where
people start refusing to use it because it's "too slow" (for example,
see refcount_t vs net-dev right now).

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 21:20                           ` Kees Cook
@ 2017-04-08  4:12                             ` Daniel Micay
  2017-04-08  4:13                               ` Daniel Micay
  0 siblings, 1 reply; 63+ messages in thread
From: Daniel Micay @ 2017-04-08  4:12 UTC (permalink / raw)
  To: Kees Cook
  Cc: Thomas Gleixner, Andy Lutomirski, Mathias Krause,
	kernel-hardening, Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

> I probably chose the wrong name for this feature (write rarely).
> That's _usually_ true, but "sensitive_write()" was getting rather
> long. The things that we need to protect with this are certainly stuff
> that doesn't get much writing, but some things are just plain
> sensitive (like page tables) and we should still try to be as fast as
> possible with them.

Not too late to rename it. Scoped write? I think it makes change to
use a different API than PaX for portability too, but not a different
x86 implementation. It's quite important to limit the writes to the
calling thread and it needs to perform well to be introduced widely.

> I'm all for a general case for the infrastructure (as Andy and Mark
> has mentioned), but I don't want to get into the situation where
> people start refusing to use it because it's "too slow" (for example,
> see refcount_t vs net-dev right now).

Meanwhile, the PaX implementation has improved to avoid the issues
that were brought up while only introducing a single always-predicted
(due to code placement) branch on the overflow flag. That seems to
have gone unnoticed upstream, where there's now a much slower
implementation that's not more secure, and is blocked from
introduction in areas where it's most needed based on the performance.
Not to mention that it's opt-in... which is never going to work.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-08  4:12                             ` Daniel Micay
@ 2017-04-08  4:13                               ` Daniel Micay
  0 siblings, 0 replies; 63+ messages in thread
From: Daniel Micay @ 2017-04-08  4:13 UTC (permalink / raw)
  To: Kees Cook
  Cc: Thomas Gleixner, Andy Lutomirski, Mathias Krause,
	kernel-hardening, Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

> Not too late to rename it. Scoped write? I think it makes change to

s/change/sense/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 16:14                       ` Andy Lutomirski
                                           ` (2 preceding siblings ...)
  2017-04-07 20:44                         ` Thomas Gleixner
@ 2017-04-08  4:21                         ` Daniel Micay
  2017-04-08  5:07                           ` Andy Lutomirski
  3 siblings, 1 reply; 63+ messages in thread
From: Daniel Micay @ 2017-04-08  4:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathias Krause, Thomas Gleixner, Kees Cook, kernel-hardening,
	Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy, Russell King,
	X86 ML, linux-kernel, linux-arm-kernel, Peter Zijlstra

>> Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
>> somewhere sensible should make those "leaks" visible fast -- and their
>> exploitation impossible, i.e. fail hard.
>
> The leaks surely exist and now we'll just add an exploitable BUG.

That didn't seem to matter for landing a rewrite of KSTACKOVERFLOW
with a bunch of *known* DoS bugs dealt with in grsecurity and those
were known issues that were unfixed for no apparent reason other than
keeping egos intact. It looks like there are still some left...

In that case, there also wasn't a security/performance advantage.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 19:58                         ` PaX Team
@ 2017-04-08  4:58                           ` Andy Lutomirski
  2017-04-09 12:47                             ` PaX Team
  0 siblings, 1 reply; 63+ messages in thread
From: Andy Lutomirski @ 2017-04-08  4:58 UTC (permalink / raw)
  To: PaX Team
  Cc: Mathias Krause, Andy Lutomirski, Thomas Gleixner, Kees Cook,
	kernel-hardening, Mark Rutland, Hoeun Ryu, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

On Fri, Apr 7, 2017 at 12:58 PM, PaX Team <pageexec@freemail.hu> wrote:
> On 7 Apr 2017 at 9:14, Andy Lutomirski wrote:
>
>> On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote:
>> > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
>> >> On Fri, 7 Apr 2017, Mathias Krause wrote:
>> > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
>> > somewhere sensible should make those "leaks" visible fast -- and their
>> > exploitation impossible, i.e. fail hard.
>>
>> The leaks surely exist and now we'll just add an exploitable BUG.
>
> can you please share those leaks that 'surely exist' and CC oss-security
> while at it?

I meant in the patchset here, not in grsecurity.  grsecurity (on very,
very brief inspection) seems to read cr0 and fix it up in
pax_enter_kernel.

>>
>> Then someone who cares about performance can benchmark the CR0.WP
>> approach against it and try to argue that it's a good idea.  This
>> benchmark should wait until I'm done with my PCID work, because PCID
>> is going to make use_mm() a whole heck of a lot faster.
>
> in my measurements switching PCID is hovers around 230 cycles for snb-ivb
> and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's
> of course a whole lot more impact for switching address spaces so it'll never
> be fast enough to beat cr0.wp.
>

If I'm reading this right, you're saying that a non-flushing CR3 write
is about the same cost as a CR0.WP write.  If so, then why should CR0
be preferred over the (arch-neutral) CR3 approach?  And why would
switching address spaces obviously be much slower?  There'll be a very
small number of TLB fills needed for the actual protected access.

--Andy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-08  4:21                         ` Daniel Micay
@ 2017-04-08  5:07                           ` Andy Lutomirski
  2017-04-08  7:33                             ` Daniel Micay
  2017-04-09 20:24                             ` PaX Team
  0 siblings, 2 replies; 63+ messages in thread
From: Andy Lutomirski @ 2017-04-08  5:07 UTC (permalink / raw)
  To: Daniel Micay
  Cc: Andy Lutomirski, Mathias Krause, Thomas Gleixner, Kees Cook,
	kernel-hardening, Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

On Fri, Apr 7, 2017 at 9:21 PM, Daniel Micay <danielmicay@gmail.com> wrote:
>>> Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
>>> somewhere sensible should make those "leaks" visible fast -- and their
>>> exploitation impossible, i.e. fail hard.
>>
>> The leaks surely exist and now we'll just add an exploitable BUG.
>
> That didn't seem to matter for landing a rewrite of KSTACKOVERFLOW
> with a bunch of *known* DoS bugs dealt with in grsecurity and those
> were known issues that were unfixed for no apparent reason other than
> keeping egos intact. It looks like there are still some left...
>
> In that case, there also wasn't a security/performance advantage.

This is wildly off topic, but I think it's worth answering anyway
because there's an important point here:

grsecurity and PaX are great projects.  They have a lot of good ideas,
and they're put together quite nicely.  The upstream kernel should
*not* do things differently from they way they are in grsecurity/PaX
just because it wants to be different.  Conversely, the upstream
kernel should not do things the same way as PaX just to be more like
PaX.

Keep in mind that the upstream kernel and grsecurity/PaX operate under
different constraints.  The upstream kernel tries to keep itself clean
and to make tree-wide updates rather that keeping compatibility stuff
around.  PaX and grsecurity presumably want to retain some degree of
simplicity when porting to newer upstream versions.

In the context of virtually mapped stacks / KSTACKOVERFLOW, this
naturally leads to different solutions.  The upstream kernel had a
bunch of buggy drivers that played badly with virtually mapped stacks.
grsecurity sensibly went for the approach where the buggy drivers kept
working.  The upstream kernel went for the approach of fixing the
drivers rather than keeping a compatibility workaround.  Different
constraints, different solutions.

The point being that, if someone sends a patch to the x86 entry code
that's justified by "be like PaX" or by "be different than PaX",
that's not okay.  It needs a real justification that stands on its
own.

In the case of rare writes or pax_open_kernel [1] or whatever we want
to call it, CR3 would work without arch-specific code, and CR0 would
not.  That's an argument for CR3 that would need to be countered by
something.  (Sure, avoiding leaks either way might need arch changes.
OTOH, a *randomized* CR3-based approach might not have as much of a
leak issue to begin with.)

[1] Contrary to popular belief, I don't sit around reading grsecurity
code or config options, so I really don't know what this thing is
called.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-08  5:07                           ` Andy Lutomirski
@ 2017-04-08  7:33                             ` Daniel Micay
  2017-04-08 15:20                               ` Andy Lutomirski
  2017-04-09 20:24                             ` PaX Team
  1 sibling, 1 reply; 63+ messages in thread
From: Daniel Micay @ 2017-04-08  7:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathias Krause, Thomas Gleixner, Kees Cook, kernel-hardening,
	Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy, Russell King,
	X86 ML, linux-kernel, linux-arm-kernel, Peter Zijlstra

> grsecurity and PaX are great projects.  They have a lot of good ideas,
> and they're put together quite nicely.  The upstream kernel should
> *not* do things differently from they way they are in grsecurity/PaX
> just because it wants to be different.  Conversely, the upstream
> kernel should not do things the same way as PaX just to be more like
> PaX.
>
> Keep in mind that the upstream kernel and grsecurity/PaX operate under
> different constraints.  The upstream kernel tries to keep itself clean
> and to make tree-wide updates rather that keeping compatibility stuff
> around.  PaX and grsecurity presumably want to retain some degree of
> simplicity when porting to newer upstream versions.
>
> In the context of virtually mapped stacks / KSTACKOVERFLOW, this
> naturally leads to different solutions.  The upstream kernel had a
> bunch of buggy drivers that played badly with virtually mapped stacks.
> grsecurity sensibly went for the approach where the buggy drivers kept
> working.  The upstream kernel went for the approach of fixing the
> drivers rather than keeping a compatibility workaround.  Different
> constraints, different solutions.

Sure, but nothing is currently landed right now and the PaX
implementation is a known quantity with a lot of testing. The
submitted code is aimed at rare writes to globals, but this feature is
more than that and design decisions shouldn't be based on just the
short term. Kees is sensibly landing features by submitting them a
little bit at a time, but a negative side effect of that is too much
focus on just the initial proposed usage.

As a downstream that's going to be making heavy use of mainline
security features as a base due to PaX and grsecurity becoming
commercial only private patches (*because* of upstream and the
companies involved), I hope things go in the right direction i.e. not
weaker/slower implementations than PaX. For example, while USERCOPY
isn't entirely landed upstream (missing slab whitelisting and user
offset/size), it's the base for the full feature and is going to get
more testing. On the other hand, refcount_t and the slab/page
sanitization work is 100% useless if you're willing to incorporate the
changes to do it without needless performance loss and complexity. PAN
emulation on 64-bit ARM was fresh ground while on ARMv7 a weaker
implementation was used for no better reason than clashing egos. The
upstream virtual mapped stacks will probably be perfectly good, but I
just find it to be such a waste of time and effort to reinvent it and
retread the same ground in terms of finding the bugs.

I actually care a lot more about 64-bit ARM support than I do x86, but
using a portable API for pax_open_kernel (for the simple uses at
least) is separate from choosing the underlying implementation. There
might not be a great way to do it on the architectures I care about
but that doesn't need to hinder x86. It's really not that much code...
A weaker/slower implementation for x86 also encourages the same
elsewhere.

> In the case of rare writes or pax_open_kernel [1] or whatever we want
> to call it, CR3 would work without arch-specific code, and CR0 would
> not.  That's an argument for CR3 that would need to be countered by
> something.  (Sure, avoiding leaks either way might need arch changes.
> OTOH, a *randomized* CR3-based approach might not have as much of a
> leak issue to begin with.)

By randomized do you mean just ASLR? Even if the window of opportunity
to exploit it is small, it's really not the same and this has a lot
more use than just rare writes to small global variables.

I wouldn't consider stack ASLR to be a replacement for making them
readable/writable only by the current thread either (required for
race-free return CFI on x86, at least without not using actual
returns).

> [1] Contrary to popular belief, I don't sit around reading grsecurity
> code or config options, so I really don't know what this thing is
> called.

Lots of the features aren't actually named. Maybe this could be
considered part of KERNEXEC but I don't think it is.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-08  7:33                             ` Daniel Micay
@ 2017-04-08 15:20                               ` Andy Lutomirski
  2017-04-09 10:53                                 ` Ingo Molnar
  2017-04-10 10:22                                 ` Mark Rutland
  0 siblings, 2 replies; 63+ messages in thread
From: Andy Lutomirski @ 2017-04-08 15:20 UTC (permalink / raw)
  To: Daniel Micay
  Cc: Andy Lutomirski, Mathias Krause, Thomas Gleixner, Kees Cook,
	kernel-hardening, Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

On Sat, Apr 8, 2017 at 12:33 AM, Daniel Micay <danielmicay@gmail.com> wrote:
> The
> submitted code is aimed at rare writes to globals, but this feature is
> more than that and design decisions shouldn't be based on just the
> short term.

Then, if you disagree with a proposed design, *explain why* in a
standalone manner.  Say what future uses a different design would
have.

> I actually care a lot more about 64-bit ARM support than I do x86, but
> using a portable API for pax_open_kernel (for the simple uses at
> least) is separate from choosing the underlying implementation. There
> might not be a great way to do it on the architectures I care about
> but that doesn't need to hinder x86. It's really not that much code...
> A weaker/slower implementation for x86 also encourages the same
> elsewhere.

No one has explained how CR0.WP is weaker or slower than my proposal.
Here's what I'm proposing:

At boot, choose a random address A.  Create an mm_struct that has a
single VMA starting at A that represents the kernel's rarely-written
section.  Compute O = (A - VA of rarely-written section).  To do a
rare write, use_mm() the mm, write to (VA + O), then unuse_mm().

This should work on any arch that has an MMU that allows this type of
aliasing and that doesn't have PA-based protections on the
rarely-written section.

It'll be considerably slower than CR0.WP on a current x86 kernel, but,
with PCID landed, it shouldn't be much slower.  It has the added
benefit that writes to non-rare-write data using the rare-write
primitive will fail.

--Andy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-08 15:20                               ` Andy Lutomirski
@ 2017-04-09 10:53                                 ` Ingo Molnar
  2017-04-10 10:22                                 ` Mark Rutland
  1 sibling, 0 replies; 63+ messages in thread
From: Ingo Molnar @ 2017-04-09 10:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Micay, Mathias Krause, Thomas Gleixner, Kees Cook,
	kernel-hardening, Mark Rutland, Hoeun Ryu, PaX Team, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra


* Andy Lutomirski <luto@kernel.org> wrote:

> On Sat, Apr 8, 2017 at 12:33 AM, Daniel Micay <danielmicay@gmail.com> wrote:
> > The
> > submitted code is aimed at rare writes to globals, but this feature is
> > more than that and design decisions shouldn't be based on just the
> > short term.
> 
> Then, if you disagree with a proposed design, *explain why* in a
> standalone manner.  Say what future uses a different design would
> have.
> 
> > I actually care a lot more about 64-bit ARM support than I do x86, but
> > using a portable API for pax_open_kernel (for the simple uses at
> > least) is separate from choosing the underlying implementation. There
> > might not be a great way to do it on the architectures I care about
> > but that doesn't need to hinder x86. It's really not that much code...
> > A weaker/slower implementation for x86 also encourages the same
> > elsewhere.
> 
> No one has explained how CR0.WP is weaker or slower than my proposal.
> Here's what I'm proposing:
> 
> At boot, choose a random address A.  Create an mm_struct that has a
> single VMA starting at A that represents the kernel's rarely-written
> section.  Compute O = (A - VA of rarely-written section).  To do a
> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().

BTW., note that this is basically a pagetable based protection key variant.

> It'll be considerably slower than CR0.WP on a current x86 kernel, but, with PCID 
> landed, it shouldn't be much slower.  It has the added benefit that writes to 
> non-rare-write data using the rare-write primitive will fail.

... which is a security advantage of the use_mm() based design you suggest.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-08  4:58                           ` Andy Lutomirski
@ 2017-04-09 12:47                             ` PaX Team
  2017-04-10  0:10                               ` Andy Lutomirski
  0 siblings, 1 reply; 63+ messages in thread
From: PaX Team @ 2017-04-09 12:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathias Krause, Andy Lutomirski, Thomas Gleixner, Kees Cook,
	kernel-hardening, Mark Rutland, Hoeun Ryu, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

On 7 Apr 2017 at 21:58, Andy Lutomirski wrote:

> On Fri, Apr 7, 2017 at 12:58 PM, PaX Team <pageexec@freemail.hu> wrote:
> > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote:
> >> Then someone who cares about performance can benchmark the CR0.WP
> >> approach against it and try to argue that it's a good idea.  This
> >> benchmark should wait until I'm done with my PCID work, because PCID
> >> is going to make use_mm() a whole heck of a lot faster.
> >
> > in my measurements switching PCID is hovers around 230 cycles for snb-ivb
> > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's
> > of course a whole lot more impact for switching address spaces so it'll never
> > be fast enough to beat cr0.wp.
> >
> 
> If I'm reading this right, you're saying that a non-flushing CR3 write
> is about the same cost as a CR0.WP write.  If so, then why should CR0
> be preferred over the (arch-neutral) CR3 approach?

cr3 (page table switching) isn't arch neutral at all ;). you probably meant
the higher level primitives except they're not enough to implement the scheme
as discussed before since the enter/exit paths are very much arch dependent.

on x86 the cost of the pax_open/close_kernel primitives comes from the cr0
writes and nothing else, use_mm suffers not only from the cr3 writes but
also locking/atomic ops and cr4 writes on its path and the inevitable TLB
entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp
a fast path in the microcode and reduce its overhead by an order of magnitude.

>  And why would switching address spaces obviously be much slower? 
> There'll be a very small number of TLB fills needed for the actual
> protected access. 

you'll be duplicating TLB entries in the alternative PCID for both code
and data, where they will accumulate (=take room away from the normal PCID
and expose unwanted memory for access) unless you also flush them when
switching back (which then will cost even more cycles). also i'm not sure
that processors implement all the 12 PCID bits so depending on how many PCIDs
you plan to use, you could be causing even more unnecessary TLB replacements.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-08  5:07                           ` Andy Lutomirski
  2017-04-08  7:33                             ` Daniel Micay
@ 2017-04-09 20:24                             ` PaX Team
  2017-04-10  0:31                               ` Andy Lutomirski
  2017-04-10 20:13                               ` Kees Cook
  1 sibling, 2 replies; 63+ messages in thread
From: PaX Team @ 2017-04-09 20:24 UTC (permalink / raw)
  To: Daniel Micay, Andy Lutomirski
  Cc: Andy Lutomirski, Mathias Krause, Thomas Gleixner, Kees Cook,
	kernel-hardening, Mark Rutland, Hoeun Ryu, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

On 7 Apr 2017 at 22:07, Andy Lutomirski wrote:
> grsecurity and PaX are great projects.  They have a lot of good ideas,
> and they're put together quite nicely.  The upstream kernel should
> *not* do things differently from they way they are in grsecurity/PaX
> just because it wants to be different.  Conversely, the upstream
> kernel should not do things the same way as PaX just to be more like
> PaX.

so weit so gut.

> Keep in mind that the upstream kernel and grsecurity/PaX operate under
> different constraints.  The upstream kernel tries to keep itself clean

so do we.

> and to make tree-wide updates rather that keeping compatibility stuff
> around.

so do we (e.g., fptr fixes for RAP, non-refcount atomic users, etc).

>  PaX and grsecurity presumably want to retain some degree of
> simplicity when porting to newer upstream versions.

s/simplicity/minimality/ as the code itself can be complex but that'll be
of the minimal complexity we can come up with.

> In the context of virtually mapped stacks / KSTACKOVERFLOW, this
> naturally leads to different solutions.  The upstream kernel had a
> bunch of buggy drivers that played badly with virtually mapped stacks.
> grsecurity sensibly went for the approach where the buggy drivers kept
> working.  The upstream kernel went for the approach of fixing the
> drivers rather than keeping a compatibility workaround.  Different
> constraints, different solutions.

except that's not what happened at all. spender's first version did just
a vmalloc for the kstack like the totally NIH'd version upstream does
now. while we always anticipated buggy dma users and thus had code that
would detect them so that we could fix them, we quickly figured that the
upstream kernel wasn't quite up to snuff as we had assumed and faced with
the amount of buggy code, we went for the current vmap approach which
kept users' systems working instead of breaking them.

you're trying to imply that upstream fixed the drivers but as the facts
show, that's not true. you simply unleashed your code on the world and
hoped(?) that enough suckers would try it out during the -rc window. as
we all know several releases and almost a year later, that was a losing
bet as you still keep fixing those drivers (and something tells me that
we haven't seen the end of it). this is simply irresponsible engineering
for no technical reason.

> In the case of rare writes or pax_open_kernel [1] or whatever we want
> to call it, CR3 would work without arch-specific code, and CR0 would
> not.  That's an argument for CR3 that would need to be countered by
> something.  (Sure, avoiding leaks either way might need arch changes.
> OTOH, a *randomized* CR3-based approach might not have as much of a
> leak issue to begin with.)

i have yet to see anyone explain what they mean by 'leak' here but if it
is what i think it is then the arch specific entry/exit changes are not
optional but mandatory. see below for randomization.

[merging in your other mail as it's the same topic]

> No one has explained how CR0.WP is weaker or slower than my proposal.

you misunderstood, Daniel was talking about your use_mm approach.

> Here's what I'm proposing:
> 
> At boot, choose a random address A.

what is the threat that a random address defends against?

>  Create an mm_struct that has a
> single VMA starting at A that represents the kernel's rarely-written
> section.  Compute O = (A - VA of rarely-written section).  To do a
> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().

the problem is that the amount of __read_only data extends beyond vmlinux,
i.e., this approach won't scale. another problem is that it can't be used
inside use_mm and switch_mm themselves (no read-only task structs or percpu
pgd for you ;) and probably several other contexts.

last but not least, use_mm says this about itself:

    (Note: this routine is intended to be called only
    from a kernel thread context)

so using it will need some engineering (or the comment be fixed).

> This should work on any arch that has an MMU that allows this type of
> aliasing and that doesn't have PA-based protections on the rarely-written
> section.

you didn't address the arch-specific changes needed in the enter/exit paths.

> It'll be considerably slower than CR0.WP on a current x86 kernel, but,
> with PCID landed, it shouldn't be much slower.

based on my experience with UDEREF on amd64, unfortunately PCID isn't all
it's cracked up to be (IIRC, it maybe halved the UDEREF overhead instead of
basically eliminating it as i had anticipated, and that was on snb, ivb and
later do even worse).

> It has the added benefit that writes to non-rare-write data using the
> rare-write primitive will fail.

what is the threat model you're assuming for this feature? based on what i
have for PaX (arbitrary read/write access exploited for data-only attacks),
the above makes no sense to me...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-09 12:47                             ` PaX Team
@ 2017-04-10  0:10                               ` Andy Lutomirski
  2017-04-10 10:42                                 ` PaX Team
  0 siblings, 1 reply; 63+ messages in thread
From: Andy Lutomirski @ 2017-04-10  0:10 UTC (permalink / raw)
  To: PaX Team
  Cc: Andy Lutomirski, Mathias Krause, Thomas Gleixner, Kees Cook,
	kernel-hardening, Mark Rutland, Hoeun Ryu, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

On Sun, Apr 9, 2017 at 5:47 AM, PaX Team <pageexec@freemail.hu> wrote:
> On 7 Apr 2017 at 21:58, Andy Lutomirski wrote:
>
>> On Fri, Apr 7, 2017 at 12:58 PM, PaX Team <pageexec@freemail.hu> wrote:
>> > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote:
>> >> Then someone who cares about performance can benchmark the CR0.WP
>> >> approach against it and try to argue that it's a good idea.  This
>> >> benchmark should wait until I'm done with my PCID work, because PCID
>> >> is going to make use_mm() a whole heck of a lot faster.
>> >
>> > in my measurements switching PCID is hovers around 230 cycles for snb-ivb
>> > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's
>> > of course a whole lot more impact for switching address spaces so it'll never
>> > be fast enough to beat cr0.wp.
>> >
>>
>> If I'm reading this right, you're saying that a non-flushing CR3 write
>> is about the same cost as a CR0.WP write.  If so, then why should CR0
>> be preferred over the (arch-neutral) CR3 approach?
>
> cr3 (page table switching) isn't arch neutral at all ;). you probably meant
> the higher level primitives except they're not enough to implement the scheme
> as discussed before since the enter/exit paths are very much arch dependent.

Yes.

>
> on x86 the cost of the pax_open/close_kernel primitives comes from the cr0
> writes and nothing else, use_mm suffers not only from the cr3 writes but
> also locking/atomic ops and cr4 writes on its path and the inevitable TLB
> entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp
> a fast path in the microcode and reduce its overhead by an order of magnitude.
>

If the CR4 writes happen in for this use case, that's a bug.

>>  And why would switching address spaces obviously be much slower?
>> There'll be a very small number of TLB fills needed for the actual
>> protected access.
>
> you'll be duplicating TLB entries in the alternative PCID for both code
> and data, where they will accumulate (=take room away from the normal PCID
> and expose unwanted memory for access) unless you also flush them when
> switching back (which then will cost even more cycles). also i'm not sure
> that processors implement all the 12 PCID bits so depending on how many PCIDs
> you plan to use, you could be causing even more unnecessary TLB replacements.
>

Unless the CPU is rather dumber than I expect, the only duplicated
entries should be for the writable aliases of pages that are written.
The rest of the pages are global and should be shared for all PCIDs.

--Andy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-09 20:24                             ` PaX Team
@ 2017-04-10  0:31                               ` Andy Lutomirski
  2017-04-10 19:47                                 ` PaX Team
  2017-04-10 20:13                               ` Kees Cook
  1 sibling, 1 reply; 63+ messages in thread
From: Andy Lutomirski @ 2017-04-10  0:31 UTC (permalink / raw)
  To: PaX Team
  Cc: Daniel Micay, Andy Lutomirski, Mathias Krause, Thomas Gleixner,
	Kees Cook, kernel-hardening, Mark Rutland, Hoeun Ryu,
	Emese Revfy, Russell King, X86 ML, linux-kernel,
	linux-arm-kernel, Peter Zijlstra

On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote:
>
>> In the context of virtually mapped stacks / KSTACKOVERFLOW, this
>> naturally leads to different solutions.  The upstream kernel had a
>> bunch of buggy drivers that played badly with virtually mapped stacks.
>> grsecurity sensibly went for the approach where the buggy drivers kept
>> working.  The upstream kernel went for the approach of fixing the
>> drivers rather than keeping a compatibility workaround.  Different
>> constraints, different solutions.
>
> except that's not what happened at all. spender's first version did just
> a vmalloc for the kstack like the totally NIH'd version upstream does
> now. while we always anticipated buggy dma users and thus had code that
> would detect them so that we could fix them, we quickly figured that the
> upstream kernel wasn't quite up to snuff as we had assumed and faced with
> the amount of buggy code, we went for the current vmap approach which
> kept users' systems working instead of breaking them.
>
> you're trying to imply that upstream fixed the drivers but as the facts
> show, that's not true. you simply unleashed your code on the world and
> hoped(?) that enough suckers would try it out during the -rc window. as
> we all know several releases and almost a year later, that was a losing
> bet as you still keep fixing those drivers (and something tells me that
> we haven't seen the end of it). this is simply irresponsible engineering
> for no technical reason.

I consider breaking buggy drivers (in a way that they either generally
work okay or that they break with a nice OOPS depending on config) to
be better than having a special case in what's supposed to be a fast
path to keep them working.  I did consider forcing the relevant debug
options on for a while just to help shake these bugs out the woodwork
faster.

>
>> In the case of rare writes or pax_open_kernel [1] or whatever we want
>> to call it, CR3 would work without arch-specific code, and CR0 would
>> not.  That's an argument for CR3 that would need to be countered by
>> something.  (Sure, avoiding leaks either way might need arch changes.
>> OTOH, a *randomized* CR3-based approach might not have as much of a
>> leak issue to begin with.)
>
> i have yet to see anyone explain what they mean by 'leak' here but if it
> is what i think it is then the arch specific entry/exit changes are not
> optional but mandatory. see below for randomization.

By "leak" I mean that a bug or exploit causes unintended code to run
with CR0.WP or a special CR3 or a special PTE or whatever loaded.  PaX
hooks the entry code to avoid leaks.

>> At boot, choose a random address A.
>
> what is the threat that a random address defends against?

Makes it harder to exploit a case where the CR3 setting leaks.

>
>>  Create an mm_struct that has a
>> single VMA starting at A that represents the kernel's rarely-written
>> section.  Compute O = (A - VA of rarely-written section).  To do a
>> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().
>
> the problem is that the amount of __read_only data extends beyond vmlinux,
> i.e., this approach won't scale. another problem is that it can't be used
> inside use_mm and switch_mm themselves (no read-only task structs or percpu
> pgd for you ;) and probably several other contexts.

Can you clarify these uses that extend beyond vmlinux?  I haven't
looked at the grsecurity patch extensively.  Are you talking about the
BPF JIT stuff?  If so, I think that should possibly be handled a bit
differently, since I think the normal
write-to-rare-write-vmlinux-sections primitive should preferably *not*
be usable to write to executable pages.  Using a real mm_struct for
this could help.

>
> last but not least, use_mm says this about itself:
>
>     (Note: this routine is intended to be called only
>     from a kernel thread context)
>
> so using it will need some engineering (or the comment be fixed).

Indeed.

>> It has the added benefit that writes to non-rare-write data using the
>> rare-write primitive will fail.
>
> what is the threat model you're assuming for this feature? based on what i
> have for PaX (arbitrary read/write access exploited for data-only attacks),
> the above makes no sense to me...
>

If I use the primitive to try to write a value to the wrong section
(write to kernel text, for example), IMO it would be nice to OOPS
instead of succeeding.

Please keep in mind that, unlike PaX, uses of a pax_open_kernel()-like
function will may be carefully audited by a friendly security expert
such as yourself.  It would be nice to harden the primitive to a
reasonable extent against minor misuses such as putting it in a
context where the compiler will emit mov-a-reg-with-WP-set-to-CR0;
ret.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 19:52                 ` PaX Team
@ 2017-04-10  8:26                   ` Thomas Gleixner
  2017-04-10 19:55                     ` PaX Team
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2017-04-10  8:26 UTC (permalink / raw)
  To: PaX Team
  Cc: Mathias Krause, Andy Lutomirski, Kees Cook, Andy Lutomirski,
	kernel-hardening, Mark Rutland, Hoeun Ryu, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

On Fri, 7 Apr 2017, PaX Team wrote:
> On 7 Apr 2017 at 11:46, Thomas Gleixner wrote:
> 
> > On Fri, 7 Apr 2017, Mathias Krause wrote:
> > > Well, doesn't look good to me. NMIs will still be able to interrupt
> > > this code and will run with CR0.WP = 0.
> > > 
> > > Shouldn't you instead question yourself why PaX can do it "just" with
> > > preempt_disable() instead?!
> > 
> > That's silly. Just because PaX does it, doesn't mean it's correct.
> 
> is that FUD or do you have actionable information to share?

That has absolutely nothing to do with FUD. I'm merily not accepting
argumentations which say: PaX can do it "just"....

That has exactly zero technical merit and it's not asked too much to
provide precise technical arguments why one implementation is better than
some other.

> > To be honest, playing games with the CR0.WP bit is outright stupid to begin with.
> 
> why is that? cr0.wp exists since the i486 and its behaviour fits my
> purposes quite well, it's the best security/performance i know of.

Works for me has never be a good engineering principle.

> > Whether protected by preempt_disable or local_irq_disable, to make that
> > work it needs CR0 handling in the exception entry/exit at the lowest
> > level.
> 
> correct.
> 
> > And that's just a nightmare maintainence wise as it's prone to be
> > broken over time.
> 
> i've got 14 years of experience of maintaining it and i never saw it break.

It's a difference whether you maintain a special purpose patch set out of
tree for a subset of architectures - I certainly know what I'm talking
about - or keeping stuff sane in the upstream kernel.

> > I certainly don't want to take the chance to leak CR0.WP ever
>
> why and where would cr0.wp leak?

It's bound to happen due to some subtle mistake and up to the point where
you catch it (in the scheduler or entry/exit path) the world is
writeable. And that will be some almost never executed error path which can
be triggered by a carefully crafted attack. A very restricted writeable
region is definitely preferred over full world writeable then, right?

> > Why the heck should we care about rare writes being performant?
> 
> because you've been misled by the NIH crowd here that the PaX feature they
> tried to (badly) extract from has anything to do with frequency of writes.

It would be apprectiated if you could keep your feud out of this. It's
enough to tell me that 'rare write' is a misleading term and why.

> > Making the world and some more writeable hardly qualifies as tightly
> > focused.
> 
> you forgot to add 'for a window of a few insns' and that the map/unmap

If it'd be guaranteed to be a few instructions, then I wouldn't be that
worried. The availability of make_world_writeable() as an unrestricted
usable function makes me nervous as hell. We've had long standing issues
where kmap_atomic() got leaked through a hard to spot almost never executed
error handling path. And the same is bound to happen with this, just with a
way worse outcome.

> approach does the same under an attacker controlled ptr.

Which attacker controlled pointer?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-08 15:20                               ` Andy Lutomirski
  2017-04-09 10:53                                 ` Ingo Molnar
@ 2017-04-10 10:22                                 ` Mark Rutland
  1 sibling, 0 replies; 63+ messages in thread
From: Mark Rutland @ 2017-04-10 10:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Micay, Mathias Krause, Thomas Gleixner, Kees Cook,
	kernel-hardening, Hoeun Ryu, PaX Team, Emese Revfy, Russell King,
	X86 ML, linux-kernel, linux-arm-kernel, Peter Zijlstra

On Sat, Apr 08, 2017 at 08:20:03AM -0700, Andy Lutomirski wrote:
> On Sat, Apr 8, 2017 at 12:33 AM, Daniel Micay <danielmicay@gmail.com> wrote:
> > The
> > submitted code is aimed at rare writes to globals, but this feature is
> > more than that and design decisions shouldn't be based on just the
> > short term.
> 
> Then, if you disagree with a proposed design, *explain why* in a
> standalone manner.  Say what future uses a different design would
> have.
> 
> > I actually care a lot more about 64-bit ARM support than I do x86, but
> > using a portable API for pax_open_kernel (for the simple uses at
> > least) is separate from choosing the underlying implementation. There
> > might not be a great way to do it on the architectures I care about
> > but that doesn't need to hinder x86. It's really not that much code...
> > A weaker/slower implementation for x86 also encourages the same
> > elsewhere.
> 
> No one has explained how CR0.WP is weaker or slower than my proposal.
> Here's what I'm proposing:
> 
> At boot, choose a random address A.  Create an mm_struct that has a
> single VMA starting at A that represents the kernel's rarely-written
> section.  Compute O = (A - VA of rarely-written section).  To do a
> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().
> 
> This should work on any arch that has an MMU that allows this type of
> aliasing and that doesn't have PA-based protections on the
> rarely-written section.

Modulo randomization, that sounds exactly like what I had envisaged [1],
so that makes sense to me.

Thanks,
Mark.

[1] http://www.openwall.com/lists/kernel-hardening/2016/11/18/3

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-07 14:45                   ` Peter Zijlstra
@ 2017-04-10 10:29                     ` Mark Rutland
  0 siblings, 0 replies; 63+ messages in thread
From: Mark Rutland @ 2017-04-10 10:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathias Krause, Thomas Gleixner, Andy Lutomirski, Kees Cook,
	Andy Lutomirski, kernel-hardening, Hoeun Ryu, PaX Team,
	Emese Revfy, Russell King, X86 ML, linux-kernel,
	linux-arm-kernel

On Fri, Apr 07, 2017 at 04:45:26PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 07, 2017 at 12:51:15PM +0200, Mathias Krause wrote:
> > Why that? It allows fast and CPU local modifications of r/o memory.
> > OTOH, an approach that needs to fiddle with page table entries
> > requires global synchronization to keep the individual TLB states in
> > sync. Hmm.. Not that fast, I'd say.
> 
> The fixmaps used for kmap_atomic are per-cpu, no global sync required.

That might be fine for x86, but for some architectures fixmap slots and
kmap_atomic mappings happen to be visible to other CPUs even if they're
not required to be.

Using an mm solves that for all, though.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-10  0:10                               ` Andy Lutomirski
@ 2017-04-10 10:42                                 ` PaX Team
  2017-04-10 16:01                                   ` Andy Lutomirski
  0 siblings, 1 reply; 63+ messages in thread
From: PaX Team @ 2017-04-10 10:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Mathias Krause, Thomas Gleixner, Kees Cook,
	kernel-hardening, Mark Rutland, Hoeun Ryu, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

On 9 Apr 2017 at 17:10, Andy Lutomirski wrote:

> On Sun, Apr 9, 2017 at 5:47 AM, PaX Team <pageexec@freemail.hu> wrote:
> > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0
> > writes and nothing else, use_mm suffers not only from the cr3 writes but
> > also locking/atomic ops and cr4 writes on its path and the inevitable TLB
> > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp
> > a fast path in the microcode and reduce its overhead by an order of magnitude.
> >
> 
> If the CR4 writes happen in for this use case, that's a bug.

that depends on how you plan to handle perf/rdpmc users and how many
alternative mm structs you plan to manage (one global, one per cpu,
one per mm struct, etc).

> > you'll be duplicating TLB entries in the alternative PCID for both code
> > and data, where they will accumulate (=take room away from the normal PCID
> > and expose unwanted memory for access) unless you also flush them when
> > switching back (which then will cost even more cycles). also i'm not sure
> > that processors implement all the 12 PCID bits so depending on how many PCIDs
> > you plan to use, you could be causing even more unnecessary TLB replacements.
> >
> 
> Unless the CPU is rather dumber than I expect, the only duplicated
> entries should be for the writable aliases of pages that are written.
> The rest of the pages are global and should be shared for all PCIDs.

well, 4.10.2.4 has language like this (4.10.3.2 implies similar):

   A logical processor may use a global TLB entry to translate a linear
   address, even if the TLB entry is associated with a PCID different
   from the current PCID.

that to me says that global page entries are associated with a PCID and
may (not) be used while in another PCID. in Intel-speak that's not 'dumb'
but "tricks up our sleeve that we don't really want to tell you about in
detail, except perhaps under a NDA".

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-10 10:42                                 ` PaX Team
@ 2017-04-10 16:01                                   ` Andy Lutomirski
  0 siblings, 0 replies; 63+ messages in thread
From: Andy Lutomirski @ 2017-04-10 16:01 UTC (permalink / raw)
  To: PaX Team
  Cc: Andy Lutomirski, Mathias Krause, Thomas Gleixner, Kees Cook,
	kernel-hardening, Mark Rutland, Hoeun Ryu, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

On Mon, Apr 10, 2017 at 3:42 AM, PaX Team <pageexec@freemail.hu> wrote:
> On 9 Apr 2017 at 17:10, Andy Lutomirski wrote:
>
>> On Sun, Apr 9, 2017 at 5:47 AM, PaX Team <pageexec@freemail.hu> wrote:
>> > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0
>> > writes and nothing else, use_mm suffers not only from the cr3 writes but
>> > also locking/atomic ops and cr4 writes on its path and the inevitable TLB
>> > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp
>> > a fast path in the microcode and reduce its overhead by an order of magnitude.
>> >
>>
>> If the CR4 writes happen in for this use case, that's a bug.
>
> that depends on how you plan to handle perf/rdpmc users and how many
> alternative mm structs you plan to manage (one global, one per cpu,
> one per mm struct, etc).
>

I was thinking one global unless more are needed for some reason.

>> > you'll be duplicating TLB entries in the alternative PCID for both code
>> > and data, where they will accumulate (=take room away from the normal PCID
>> > and expose unwanted memory for access) unless you also flush them when
>> > switching back (which then will cost even more cycles). also i'm not sure
>> > that processors implement all the 12 PCID bits so depending on how many PCIDs
>> > you plan to use, you could be causing even more unnecessary TLB replacements.
>> >
>>
>> Unless the CPU is rather dumber than I expect, the only duplicated
>> entries should be for the writable aliases of pages that are written.
>> The rest of the pages are global and should be shared for all PCIDs.
>
> well, 4.10.2.4 has language like this (4.10.3.2 implies similar):
>
>    A logical processor may use a global TLB entry to translate a linear
>    address, even if the TLB entry is associated with a PCID different
>    from the current PCID.

I read this as: the CPU still semantically tags global TLB entries
with a PCID, but the CPU will use (probably) use those TLB entries
even if the PCIDs don't match.  IIRC none of the TLB instructions have
any effect that makes the PCID associated with a global entry visible,
so the CPU could presumably omit the PCID tags entirely for global
entries.  E.g. I don't think there's any way to say "flush global
entries with a given PCID".

--Andy

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-10  0:31                               ` Andy Lutomirski
@ 2017-04-10 19:47                                 ` PaX Team
  2017-04-10 20:27                                   ` Andy Lutomirski
  0 siblings, 1 reply; 63+ messages in thread
From: PaX Team @ 2017-04-10 19:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Micay, Andy Lutomirski, Mathias Krause, Thomas Gleixner,
	Kees Cook, kernel-hardening, Mark Rutland, Hoeun Ryu,
	Emese Revfy, Russell King, X86 ML, linux-kernel,
	linux-arm-kernel, Peter Zijlstra

On 9 Apr 2017 at 17:31, Andy Lutomirski wrote:

> On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote:
> >
> I consider breaking buggy drivers (in a way that they either generally
> work okay

do they work okay when the dma transfer goes to a buffer that crosses
physically non-contiguous page boundaries?

> or that they break with a nice OOPS depending on config) to
> be better than having a special case in what's supposed to be a fast
> path to keep them working.  I did consider forcing the relevant debug
> options on for a while just to help shake these bugs out the woodwork
> faster.

that's a false dichotomy, discovering buggy drivers is orthogonal to (not)
breaking users' systems as grsec shows. and how did you expect to 'shake
these bugs out' when your own suggestion at the time was for distros to
not enable this feature 'for a while'?

> > i have yet to see anyone explain what they mean by 'leak' here but if it
> > is what i think it is then the arch specific entry/exit changes are not
> > optional but mandatory. see below for randomization.
> 
> By "leak" I mean that a bug or exploit causes unintended code to run
> with CR0.WP or a special CR3 or a special PTE or whatever loaded.

how can a bug/exploit cause something like this?

>  PaX hooks the entry code to avoid leaks. 

PaX doesn't instrument enter/exit paths to prevent state leaks into interrupt
context (it's a useful sideeffect though), rather it's needed for correctness
if the kernel can be interrupted at all while it's open (address space switching
will need to handle this too but you have yet to address it).

> >> At boot, choose a random address A.
> >
> > what is the threat that a random address defends against?
> 
> Makes it harder to exploit a case where the CR3 setting leaks.

if an attacker has the ability to cause this leak (details of which are subject
to the question i asked above) then why wouldn't he simply also make use of the
primitives to modify his target via the writable vma without ever having to know
the randomized address? i also wonder what exploit power you assume for this
attack and whether that is already enough to simply go after page tables, etc
instead of figuring out the alternative address space.

> > the problem is that the amount of __read_only data extends beyond vmlinux,
> > i.e., this approach won't scale. another problem is that it can't be used
> > inside use_mm and switch_mm themselves (no read-only task structs or percpu
> > pgd for you ;) and probably several other contexts.
> 
> Can you clarify these uses that extend beyond vmlinux?

one obvious candidate is modules. how do you want to handle them? then there's
a whole bunch of dynamically allocated data that is a candidate for __read_only
treatment.

> > what is the threat model you're assuming for this feature? based on what i
> > have for PaX (arbitrary read/write access exploited for data-only attacks),
> > the above makes no sense to me...
> 
> If I use the primitive to try to write a value to the wrong section
> (write to kernel text, for example), IMO it would be nice to OOPS
> instead of succeeding.

this doesn't tell me what power you're assuming the attacker has. is it
my generic arbitrary read-write ability or something more restricted and
thus less realistic? i.e., how does the attacker get to 'use the primitive'
and (presumably) also control the ptr/data?

as for your specific example, kernel text isn't 'non-rare-write data' that
you spoke of before, but that aside, what prevents an attacker from computing
his target ptr so that after your accessor rebases it, it'd point back to his
intended target instead? will you range-check (find_vma eventually?) each time?
how will you make all this code safe from races from another task? the more
checks you make, the more likely that something sensitive will spill to memory
and be a target itself in order to hijack the sensitive write.

> Please keep in mind that, unlike PaX, uses of a pax_open_kernel()-like
> function will may be carefully audited by a friendly security expert
> such as yourself.  It would be nice to harden the primitive to a
> reasonable extent against minor misuses such as putting it in a
> context where the compiler will emit mov-a-reg-with-WP-set-to-CR0;
> ret.

i don't understand what's there to audit. if you want to treat a given piece
of data as __read_only then you have no choice but to allow writes to it via
the open/close mechanism and the compiler can tell you just where those
writes are (and even do the instrumentation when you get tired of doing it
by hand).

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-10  8:26                   ` Thomas Gleixner
@ 2017-04-10 19:55                     ` PaX Team
  0 siblings, 0 replies; 63+ messages in thread
From: PaX Team @ 2017-04-10 19:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathias Krause, Andy Lutomirski, Kees Cook, Andy Lutomirski,
	kernel-hardening, Mark Rutland, Hoeun Ryu, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

On 10 Apr 2017 at 10:26, Thomas Gleixner wrote:

> On Fri, 7 Apr 2017, PaX Team wrote:
> > On 7 Apr 2017 at 11:46, Thomas Gleixner wrote:
> > > That's silly. Just because PaX does it, doesn't mean it's correct.
> > 
> > is that FUD or do you have actionable information to share?
> 
> That has absolutely nothing to do with FUD. I'm merily not accepting
> argumentations which say: PaX can do it "just"....

you implied that what PaX does may not be correct. if you can't back that
up with facts and technical arguments then it is FUD. your turn.

> That has exactly zero technical merit and it's not asked too much to
> provide precise technical arguments why one implementation is better than
> some other.

exactly. start with explaining what is not correct in PaX with "precise
technical arguments".

> > > To be honest, playing games with the CR0.WP bit is outright stupid to begin with.
> > 
> > why is that? cr0.wp exists since the i486 and its behaviour fits my
> > purposes quite well, it's the best security/performance i know of.
> 
> Works for me has never be a good engineering principle.

good thing i didn't say that. on the other hand you failed to provide "precise
technical arguments" for why "playing games with the CR0.WP bit is outright
stupid to begin with". do you have any to share and discuss?

> > > And that's just a nightmare maintainence wise as it's prone to be
> > > broken over time.
> > 
> > i've got 14 years of experience of maintaining it and i never saw it break.
> 
> It's a difference whether you maintain a special purpose patch set out of
> tree for a subset of architectures - I certainly know what I'm talking
> about - or keeping stuff sane in the upstream kernel.

there's no difference to me, i keep my stuff sane regardless. of course what
you do with your out-of-tree code is your business but don't extrapolate it
to mine. now besides argumentum ad verecundiam do you have "precise technical
arguments" as to why maintaining a cr0.wp based approach would be "a nightmare
maintainence wise as it's prone to be broken over time."?

> > > I certainly don't want to take the chance to leak CR0.WP ever
> >
> > why and where would cr0.wp leak?
> 
> It's bound to happen due to some subtle mistake

i don't see what subtle mistake you're thinking of here. can you give me
an example?

> and up to the point where you catch it (in the scheduler or entry/exit path)
> the world is writeable.

where such a leak is caught depends on what subtle mistake you're talking
about, so let's get back to this point once you answered that question.

> And that will be some almost never executed error path which can
> be triggered by a carefully crafted attack.

open/close calls have nothing to do with error paths or even conditional
execution, they're always executed as a sequence so this situation cannot
occur.

> A very restricted writeable region is definitely preferred over full
> world writeable then, right? 

it doesn't matter when the attacker has an arbitrary read/write primitive
which he can just use to modify that 'very restricted writeable region'
to whatever he needs to cover first. now if all the data managing this
region were also protected then it'd matter but that's never going to
happen in the upstream kernel.

> > > Making the world and some more writeable hardly qualifies as tightly
> > > focused.
> > 
> > you forgot to add 'for a window of a few insns' and that the map/unmap
> 
> If it'd be guaranteed to be a few instructions, then I wouldn't be that
> worried.

it is, by definition assignments to otherwise __read_only data are (have to
be) bracketed with open/close instrumentation.

> The availability of make_world_writeable() as an unrestricted
> usable function makes me nervous as hell.

i can't imagine the nightmares you must have lived through for the two decades
during which the kernel was all wide open... spass beiseite, we can address
these fears of yours once you explain just what kind of error situations you
have in mind and why they don't also apply to say text_poke().

> We've had long standing issues where kmap_atomic() got leaked through a
> hard to spot almost never executed error handling path. And the same is
> bound to happen with this, just with a way worse outcome. 

the lifetime and use of kmaps is very different so you'll have to explain
in more detail why the problems they had apply here as well.

> > approach does the same under an attacker controlled ptr.
> 
> Which attacker controlled pointer?

i meant the one passed to the map/unmap code, attacker control over it means
the whole world is effectively writable again (Andy's vma approach would restrict
this to just the interesting read-only data except the whole thing is irrelevant
until all participating data (vma, page tables, etc) are also protected).

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-09 20:24                             ` PaX Team
  2017-04-10  0:31                               ` Andy Lutomirski
@ 2017-04-10 20:13                               ` Kees Cook
  2017-04-10 20:17                                 ` Andy Lutomirski
  1 sibling, 1 reply; 63+ messages in thread
From: Kees Cook @ 2017-04-10 20:13 UTC (permalink / raw)
  To: PaX Team
  Cc: Daniel Micay, Andy Lutomirski, Mathias Krause, Thomas Gleixner,
	kernel-hardening, Mark Rutland, Hoeun Ryu, Emese Revfy,
	Russell King, X86 ML, linux-kernel, linux-arm-kernel,
	Peter Zijlstra

On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote:
> On 7 Apr 2017 at 22:07, Andy Lutomirski wrote:
>> No one has explained how CR0.WP is weaker or slower than my proposal.
>
> you misunderstood, Daniel was talking about your use_mm approach.
>
>> Here's what I'm proposing:
>>
>> At boot, choose a random address A.
>
> what is the threat that a random address defends against?
>
>>  Create an mm_struct that has a
>> single VMA starting at A that represents the kernel's rarely-written
>> section.  Compute O = (A - VA of rarely-written section).  To do a
>> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().
>
> the problem is that the amount of __read_only data extends beyond vmlinux,
> i.e., this approach won't scale. another problem is that it can't be used
> inside use_mm and switch_mm themselves (no read-only task structs or percpu
> pgd for you ;) and probably several other contexts.

These are the limitations that concern me: what will we NOT be able to
make read-only as a result of the use_mm() design choice? My RFC
series included a simple case and a constify case, but I did not
include things like making page tables read-only, etc.

I cant accept not using cr0, since we need to design something that
works on arm64 too, which doesn't have anything like this (AFAIK), but
I'd like to make sure we don't paint ourselves into a corner.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-10 20:13                               ` Kees Cook
@ 2017-04-10 20:17                                 ` Andy Lutomirski
  0 siblings, 0 replies; 63+ messages in thread
From: Andy Lutomirski @ 2017-04-10 20:17 UTC (permalink / raw)
  To: Kees Cook
  Cc: PaX Team, Daniel Micay, Andy Lutomirski, Mathias Krause,
	Thomas Gleixner, kernel-hardening, Mark Rutland, Hoeun Ryu,
	Emese Revfy, Russell King, X86 ML, linux-kernel,
	linux-arm-kernel, Peter Zijlstra

On Mon, Apr 10, 2017 at 1:13 PM, Kees Cook <keescook@chromium.org> wrote:
> On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote:
>> On 7 Apr 2017 at 22:07, Andy Lutomirski wrote:
>>> No one has explained how CR0.WP is weaker or slower than my proposal.
>>
>> you misunderstood, Daniel was talking about your use_mm approach.
>>
>>> Here's what I'm proposing:
>>>
>>> At boot, choose a random address A.
>>
>> what is the threat that a random address defends against?
>>
>>>  Create an mm_struct that has a
>>> single VMA starting at A that represents the kernel's rarely-written
>>> section.  Compute O = (A - VA of rarely-written section).  To do a
>>> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().
>>
>> the problem is that the amount of __read_only data extends beyond vmlinux,
>> i.e., this approach won't scale. another problem is that it can't be used
>> inside use_mm and switch_mm themselves (no read-only task structs or percpu
>> pgd for you ;) and probably several other contexts.
>
> These are the limitations that concern me: what will we NOT be able to
> make read-only as a result of the use_mm() design choice? My RFC
> series included a simple case and a constify case, but I did not
> include things like making page tables read-only, etc.

If we make page tables read-only, we may need to have multiple levels
of rareness.  Page table writes aren't all that rare, and I can
imagine distros configuring the kernel so that static structs full of
function pointers are read-only (IMO that should be the default or
even mandatory), but page tables may be a different story.

That being said, CR3-twiddling to write to page tables could actually
work.  Hmm.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap()
  2017-04-10 19:47                                 ` PaX Team
@ 2017-04-10 20:27                                   ` Andy Lutomirski
  0 siblings, 0 replies; 63+ messages in thread
From: Andy Lutomirski @ 2017-04-10 20:27 UTC (permalink / raw)
  To: PaX Team
  Cc: Andy Lutomirski, Daniel Micay, Mathias Krause, Thomas Gleixner,
	Kees Cook, kernel-hardening, Mark Rutland, Hoeun Ryu,
	Emese Revfy, Russell King, X86 ML, linux-kernel,
	linux-arm-kernel, Peter Zijlstra

On Mon, Apr 10, 2017 at 12:47 PM, PaX Team <pageexec@freemail.hu> wrote:
> On 9 Apr 2017 at 17:31, Andy Lutomirski wrote:
>
>> On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote:
>> >
>> I consider breaking buggy drivers (in a way that they either generally
>> work okay
>
> do they work okay when the dma transfer goes to a buffer that crosses
> physically non-contiguous page boundaries?

Nope.  Like I said, i considered making the debugging mandatory.  I
may still send patches to do that.

>> By "leak" I mean that a bug or exploit causes unintended code to run
>> with CR0.WP or a special CR3 or a special PTE or whatever loaded.
>
> how can a bug/exploit cause something like this?

For example: a bug in entry logic, a bug in perf NMI handling, or even
a bug in *nested* perf NMI handling (egads!).  Or maybe some super
nasty interaction with suspend/resume.  These are all fairly unlikely
(except the nested perf case), but still.

As a concrete example, back before my big NMI improvement series, it
was possible for an NMI return to invoke espfix and/or take an IRET
fault.  This *shouldn't* happen on return to a context with CR0.WP
set, but it would be incredibly nasty if it did.  The code is
separated out now, so it should be okay...

>
>>  PaX hooks the entry code to avoid leaks.
>
> PaX doesn't instrument enter/exit paths to prevent state leaks into interrupt
> context (it's a useful sideeffect though), rather it's needed for correctness
> if the kernel can be interrupted at all while it's open (address space switching
> will need to handle this too but you have yet to address it).

I don't think we disagree here.  A leak would be a case of incorrectness.

>
>> >> At boot, choose a random address A.
>> >
>> > what is the threat that a random address defends against?
>>
>> Makes it harder to exploit a case where the CR3 setting leaks.
>
> if an attacker has the ability to cause this leak (details of which are subject
> to the question i asked above) then why wouldn't he simply also make use of the
> primitives to modify his target via the writable vma without ever having to know
> the randomized address? i also wonder what exploit power you assume for this
> attack and whether that is already enough to simply go after page tables, etc
> instead of figuring out the alternative address space.

I'm imagining the power to (a) cause some code path to execute while
the kernel is "open" and (b) the ability to use the buggy code path in
question to write a a fully- or partially-controlled address.  With
CR0.WP clear, this can write shellcode directly.  With CR3 pointing to
a page table that maps some parts of the kernel (but not text!) at a
randomized offset, you need to figure out the offset and find some
other target in the mapping that gets your exploit farther along.  You
can't write shellcode directly.

>
>> > the problem is that the amount of __read_only data extends beyond vmlinux,
>> > i.e., this approach won't scale. another problem is that it can't be used
>> > inside use_mm and switch_mm themselves (no read-only task structs or percpu
>> > pgd for you ;) and probably several other contexts.
>>
>> Can you clarify these uses that extend beyond vmlinux?
>
> one obvious candidate is modules. how do you want to handle them? then there's
> a whole bunch of dynamically allocated data that is a candidate for __read_only
> treatment.

Exactly the same way.  Map those regions at the same offset, maybe
even in the same VMA.  There's no reason that an artificial VMA used
for this purpose can't be many gigabytes long and have vm_ops that
only allow access to certain things.  But multiple VMAs would work,
too.

>
>> > what is the threat model you're assuming for this feature? based on what i
>> > have for PaX (arbitrary read/write access exploited for data-only attacks),
>> > the above makes no sense to me...
>>
>> If I use the primitive to try to write a value to the wrong section
>> (write to kernel text, for example), IMO it would be nice to OOPS
>> instead of succeeding.
>
> this doesn't tell me what power you're assuming the attacker has. is it
> my generic arbitrary read-write ability or something more restricted and
> thus less realistic? i.e., how does the attacker get to 'use the primitive'
> and (presumably) also control the ptr/data?
>
> as for your specific example, kernel text isn't 'non-rare-write data' that
> you spoke of before, but that aside, what prevents an attacker from computing
> his target ptr so that after your accessor rebases it, it'd point back to his
> intended target instead?

It's a restriction on what targets can be hit.  With CR0.WP, you can
hit anything that has a VA.  With CR3, you can hit only that which is
mapped.

> will you range-check (find_vma eventually?) each time?
> how will you make all this code safe from races from another task? the more
> checks you make, the more likely that something sensitive will spill to memory
> and be a target itself in order to hijack the sensitive write.

There's no code here making the checks at write time.  It's just page
table / VMA setup.

>
>> Please keep in mind that, unlike PaX, uses of a pax_open_kernel()-like
>> function will may be carefully audited by a friendly security expert
>> such as yourself.  It would be nice to harden the primitive to a
>> reasonable extent against minor misuses such as putting it in a
>> context where the compiler will emit mov-a-reg-with-WP-set-to-CR0;
>> ret.
>
> i don't understand what's there to audit. if you want to treat a given piece
> of data as __read_only then you have no choice but to allow writes to it via
> the open/close mechanism and the compiler can tell you just where those
> writes are (and even do the instrumentation when you get tired of doing it
> by hand).
>

I mean auditing all uses of pax_open_kernel() or any other function
that opens the kernel.  That function is, as used in PaX, terrifying.
PaX probably gets every user right, but I don't trust driver writers
with a function like pax_open_kernel() that's as powerful as PaX's.

Suppose you get driver code like this:

void foo(int (*func)()) {
  pax_open_kernel();
  *thingy = func();
  pax_close_kernel();
}

That would be a very, very juicy target for a ROP-like attack.  Just
get the kernel to call this function with func pointing to something
that does a memcpy or similar into executable space.  Boom, shellcode
execution.

If CR3 is used instead, exploiting this is considerably more complicated.

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, back to index

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-29 18:15 [RFC v2] Introduce rare_write() infrastructure Kees Cook
2017-03-29 18:15 ` [RFC v2][PATCH 01/11] " Kees Cook
2017-03-29 18:23   ` Kees Cook
2017-03-30  7:44     ` Ho-Eun Ryu
2017-03-30 17:02       ` Kees Cook
2017-04-07  8:09   ` Ho-Eun Ryu
2017-04-07 20:38     ` Kees Cook
2017-03-29 18:15 ` [RFC v2][PATCH 02/11] lkdtm: add test for " Kees Cook
2017-03-30  9:34   ` [kernel-hardening] " Ian Campbell
2017-03-30 16:16     ` Kees Cook
2017-03-29 18:15 ` [RFC v2][PATCH 03/11] net: switch sock_diag handlers to rare_write() Kees Cook
2017-03-29 18:15 ` [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap() Kees Cook
2017-03-29 22:38   ` Andy Lutomirski
2017-03-30  1:41     ` Kees Cook
2017-04-05 23:57       ` Andy Lutomirski
2017-04-06  0:14         ` Kees Cook
2017-04-06 15:59           ` Andy Lutomirski
2017-04-07  8:34             ` [kernel-hardening] " Mathias Krause
2017-04-07  9:46               ` Thomas Gleixner
2017-04-07 10:51                 ` Mathias Krause
2017-04-07 13:14                   ` Thomas Gleixner
2017-04-07 13:30                     ` Mathias Krause
2017-04-07 16:14                       ` Andy Lutomirski
2017-04-07 16:22                         ` Mark Rutland
2017-04-07 19:58                         ` PaX Team
2017-04-08  4:58                           ` Andy Lutomirski
2017-04-09 12:47                             ` PaX Team
2017-04-10  0:10                               ` Andy Lutomirski
2017-04-10 10:42                                 ` PaX Team
2017-04-10 16:01                                   ` Andy Lutomirski
2017-04-07 20:44                         ` Thomas Gleixner
2017-04-07 21:20                           ` Kees Cook
2017-04-08  4:12                             ` Daniel Micay
2017-04-08  4:13                               ` Daniel Micay
2017-04-08  4:21                         ` Daniel Micay
2017-04-08  5:07                           ` Andy Lutomirski
2017-04-08  7:33                             ` Daniel Micay
2017-04-08 15:20                               ` Andy Lutomirski
2017-04-09 10:53                                 ` Ingo Molnar
2017-04-10 10:22                                 ` Mark Rutland
2017-04-09 20:24                             ` PaX Team
2017-04-10  0:31                               ` Andy Lutomirski
2017-04-10 19:47                                 ` PaX Team
2017-04-10 20:27                                   ` Andy Lutomirski
2017-04-10 20:13                               ` Kees Cook
2017-04-10 20:17                                 ` Andy Lutomirski
2017-04-07 19:25                       ` Thomas Gleixner
2017-04-07 14:45                   ` Peter Zijlstra
2017-04-10 10:29                     ` Mark Rutland
2017-04-07 19:52                 ` PaX Team
2017-04-10  8:26                   ` Thomas Gleixner
2017-04-10 19:55                     ` PaX Team
2017-04-07  9:37   ` Peter Zijlstra
2017-03-29 18:15 ` [RFC v2][PATCH 05/11] ARM: mm: dump: Add domain to output Kees Cook
2017-03-29 18:15 ` [RFC v2][PATCH 06/11] ARM: domains: Extract common USER domain init Kees Cook
2017-03-29 18:15 ` [RFC v2][PATCH 07/11] ARM: mm: set DOMAIN_WR_RARE for rodata Kees Cook
2017-03-29 18:16 ` [RFC v2][PATCH 08/11] ARM: Implement __arch_rare_write_begin/end() Kees Cook
2017-04-07  9:36   ` Peter Zijlstra
2017-03-29 18:16 ` [RFC v2][PATCH 09/11] list: add rare_write() list helpers Kees Cook
2017-03-29 18:16 ` [RFC v2][PATCH 10/11] gcc-plugins: Add constify plugin Kees Cook
2017-03-29 18:16 ` [RFC v2][PATCH 11/11] cgroups: force all struct cftype const Kees Cook
2017-03-29 19:00 ` [RFC v2] Introduce rare_write() infrastructure Russell King - ARM Linux
2017-03-29 19:14   ` Kees Cook

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git