This is take 2 of an RFC series to demonstrate a possible infrastructure for the "write rarely" memory storage type in the kernel (patch 1). The intent is to further reduce the internal attack surface of the kernel by making more variables read-only while "at rest". This is heavily based on the "__read_only" portion of the KERNEXEC infrastructure from PaX/grsecurity, though I tried to adjust it to be more in line with the upstream discussions around the APIs. Also included is the PaX/grsecurity constify plugin (patch 10) which will automatically make all instances of certain structures read-only, to help demonstrate more complex cases of "write rarely" targets. (The plugin in this series is altered to only operate on marked structures, rather than the full automatic constification.) As part of the series I've included both x86 support (patch 4), exactly as done in PaX, and ARM support (patches 5-7), similar to what is done in grsecurity but without support for earlier ARM CPUs. Both are lightly tested by me, though they need a bit more work, especially ARM as it is missing the correct domain marking for kernel modules. I've added an lkdtm test (patch 2), which serves as a stand-alone example of the new infrastructure. Included are two example "conversions" to the rare_write()-style of variable manipulation: a simple one, which switches the inet diag handler table to write-rarely during register/unregister calls (patch 3), and a more complex one: cgroup types (patch 11), which is made read-only via the constify plugin. The latter uses rare-write linked lists (patch 9) and multi-field updates. Both examples are refactorings of what already appears in PaX/grsecurity. The patches are: [PATCH 01/11] Introduce rare_write() infrastructure [PATCH 02/11] lkdtm: add test for rare_write() infrastructure [PATCH 03/11] net: switch sock_diag handlers to rare_write() [PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap() [PATCH 05/11] ARM: mm: dump: Add domain to output [PATCH 06/11] ARM: domains: Extract common USER domain init [PATCH 07/11] ARM: mm: set DOMAIN_WR_RARE for rodata [PATCH 08/11] ARM: Implement __arch_rare_write_begin/unmap() [PATCH 09/11] list: add rare_write() list helpers [PATCH 10/11] gcc-plugins: Add constify plugin [PATCH 11/11] cgroups: force all struct cftype const -Kees
Several types of data storage exist in the kernel: read-write data (.data, .bss), read-only data (.rodata), and RO-after-init. This introduces the infrastructure for another type: write-rarely, which is intended for data that is either only rarely modified or especially security-sensitive. The goal is to further reduce the internal attack surface of the kernel by making this storage read-only when "at rest". This makes it much harder to be subverted by attackers who have a kernel-write flaw, since they cannot directly change these memory contents. This work is heavily based on PaX and grsecurity's pax_{open,close}_kernel API, its __read_only annotations, its constify plugin, and the work done to identify sensitive structures that should be moved from .data into .rodata. This builds the initial infrastructure to support these kinds of changes, though the API and naming has been adjusted in places for clarity and maintainability. Variables declared with the __wr_rare annotation will be moved to the .rodata section if an architecture supports CONFIG_HAVE_ARCH_WRITE_RARE. To change these variables, either a single rare_write() macro can be used, or multiple uses of __rare_write(), wrapped in a matching pair of rare_write_begin() and rare_write_end() macros can be used. These macros are expanded into the arch-specific functions that perform the actions needed to write to otherwise read-only memory. As detailed in the Kconfig help, the arch-specific helpers have several requirements to make them sensible/safe for use by the kernel: they must not allow non-current CPUs to write the memory area, they must run non-preemptible to avoid accidentally leaving memory writable, and must be inline to avoid making them desirable ROP targets for attackers. Signed-off-by: Kees Cook <keescook@chromium.org> --- arch/Kconfig | 25 +++++++++++++++++++++++++ include/linux/compiler.h | 32 ++++++++++++++++++++++++++++++++ include/linux/preempt.h | 6 ++++-- 3 files changed, 61 insertions(+), 2 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index cd211a14a88f..5ebf62500b99 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -847,4 +847,29 @@ config STRICT_MODULE_RWX config ARCH_WANT_RELAX_ORDER bool +config HAVE_ARCH_RARE_WRITE + def_bool n + help + An arch should select this option if it has defined the functions + __arch_rare_write_begin() and __arch_rare_write_end() to + respectively enable and disable writing to read-only memory. The + routines must meet the following requirements: + - read-only memory writing must only be available on the current + CPU (to make sure other CPUs can't race to make changes too). + - the routines must be declared inline (to discourage ROP use). + - the routines must not be preemptible (likely they will call + preempt_disable() and preempt_enable_no_resched() respectively). + - the routines must validate expected state (e.g. when enabling + writes, BUG() if writes are already be enabled). + +config HAVE_ARCH_RARE_WRITE_MEMCPY + def_bool n + depends on HAVE_ARCH_RARE_WRITE + help + An arch should select this option if a special accessor is needed + to write to otherwise read-only memory, defined by the function + __arch_rare_write_memcpy(). Without this, the write-rarely + infrastructure will just attempt to write directly to the memory + using a const-ignoring assignment. + source "kernel/gcov/Kconfig" diff --git a/include/linux/compiler.h b/include/linux/compiler.h index f8110051188f..274bd03cfe9e 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -336,6 +336,38 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s __u.__val; \ }) +/* + * Build "write rarely" infrastructure for flipping memory r/w + * on a per-CPU basis. + */ +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE +# define __wr_rare +# define __wr_rare_type +# define __rare_write(__var, __val) (__var = (__val)) +# define rare_write_begin() do { } while (0) +# define rare_write_end() do { } while (0) +#else +# define __wr_rare __ro_after_init +# define __wr_rare_type const +# ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY +# define __rare_write_n(dst, src, len) ({ \ + BUILD_BUG(!builtin_const(len)); \ + __arch_rare_write_memcpy((dst), (src), (len)); \ + }) +# define __rare_write(var, val) __rare_write_n(&(var), &(val), sizeof(var)) +# else +# define __rare_write(var, val) ((*(typeof((typeof(var))0) *)&(var)) = (val)) +# endif +# define rare_write_begin() __arch_rare_write_begin() +# define rare_write_end() __arch_rare_write_end() +#endif +#define rare_write(__var, __val) ({ \ + rare_write_begin(); \ + __rare_write(__var, __val); \ + rare_write_end(); \ + __var; \ +}) + #endif /* __KERNEL__ */ #endif /* __ASSEMBLY__ */ diff --git a/include/linux/preempt.h b/include/linux/preempt.h index cae461224948..4fc97aaa22ea 100644 --- a/include/linux/preempt.h +++ b/include/linux/preempt.h @@ -258,10 +258,12 @@ do { \ /* * Modules have no business playing preemption tricks. */ -#undef sched_preempt_enable_no_resched -#undef preempt_enable_no_resched #undef preempt_enable_no_resched_notrace #undef preempt_check_resched +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE +#undef sched_preempt_enable_no_resched +#undef preempt_enable_no_resched +#endif #endif #define preempt_set_need_resched() \ -- 2.7.4
This adds the WRITE_RARE_WRITE test to validate variables marked with __wr_rare. Signed-off-by: Kees Cook <keescook@chromium.org> --- drivers/misc/lkdtm.h | 1 + drivers/misc/lkdtm_core.c | 1 + drivers/misc/lkdtm_perms.c | 19 ++++++++++++++++++- 3 files changed, 20 insertions(+), 1 deletion(-) diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h index 67d27be60405..d1fd5aefa235 100644 --- a/drivers/misc/lkdtm.h +++ b/drivers/misc/lkdtm.h @@ -39,6 +39,7 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void); void __init lkdtm_perms_init(void); void lkdtm_WRITE_RO(void); void lkdtm_WRITE_RO_AFTER_INIT(void); +void lkdtm_WRITE_RARE_WRITE(void); void lkdtm_WRITE_KERN(void); void lkdtm_EXEC_DATA(void); void lkdtm_EXEC_STACK(void); diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c index b9a4cd4a9b68..ac8a55947189 100644 --- a/drivers/misc/lkdtm_core.c +++ b/drivers/misc/lkdtm_core.c @@ -219,6 +219,7 @@ struct crashtype crashtypes[] = { CRASHTYPE(ACCESS_USERSPACE), CRASHTYPE(WRITE_RO), CRASHTYPE(WRITE_RO_AFTER_INIT), + CRASHTYPE(WRITE_RARE_WRITE), CRASHTYPE(WRITE_KERN), CRASHTYPE(REFCOUNT_SATURATE_INC), CRASHTYPE(REFCOUNT_SATURATE_ADD), diff --git a/drivers/misc/lkdtm_perms.c b/drivers/misc/lkdtm_perms.c index c7635a79341f..8fbadfa4cc34 100644 --- a/drivers/misc/lkdtm_perms.c +++ b/drivers/misc/lkdtm_perms.c @@ -20,12 +20,15 @@ /* This is non-const, so it will end up in the .data section. */ static u8 data_area[EXEC_SIZE]; -/* This is cost, so it will end up in the .rodata section. */ +/* This is const, so it will end up in the .rodata section. */ static const unsigned long rodata = 0xAA55AA55; /* This is marked __ro_after_init, so it should ultimately be .rodata. */ static unsigned long ro_after_init __ro_after_init = 0x55AA5500; +/* This is marked __wr_rare, so it should ultimately be .rodata. */ +static unsigned long wr_rare __wr_rare = 0xAA66AA66; + /* * This just returns to the caller. It is designed to be copied into * non-executable memory regions. @@ -103,6 +106,20 @@ void lkdtm_WRITE_RO_AFTER_INIT(void) *ptr ^= 0xabcd1234; } +void lkdtm_WRITE_RARE_WRITE(void) +{ + /* Explicitly cast away "const" for the test. */ + unsigned long *ptr = (unsigned long *)&wr_rare; + + pr_info("attempting good rare write at %p\n", ptr); + rare_write(*ptr, 0x11335577); + if (wr_rare != 0x11335577) + pr_warn("Yikes: wr_rare did not actually change!\n"); + + pr_info("attempting bad rare write at %p\n", ptr); + *ptr ^= 0xbcd12345; +} + void lkdtm_WRITE_KERN(void) { size_t size; -- 2.7.4
This is a simple example a register/unregister case for __wr_rare markings, which only needs a simple rare_write() call to make updates. Signed-off-by: Kees Cook <keescook@chromium.org> --- net/core/sock_diag.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c index 6b10573cc9fa..67253026106f 100644 --- a/net/core/sock_diag.c +++ b/net/core/sock_diag.c @@ -14,7 +14,7 @@ #include <linux/inet_diag.h> #include <linux/sock_diag.h> -static const struct sock_diag_handler *sock_diag_handlers[AF_MAX]; +static const struct sock_diag_handler *sock_diag_handlers[AF_MAX] __wr_rare; static int (*inet_rcv_compat)(struct sk_buff *skb, struct nlmsghdr *nlh); static DEFINE_MUTEX(sock_diag_table_mutex); static struct workqueue_struct *broadcast_wq; @@ -194,7 +194,7 @@ int sock_diag_register(const struct sock_diag_handler *hndl) if (sock_diag_handlers[hndl->family]) err = -EBUSY; else - sock_diag_handlers[hndl->family] = hndl; + rare_write(sock_diag_handlers[hndl->family], hndl); mutex_unlock(&sock_diag_table_mutex); return err; @@ -210,7 +210,7 @@ void sock_diag_unregister(const struct sock_diag_handler *hnld) mutex_lock(&sock_diag_table_mutex); BUG_ON(sock_diag_handlers[family] != hnld); - sock_diag_handlers[family] = NULL; + rare_write(sock_diag_handlers[family], NULL); mutex_unlock(&sock_diag_table_mutex); } EXPORT_SYMBOL_GPL(sock_diag_unregister); -- 2.7.4
Based on PaX's x86 pax_{open,close}_kernel() implementation, this allows HAVE_ARCH_RARE_WRITE to work on x86. There is missing work to sort out some header file issues where preempt.h is missing, though it can't be included in pg_table.h unconditionally... some other solution will be needed, perhaps an entirely separate header file for rare_write()-related defines... This patch is also missing paravirt support. Signed-off-by: Kees Cook <keescook@chromium.org> --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 31 +++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index cc98d5a294ee..2d1d707aa036 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -106,6 +106,7 @@ config X86 select HAVE_ARCH_KMEMCHECK select HAVE_ARCH_MMAP_RND_BITS if MMU select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT + select HAVE_ARCH_RARE_WRITE select HAVE_ARCH_SECCOMP_FILTER select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 1cfb36b8c024..2e6bf661bb84 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -91,6 +91,37 @@ extern struct mm_struct *pgd_page_get_mm(struct page *page); #endif /* CONFIG_PARAVIRT */ +/* TODO: Bad hack to deal with preempt macros being missing sometimes. */ +#ifndef preempt_disable +#include <linux/preempt.h> +#endif + +static __always_inline unsigned long __arch_rare_write_begin(void) +{ + unsigned long cr0; + + preempt_disable(); + barrier(); + cr0 = read_cr0() ^ X86_CR0_WP; + BUG_ON(cr0 & X86_CR0_WP); + write_cr0(cr0); + barrier(); + return cr0 ^ X86_CR0_WP; +} + +static __always_inline unsigned long __arch_rare_write_end(void) +{ + unsigned long cr0; + + barrier(); + cr0 = read_cr0() ^ X86_CR0_WP; + BUG_ON(!(cr0 & X86_CR0_WP)); + write_cr0(cr0); + barrier(); + preempt_enable_no_resched(); + return cr0 ^ X86_CR0_WP; +} + /* * The following only work if pte_present() is true. * Undefined behaviour if not.. -- 2.7.4
This adds the memory domain (on non-LPAE) to the PMD and PTE dumps. This isn't in the regular PMD bits because I couldn't find a clean way to fall back to retain some of the PMD bits when reporting PTE. So this is special-cased currently. New output example: ---[ Modules ]--- 0x7f000000-0x7f001000 4K KERNEL ro x SHD MEM/CACHED/WBWA 0x7f001000-0x7f002000 4K KERNEL ro NX SHD MEM/CACHED/WBWA 0x7f002000-0x7f004000 8K KERNEL RW NX SHD MEM/CACHED/WBWA ---[ Kernel Mapping ]--- 0x80000000-0x80100000 1M KERNEL RW NX SHD 0x80100000-0x80800000 7M KERNEL ro x SHD 0x80800000-0x80b00000 3M KERNEL ro NX SHD 0x80b00000-0xa0000000 501M KERNEL RW NX SHD ... ---[ Vectors ]--- 0xffff0000-0xffff1000 4K VECTORS USR ro x SHD MEM/CACHED/WBWA 0xffff1000-0xffff2000 4K VECTORS ro x SHD MEM/CACHED/WBWA Signed-off-by: Kees Cook <keescook@chromium.org> --- This patch is already queued in the ARM tree, but I'm including it here too since a following patch updates the list of domain names from this patch... --- arch/arm/mm/dump.c | 54 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 44 insertions(+), 10 deletions(-) diff --git a/arch/arm/mm/dump.c b/arch/arm/mm/dump.c index 21192d6eda40..35ff45470dbf 100644 --- a/arch/arm/mm/dump.c +++ b/arch/arm/mm/dump.c @@ -17,6 +17,7 @@ #include <linux/mm.h> #include <linux/seq_file.h> +#include <asm/domain.h> #include <asm/fixmap.h> #include <asm/memory.h> #include <asm/pgtable.h> @@ -43,6 +44,7 @@ struct pg_state { unsigned long start_address; unsigned level; u64 current_prot; + const char *current_domain; }; struct prot_bits { @@ -216,7 +218,8 @@ static void dump_prot(struct pg_state *st, const struct prot_bits *bits, size_t } } -static void note_page(struct pg_state *st, unsigned long addr, unsigned level, u64 val) +static void note_page(struct pg_state *st, unsigned long addr, + unsigned int level, u64 val, const char *domain) { static const char units[] = "KMGTPE"; u64 prot = val & pg_level[level].mask; @@ -224,8 +227,10 @@ static void note_page(struct pg_state *st, unsigned long addr, unsigned level, u if (!st->level) { st->level = level; st->current_prot = prot; + st->current_domain = domain; seq_printf(st->seq, "---[ %s ]---\n", st->marker->name); } else if (prot != st->current_prot || level != st->level || + domain != st->current_domain || addr >= st->marker[1].start_address) { const char *unit = units; unsigned long delta; @@ -240,6 +245,8 @@ static void note_page(struct pg_state *st, unsigned long addr, unsigned level, u unit++; } seq_printf(st->seq, "%9lu%c", delta, *unit); + if (st->current_domain) + seq_printf(st->seq, " %s", st->current_domain); if (pg_level[st->level].bits) dump_prot(st, pg_level[st->level].bits, pg_level[st->level].num); seq_printf(st->seq, "\n"); @@ -251,11 +258,13 @@ static void note_page(struct pg_state *st, unsigned long addr, unsigned level, u } st->start_address = addr; st->current_prot = prot; + st->current_domain = domain; st->level = level; } } -static void walk_pte(struct pg_state *st, pmd_t *pmd, unsigned long start) +static void walk_pte(struct pg_state *st, pmd_t *pmd, unsigned long start, + const char *domain) { pte_t *pte = pte_offset_kernel(pmd, 0); unsigned long addr; @@ -263,25 +272,50 @@ static void walk_pte(struct pg_state *st, pmd_t *pmd, unsigned long start) for (i = 0; i < PTRS_PER_PTE; i++, pte++) { addr = start + i * PAGE_SIZE; - note_page(st, addr, 4, pte_val(*pte)); + note_page(st, addr, 4, pte_val(*pte), domain); } } +static const char *get_domain_name(pmd_t *pmd) +{ +#ifndef CONFIG_ARM_LPAE + switch (pmd_val(*pmd) & PMD_DOMAIN_MASK) { + case PMD_DOMAIN(DOMAIN_KERNEL): + return "KERNEL "; + case PMD_DOMAIN(DOMAIN_USER): + return "USER "; + case PMD_DOMAIN(DOMAIN_IO): + return "IO "; + case PMD_DOMAIN(DOMAIN_VECTORS): + return "VECTORS"; + default: + return "unknown"; + } +#endif + return NULL; +} + static void walk_pmd(struct pg_state *st, pud_t *pud, unsigned long start) { pmd_t *pmd = pmd_offset(pud, 0); unsigned long addr; unsigned i; + const char *domain; for (i = 0; i < PTRS_PER_PMD; i++, pmd++) { addr = start + i * PMD_SIZE; + domain = get_domain_name(pmd); if (pmd_none(*pmd) || pmd_large(*pmd) || !pmd_present(*pmd)) - note_page(st, addr, 3, pmd_val(*pmd)); + note_page(st, addr, 3, pmd_val(*pmd), domain); else - walk_pte(st, pmd, addr); + walk_pte(st, pmd, addr, domain); - if (SECTION_SIZE < PMD_SIZE && pmd_large(pmd[1])) - note_page(st, addr + SECTION_SIZE, 3, pmd_val(pmd[1])); + if (SECTION_SIZE < PMD_SIZE && pmd_large(pmd[1])) { + addr += SECTION_SIZE; + pmd++; + domain = get_domain_name(pmd); + note_page(st, addr, 3, pmd_val(*pmd), domain); + } } } @@ -296,7 +330,7 @@ static void walk_pud(struct pg_state *st, pgd_t *pgd, unsigned long start) if (!pud_none(*pud)) { walk_pmd(st, pud, addr); } else { - note_page(st, addr, 2, pud_val(*pud)); + note_page(st, addr, 2, pud_val(*pud), NULL); } } } @@ -317,11 +351,11 @@ static void walk_pgd(struct seq_file *m) if (!pgd_none(*pgd)) { walk_pud(&st, pgd, addr); } else { - note_page(&st, addr, 1, pgd_val(*pgd)); + note_page(&st, addr, 1, pgd_val(*pgd), NULL); } } - note_page(&st, 0, 0, 0); + note_page(&st, 0, 0, 0, NULL); } static int ptdump_show(struct seq_file *m, void *v) -- 2.7.4
Everything but the USER domain is the same with CONFIG_CPU_SW_DOMAIN_PAN or not. This extracts the differences for a common DACR_INIT macro so it is easier to make future changes (like adding the WR_RARE domain). Signed-off-by: Kees Cook <keescook@chromium.org> --- arch/arm/include/asm/domain.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/arm/include/asm/domain.h b/arch/arm/include/asm/domain.h index 99d9f630d6b6..8b33bd7f6bf9 100644 --- a/arch/arm/include/asm/domain.h +++ b/arch/arm/include/asm/domain.h @@ -59,18 +59,18 @@ #define domain_val(dom,type) ((type) << (2 * (dom))) #ifdef CONFIG_CPU_SW_DOMAIN_PAN -#define DACR_INIT \ - (domain_val(DOMAIN_USER, DOMAIN_NOACCESS) | \ - domain_val(DOMAIN_KERNEL, DOMAIN_MANAGER) | \ - domain_val(DOMAIN_IO, DOMAIN_CLIENT) | \ - domain_val(DOMAIN_VECTORS, DOMAIN_CLIENT)) +#define __DACR_INIT_USER \ + domain_val(DOMAIN_USER, DOMAIN_NOACCESS) #else +#define __DACR_INIT_USER \ + domain_val(DOMAIN_USER, DOMAIN_CLIENT) +#endif + #define DACR_INIT \ - (domain_val(DOMAIN_USER, DOMAIN_CLIENT) | \ + (__DACR_INIT_USER | \ domain_val(DOMAIN_KERNEL, DOMAIN_MANAGER) | \ domain_val(DOMAIN_IO, DOMAIN_CLIENT) | \ domain_val(DOMAIN_VECTORS, DOMAIN_CLIENT)) -#endif #define __DACR_DEFAULT \ domain_val(DOMAIN_KERNEL, DOMAIN_CLIENT) | \ -- 2.7.4
This creates DOMAIN_WR_RARE for the kernel's .rodata section, separate from DOMAIN_KERNEL to avoid predictive fetching in device memory during a DOMAIN_MANAGER transition. TODO: handle kernel module vmalloc memory, which needs to be marked as DOMAIN_WR_RARE too, for module .rodata sections. Signed-off-by: Kees Cook <keescook@chromium.org> --- arch/arm/include/asm/domain.h | 3 +++ arch/arm/mm/dump.c | 2 ++ arch/arm/mm/init.c | 7 ++++--- 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/arch/arm/include/asm/domain.h b/arch/arm/include/asm/domain.h index 8b33bd7f6bf9..b5ca80ac823c 100644 --- a/arch/arm/include/asm/domain.h +++ b/arch/arm/include/asm/domain.h @@ -43,6 +43,7 @@ #define DOMAIN_IO 0 #endif #define DOMAIN_VECTORS 3 +#define DOMAIN_WR_RARE 4 /* * Domain types @@ -69,11 +70,13 @@ #define DACR_INIT \ (__DACR_INIT_USER | \ domain_val(DOMAIN_KERNEL, DOMAIN_MANAGER) | \ + domain_val(DOMAIN_WR_RARE, DOMAIN_CLIENT) | \ domain_val(DOMAIN_IO, DOMAIN_CLIENT) | \ domain_val(DOMAIN_VECTORS, DOMAIN_CLIENT)) #define __DACR_DEFAULT \ domain_val(DOMAIN_KERNEL, DOMAIN_CLIENT) | \ + domain_val(DOMAIN_WR_RARE, DOMAIN_CLIENT) | \ domain_val(DOMAIN_IO, DOMAIN_CLIENT) | \ domain_val(DOMAIN_VECTORS, DOMAIN_CLIENT) diff --git a/arch/arm/mm/dump.c b/arch/arm/mm/dump.c index 35ff45470dbf..b1aa9a17e0c3 100644 --- a/arch/arm/mm/dump.c +++ b/arch/arm/mm/dump.c @@ -288,6 +288,8 @@ static const char *get_domain_name(pmd_t *pmd) return "IO "; case PMD_DOMAIN(DOMAIN_VECTORS): return "VECTORS"; + case PMD_DOMAIN(DOMAIN_WR_RARE): + return "WR_RARE"; default: return "unknown"; } diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index 1d8558ff9827..d54a74b5718b 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -642,9 +642,10 @@ static struct section_perm ro_perms[] = { .mask = ~L_PMD_SECT_RDONLY, .prot = L_PMD_SECT_RDONLY, #else - .mask = ~(PMD_SECT_APX | PMD_SECT_AP_WRITE), - .prot = PMD_SECT_APX | PMD_SECT_AP_WRITE, - .clear = PMD_SECT_AP_WRITE, + .mask = ~(PMD_SECT_APX | PMD_SECT_AP_WRITE | PMD_DOMAIN_MASK), + .prot = PMD_SECT_APX | PMD_SECT_AP_WRITE | \ + PMD_DOMAIN(DOMAIN_WR_RARE), + .clear = PMD_SECT_AP_WRITE | PMD_DOMAIN(DOMAIN_KERNEL), #endif }, }; -- 2.7.4
Based on grsecurity's ARM pax_{open,close}_kernel() implementation, this allows HAVE_ARCH_RARE_WRITE to work on ARM. Signed-off-by: Kees Cook <keescook@chromium.org> --- arch/arm/Kconfig | 1 + arch/arm/include/asm/domain.h | 3 ++- arch/arm/include/asm/pgtable.h | 27 +++++++++++++++++++++++++++ 3 files changed, 30 insertions(+), 1 deletion(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 0d4e71b42c77..57b8aeaf501c 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -45,6 +45,7 @@ config ARM select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_MMAP_RND_BITS if MMU + select HAVE_ARCH_RARE_WRITE if MMU && !ARM_LPAE && !CPU_USE_DOMAINS select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT) select HAVE_ARCH_TRACEHOOK select HAVE_ARM_SMCCC if CPU_V7 diff --git a/arch/arm/include/asm/domain.h b/arch/arm/include/asm/domain.h index b5ca80ac823c..b3fb5c0a2efd 100644 --- a/arch/arm/include/asm/domain.h +++ b/arch/arm/include/asm/domain.h @@ -54,6 +54,7 @@ #define DOMAIN_MANAGER 3 #else #define DOMAIN_MANAGER 1 +#define DOMAIN_FORCE_MANAGER 3 #endif #define domain_mask(dom) ((3) << (2 * (dom))) @@ -118,7 +119,7 @@ static inline void set_domain(unsigned val) } #endif -#ifdef CONFIG_CPU_USE_DOMAINS +#if defined(CONFIG_CPU_USE_DOMAINS) || defined(CONFIG_HAVE_ARCH_RARE_WRITE) #define modify_domain(dom,type) \ do { \ unsigned int domain = get_domain(); \ diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h index 1c462381c225..104923ea9eb5 100644 --- a/arch/arm/include/asm/pgtable.h +++ b/arch/arm/include/asm/pgtable.h @@ -57,6 +57,33 @@ extern void __pgd_error(const char *file, int line, pgd_t); #define pmd_ERROR(pmd) __pmd_error(__FILE__, __LINE__, pmd) #define pgd_ERROR(pgd) __pgd_error(__FILE__, __LINE__, pgd) +#ifdef CONFIG_HAVE_ARCH_RARE_WRITE +#include <asm/domain.h> +#include <linux/preempt.h> + +static inline int test_domain(int domain, int domaintype) +{ + return (get_domain() & domain_val(domain, 3)) == + domain_val(domain, domaintype); +} + +static inline unsigned long __arch_rare_write_begin(void) +{ + preempt_disable(); + BUG_ON(test_domain(DOMAIN_WR_RARE, DOMAIN_FORCE_MANAGER)); + modify_domain(DOMAIN_WR_RARE, DOMAIN_FORCE_MANAGER); + return 0; +} + +static inline unsigned long __arch_rare_write_end(void) +{ + BUG_ON(test_domain(DOMAIN_WR_RARE, DOMAIN_CLIENT)); + modify_domain(DOMAIN_WR_RARE, DOMAIN_CLIENT); + preempt_enable_no_resched(); + return 0; +} +#endif + /* * This is the lowest virtual address we can permit any user space * mapping to be mapped at. This is particularly important for -- 2.7.4
Some structures that are intended to be made write-rarely are designed to be linked by lists. As a result, there need to be rare_write()-supported linked list primitives. As found in PaX, this adds list management helpers for doing updates to rarely-changed lists. Signed-off-by: Kees Cook <keescook@chromium.org> --- include/linux/list.h | 17 +++++++++++++++++ lib/Makefile | 2 +- lib/list_debug.c | 37 +++++++++++++++++++++++++++++++++++++ 3 files changed, 55 insertions(+), 1 deletion(-) diff --git a/include/linux/list.h b/include/linux/list.h index ae537fa46216..50fdd5b737aa 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -126,6 +126,23 @@ static inline void list_del(struct list_head *entry) entry->prev = LIST_POISON2; } +extern void __rare_list_add(struct list_head *new, + struct list_head *prev, + struct list_head *next); + +static inline void +rare_list_add(__wr_rare_type struct list_head *new, struct list_head *head) +{ + __rare_list_add((struct list_head *)new, head, head->next); +} +static inline void +rare_list_add_tail(__wr_rare_type struct list_head *new, struct list_head *head) +{ + __rare_list_add((struct list_head *)new, head->prev, head); +} + +extern void rare_list_del(__wr_rare_type struct list_head *entry); + /** * list_replace - replace old entry by new one * @old : the element to be replaced diff --git a/lib/Makefile b/lib/Makefile index 320ac46a8725..cd64fd8f7a21 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -83,7 +83,7 @@ obj-$(CONFIG_BTREE) += btree.o obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o obj-$(CONFIG_ASSOCIATIVE_ARRAY) += assoc_array.o obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o -obj-$(CONFIG_DEBUG_LIST) += list_debug.o +obj-y += list_debug.o obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o ifneq ($(CONFIG_HAVE_DEC_LOCK),y) diff --git a/lib/list_debug.c b/lib/list_debug.c index a34db8d27667..1add73f9479a 100644 --- a/lib/list_debug.c +++ b/lib/list_debug.c @@ -10,7 +10,9 @@ #include <linux/bug.h> #include <linux/kernel.h> #include <linux/rculist.h> +#include <linux/mm.h> +#ifdef CONFIG_DEBUG_LIST /* * Check that the data structures for the list manipulations are reasonably * valid. Failures here indicate memory corruption (and possibly an exploit @@ -60,3 +62,38 @@ bool __list_del_entry_valid(struct list_head *entry) } EXPORT_SYMBOL(__list_del_entry_valid); + +#endif /* CONFIG_DEBUG_LIST */ + +void __rare_list_add(struct list_head *new, struct list_head *prev, + struct list_head *next) +{ + if (!__list_add_valid(new, prev, next)) + return; + + rare_write_begin(); + __rare_write(next->prev, new); + __rare_write(new->next, next); + __rare_write(new->prev, prev); + __rare_write(prev->next, new); + rare_write_end(); +} +EXPORT_SYMBOL(__rare_list_add); + +void rare_list_del(__wr_rare_type struct list_head *entry_const) +{ + struct list_head *entry = (struct list_head *)entry_const; + struct list_head *prev = entry->prev; + struct list_head *next = entry->next; + + if (!__list_del_entry_valid(entry)) + return; + + rare_write_begin(); + __rare_write(next->prev, prev); + __rare_write(prev->next, next); + __rare_write(entry->next, LIST_POISON1); + __rare_write(entry->prev, LIST_POISON2); + rare_write_end(); +} +EXPORT_SYMBOL(rare_list_del); -- 2.7.4
This is a port of the PaX/grsecurity constify plugin. However, it has the automatic function-pointer struct detection temporarily disabled. As a result, this will only recognize the __do_const annotation, which makes all instances of a marked structure read-only. The rare_write() infrastructure can be used to make changes to such variables. Signed-off-by: Kees Cook <keescook@chromium.org> --- arch/Kconfig | 13 + include/linux/compiler-gcc.h | 5 + include/linux/compiler.h | 8 + scripts/Makefile.gcc-plugins | 2 + scripts/gcc-plugins/constify_plugin.c | 585 ++++++++++++++++++++++++++++++++++ 5 files changed, 613 insertions(+) create mode 100644 scripts/gcc-plugins/constify_plugin.c diff --git a/arch/Kconfig b/arch/Kconfig index 5ebf62500b99..79447d6a40bc 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -436,6 +436,19 @@ config GCC_PLUGIN_STRUCTLEAK_VERBOSE initialized. Since not all existing initializers are detected by the plugin, this can produce false positive warnings. +config GCC_PLUGIN_CONSTIFY + bool "Make function pointer structures and marked variables read-only" + depends on GCC_PLUGINS && HAVE_ARCH_RARE_WRITE && !UML + help + By saying Y here the compiler will automatically constify a class + of types that contain only function pointers, as well as any + manually annotated structures. This reduces the kernel's attack + surface and also produces a better memory layout. + + This plugin was ported from grsecurity/PaX. More information at: + * https://grsecurity.net/ + * https://pax.grsecurity.net/ + config HAVE_CC_STACKPROTECTOR bool help diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h index 0efef9cf014f..cee2bf7c32a4 100644 --- a/include/linux/compiler-gcc.h +++ b/include/linux/compiler-gcc.h @@ -191,6 +191,11 @@ #if GCC_VERSION >= 40500 +#ifdef CONSTIFY_PLUGIN +#define __no_const __attribute__((no_const)) +#define __do_const __attribute__((do_const)) +#endif + #ifndef __CHECKER__ #ifdef LATENT_ENTROPY_PLUGIN #define __latent_entropy __attribute__((latent_entropy)) diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 274bd03cfe9e..2def0f12a71c 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -476,6 +476,14 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s # define __latent_entropy #endif +#ifndef __do_const +# define __do_const +#endif + +#ifndef __no_const +# define __no_const +#endif + /* * Tell gcc if a function is cold. The compiler will assume any path * directly leading to the call is unlikely. diff --git a/scripts/Makefile.gcc-plugins b/scripts/Makefile.gcc-plugins index 82335533620e..8d264c0bb758 100644 --- a/scripts/Makefile.gcc-plugins +++ b/scripts/Makefile.gcc-plugins @@ -5,6 +5,8 @@ ifdef CONFIG_GCC_PLUGINS SANCOV_PLUGIN := -fplugin=$(objtree)/scripts/gcc-plugins/sancov_plugin.so gcc-plugin-$(CONFIG_GCC_PLUGIN_CYC_COMPLEXITY) += cyc_complexity_plugin.so + gcc-plugin-$(CONFIG_GCC_PLUGIN_CONSTIFY) += constify_plugin.so + gcc-plugin-cflags-$(CONFIG_GCC_PLUGIN_CONSTIFY) += -DCONSTIFY_PLUGIN gcc-plugin-$(CONFIG_GCC_PLUGIN_LATENT_ENTROPY) += latent_entropy_plugin.so gcc-plugin-cflags-$(CONFIG_GCC_PLUGIN_LATENT_ENTROPY) += -DLATENT_ENTROPY_PLUGIN diff --git a/scripts/gcc-plugins/constify_plugin.c b/scripts/gcc-plugins/constify_plugin.c new file mode 100644 index 000000000000..e7d6f3140e87 --- /dev/null +++ b/scripts/gcc-plugins/constify_plugin.c @@ -0,0 +1,585 @@ +/* + * Copyright 2011 by Emese Revfy <re.emese@gmail.com> + * Copyright 2011-2016 by PaX Team <pageexec@freemail.hu> + * Licensed under the GPL v2, or (at your option) v3 + * + * This gcc plugin constifies all structures which contain only function pointers or are explicitly marked for constification. + * + * Homepage: + * http://www.grsecurity.net/~ephox/const_plugin/ + * + * Usage: + * $ gcc -I`gcc -print-file-name=plugin`/include -fPIC -shared -O2 -o constify_plugin.so constify_plugin.c + * $ gcc -fplugin=constify_plugin.so test.c -O2 + */ + +#include "gcc-common.h" + +// unused C type flag in all versions 4.5-6 +#define TYPE_CONSTIFY_VISITED(TYPE) TYPE_LANG_FLAG_4(TYPE) + +__visible int plugin_is_GPL_compatible; + +static bool enabled = true; + +static struct plugin_info const_plugin_info = { + .version = "201607241840vanilla", + .help = "disable\tturn off constification\n", +}; + +static struct { + const char *name; + const char *asm_op; +} const_sections[] = { + {".init.rodata", "\t.section\t.init.rodata,\"a\""}, + {".ref.rodata", "\t.section\t.ref.rodata,\"a\""}, + {".devinit.rodata", "\t.section\t.devinit.rodata,\"a\""}, + {".devexit.rodata", "\t.section\t.devexit.rodata,\"a\""}, + {".cpuinit.rodata", "\t.section\t.cpuinit.rodata,\"a\""}, + {".cpuexit.rodata", "\t.section\t.cpuexit.rodata,\"a\""}, + {".meminit.rodata", "\t.section\t.meminit.rodata,\"a\""}, + {".memexit.rodata", "\t.section\t.memexit.rodata,\"a\""}, + {".data..read_only", "\t.section\t.data..read_only,\"a\""}, +}; + +typedef struct { + bool has_fptr_field; + bool has_writable_field; + bool has_do_const_field; + bool has_no_const_field; +} constify_info; + +static const_tree get_field_type(const_tree field) +{ + return strip_array_types(TREE_TYPE(field)); +} + +static bool is_fptr(const_tree field) +{ + /* XXX: disable automatic constification. */ + return false; + + const_tree ptr = get_field_type(field); + + if (TREE_CODE(ptr) != POINTER_TYPE) + return false; + + return TREE_CODE(TREE_TYPE(ptr)) == FUNCTION_TYPE; +} + +/* + * determine whether the given structure type meets the requirements for automatic constification, + * including the constification attributes on nested structure types + */ +static void constifiable(const_tree node, constify_info *cinfo) +{ + const_tree field; + + gcc_assert(TREE_CODE(node) == RECORD_TYPE || TREE_CODE(node) == UNION_TYPE); + + // e.g., pointer to structure fields while still constructing the structure type + if (TYPE_FIELDS(node) == NULL_TREE) + return; + + for (field = TYPE_FIELDS(node); field; field = TREE_CHAIN(field)) { + const_tree type = get_field_type(field); + enum tree_code code = TREE_CODE(type); + + if (node == type) + continue; + + if (is_fptr(field)) + cinfo->has_fptr_field = true; + else if (code == RECORD_TYPE || code == UNION_TYPE) { + if (lookup_attribute("do_const", TYPE_ATTRIBUTES(type))) + cinfo->has_do_const_field = true; + else if (lookup_attribute("no_const", TYPE_ATTRIBUTES(type))) + cinfo->has_no_const_field = true; + else + constifiable(type, cinfo); + } else if (!TREE_READONLY(field)) + cinfo->has_writable_field = true; + } +} + +static bool constified(const_tree node) +{ + constify_info cinfo = { + .has_fptr_field = false, + .has_writable_field = false, + .has_do_const_field = false, + .has_no_const_field = false + }; + + gcc_assert(TREE_CODE(node) == RECORD_TYPE || TREE_CODE(node) == UNION_TYPE); + + if (lookup_attribute("no_const", TYPE_ATTRIBUTES(node))) { +// gcc_assert(!TYPE_READONLY(node)); + return false; + } + + if (lookup_attribute("do_const", TYPE_ATTRIBUTES(node))) { + gcc_assert(TYPE_READONLY(node)); + return true; + } + + constifiable(node, &cinfo); + if ((!cinfo.has_fptr_field || cinfo.has_writable_field || cinfo.has_no_const_field) && !cinfo.has_do_const_field) + return false; + + return TYPE_READONLY(node); +} + +static void deconstify_tree(tree node); + +static void deconstify_type(tree type) +{ + tree field; + + gcc_assert(TREE_CODE(type) == RECORD_TYPE || TREE_CODE(type) == UNION_TYPE); + + for (field = TYPE_FIELDS(type); field; field = TREE_CHAIN(field)) { + const_tree fieldtype = get_field_type(field); + + // special case handling of simple ptr-to-same-array-type members + if (TREE_CODE(TREE_TYPE(field)) == POINTER_TYPE) { + tree ptrtype = TREE_TYPE(TREE_TYPE(field)); + + if (TREE_TYPE(TREE_TYPE(field)) == type) + continue; + if (TREE_CODE(ptrtype) != RECORD_TYPE && TREE_CODE(ptrtype) != UNION_TYPE) + continue; + if (!constified(ptrtype)) + continue; + if (TYPE_MAIN_VARIANT(ptrtype) == TYPE_MAIN_VARIANT(type)) + TREE_TYPE(field) = build_pointer_type(build_qualified_type(type, TYPE_QUALS(ptrtype) & ~TYPE_QUAL_CONST)); + continue; + } + if (TREE_CODE(fieldtype) != RECORD_TYPE && TREE_CODE(fieldtype) != UNION_TYPE) + continue; + if (!constified(fieldtype)) + continue; + + deconstify_tree(field); + TREE_READONLY(field) = 0; + } + TYPE_READONLY(type) = 0; + C_TYPE_FIELDS_READONLY(type) = 0; + if (lookup_attribute("do_const", TYPE_ATTRIBUTES(type))) { + TYPE_ATTRIBUTES(type) = copy_list(TYPE_ATTRIBUTES(type)); + TYPE_ATTRIBUTES(type) = remove_attribute("do_const", TYPE_ATTRIBUTES(type)); + } +} + +static void deconstify_tree(tree node) +{ + tree old_type, new_type, field; + + old_type = TREE_TYPE(node); + while (TREE_CODE(old_type) == ARRAY_TYPE && TREE_CODE(TREE_TYPE(old_type)) != ARRAY_TYPE) { + node = TREE_TYPE(node) = copy_node(old_type); + old_type = TREE_TYPE(old_type); + } + + gcc_assert(TREE_CODE(old_type) == RECORD_TYPE || TREE_CODE(old_type) == UNION_TYPE); + gcc_assert(TYPE_READONLY(old_type) && (TYPE_QUALS(old_type) & TYPE_QUAL_CONST)); + + new_type = build_qualified_type(old_type, TYPE_QUALS(old_type) & ~TYPE_QUAL_CONST); + TYPE_FIELDS(new_type) = copy_list(TYPE_FIELDS(new_type)); + for (field = TYPE_FIELDS(new_type); field; field = TREE_CHAIN(field)) + DECL_FIELD_CONTEXT(field) = new_type; + + deconstify_type(new_type); + + TREE_TYPE(node) = new_type; +} + +static tree handle_no_const_attribute(tree *node, tree name, tree args, int flags, bool *no_add_attrs) +{ + tree type; + constify_info cinfo = { + .has_fptr_field = false, + .has_writable_field = false, + .has_do_const_field = false, + .has_no_const_field = false + }; + + *no_add_attrs = true; + if (TREE_CODE(*node) == FUNCTION_DECL) { + error("%qE attribute does not apply to functions (%qF)", name, *node); + return NULL_TREE; + } + + if (TREE_CODE(*node) == PARM_DECL) { + error("%qE attribute does not apply to function parameters (%qD)", name, *node); + return NULL_TREE; + } + + if (TREE_CODE(*node) == VAR_DECL) { + error("%qE attribute does not apply to variables (%qD)", name, *node); + return NULL_TREE; + } + + if (TYPE_P(*node)) { + type = *node; + } else { + if (TREE_CODE(*node) != TYPE_DECL) { + error("%qE attribute does not apply to %qD (%qT)", name, *node, TREE_TYPE(*node)); + return NULL_TREE; + } + type = TREE_TYPE(*node); + } + + if (TREE_CODE(type) != RECORD_TYPE && TREE_CODE(type) != UNION_TYPE) { + error("%qE attribute used on %qT applies to struct and union types only", name, type); + return NULL_TREE; + } + + if (lookup_attribute(IDENTIFIER_POINTER(name), TYPE_ATTRIBUTES(type))) { + error("%qE attribute is already applied to the type %qT", name, type); + return NULL_TREE; + } + + if (TYPE_P(*node)) { + if (lookup_attribute("do_const", TYPE_ATTRIBUTES(type))) + error("%qE attribute used on type %qT is incompatible with 'do_const'", name, type); + else + *no_add_attrs = false; + return NULL_TREE; + } + + constifiable(type, &cinfo); + if ((cinfo.has_fptr_field && !cinfo.has_writable_field && !cinfo.has_no_const_field) || lookup_attribute("do_const", TYPE_ATTRIBUTES(type))) { + if (enabled) { + if TYPE_P(*node) + deconstify_type(*node); + else + deconstify_tree(*node); + } + if (TYPE_P(*node)) + TYPE_CONSTIFY_VISITED(*node) = 1; + else + TYPE_CONSTIFY_VISITED(TREE_TYPE(*node)) = 1; + return NULL_TREE; + } + + if (enabled && TYPE_FIELDS(type)) + error("%qE attribute used on type %qT that is not constified", name, type); + return NULL_TREE; +} + +static void constify_type(tree type) +{ + gcc_assert(type == TYPE_MAIN_VARIANT(type)); + TYPE_READONLY(type) = 1; + C_TYPE_FIELDS_READONLY(type) = 1; + TYPE_CONSTIFY_VISITED(type) = 1; +// TYPE_ATTRIBUTES(type) = copy_list(TYPE_ATTRIBUTES(type)); +// TYPE_ATTRIBUTES(type) = tree_cons(get_identifier("do_const"), NULL_TREE, TYPE_ATTRIBUTES(type)); +} + +static tree handle_do_const_attribute(tree *node, tree name, tree args, int flags, bool *no_add_attrs) +{ + *no_add_attrs = true; + if (!TYPE_P(*node)) { + error("%qE attribute applies to types only (%qD)", name, *node); + return NULL_TREE; + } + + if (TREE_CODE(*node) != RECORD_TYPE && TREE_CODE(*node) != UNION_TYPE) { + error("%qE attribute used on %qT applies to struct and union types only", name, *node); + return NULL_TREE; + } + + if (lookup_attribute(IDENTIFIER_POINTER(name), TYPE_ATTRIBUTES(*node))) { + error("%qE attribute used on %qT is already applied to the type", name, *node); + return NULL_TREE; + } + + if (lookup_attribute("no_const", TYPE_ATTRIBUTES(*node))) { + error("%qE attribute used on %qT is incompatible with 'no_const'", name, *node); + return NULL_TREE; + } + + *no_add_attrs = false; + return NULL_TREE; +} + +static struct attribute_spec no_const_attr = { + .name = "no_const", + .min_length = 0, + .max_length = 0, + .decl_required = false, + .type_required = false, + .function_type_required = false, + .handler = handle_no_const_attribute, +#if BUILDING_GCC_VERSION >= 4007 + .affects_type_identity = true +#endif +}; + +static struct attribute_spec do_const_attr = { + .name = "do_const", + .min_length = 0, + .max_length = 0, + .decl_required = false, + .type_required = false, + .function_type_required = false, + .handler = handle_do_const_attribute, +#if BUILDING_GCC_VERSION >= 4007 + .affects_type_identity = true +#endif +}; + +static void register_attributes(void *event_data, void *data) +{ + register_attribute(&no_const_attr); + register_attribute(&do_const_attr); +} + +static void finish_type(void *event_data, void *data) +{ + tree type = (tree)event_data; + constify_info cinfo = { + .has_fptr_field = false, + .has_writable_field = false, + .has_do_const_field = false, + .has_no_const_field = false + }; + + if (type == NULL_TREE || type == error_mark_node) + return; + +#if BUILDING_GCC_VERSION >= 5000 + if (TREE_CODE(type) == ENUMERAL_TYPE) + return; +#endif + + if (TYPE_FIELDS(type) == NULL_TREE || TYPE_CONSTIFY_VISITED(type)) + return; + + constifiable(type, &cinfo); + + if (lookup_attribute("no_const", TYPE_ATTRIBUTES(type))) { + if ((cinfo.has_fptr_field && !cinfo.has_writable_field && !cinfo.has_no_const_field) || cinfo.has_do_const_field) { + deconstify_type(type); + TYPE_CONSTIFY_VISITED(type) = 1; + } else + error("'no_const' attribute used on type %qT that is not constified", type); + return; + } + + if (lookup_attribute("do_const", TYPE_ATTRIBUTES(type))) { + if (!cinfo.has_writable_field && !cinfo.has_no_const_field) { + error("'do_const' attribute used on type %qT that is%sconstified", type, cinfo.has_fptr_field ? " " : " not "); + return; + } + constify_type(type); + return; + } + + if (cinfo.has_fptr_field && !cinfo.has_writable_field && !cinfo.has_no_const_field) { + if (lookup_attribute("do_const", TYPE_ATTRIBUTES(type))) { + error("'do_const' attribute used on type %qT that is constified", type); + return; + } + constify_type(type); + return; + } + + deconstify_type(type); + TYPE_CONSTIFY_VISITED(type) = 1; +} + +static bool is_constified_var(varpool_node_ptr node) +{ + tree var = NODE_DECL(node); + tree type = TREE_TYPE(var); + + if (node->alias) + return false; + + if (DECL_EXTERNAL(var)) + return false; + + // XXX handle more complex nesting of arrays/structs + if (TREE_CODE(type) == ARRAY_TYPE) + type = TREE_TYPE(type); + + if (TREE_CODE(type) != RECORD_TYPE && TREE_CODE(type) != UNION_TYPE) + return false; + + if (!TYPE_READONLY(type) || !C_TYPE_FIELDS_READONLY(type)) + return false; + + if (!TYPE_CONSTIFY_VISITED(type)) + return false; + + return true; +} + +static void check_section_mismatch(varpool_node_ptr node) +{ + tree var, section; + size_t i; + const char *name; + + var = NODE_DECL(node); + name = get_decl_section_name(var); + section = lookup_attribute("section", DECL_ATTRIBUTES(var)); + if (!section) { + if (name) { + fprintf(stderr, "DECL_SECTION [%s] ", name); + dump_varpool_node(stderr, node); + gcc_unreachable(); + } + return; + } else + gcc_assert(name); + +//fprintf(stderr, "SECTIONAME: [%s] ", get_decl_section_name(var)); +//debug_tree(var); + + gcc_assert(!TREE_CHAIN(section)); + gcc_assert(TREE_VALUE(section)); + + section = TREE_VALUE(TREE_VALUE(section)); + gcc_assert(!strcmp(TREE_STRING_POINTER(section), name)); +//debug_tree(section); + + for (i = 0; i < ARRAY_SIZE(const_sections); i++) + if (!strcmp(const_sections[i].name, name)) + return; + + error_at(DECL_SOURCE_LOCATION(var), "constified variable %qD placed into writable section %E", var, section); +} + +// this works around a gcc bug/feature where uninitialized globals +// are moved into the .bss section regardless of any constification +// see gcc/varasm.c:bss_initializer_p() +static void fix_initializer(varpool_node_ptr node) +{ + tree var = NODE_DECL(node); + tree type = TREE_TYPE(var); + + if (DECL_INITIAL(var)) + return; + + DECL_INITIAL(var) = build_constructor(type, NULL); +// inform(DECL_SOURCE_LOCATION(var), "constified variable %qE moved into .rodata", var); +} + +static void check_global_variables(void *event_data, void *data) +{ + varpool_node_ptr node; + + FOR_EACH_VARIABLE(node) { + if (!is_constified_var(node)) + continue; + + check_section_mismatch(node); + fix_initializer(node); + } +} + +static unsigned int check_local_variables_execute(void) +{ + unsigned int ret = 0; + tree var; + + unsigned int i; + + FOR_EACH_LOCAL_DECL(cfun, i, var) { + tree type = TREE_TYPE(var); + + gcc_assert(DECL_P(var)); + if (is_global_var(var)) + continue; + + if (TREE_CODE(type) != RECORD_TYPE && TREE_CODE(type) != UNION_TYPE) + continue; + + if (!TYPE_READONLY(type) || !C_TYPE_FIELDS_READONLY(type)) + continue; + + if (!TYPE_CONSTIFY_VISITED(type)) + continue; + + error_at(DECL_SOURCE_LOCATION(var), "constified variable %qE cannot be local", var); + ret = 1; + } + return ret; +} + +#define PASS_NAME check_local_variables +#define NO_GATE +#include "gcc-generate-gimple-pass.h" + +static unsigned int (*old_section_type_flags)(tree decl, const char *name, int reloc); + +static unsigned int constify_section_type_flags(tree decl, const char *name, int reloc) +{ + size_t i; + + for (i = 0; i < ARRAY_SIZE(const_sections); i++) + if (!strcmp(const_sections[i].name, name)) + return 0; + + return old_section_type_flags(decl, name, reloc); +} + +static void constify_start_unit(void *gcc_data, void *user_data) +{ +// size_t i; + +// for (i = 0; i < ARRAY_SIZE(const_sections); i++) +// const_sections[i].section = get_unnamed_section(0, output_section_asm_op, const_sections[i].asm_op); +// const_sections[i].section = get_section(const_sections[i].name, 0, NULL); + + old_section_type_flags = targetm.section_type_flags; + targetm.section_type_flags = constify_section_type_flags; +} + +__visible int plugin_init(struct plugin_name_args *plugin_info, struct plugin_gcc_version *version) +{ + const char * const plugin_name = plugin_info->base_name; + const int argc = plugin_info->argc; + const struct plugin_argument * const argv = plugin_info->argv; + int i; + + struct register_pass_info check_local_variables_pass_info; + + check_local_variables_pass_info.pass = make_check_local_variables_pass(); + check_local_variables_pass_info.reference_pass_name = "ssa"; + check_local_variables_pass_info.ref_pass_instance_number = 1; + check_local_variables_pass_info.pos_op = PASS_POS_INSERT_BEFORE; + + if (!plugin_default_version_check(version, &gcc_version)) { + error(G_("incompatible gcc/plugin versions")); + return 1; + } + + for (i = 0; i < argc; ++i) { + if (!(strcmp(argv[i].key, "disable"))) { + enabled = false; + continue; + } + error(G_("unknown option '-fplugin-arg-%s-%s'"), plugin_name, argv[i].key); + } + + if (strncmp(lang_hooks.name, "GNU C", 5) && !strncmp(lang_hooks.name, "GNU C+", 6)) { + inform(UNKNOWN_LOCATION, G_("%s supports C only, not %s"), plugin_name, lang_hooks.name); + enabled = false; + } + + register_callback(plugin_name, PLUGIN_INFO, NULL, &const_plugin_info); + if (enabled) { + register_callback(plugin_name, PLUGIN_ALL_IPA_PASSES_START, check_global_variables, NULL); + register_callback(plugin_name, PLUGIN_FINISH_TYPE, finish_type, NULL); + register_callback(plugin_name, PLUGIN_PASS_MANAGER_SETUP, NULL, &check_local_variables_pass_info); + register_callback(plugin_name, PLUGIN_START_UNIT, constify_start_unit, NULL); + } + register_callback(plugin_name, PLUGIN_ATTRIBUTES, register_attributes, NULL); + + return 0; +} -- 2.7.4
As found in PaX, mark struct cftype with __do_const and add helpers to deal with rare writes. This is a more complex example of a write-rarely structure, which needs to use list helpers and blocks of begin/end pairs to perform the needed updates. With this change and the constify plugin enabled, the before/after section byte sizes show: before: rodata: 0x2cc2f0 data: 0x130d00 after: rodata: 0x2cf2f0 (+74478) data: 0x12e5c0 (-65710) Signed-off-by: Kees Cook <keescook@chromium.org> --- include/linux/cgroup-defs.h | 2 +- kernel/cgroup/cgroup.c | 35 +++++++++++++++++++++++------------ 2 files changed, 24 insertions(+), 13 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 6a3f850cabab..67563a80d01f 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -434,7 +434,7 @@ struct cftype { #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lock_class_key lockdep_key; #endif -}; +} __do_const; /* * Control Group subsystem type. diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 48851327a15e..94188df45f96 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3058,11 +3058,11 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp, int ret; #ifdef CONFIG_DEBUG_LOCK_ALLOC - key = &cft->lockdep_key; + key = (struct lock_class_key *)&cft->lockdep_key; #endif kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name), - cgroup_file_mode(cft), 0, cft->kf_ops, cft, - NULL, key); + cgroup_file_mode(cft), 0, cft->kf_ops, + (void *)cft, NULL, key); if (IS_ERR(kn)) return PTR_ERR(kn); @@ -3165,11 +3165,16 @@ static void cgroup_exit_cftypes(struct cftype *cfts) /* free copy for custom atomic_write_len, see init_cftypes() */ if (cft->max_write_len && cft->max_write_len != PAGE_SIZE) kfree(cft->kf_ops); - cft->kf_ops = NULL; - cft->ss = NULL; + + rare_write_begin(); + __rare_write(cft->kf_ops, NULL); + __rare_write(cft->ss, NULL); /* revert flags set by cgroup core while adding @cfts */ - cft->flags &= ~(__CFTYPE_ONLY_ON_DFL | __CFTYPE_NOT_ON_DFL); + __rare_write(cft->flags, + cft->flags & ~(__CFTYPE_ONLY_ON_DFL | + __CFTYPE_NOT_ON_DFL)); + rare_write_end(); } } @@ -3200,8 +3205,10 @@ static int cgroup_init_cftypes(struct cgroup_subsys *ss, struct cftype *cfts) kf_ops->atomic_write_len = cft->max_write_len; } - cft->kf_ops = kf_ops; - cft->ss = ss; + rare_write_begin(); + __rare_write(cft->kf_ops, kf_ops); + __rare_write(cft->ss, ss); + rare_write_end(); } return 0; @@ -3214,7 +3221,7 @@ static int cgroup_rm_cftypes_locked(struct cftype *cfts) if (!cfts || !cfts[0].ss) return -ENOENT; - list_del(&cfts->node); + rare_list_del(&cfts->node); cgroup_apply_cftypes(cfts, false); cgroup_exit_cftypes(cfts); return 0; @@ -3271,7 +3278,7 @@ static int cgroup_add_cftypes(struct cgroup_subsys *ss, struct cftype *cfts) mutex_lock(&cgroup_mutex); - list_add_tail(&cfts->node, &ss->cfts); + rare_list_add_tail(&cfts->node, &ss->cfts); ret = cgroup_apply_cftypes(cfts, true); if (ret) cgroup_rm_cftypes_locked(cfts); @@ -3292,8 +3299,10 @@ int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts) { struct cftype *cft; + rare_write_begin(); for (cft = cfts; cft && cft->name[0] != '\0'; cft++) - cft->flags |= __CFTYPE_ONLY_ON_DFL; + __rare_write(cft->flags, cft->flags | __CFTYPE_ONLY_ON_DFL); + rare_write_end(); return cgroup_add_cftypes(ss, cfts); } @@ -3309,8 +3318,10 @@ int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts) { struct cftype *cft; + rare_write_begin(); for (cft = cfts; cft && cft->name[0] != '\0'; cft++) - cft->flags |= __CFTYPE_NOT_ON_DFL; + __rare_write(cft->flags, cft->flags | __CFTYPE_NOT_ON_DFL); + rare_write_end(); return cgroup_add_cftypes(ss, cfts); } -- 2.7.4
On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
> +/*
> + * Build "write rarely" infrastructure for flipping memory r/w
> + * on a per-CPU basis.
> + */
> +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE
> +# define __wr_rare
> +# define __wr_rare_type
> +# define __rare_write(__var, __val) (__var = (__val))
> +# define rare_write_begin() do { } while (0)
> +# define rare_write_end() do { } while (0)
> +#else
> +# define __wr_rare __ro_after_init
> +# define __wr_rare_type const
> +# ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY
> +# define __rare_write_n(dst, src, len) ({ \
> + BUILD_BUG(!builtin_const(len)); \
> + __arch_rare_write_memcpy((dst), (src), (len)); \
> + })
> +# define __rare_write(var, val) __rare_write_n(&(var), &(val), sizeof(var))
> +# else
> +# define __rare_write(var, val) ((*(typeof((typeof(var))0) *)&(var)) = (val))
> +# endif
> +# define rare_write_begin() __arch_rare_write_begin()
> +# define rare_write_end() __arch_rare_write_end()
> +#endif
> +#define rare_write(__var, __val) ({ \
> + rare_write_begin(); \
> + __rare_write(__var, __val); \
> + rare_write_end(); \
> + __var; \
> +})
> +
Of course, only after sending this do I realize that the MEMCPY case
will need to be further adjusted, since it currently can't take
literals. I guess something like this needs to be done:
#define __rare_write(var, val) ({ \
typeof(var) __src = (val); \
__rare_write_n(&(var), &(__src), sizeof(var)); \
})
-Kees
--
Kees Cook
Pixel Security
On Wed, Mar 29, 2017 at 11:15:52AM -0700, Kees Cook wrote: > This is take 2 of an RFC series to demonstrate a possible infrastructure > for the "write rarely" memory storage type in the kernel (patch 1). The > intent is to further reduce the internal attack surface of the kernel > by making more variables read-only while "at rest". This is heavily > based on the "__read_only" portion of the KERNEXEC infrastructure from > PaX/grsecurity, though I tried to adjust it to be more in line with > the upstream discussions around the APIs. How much data are we talking about here? > As part of the series I've included both x86 support (patch 4), exactly > as done in PaX, and ARM support (patches 5-7), similar to what is done in > grsecurity but without support for earlier ARM CPUs. Both are lightly > tested by me, though they need a bit more work, especially ARM as it is > missing the correct domain marking for kernel modules. The implementation as it stands on ARM is going to gobble up multiples of 1MB of RAM as you need it to be section aligned due to using domains, so if we're talking about a small amount of data, this is very inefficient. That also only works for non-LPAE as LPAE is unable to use that method. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net.
On Wed, Mar 29, 2017 at 12:00 PM, Russell King - ARM Linux <linux@armlinux.org.uk> wrote: > On Wed, Mar 29, 2017 at 11:15:52AM -0700, Kees Cook wrote: >> This is take 2 of an RFC series to demonstrate a possible infrastructure >> for the "write rarely" memory storage type in the kernel (patch 1). The >> intent is to further reduce the internal attack surface of the kernel >> by making more variables read-only while "at rest". This is heavily >> based on the "__read_only" portion of the KERNEXEC infrastructure from >> PaX/grsecurity, though I tried to adjust it to be more in line with >> the upstream discussions around the APIs. > > How much data are we talking about here? The goal is to put as much sensitive stuff from .data as possible into .rodata, which is mostly structures with function pointers, etc. I haven't measured the "final" results, since there's still a lot of work to do to get all the other annotations into upstream. >> As part of the series I've included both x86 support (patch 4), exactly >> as done in PaX, and ARM support (patches 5-7), similar to what is done in >> grsecurity but without support for earlier ARM CPUs. Both are lightly >> tested by me, though they need a bit more work, especially ARM as it is >> missing the correct domain marking for kernel modules. > > The implementation as it stands on ARM is going to gobble up > multiples of 1MB of RAM as you need it to be section aligned due to > using domains, so if we're talking about a small amount of data, > this is very inefficient. That also only works for non-LPAE as LPAE > is unable to use that method. AIUI, we're just flipping toe domain from DOMAIN_KERNEL to DOMAIN_WR_RARE on the .rodata section (which is already 1MB aligned), and (in the future if I or someone else can figure out how) on the kernel module vmap area (which also shouldn't be changing alignment at all). -Kees -- Kees Cook Pixel Security
On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote: > Based on PaX's x86 pax_{open,close}_kernel() implementation, this > allows HAVE_ARCH_RARE_WRITE to work on x86. > > + > +static __always_inline unsigned long __arch_rare_write_begin(void) > +{ > + unsigned long cr0; > + > + preempt_disable(); This looks wrong. DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work, as would local_irq_disable(). There's no way that just disabling preemption is enough. (Also, how does this interact with perf nmis?) --Andy
On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>
>
>> +
>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>> +{
>> + unsigned long cr0;
>> +
>> + preempt_disable();
>
> This looks wrong. DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
> as would local_irq_disable(). There's no way that just disabling
> preemption is enough.
>
> (Also, how does this interact with perf nmis?)
Do you mean preempt_disable() isn't strong enough here? I'm open to
suggestions. The goal would be to make sure nothing between _begin and
_end would get executed without interruption...
-Kees
--
Kees Cook
Pixel Security
> On 30 Mar 2017, at 3:23 AM, Kees Cook <keescook@chromium.org> wrote: > > On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote: >> +/* >> + * Build "write rarely" infrastructure for flipping memory r/w >> + * on a per-CPU basis. >> + */ >> +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE >> +# define __wr_rare >> +# define __wr_rare_type >> +# define __rare_write(__var, __val) (__var = (__val)) >> +# define rare_write_begin() do { } while (0) >> +# define rare_write_end() do { } while (0) >> +#else >> +# define __wr_rare __ro_after_init >> +# define __wr_rare_type const >> +# ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY >> +# define __rare_write_n(dst, src, len) ({ \ >> + BUILD_BUG(!builtin_const(len)); \ >> + __arch_rare_write_memcpy((dst), (src), (len)); \ >> + }) >> +# define __rare_write(var, val) __rare_write_n(&(var), &(val), sizeof(var)) >> +# else >> +# define __rare_write(var, val) ((*(typeof((typeof(var))0) *)&(var)) = (val)) >> +# endif >> +# define rare_write_begin() __arch_rare_write_begin() >> +# define rare_write_end() __arch_rare_write_end() >> +#endif >> +#define rare_write(__var, __val) ({ \ >> + rare_write_begin(); \ >> + __rare_write(__var, __val); \ >> + rare_write_end(); \ >> + __var; \ >> +}) >> + > > Of course, only after sending this do I realize that the MEMCPY case > will need to be further adjusted, since it currently can't take > literals. I guess something like this needs to be done: > > #define __rare_write(var, val) ({ \ > typeof(var) __src = (val); \ > __rare_write_n(&(var), &(__src), sizeof(var)); \ > }) > Right, and it has a problem with BUILD_BUG, which causes compilation error when CONFIG_HABE_ARCH_RARE_WRITE_MEMCPY is true BUILD_BUG is defined in <linux/bug.h> but <linux/bug.h> includes <linux/compiler.h> Please see the following. diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 3334fa9..3fa50e1 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -350,11 +350,11 @@ static __always_inline void __write_once_size(volatile vo\ id *p, void *res, int s # define __wr_rare __ro_after_init # define __wr_rare_type const # ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY -# define __rare_write_n(dst, src, len) ({ \ - BUILD_BUG(!builtin_const(len)); \ - __arch_rare_write_memcpy((dst), (src), (len)); \ +# define __rare_write_n(var, val, len) ({ \ + typeof(val) __val = val; \ + __arch_rare_write_memcpy(&(var), &(__val), (len)); \ }) -# define __rare_write(var, val) __rare_write_n(&(var), &(val), sizeof(var)) +# define __rare_write(var, val) __rare_write_n((var), (val), sizeof(var)) # else # define __rare_write(var, val) ((*(typeof((typeof(var))0) *)&(var)) = (val)\ ) # endif > -Kees > > -- > Kees Cook > Pixel Security
On Wed, 2017-03-29 at 11:15 -0700, Kees Cook wrote:
> diff --git a/drivers/misc/lkdtm_perms.c b/drivers/misc/lkdtm_perms.c
> index c7635a79341f..8fbadfa4cc34 100644
> --- a/drivers/misc/lkdtm_perms.c
> +++ b/drivers/misc/lkdtm_perms.c
> [...]
> +/* This is marked __wr_rare, so it should ultimately be .rodata. */
> +static unsigned long wr_rare __wr_rare = 0xAA66AA66;
> [...]
> +void lkdtm_WRITE_RARE_WRITE(void)
> +{
> + /* Explicitly cast away "const" for the test. */
wr_rare isn't actually declared const above though? I don't think
__wr_rare includes a const, apologies if I missed it.
OOI, if wr_rare _were_ const then can the compiler optimise the a pair
of reads spanning the rare_write? i.e. adding const to the declaration
above to get:
static const unsigned long wr_rare __wr_rare = 0xAA66AA66;
x = wr_read;
rare_write(x, 0xf000baaa);
y = wr_read;
Is it possible that x == y == 0xaa66aa66 because gcc realises the x and
y came from the same const location? Have I missed a clobber somewhere
(I can't actually find a definition of __arch_rare_write_memcpy in this
series so maybe it's there), or is such code expected to always cast
away the const first?
I suppose such constructs are rare in practice in the sorts of places
where rare_write is appropriate, but with aggressive inlining it could
occur as an unexpected trap for the unwary perhaps.
Ian.
On Thu, Mar 30, 2017 at 2:34 AM, Ian Campbell <ijc@hellion.org.uk> wrote: > On Wed, 2017-03-29 at 11:15 -0700, Kees Cook wrote: >> diff --git a/drivers/misc/lkdtm_perms.c b/drivers/misc/lkdtm_perms.c >> index c7635a79341f..8fbadfa4cc34 100644 >> --- a/drivers/misc/lkdtm_perms.c >> +++ b/drivers/misc/lkdtm_perms.c >> [...] >> +/* This is marked __wr_rare, so it should ultimately be .rodata. */ >> +static unsigned long wr_rare __wr_rare = 0xAA66AA66; >> [...] >> +void lkdtm_WRITE_RARE_WRITE(void) >> +{ >> + /* Explicitly cast away "const" for the test. */ > > wr_rare isn't actually declared const above though? I don't think > __wr_rare includes a const, apologies if I missed it. Yeah, good point. I think this was a left-over from an earlier version where I'd forgotten about that detail. > OOI, if wr_rare _were_ const then can the compiler optimise the a pair > of reads spanning the rare_write? i.e. adding const to the declaration > above to get: > > static const unsigned long wr_rare __wr_rare = 0xAA66AA66; > x = wr_read; > rare_write(x, 0xf000baaa); > y = wr_read; > > Is it possible that x == y == 0xaa66aa66 because gcc realises the x and > y came from the same const location? Have I missed a clobber somewhere > (I can't actually find a definition of __arch_rare_write_memcpy in this > series so maybe it's there), or is such code expected to always cast > away the const first? > > I suppose such constructs are rare in practice in the sorts of places > where rare_write is appropriate, but with aggressive inlining it could > occur as an unexpected trap for the unwary perhaps. Right, __wr_rare is actually marked as .data..ro_after_init, which gcc effectively ignores (thinking it's part of .data), but the linker script later movies this section into the read-only portion with .rodata. As a result, the compiler treats it as writable, but the storage location is actually read-only. (And, AIUI, the constify plugin makes things read-only in a similar way, though I think it's more subtle but still avoids the const-optimization dangers.) -Kees -- Kees Cook Pixel Security
On Thu, Mar 30, 2017 at 12:44 AM, Ho-Eun Ryu <hoeun.ryu@gmail.com> wrote:
>
>> On 30 Mar 2017, at 3:23 AM, Kees Cook <keescook@chromium.org> wrote:
>>
>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>> +/*
>>> + * Build "write rarely" infrastructure for flipping memory r/w
>>> + * on a per-CPU basis.
>>> + */
>>> +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE
>>> +# define __wr_rare
>>> +# define __wr_rare_type
>>> +# define __rare_write(__var, __val) (__var = (__val))
>>> +# define rare_write_begin() do { } while (0)
>>> +# define rare_write_end() do { } while (0)
>>> +#else
>>> +# define __wr_rare __ro_after_init
>>> +# define __wr_rare_type const
>>> +# ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY
>>> +# define __rare_write_n(dst, src, len) ({ \
>>> + BUILD_BUG(!builtin_const(len)); \
>>> + __arch_rare_write_memcpy((dst), (src), (len)); \
>>> + })
>>> +# define __rare_write(var, val) __rare_write_n(&(var), &(val), sizeof(var))
>>> +# else
>>> +# define __rare_write(var, val) ((*(typeof((typeof(var))0) *)&(var)) = (val))
>>> +# endif
>>> +# define rare_write_begin() __arch_rare_write_begin()
>>> +# define rare_write_end() __arch_rare_write_end()
>>> +#endif
>>> +#define rare_write(__var, __val) ({ \
>>> + rare_write_begin(); \
>>> + __rare_write(__var, __val); \
>>> + rare_write_end(); \
>>> + __var; \
>>> +})
>>> +
>>
>> Of course, only after sending this do I realize that the MEMCPY case
>> will need to be further adjusted, since it currently can't take
>> literals. I guess something like this needs to be done:
>>
>> #define __rare_write(var, val) ({ \
>> typeof(var) __src = (val); \
>> __rare_write_n(&(var), &(__src), sizeof(var)); \
>> })
>>
>
> Right, and it has a problem with BUILD_BUG, which causes compilation error when CONFIG_HABE_ARCH_RARE_WRITE_MEMCPY is true
> BUILD_BUG is defined in <linux/bug.h> but <linux/bug.h> includes <linux/compiler.h>
>
> Please see the following.
>
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 3334fa9..3fa50e1 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -350,11 +350,11 @@ static __always_inline void __write_once_size(volatile vo\
> id *p, void *res, int s
> # define __wr_rare __ro_after_init
> # define __wr_rare_type const
> # ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY
> -# define __rare_write_n(dst, src, len) ({ \
> - BUILD_BUG(!builtin_const(len)); \
> - __arch_rare_write_memcpy((dst), (src), (len)); \
> +# define __rare_write_n(var, val, len) ({ \
> + typeof(val) __val = val; \
> + __arch_rare_write_memcpy(&(var), &(__val), (len)); \
> })
> -# define __rare_write(var, val) __rare_write_n(&(var), &(val), sizeof(var))
> +# define __rare_write(var, val) __rare_write_n((var), (val), sizeof(var))
> # else
> # define __rare_write(var, val) ((*(typeof((typeof(var))0) *)&(var)) = (val)\
> )
> # endif
Thanks for the catch, I'll update this (I'll use compiletime_assert,
which is defined in compiler.h).
-Kees
--
Kees Cook
Pixel Security
On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>>
>>
>>> +
>>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>>> +{
>>> + unsigned long cr0;
>>> +
>>> + preempt_disable();
>>
>> This looks wrong. DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
>> as would local_irq_disable(). There's no way that just disabling
>> preemption is enough.
>>
>> (Also, how does this interact with perf nmis?)
>
> Do you mean preempt_disable() isn't strong enough here? I'm open to
> suggestions. The goal would be to make sure nothing between _begin and
> _end would get executed without interruption...
>
Sorry for the very slow response.
preempt_disable() isn't strong enough to prevent interrupts, and an
interrupt here would run with WP off, causing unknown havoc. I tend
to think that the caller should be responsible for turning off
interrupts.
--Andy
On Wed, Apr 5, 2017 at 4:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>>>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>>>
>>>
>>>> +
>>>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>>>> +{
>>>> + unsigned long cr0;
>>>> +
>>>> + preempt_disable();
>>>
>>> This looks wrong. DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
>>> as would local_irq_disable(). There's no way that just disabling
>>> preemption is enough.
>>>
>>> (Also, how does this interact with perf nmis?)
>>
>> Do you mean preempt_disable() isn't strong enough here? I'm open to
>> suggestions. The goal would be to make sure nothing between _begin and
>> _end would get executed without interruption...
>>
>
> Sorry for the very slow response.
>
> preempt_disable() isn't strong enough to prevent interrupts, and an
> interrupt here would run with WP off, causing unknown havoc. I tend
> to think that the caller should be responsible for turning off
> interrupts.
So, something like:
Top-level functions:
static __always_inline rare_write_begin(void)
{
preempt_disable();
local_irq_disable();
barrier();
__arch_rare_write_begin();
barrier();
}
static __always_inline rare_write_end(void)
{
barrier();
__arch_rare_write_end();
barrier();
local_irq_enable();
preempt_enable_no_resched();
}
x86-specific helpers:
static __always_inline unsigned long __arch_rare_write_begin(void)
{
unsigned long cr0;
cr0 = read_cr0() ^ X86_CR0_WP;
BUG_ON(cr0 & X86_CR0_WP);
write_cr0(cr0);
return cr0 ^ X86_CR0_WP;
}
static __always_inline unsigned long __arch_rare_write_end(void)
{
unsigned long cr0;
cr0 = read_cr0() ^ X86_CR0_WP;
BUG_ON(!(cr0 & X86_CR0_WP));
write_cr0(cr0);
return cr0 ^ X86_CR0_WP;
}
I can give it a spin...
-Kees
--
Kees Cook
Pixel Security
On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Apr 5, 2017 at 4:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote:
>>> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>>>>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>>>>
>>>>
>>>>> +
>>>>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>>>>> +{
>>>>> + unsigned long cr0;
>>>>> +
>>>>> + preempt_disable();
>>>>
>>>> This looks wrong. DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
>>>> as would local_irq_disable(). There's no way that just disabling
>>>> preemption is enough.
>>>>
>>>> (Also, how does this interact with perf nmis?)
>>>
>>> Do you mean preempt_disable() isn't strong enough here? I'm open to
>>> suggestions. The goal would be to make sure nothing between _begin and
>>> _end would get executed without interruption...
>>>
>>
>> Sorry for the very slow response.
>>
>> preempt_disable() isn't strong enough to prevent interrupts, and an
>> interrupt here would run with WP off, causing unknown havoc. I tend
>> to think that the caller should be responsible for turning off
>> interrupts.
>
> So, something like:
>
> Top-level functions:
>
> static __always_inline rare_write_begin(void)
> {
> preempt_disable();
> local_irq_disable();
> barrier();
> __arch_rare_write_begin();
> barrier();
> }
Looks good, except you don't need preempt_disable().
local_irq_disable() also disables preemption. You might need to use
local_irq_save(), though, depending on whether any callers already
have IRQs off.
--Andy
> On 30 Mar 2017, at 3:15 AM, Kees Cook <keescook@chromium.org> wrote: > > Several types of data storage exist in the kernel: read-write data (.data, > .bss), read-only data (.rodata), and RO-after-init. This introduces the > infrastructure for another type: write-rarely, which is intended for data > that is either only rarely modified or especially security-sensitive. The > goal is to further reduce the internal attack surface of the kernel by > making this storage read-only when "at rest". This makes it much harder > to be subverted by attackers who have a kernel-write flaw, since they > cannot directly change these memory contents. > > This work is heavily based on PaX and grsecurity's pax_{open,close}_kernel > API, its __read_only annotations, its constify plugin, and the work done > to identify sensitive structures that should be moved from .data into > .rodata. This builds the initial infrastructure to support these kinds > of changes, though the API and naming has been adjusted in places for > clarity and maintainability. > > Variables declared with the __wr_rare annotation will be moved to the > .rodata section if an architecture supports CONFIG_HAVE_ARCH_WRITE_RARE. > To change these variables, either a single rare_write() macro can be used, > or multiple uses of __rare_write(), wrapped in a matching pair of > rare_write_begin() and rare_write_end() macros can be used. These macros > are expanded into the arch-specific functions that perform the actions > needed to write to otherwise read-only memory. > > As detailed in the Kconfig help, the arch-specific helpers have several > requirements to make them sensible/safe for use by the kernel: they must > not allow non-current CPUs to write the memory area, they must run > non-preemptible to avoid accidentally leaving memory writable, and must > be inline to avoid making them desirable ROP targets for attackers. > > Signed-off-by: Kees Cook <keescook@chromium.org> > --- > arch/Kconfig | 25 +++++++++++++++++++++++++ > include/linux/compiler.h | 32 ++++++++++++++++++++++++++++++++ > include/linux/preempt.h | 6 ++++-- > 3 files changed, 61 insertions(+), 2 deletions(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index cd211a14a88f..5ebf62500b99 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -847,4 +847,29 @@ config STRICT_MODULE_RWX > config ARCH_WANT_RELAX_ORDER > bool > > +config HAVE_ARCH_RARE_WRITE > + def_bool n > + help > + An arch should select this option if it has defined the functions > + __arch_rare_write_begin() and __arch_rare_write_end() to > + respectively enable and disable writing to read-only memory. The > + routines must meet the following requirements: > + - read-only memory writing must only be available on the current > + CPU (to make sure other CPUs can't race to make changes too). > + - the routines must be declared inline (to discourage ROP use). > + - the routines must not be preemptible (likely they will call > + preempt_disable() and preempt_enable_no_resched() respectively). > + - the routines must validate expected state (e.g. when enabling > + writes, BUG() if writes are already be enabled). > + > +config HAVE_ARCH_RARE_WRITE_MEMCPY > + def_bool n > + depends on HAVE_ARCH_RARE_WRITE > + help > + An arch should select this option if a special accessor is needed > + to write to otherwise read-only memory, defined by the function > + __arch_rare_write_memcpy(). Without this, the write-rarely > + infrastructure will just attempt to write directly to the memory > + using a const-ignoring assignment. > + > source "kernel/gcov/Kconfig" > diff --git a/include/linux/compiler.h b/include/linux/compiler.h > index f8110051188f..274bd03cfe9e 100644 > --- a/include/linux/compiler.h > +++ b/include/linux/compiler.h > @@ -336,6 +336,38 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s > __u.__val; \ > }) > > +/* > + * Build "write rarely" infrastructure for flipping memory r/w > + * on a per-CPU basis. > + */ > +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE > +# define __wr_rare > +# define __wr_rare_type > +# define __rare_write(__var, __val) (__var = (__val)) > +# define rare_write_begin() do { } while (0) > +# define rare_write_end() do { } while (0) > +#else > +# define __wr_rare __ro_after_init > +# define __wr_rare_type const > +# ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY > +# define __rare_write_n(dst, src, len) ({ \ > + BUILD_BUG(!builtin_const(len)); \ > + __arch_rare_write_memcpy((dst), (src), (len)); \ > + }) > +# define __rare_write(var, val) __rare_write_n(&(var), &(val), sizeof(var)) > +# else > +# define __rare_write(var, val) ((*(typeof((typeof(var))0) *)&(var)) = (val)) > +# endif > +# define rare_write_begin() __arch_rare_write_begin() > +# define rare_write_end() __arch_rare_write_end() > +#endif > +#define rare_write(__var, __val) ({ \ > + rare_write_begin(); \ > + __rare_write(__var, __val); \ > + rare_write_end(); \ > + __var; \ > +}) > + How about we have a separate header file splitting section annotations and the actual APIs. include/linux/compiler.h: __wr_rare __wr_rare_type include/linux/rare_write.h: __rare_write_n() __rare_write() rare_write_begin() rare_write_end() OR moving all of them to include/linux/rare_write.h. I’m writing the arm64 port for rare_write feature and I’ve stucked in some header problems for the next version of the patch. I need some other mmu related APIs (mostly defined in `arch/arm64/include/asm/mmu_context.h`) to implement those helpers. but I cannot include the header in `include/linux/compiler.h` prior to definition of rare_write macros (huge compilation errors). You know that `linux/compiler.h` header is mostly base of other headers not a user. I have to define `__arch_rare_write_[begin/end/memcpy]()` functions as `static inline` in a header file somewhere in `arch/arm64/include/asm/` to avoid making them ROP targets and the helpers need other mmu related APIs. And the helpers and the mmu related APIs cannot be defined prior to definition of rare_write macros in `linux/compile.h`. And I think I'll need `include/linux/module.h` for module related APIs like `is_module_address()` to support module address conversion in `__arch_rare_write_memcpy()` and it’ll make the situation worse. * make `include/linux/rare_write.h` * the header defines rare_write APIs * the header includes `include/asm/rare_write.h` for arch-specific helpers * users using rare_write feature should include `linux/rare_write.h` OR suggest other solutions please. > #endif /* __KERNEL__ */ > > #endif /* __ASSEMBLY__ */ > diff --git a/include/linux/preempt.h b/include/linux/preempt.h > index cae461224948..4fc97aaa22ea 100644 > --- a/include/linux/preempt.h > +++ b/include/linux/preempt.h > @@ -258,10 +258,12 @@ do { \ > /* > * Modules have no business playing preemption tricks. > */ > -#undef sched_preempt_enable_no_resched > -#undef preempt_enable_no_resched > #undef preempt_enable_no_resched_notrace > #undef preempt_check_resched > +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE > +#undef sched_preempt_enable_no_resched > +#undef preempt_enable_no_resched > +#endif > #endif > > #define preempt_set_need_resched() \ > -- > 2.7.4 >
On 6 April 2017 at 17:59, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Wed, Apr 5, 2017 at 4:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote:
>>>> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>>>>>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>>>>>
>>>>>> +
>>>>>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>>>>>> +{
>>>>>> + unsigned long cr0;
>>>>>> +
>>>>>> + preempt_disable();
>>>>>
>>>>> This looks wrong. DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
>>>>> as would local_irq_disable(). There's no way that just disabling
>>>>> preemption is enough.
>>>>>
>>>>> (Also, how does this interact with perf nmis?)
>>>>
>>>> Do you mean preempt_disable() isn't strong enough here? I'm open to
>>>> suggestions. The goal would be to make sure nothing between _begin and
>>>> _end would get executed without interruption...
>>>>
>>>
>>> Sorry for the very slow response.
>>>
>>> preempt_disable() isn't strong enough to prevent interrupts, and an
>>> interrupt here would run with WP off, causing unknown havoc. I tend
>>> to think that the caller should be responsible for turning off
>>> interrupts.
>>
>> So, something like:
>>
>> Top-level functions:
>>
>> static __always_inline rare_write_begin(void)
>> {
>> preempt_disable();
>> local_irq_disable();
>> barrier();
>> __arch_rare_write_begin();
>> barrier();
>> }
>
> Looks good, except you don't need preempt_disable().
> local_irq_disable() also disables preemption. You might need to use
> local_irq_save(), though, depending on whether any callers already
> have IRQs off.
Well, doesn't look good to me. NMIs will still be able to interrupt
this code and will run with CR0.WP = 0.
Shouldn't you instead question yourself why PaX can do it "just" with
preempt_disable() instead?!
Cheers,
Mathias
On Wed, Mar 29, 2017 at 11:16:00AM -0700, Kees Cook wrote: > +static inline unsigned long __arch_rare_write_end(void) > +{ > + BUG_ON(test_domain(DOMAIN_WR_RARE, DOMAIN_CLIENT)); > + modify_domain(DOMAIN_WR_RARE, DOMAIN_CLIENT); > + preempt_enable_no_resched(); NAK > + return 0; > +}
On Wed, Mar 29, 2017 at 11:15:56AM -0700, Kees Cook wrote: > +static __always_inline unsigned long __arch_rare_write_end(void) > +{ > + unsigned long cr0; > + > + barrier(); > + cr0 = read_cr0() ^ X86_CR0_WP; > + BUG_ON(!(cr0 & X86_CR0_WP)); > + write_cr0(cr0); > + barrier(); > + preempt_enable_no_resched(); NAK > + return cr0 ^ X86_CR0_WP; > +}
On Fri, 7 Apr 2017, Mathias Krause wrote:
> On 6 April 2017 at 17:59, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
> >> static __always_inline rare_write_begin(void)
> >> {
> >> preempt_disable();
> >> local_irq_disable();
> >> barrier();
> >> __arch_rare_write_begin();
> >> barrier();
> >> }
> >
> > Looks good, except you don't need preempt_disable().
> > local_irq_disable() also disables preemption. You might need to use
> > local_irq_save(), though, depending on whether any callers already
> > have IRQs off.
>
> Well, doesn't look good to me. NMIs will still be able to interrupt
> this code and will run with CR0.WP = 0.
>
> Shouldn't you instead question yourself why PaX can do it "just" with
> preempt_disable() instead?!
That's silly. Just because PaX does it, doesn't mean it's correct. To be
honest, playing games with the CR0.WP bit is outright stupid to begin with.
Whether protected by preempt_disable or local_irq_disable, to make that
work it needs CR0 handling in the exception entry/exit at the lowest
level. And that's just a nightmare maintainence wise as it's prone to be
broken over time. Aside of that it's pointless overhead for the normal case.
The proper solution is:
write_rare(ptr, val)
{
mp = map_shadow_rw(ptr);
*mp = val;
unmap_shadow_rw(mp);
}
map_shadow_rw() is essentially the same thing as we do in the highmem case
where the kernel creates a shadow mapping of the user space pages via
kmap_atomic().
It's valid (at least on x86) to have a shadow map with the same page
attributes but write enabled. That does not require any fixups of CR0 and
just works.
Thanks,
tglx
On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote: > On Fri, 7 Apr 2017, Mathias Krause wrote: >> On 6 April 2017 at 17:59, Andy Lutomirski <luto@amacapital.net> wrote: >> > On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote: >> >> static __always_inline rare_write_begin(void) >> >> { >> >> preempt_disable(); >> >> local_irq_disable(); >> >> barrier(); >> >> __arch_rare_write_begin(); >> >> barrier(); >> >> } >> > >> > Looks good, except you don't need preempt_disable(). >> > local_irq_disable() also disables preemption. You might need to use >> > local_irq_save(), though, depending on whether any callers already >> > have IRQs off. >> >> Well, doesn't look good to me. NMIs will still be able to interrupt >> this code and will run with CR0.WP = 0. >> >> Shouldn't you instead question yourself why PaX can do it "just" with >> preempt_disable() instead?! > > That's silly. Just because PaX does it, doesn't mean it's correct. To be > honest, playing games with the CR0.WP bit is outright stupid to begin with. Why that? It allows fast and CPU local modifications of r/o memory. OTOH, an approach that needs to fiddle with page table entries requires global synchronization to keep the individual TLB states in sync. Hmm.. Not that fast, I'd say. > Whether protected by preempt_disable or local_irq_disable, to make that > work it needs CR0 handling in the exception entry/exit at the lowest > level. And that's just a nightmare maintainence wise as it's prone to be > broken over time. It seems to be working fine for more than a decade now in PaX. So it can't be such a big maintenance nightmare ;) > Aside of that it's pointless overhead for the normal case. > > The proper solution is: > > write_rare(ptr, val) > { > mp = map_shadow_rw(ptr); > *mp = val; > unmap_shadow_rw(mp); > } > > map_shadow_rw() is essentially the same thing as we do in the highmem case > where the kernel creates a shadow mapping of the user space pages via > kmap_atomic(). The "proper solution" seems to be much slower compared to just toggling CR0.WP (which is costly in itself, already) because of the TLB invalidation / synchronisation involved. > It's valid (at least on x86) to have a shadow map with the same page > attributes but write enabled. That does not require any fixups of CR0 and > just works. "Just works", sure -- but it's not as tightly focused as the PaX solution which is CPU local, while your proposed solution is globally visible. Cheers, Mathias
On Fri, 7 Apr 2017, Mathias Krause wrote: > On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote: > > Whether protected by preempt_disable or local_irq_disable, to make that > > work it needs CR0 handling in the exception entry/exit at the lowest > > level. And that's just a nightmare maintainence wise as it's prone to be > > broken over time. > > It seems to be working fine for more than a decade now in PaX. So it > can't be such a big maintenance nightmare ;) I really do not care whether PaX wants to chase and verify that over and over. I certainly don't want to take the chance to leak CR0.WP ever and I very much care about extra stuff to check in the entry/exit path. > The "proper solution" seems to be much slower compared to just > toggling CR0.WP (which is costly in itself, already) because of the > TLB invalidation / synchronisation involved. Why the heck should we care about rare writes being performant? > > It's valid (at least on x86) to have a shadow map with the same page > > attributes but write enabled. That does not require any fixups of CR0 and > > just works. > > "Just works", sure -- but it's not as tightly focused as the PaX > solution which is CPU local, while your proposed solution is globally > visible. Making the world and some more writeable hardly qualifies as tightly focussed. Making the mapping concept CPU local is not rocket science either. The question is whethers it's worth the trouble. Thanks, tglx
On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: > On Fri, 7 Apr 2017, Mathias Krause wrote: >> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote: >> > Whether protected by preempt_disable or local_irq_disable, to make that >> > work it needs CR0 handling in the exception entry/exit at the lowest >> > level. And that's just a nightmare maintainence wise as it's prone to be >> > broken over time. >> >> It seems to be working fine for more than a decade now in PaX. So it >> can't be such a big maintenance nightmare ;) > > I really do not care whether PaX wants to chase and verify that over and > over. I certainly don't want to take the chance to leak CR0.WP ever and I > very much care about extra stuff to check in the entry/exit path. Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) somewhere sensible should make those "leaks" visible fast -- and their exploitation impossible, i.e. fail hard. >> The "proper solution" seems to be much slower compared to just >> toggling CR0.WP (which is costly in itself, already) because of the >> TLB invalidation / synchronisation involved. > > Why the heck should we care about rare writes being performant? As soon as they stop being rare and people start extending the r/o protection to critical data structures accessed often. Then performance matters. >> > It's valid (at least on x86) to have a shadow map with the same page >> > attributes but write enabled. That does not require any fixups of CR0 and >> > just works. >> >> "Just works", sure -- but it's not as tightly focused as the PaX >> solution which is CPU local, while your proposed solution is globally >> visible. > > Making the world and some more writeable hardly qualifies as tightly > focussed. Making the mapping concept CPU local is not rocket science > either. The question is whethers it's worth the trouble. No, the question is if the value of the concept is well understood and if people can see what could be done with such a strong primitive. Apparently not... Cheers, Mathias
On Fri, Apr 07, 2017 at 12:51:15PM +0200, Mathias Krause wrote:
> Why that? It allows fast and CPU local modifications of r/o memory.
> OTOH, an approach that needs to fiddle with page table entries
> requires global synchronization to keep the individual TLB states in
> sync. Hmm.. Not that fast, I'd say.
The fixmaps used for kmap_atomic are per-cpu, no global sync required.
On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote: > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: >> On Fri, 7 Apr 2017, Mathias Krause wrote: >>> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote: >>> > Whether protected by preempt_disable or local_irq_disable, to make that >>> > work it needs CR0 handling in the exception entry/exit at the lowest >>> > level. And that's just a nightmare maintainence wise as it's prone to be >>> > broken over time. >>> >>> It seems to be working fine for more than a decade now in PaX. So it >>> can't be such a big maintenance nightmare ;) >> >> I really do not care whether PaX wants to chase and verify that over and >> over. I certainly don't want to take the chance to leak CR0.WP ever and I >> very much care about extra stuff to check in the entry/exit path. > > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) > somewhere sensible should make those "leaks" visible fast -- and their > exploitation impossible, i.e. fail hard. The leaks surely exist and now we'll just add an exploitable BUG. >>> > It's valid (at least on x86) to have a shadow map with the same page >>> > attributes but write enabled. That does not require any fixups of CR0 and >>> > just works. >>> >>> "Just works", sure -- but it's not as tightly focused as the PaX >>> solution which is CPU local, while your proposed solution is globally >>> visible. >> >> Making the world and some more writeable hardly qualifies as tightly >> focussed. Making the mapping concept CPU local is not rocket science >> either. The question is whethers it's worth the trouble. > > No, the question is if the value of the concept is well understood and > if people can see what could be done with such a strong primitive. > Apparently not... I think we're approaching this all wrong, actually. The fact that x86 has this CR0.WP thing is arguably a historical accident, and the fact that PaX uses it doesn't mean that PaX is doing it the best way for upstream Linux. Why don't we start at the other end and do a generic non-arch-specific implementation: set up an mm_struct that contains an RW alias of the relevant parts of rodata and use use_mm to access it. (That is, get_fs() to back up the old fs, set_fs(USER_DS), use_mm(&rare_write_mm), do the write using copy_to_user, undo everything.) Then someone who cares about performance can benchmark the CR0.WP approach against it and try to argue that it's a good idea. This benchmark should wait until I'm done with my PCID work, because PCID is going to make use_mm() a whole heck of a lot faster. --Andy
On Fri, Apr 07, 2017 at 09:14:29AM -0700, Andy Lutomirski wrote:
> I think we're approaching this all wrong, actually. The fact that x86
> has this CR0.WP thing is arguably a historical accident, and the fact
> that PaX uses it doesn't mean that PaX is doing it the best way for
> upstream Linux.
>
> Why don't we start at the other end and do a generic non-arch-specific
> implementation: set up an mm_struct that contains an RW alias of the
> relevant parts of rodata and use use_mm to access it. (That is,
> get_fs() to back up the old fs, set_fs(USER_DS),
> use_mm(&rare_write_mm), do the write using copy_to_user, undo
> everything.)
FWIW, I completely agree with this approach.
That's largely the approach arm64 would have to take regardless (as per
Hoeun's patches), and having a consistent common implementation would be
desireable.
There are a couple of other complications to handle (e.g. a perf
interrupt coming in and trying to read from the mapping), but I think we
can handle that generically.
Thanks,
Mark.
On Fri, 7 Apr 2017, Mathias Krause wrote: > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Fri, 7 Apr 2017, Mathias Krause wrote: > >> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote: > >> > Whether protected by preempt_disable or local_irq_disable, to make that > >> > work it needs CR0 handling in the exception entry/exit at the lowest > >> > level. And that's just a nightmare maintainence wise as it's prone to be > >> > broken over time. > >> > >> It seems to be working fine for more than a decade now in PaX. So it > >> can't be such a big maintenance nightmare ;) > > > > I really do not care whether PaX wants to chase and verify that over and > > over. I certainly don't want to take the chance to leak CR0.WP ever and I > > very much care about extra stuff to check in the entry/exit path. > > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) > somewhere sensible should make those "leaks" visible fast -- and their > exploitation impossible, i.e. fail hard. Sure, you trade leaking WP with an potentially exploitable BUG(). > >> The "proper solution" seems to be much slower compared to just > >> toggling CR0.WP (which is costly in itself, already) because of the > >> TLB invalidation / synchronisation involved. > > > > Why the heck should we care about rare writes being performant? > > As soon as they stop being rare and people start extending the r/o > protection to critical data structures accessed often. Then > performance matters. Emphasis on "Then". I'm not seeing it, because no matter what you do it's going to be slow. Aside of that increasing the usage will also increase the chance to leak stuff. In that case I rather leak a single page mapping temporarily than taking the chance to leak WP. > >> > It's valid (at least on x86) to have a shadow map with the same page > >> > attributes but write enabled. That does not require any fixups of CR0 and > >> > just works. > >> > >> "Just works", sure -- but it's not as tightly focused as the PaX > >> solution which is CPU local, while your proposed solution is globally > >> visible. > > > > Making the world and some more writeable hardly qualifies as tightly > > focussed. Making the mapping concept CPU local is not rocket science > > either. The question is whethers it's worth the trouble. > > No, the question is if the value of the concept is well understood and > if people can see what could be done with such a strong primitive. > Apparently not... Oh, well. We can stop that discussion right here, if all you can provide is a killer phrase. I'm well aware what can be done with a strong primitive and I certainly understand the concept, but I'm not naive enough to believe that lifting one of the strong protections the kernel has by globaly disabling WP is anything which should be even considered. That bit is a horrible misconception and should be fused to 1. Aside of that, if you had taken the time to figure out how kmap_atomic stuff works then you would have noticed that it does not require cross CPU pagetable syncs and that the mapping place can be randomized to a certain degree. So this has neither global impact, nor does it become immediately globally visible. Thanks, tglx
On 7 Apr 2017 at 11:46, Thomas Gleixner wrote: > On Fri, 7 Apr 2017, Mathias Krause wrote: > > Well, doesn't look good to me. NMIs will still be able to interrupt > > this code and will run with CR0.WP = 0. > > > > Shouldn't you instead question yourself why PaX can do it "just" with > > preempt_disable() instead?! > > That's silly. Just because PaX does it, doesn't mean it's correct. is that FUD or do you have actionable information to share? > To be honest, playing games with the CR0.WP bit is outright stupid to begin with. why is that? cr0.wp exists since the i486 and its behaviour fits my purposes quite well, it's the best security/performance i know of. > Whether protected by preempt_disable or local_irq_disable, to make that > work it needs CR0 handling in the exception entry/exit at the lowest > level. correct. > And that's just a nightmare maintainence wise as it's prone to be > broken over time. i've got 14 years of experience of maintaining it and i never saw it break. > Aside of that it's pointless overhead for the normal case. unless it's optional code as the whole feature already is. > The proper solution is: > > write_rare(ptr, val) > { > mp = map_shadow_rw(ptr); > *mp = val; > unmap_shadow_rw(mp); > } this is not *the* proper solution, but only a naive one that suffers from the exact same need that the cr0.wp approach does and has worse performance impact. not exactly a win... [continuing from your next mail in order to save round-trip time] > I really do not care whether PaX wants to chase and verify that over and > over. verifying it is no different than verifying, say, swapgs use. > I certainly don't want to take the chance to leak CR0.WP ever why and where would cr0.wp leak? > and I very much care about extra stuff to check in the entry/exit path. your 'proper' solution needs (a lot more) extra stuff too. > Why the heck should we care about rare writes being performant? because you've been misled by the NIH crowd here that the PaX feature they tried to (badly) extract from has anything to do with frequency of writes. it does not. what it does do is provide an environment for variables that are conceptually writable but for security reasons should be read-only most of the time by most of the code (ditto for the grossly misunderstood and thus misnamed ro-after-shit). now imagine locking down the page table hierarchy with it... > Making the world and some more writeable hardly qualifies as tightly > focused. you forgot to add 'for a window of a few insns' and that the map/unmap approach does the same under an attacker controlled ptr. > Making the mapping concept CPU local is not rocket science > either. The question is whether it's worth the trouble. it is for people who care about the integrity of the kernel, and all this read-onlyness stuff implies that some do.
On 7 Apr 2017 at 9:14, Andy Lutomirski wrote: > On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote: > > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: > >> On Fri, 7 Apr 2017, Mathias Krause wrote: > > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) > > somewhere sensible should make those "leaks" visible fast -- and their > > exploitation impossible, i.e. fail hard. > > The leaks surely exist and now we'll just add an exploitable BUG. can you please share those leaks that 'surely exist' and CC oss-security while at it? > I think we're approaching this all wrong, actually. The fact that x86 > has this CR0.WP thing is arguably a historical accident, and the fact > that PaX uses it doesn't mean that PaX is doing it the best way for > upstream Linux. > > Why don't we start at the other end and do a generic non-arch-specific > implementation: set up an mm_struct that contains an RW alias of the > relevant parts of rodata and use use_mm to access it. (That is, > get_fs() to back up the old fs, set_fs(USER_DS), > use_mm(&rare_write_mm), do the write using copy_to_user, undo > everything.) > > Then someone who cares about performance can benchmark the CR0.WP > approach against it and try to argue that it's a good idea. This > benchmark should wait until I'm done with my PCID work, because PCID > is going to make use_mm() a whole heck of a lot faster. in my measurements switching PCID is hovers around 230 cycles for snb-ivb and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's of course a whole lot more impact for switching address spaces so it'll never be fast enough to beat cr0.wp.
On Fri, Apr 7, 2017 at 1:09 AM, Ho-Eun Ryu <hoeun.ryu@gmail.com> wrote: > >> On 30 Mar 2017, at 3:15 AM, Kees Cook <keescook@chromium.org> wrote: >> >> Several types of data storage exist in the kernel: read-write data (.data, >> .bss), read-only data (.rodata), and RO-after-init. This introduces the >> infrastructure for another type: write-rarely, which is intended for data >> that is either only rarely modified or especially security-sensitive. The >> goal is to further reduce the internal attack surface of the kernel by >> making this storage read-only when "at rest". This makes it much harder >> to be subverted by attackers who have a kernel-write flaw, since they >> cannot directly change these memory contents. >> >> This work is heavily based on PaX and grsecurity's pax_{open,close}_kernel >> API, its __read_only annotations, its constify plugin, and the work done >> to identify sensitive structures that should be moved from .data into >> .rodata. This builds the initial infrastructure to support these kinds >> of changes, though the API and naming has been adjusted in places for >> clarity and maintainability. >> >> Variables declared with the __wr_rare annotation will be moved to the >> .rodata section if an architecture supports CONFIG_HAVE_ARCH_WRITE_RARE. >> To change these variables, either a single rare_write() macro can be used, >> or multiple uses of __rare_write(), wrapped in a matching pair of >> rare_write_begin() and rare_write_end() macros can be used. These macros >> are expanded into the arch-specific functions that perform the actions >> needed to write to otherwise read-only memory. >> >> As detailed in the Kconfig help, the arch-specific helpers have several >> requirements to make them sensible/safe for use by the kernel: they must >> not allow non-current CPUs to write the memory area, they must run >> non-preemptible to avoid accidentally leaving memory writable, and must >> be inline to avoid making them desirable ROP targets for attackers. >> >> Signed-off-by: Kees Cook <keescook@chromium.org> >> --- >> arch/Kconfig | 25 +++++++++++++++++++++++++ >> include/linux/compiler.h | 32 ++++++++++++++++++++++++++++++++ >> include/linux/preempt.h | 6 ++++-- >> 3 files changed, 61 insertions(+), 2 deletions(-) >> >> diff --git a/arch/Kconfig b/arch/Kconfig >> index cd211a14a88f..5ebf62500b99 100644 >> --- a/arch/Kconfig >> +++ b/arch/Kconfig >> @@ -847,4 +847,29 @@ config STRICT_MODULE_RWX >> config ARCH_WANT_RELAX_ORDER >> bool >> >> +config HAVE_ARCH_RARE_WRITE >> + def_bool n >> + help >> + An arch should select this option if it has defined the functions >> + __arch_rare_write_begin() and __arch_rare_write_end() to >> + respectively enable and disable writing to read-only memory. The >> + routines must meet the following requirements: >> + - read-only memory writing must only be available on the current >> + CPU (to make sure other CPUs can't race to make changes too). >> + - the routines must be declared inline (to discourage ROP use). >> + - the routines must not be preemptible (likely they will call >> + preempt_disable() and preempt_enable_no_resched() respectively). >> + - the routines must validate expected state (e.g. when enabling >> + writes, BUG() if writes are already be enabled). >> + >> +config HAVE_ARCH_RARE_WRITE_MEMCPY >> + def_bool n >> + depends on HAVE_ARCH_RARE_WRITE >> + help >> + An arch should select this option if a special accessor is needed >> + to write to otherwise read-only memory, defined by the function >> + __arch_rare_write_memcpy(). Without this, the write-rarely >> + infrastructure will just attempt to write directly to the memory >> + using a const-ignoring assignment. >> + >> source "kernel/gcov/Kconfig" >> diff --git a/include/linux/compiler.h b/include/linux/compiler.h >> index f8110051188f..274bd03cfe9e 100644 >> --- a/include/linux/compiler.h >> +++ b/include/linux/compiler.h >> @@ -336,6 +336,38 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s >> __u.__val; \ >> }) >> >> +/* >> + * Build "write rarely" infrastructure for flipping memory r/w >> + * on a per-CPU basis. >> + */ >> +#ifndef CONFIG_HAVE_ARCH_RARE_WRITE >> +# define __wr_rare >> +# define __wr_rare_type >> +# define __rare_write(__var, __val) (__var = (__val)) >> +# define rare_write_begin() do { } while (0) >> +# define rare_write_end() do { } while (0) >> +#else >> +# define __wr_rare __ro_after_init >> +# define __wr_rare_type const >> +# ifdef CONFIG_HAVE_ARCH_RARE_WRITE_MEMCPY >> +# define __rare_write_n(dst, src, len) ({ \ >> + BUILD_BUG(!builtin_const(len)); \ >> + __arch_rare_write_memcpy((dst), (src), (len)); \ >> + }) >> +# define __rare_write(var, val) __rare_write_n(&(var), &(val), sizeof(var)) >> +# else >> +# define __rare_write(var, val) ((*(typeof((typeof(var))0) *)&(var)) = (val)) >> +# endif >> +# define rare_write_begin() __arch_rare_write_begin() >> +# define rare_write_end() __arch_rare_write_end() >> +#endif >> +#define rare_write(__var, __val) ({ \ >> + rare_write_begin(); \ >> + __rare_write(__var, __val); \ >> + rare_write_end(); \ >> + __var; \ >> +}) >> + > > How about we have a separate header file splitting section annotations and the actual APIs. > > include/linux/compiler.h: > __wr_rare > __wr_rare_type > > include/linux/rare_write.h: > __rare_write_n() > __rare_write() > rare_write_begin() > rare_write_end() Yeah, that's actually exactly what I did for the current tree: https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/write-rarely -Kees -- Kees Cook Pixel Security
On Fri, 7 Apr 2017, Andy Lutomirski wrote: > On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote: > > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: > >> I really do not care whether PaX wants to chase and verify that over and > >> over. I certainly don't want to take the chance to leak CR0.WP ever and I > >> very much care about extra stuff to check in the entry/exit path. > > > > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) > > somewhere sensible should make those "leaks" visible fast -- and their > > exploitation impossible, i.e. fail hard. > > The leaks surely exist and now we'll just add an exploitable BUG. :) > I think we're approaching this all wrong, actually. The fact that x86 > has this CR0.WP thing is arguably a historical accident, and the fact > that PaX uses it doesn't mean that PaX is doing it the best way for > upstream Linux. As I said before. It should be fused to 1. > Why don't we start at the other end and do a generic non-arch-specific > implementation: set up an mm_struct that contains an RW alias of the > relevant parts of rodata and use use_mm to access it. (That is, > get_fs() to back up the old fs, set_fs(USER_DS), > use_mm(&rare_write_mm), do the write using copy_to_user, undo > everything.) That works too, though I'm not sure that it's more efficient than temporarily creating and undoing a single cpu local alias mapping similar to the kmap_atomic mechanism in the highmem case. Aside of that the alias mapping requires a full PGDIR entry unless you want to get into the mess of keeping yet another identity mapping up to date. Thanks, tglx
On Fri, Apr 7, 2017 at 1:44 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 7 Apr 2017, Andy Lutomirski wrote:
>> On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote:
>> > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
>> >> I really do not care whether PaX wants to chase and verify that over and
>> >> over. I certainly don't want to take the chance to leak CR0.WP ever and I
>> >> very much care about extra stuff to check in the entry/exit path.
>> >
>> > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
>> > somewhere sensible should make those "leaks" visible fast -- and their
>> > exploitation impossible, i.e. fail hard.
>>
>> The leaks surely exist and now we'll just add an exploitable BUG.
>
> :)
>
>> I think we're approaching this all wrong, actually. The fact that x86
>> has this CR0.WP thing is arguably a historical accident, and the fact
>> that PaX uses it doesn't mean that PaX is doing it the best way for
>> upstream Linux.
>
> As I said before. It should be fused to 1.
>
>> Why don't we start at the other end and do a generic non-arch-specific
>> implementation: set up an mm_struct that contains an RW alias of the
>> relevant parts of rodata and use use_mm to access it. (That is,
>> get_fs() to back up the old fs, set_fs(USER_DS),
>> use_mm(&rare_write_mm), do the write using copy_to_user, undo
>> everything.)
>
> That works too, though I'm not sure that it's more efficient than
> temporarily creating and undoing a single cpu local alias mapping similar
> to the kmap_atomic mechanism in the highmem case.
>
> Aside of that the alias mapping requires a full PGDIR entry unless you want
> to get into the mess of keeping yet another identity mapping up to date.
I probably chose the wrong name for this feature (write rarely).
That's _usually_ true, but "sensitive_write()" was getting rather
long. The things that we need to protect with this are certainly stuff
that doesn't get much writing, but some things are just plain
sensitive (like page tables) and we should still try to be as fast as
possible with them.
I'll try to include a third example in the next posting of the series
that makes this more obvious.
I'm all for a general case for the infrastructure (as Andy and Mark
has mentioned), but I don't want to get into the situation where
people start refusing to use it because it's "too slow" (for example,
see refcount_t vs net-dev right now).
-Kees
--
Kees Cook
Pixel Security
> I probably chose the wrong name for this feature (write rarely). > That's _usually_ true, but "sensitive_write()" was getting rather > long. The things that we need to protect with this are certainly stuff > that doesn't get much writing, but some things are just plain > sensitive (like page tables) and we should still try to be as fast as > possible with them. Not too late to rename it. Scoped write? I think it makes change to use a different API than PaX for portability too, but not a different x86 implementation. It's quite important to limit the writes to the calling thread and it needs to perform well to be introduced widely. > I'm all for a general case for the infrastructure (as Andy and Mark > has mentioned), but I don't want to get into the situation where > people start refusing to use it because it's "too slow" (for example, > see refcount_t vs net-dev right now). Meanwhile, the PaX implementation has improved to avoid the issues that were brought up while only introducing a single always-predicted (due to code placement) branch on the overflow flag. That seems to have gone unnoticed upstream, where there's now a much slower implementation that's not more secure, and is blocked from introduction in areas where it's most needed based on the performance. Not to mention that it's opt-in... which is never going to work.
> Not too late to rename it. Scoped write? I think it makes change to
s/change/sense/
>> Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
>> somewhere sensible should make those "leaks" visible fast -- and their
>> exploitation impossible, i.e. fail hard.
>
> The leaks surely exist and now we'll just add an exploitable BUG.
That didn't seem to matter for landing a rewrite of KSTACKOVERFLOW
with a bunch of *known* DoS bugs dealt with in grsecurity and those
were known issues that were unfixed for no apparent reason other than
keeping egos intact. It looks like there are still some left...
In that case, there also wasn't a security/performance advantage.
On Fri, Apr 7, 2017 at 12:58 PM, PaX Team <pageexec@freemail.hu> wrote: > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote: > >> On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote: >> > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: >> >> On Fri, 7 Apr 2017, Mathias Krause wrote: >> > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) >> > somewhere sensible should make those "leaks" visible fast -- and their >> > exploitation impossible, i.e. fail hard. >> >> The leaks surely exist and now we'll just add an exploitable BUG. > > can you please share those leaks that 'surely exist' and CC oss-security > while at it? I meant in the patchset here, not in grsecurity. grsecurity (on very, very brief inspection) seems to read cr0 and fix it up in pax_enter_kernel. >> >> Then someone who cares about performance can benchmark the CR0.WP >> approach against it and try to argue that it's a good idea. This >> benchmark should wait until I'm done with my PCID work, because PCID >> is going to make use_mm() a whole heck of a lot faster. > > in my measurements switching PCID is hovers around 230 cycles for snb-ivb > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's > of course a whole lot more impact for switching address spaces so it'll never > be fast enough to beat cr0.wp. > If I'm reading this right, you're saying that a non-flushing CR3 write is about the same cost as a CR0.WP write. If so, then why should CR0 be preferred over the (arch-neutral) CR3 approach? And why would switching address spaces obviously be much slower? There'll be a very small number of TLB fills needed for the actual protected access. --Andy
On Fri, Apr 7, 2017 at 9:21 PM, Daniel Micay <danielmicay@gmail.com> wrote:
>>> Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
>>> somewhere sensible should make those "leaks" visible fast -- and their
>>> exploitation impossible, i.e. fail hard.
>>
>> The leaks surely exist and now we'll just add an exploitable BUG.
>
> That didn't seem to matter for landing a rewrite of KSTACKOVERFLOW
> with a bunch of *known* DoS bugs dealt with in grsecurity and those
> were known issues that were unfixed for no apparent reason other than
> keeping egos intact. It looks like there are still some left...
>
> In that case, there also wasn't a security/performance advantage.
This is wildly off topic, but I think it's worth answering anyway
because there's an important point here:
grsecurity and PaX are great projects. They have a lot of good ideas,
and they're put together quite nicely. The upstream kernel should
*not* do things differently from they way they are in grsecurity/PaX
just because it wants to be different. Conversely, the upstream
kernel should not do things the same way as PaX just to be more like
PaX.
Keep in mind that the upstream kernel and grsecurity/PaX operate under
different constraints. The upstream kernel tries to keep itself clean
and to make tree-wide updates rather that keeping compatibility stuff
around. PaX and grsecurity presumably want to retain some degree of
simplicity when porting to newer upstream versions.
In the context of virtually mapped stacks / KSTACKOVERFLOW, this
naturally leads to different solutions. The upstream kernel had a
bunch of buggy drivers that played badly with virtually mapped stacks.
grsecurity sensibly went for the approach where the buggy drivers kept
working. The upstream kernel went for the approach of fixing the
drivers rather than keeping a compatibility workaround. Different
constraints, different solutions.
The point being that, if someone sends a patch to the x86 entry code
that's justified by "be like PaX" or by "be different than PaX",
that's not okay. It needs a real justification that stands on its
own.
In the case of rare writes or pax_open_kernel [1] or whatever we want
to call it, CR3 would work without arch-specific code, and CR0 would
not. That's an argument for CR3 that would need to be countered by
something. (Sure, avoiding leaks either way might need arch changes.
OTOH, a *randomized* CR3-based approach might not have as much of a
leak issue to begin with.)
[1] Contrary to popular belief, I don't sit around reading grsecurity
code or config options, so I really don't know what this thing is
called.
> grsecurity and PaX are great projects. They have a lot of good ideas, > and they're put together quite nicely. The upstream kernel should > *not* do things differently from they way they are in grsecurity/PaX > just because it wants to be different. Conversely, the upstream > kernel should not do things the same way as PaX just to be more like > PaX. > > Keep in mind that the upstream kernel and grsecurity/PaX operate under > different constraints. The upstream kernel tries to keep itself clean > and to make tree-wide updates rather that keeping compatibility stuff > around. PaX and grsecurity presumably want to retain some degree of > simplicity when porting to newer upstream versions. > > In the context of virtually mapped stacks / KSTACKOVERFLOW, this > naturally leads to different solutions. The upstream kernel had a > bunch of buggy drivers that played badly with virtually mapped stacks. > grsecurity sensibly went for the approach where the buggy drivers kept > working. The upstream kernel went for the approach of fixing the > drivers rather than keeping a compatibility workaround. Different > constraints, different solutions. Sure, but nothing is currently landed right now and the PaX implementation is a known quantity with a lot of testing. The submitted code is aimed at rare writes to globals, but this feature is more than that and design decisions shouldn't be based on just the short term. Kees is sensibly landing features by submitting them a little bit at a time, but a negative side effect of that is too much focus on just the initial proposed usage. As a downstream that's going to be making heavy use of mainline security features as a base due to PaX and grsecurity becoming commercial only private patches (*because* of upstream and the companies involved), I hope things go in the right direction i.e. not weaker/slower implementations than PaX. For example, while USERCOPY isn't entirely landed upstream (missing slab whitelisting and user offset/size), it's the base for the full feature and is going to get more testing. On the other hand, refcount_t and the slab/page sanitization work is 100% useless if you're willing to incorporate the changes to do it without needless performance loss and complexity. PAN emulation on 64-bit ARM was fresh ground while on ARMv7 a weaker implementation was used for no better reason than clashing egos. The upstream virtual mapped stacks will probably be perfectly good, but I just find it to be such a waste of time and effort to reinvent it and retread the same ground in terms of finding the bugs. I actually care a lot more about 64-bit ARM support than I do x86, but using a portable API for pax_open_kernel (for the simple uses at least) is separate from choosing the underlying implementation. There might not be a great way to do it on the architectures I care about but that doesn't need to hinder x86. It's really not that much code... A weaker/slower implementation for x86 also encourages the same elsewhere. > In the case of rare writes or pax_open_kernel [1] or whatever we want > to call it, CR3 would work without arch-specific code, and CR0 would > not. That's an argument for CR3 that would need to be countered by > something. (Sure, avoiding leaks either way might need arch changes. > OTOH, a *randomized* CR3-based approach might not have as much of a > leak issue to begin with.) By randomized do you mean just ASLR? Even if the window of opportunity to exploit it is small, it's really not the same and this has a lot more use than just rare writes to small global variables. I wouldn't consider stack ASLR to be a replacement for making them readable/writable only by the current thread either (required for race-free return CFI on x86, at least without not using actual returns). > [1] Contrary to popular belief, I don't sit around reading grsecurity > code or config options, so I really don't know what this thing is > called. Lots of the features aren't actually named. Maybe this could be considered part of KERNEXEC but I don't think it is.
On Sat, Apr 8, 2017 at 12:33 AM, Daniel Micay <danielmicay@gmail.com> wrote: > The > submitted code is aimed at rare writes to globals, but this feature is > more than that and design decisions shouldn't be based on just the > short term. Then, if you disagree with a proposed design, *explain why* in a standalone manner. Say what future uses a different design would have. > I actually care a lot more about 64-bit ARM support than I do x86, but > using a portable API for pax_open_kernel (for the simple uses at > least) is separate from choosing the underlying implementation. There > might not be a great way to do it on the architectures I care about > but that doesn't need to hinder x86. It's really not that much code... > A weaker/slower implementation for x86 also encourages the same > elsewhere. No one has explained how CR0.WP is weaker or slower than my proposal. Here's what I'm proposing: At boot, choose a random address A. Create an mm_struct that has a single VMA starting at A that represents the kernel's rarely-written section. Compute O = (A - VA of rarely-written section). To do a rare write, use_mm() the mm, write to (VA + O), then unuse_mm(). This should work on any arch that has an MMU that allows this type of aliasing and that doesn't have PA-based protections on the rarely-written section. It'll be considerably slower than CR0.WP on a current x86 kernel, but, with PCID landed, it shouldn't be much slower. It has the added benefit that writes to non-rare-write data using the rare-write primitive will fail. --Andy
* Andy Lutomirski <luto@kernel.org> wrote: > On Sat, Apr 8, 2017 at 12:33 AM, Daniel Micay <danielmicay@gmail.com> wrote: > > The > > submitted code is aimed at rare writes to globals, but this feature is > > more than that and design decisions shouldn't be based on just the > > short term. > > Then, if you disagree with a proposed design, *explain why* in a > standalone manner. Say what future uses a different design would > have. > > > I actually care a lot more about 64-bit ARM support than I do x86, but > > using a portable API for pax_open_kernel (for the simple uses at > > least) is separate from choosing the underlying implementation. There > > might not be a great way to do it on the architectures I care about > > but that doesn't need to hinder x86. It's really not that much code... > > A weaker/slower implementation for x86 also encourages the same > > elsewhere. > > No one has explained how CR0.WP is weaker or slower than my proposal. > Here's what I'm proposing: > > At boot, choose a random address A. Create an mm_struct that has a > single VMA starting at A that represents the kernel's rarely-written > section. Compute O = (A - VA of rarely-written section). To do a > rare write, use_mm() the mm, write to (VA + O), then unuse_mm(). BTW., note that this is basically a pagetable based protection key variant. > It'll be considerably slower than CR0.WP on a current x86 kernel, but, with PCID > landed, it shouldn't be much slower. It has the added benefit that writes to > non-rare-write data using the rare-write primitive will fail. ... which is a security advantage of the use_mm() based design you suggest. Thanks, Ingo
On 7 Apr 2017 at 21:58, Andy Lutomirski wrote: > On Fri, Apr 7, 2017 at 12:58 PM, PaX Team <pageexec@freemail.hu> wrote: > > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote: > >> Then someone who cares about performance can benchmark the CR0.WP > >> approach against it and try to argue that it's a good idea. This > >> benchmark should wait until I'm done with my PCID work, because PCID > >> is going to make use_mm() a whole heck of a lot faster. > > > > in my measurements switching PCID is hovers around 230 cycles for snb-ivb > > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's > > of course a whole lot more impact for switching address spaces so it'll never > > be fast enough to beat cr0.wp. > > > > If I'm reading this right, you're saying that a non-flushing CR3 write > is about the same cost as a CR0.WP write. If so, then why should CR0 > be preferred over the (arch-neutral) CR3 approach? cr3 (page table switching) isn't arch neutral at all ;). you probably meant the higher level primitives except they're not enough to implement the scheme as discussed before since the enter/exit paths are very much arch dependent. on x86 the cost of the pax_open/close_kernel primitives comes from the cr0 writes and nothing else, use_mm suffers not only from the cr3 writes but also locking/atomic ops and cr4 writes on its path and the inevitable TLB entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp a fast path in the microcode and reduce its overhead by an order of magnitude. > And why would switching address spaces obviously be much slower? > There'll be a very small number of TLB fills needed for the actual > protected access. you'll be duplicating TLB entries in the alternative PCID for both code and data, where they will accumulate (=take room away from the normal PCID and expose unwanted memory for access) unless you also flush them when switching back (which then will cost even more cycles). also i'm not sure that processors implement all the 12 PCID bits so depending on how many PCIDs you plan to use, you could be causing even more unnecessary TLB replacements.
On 7 Apr 2017 at 22:07, Andy Lutomirski wrote: > grsecurity and PaX are great projects. They have a lot of good ideas, > and they're put together quite nicely. The upstream kernel should > *not* do things differently from they way they are in grsecurity/PaX > just because it wants to be different. Conversely, the upstream > kernel should not do things the same way as PaX just to be more like > PaX. so weit so gut. > Keep in mind that the upstream kernel and grsecurity/PaX operate under > different constraints. The upstream kernel tries to keep itself clean so do we. > and to make tree-wide updates rather that keeping compatibility stuff > around. so do we (e.g., fptr fixes for RAP, non-refcount atomic users, etc). > PaX and grsecurity presumably want to retain some degree of > simplicity when porting to newer upstream versions. s/simplicity/minimality/ as the code itself can be complex but that'll be of the minimal complexity we can come up with. > In the context of virtually mapped stacks / KSTACKOVERFLOW, this > naturally leads to different solutions. The upstream kernel had a > bunch of buggy drivers that played badly with virtually mapped stacks. > grsecurity sensibly went for the approach where the buggy drivers kept > working. The upstream kernel went for the approach of fixing the > drivers rather than keeping a compatibility workaround. Different > constraints, different solutions. except that's not what happened at all. spender's first version did just a vmalloc for the kstack like the totally NIH'd version upstream does now. while we always anticipated buggy dma users and thus had code that would detect them so that we could fix them, we quickly figured that the upstream kernel wasn't quite up to snuff as we had assumed and faced with the amount of buggy code, we went for the current vmap approach which kept users' systems working instead of breaking them. you're trying to imply that upstream fixed the drivers but as the facts show, that's not true. you simply unleashed your code on the world and hoped(?) that enough suckers would try it out during the -rc window. as we all know several releases and almost a year later, that was a losing bet as you still keep fixing those drivers (and something tells me that we haven't seen the end of it). this is simply irresponsible engineering for no technical reason. > In the case of rare writes or pax_open_kernel [1] or whatever we want > to call it, CR3 would work without arch-specific code, and CR0 would > not. That's an argument for CR3 that would need to be countered by > something. (Sure, avoiding leaks either way might need arch changes. > OTOH, a *randomized* CR3-based approach might not have as much of a > leak issue to begin with.) i have yet to see anyone explain what they mean by 'leak' here but if it is what i think it is then the arch specific entry/exit changes are not optional but mandatory. see below for randomization. [merging in your other mail as it's the same topic] > No one has explained how CR0.WP is weaker or slower than my proposal. you misunderstood, Daniel was talking about your use_mm approach. > Here's what I'm proposing: > > At boot, choose a random address A. what is the threat that a random address defends against? > Create an mm_struct that has a > single VMA starting at A that represents the kernel's rarely-written > section. Compute O = (A - VA of rarely-written section). To do a > rare write, use_mm() the mm, write to (VA + O), then unuse_mm(). the problem is that the amount of __read_only data extends beyond vmlinux, i.e., this approach won't scale. another problem is that it can't be used inside use_mm and switch_mm themselves (no read-only task structs or percpu pgd for you ;) and probably several other contexts. last but not least, use_mm says this about itself: (Note: this routine is intended to be called only from a kernel thread context) so using it will need some engineering (or the comment be fixed). > This should work on any arch that has an MMU that allows this type of > aliasing and that doesn't have PA-based protections on the rarely-written > section. you didn't address the arch-specific changes needed in the enter/exit paths. > It'll be considerably slower than CR0.WP on a current x86 kernel, but, > with PCID landed, it shouldn't be much slower. based on my experience with UDEREF on amd64, unfortunately PCID isn't all it's cracked up to be (IIRC, it maybe halved the UDEREF overhead instead of basically eliminating it as i had anticipated, and that was on snb, ivb and later do even worse). > It has the added benefit that writes to non-rare-write data using the > rare-write primitive will fail. what is the threat model you're assuming for this feature? based on what i have for PaX (arbitrary read/write access exploited for data-only attacks), the above makes no sense to me...
On Sun, Apr 9, 2017 at 5:47 AM, PaX Team <pageexec@freemail.hu> wrote: > On 7 Apr 2017 at 21:58, Andy Lutomirski wrote: > >> On Fri, Apr 7, 2017 at 12:58 PM, PaX Team <pageexec@freemail.hu> wrote: >> > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote: >> >> Then someone who cares about performance can benchmark the CR0.WP >> >> approach against it and try to argue that it's a good idea. This >> >> benchmark should wait until I'm done with my PCID work, because PCID >> >> is going to make use_mm() a whole heck of a lot faster. >> > >> > in my measurements switching PCID is hovers around 230 cycles for snb-ivb >> > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's >> > of course a whole lot more impact for switching address spaces so it'll never >> > be fast enough to beat cr0.wp. >> > >> >> If I'm reading this right, you're saying that a non-flushing CR3 write >> is about the same cost as a CR0.WP write. If so, then why should CR0 >> be preferred over the (arch-neutral) CR3 approach? > > cr3 (page table switching) isn't arch neutral at all ;). you probably meant > the higher level primitives except they're not enough to implement the scheme > as discussed before since the enter/exit paths are very much arch dependent. Yes. > > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0 > writes and nothing else, use_mm suffers not only from the cr3 writes but > also locking/atomic ops and cr4 writes on its path and the inevitable TLB > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp > a fast path in the microcode and reduce its overhead by an order of magnitude. > If the CR4 writes happen in for this use case, that's a bug. >> And why would switching address spaces obviously be much slower? >> There'll be a very small number of TLB fills needed for the actual >> protected access. > > you'll be duplicating TLB entries in the alternative PCID for both code > and data, where they will accumulate (=take room away from the normal PCID > and expose unwanted memory for access) unless you also flush them when > switching back (which then will cost even more cycles). also i'm not sure > that processors implement all the 12 PCID bits so depending on how many PCIDs > you plan to use, you could be causing even more unnecessary TLB replacements. > Unless the CPU is rather dumber than I expect, the only duplicated entries should be for the writable aliases of pages that are written. The rest of the pages are global and should be shared for all PCIDs. --Andy
On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote: > >> In the context of virtually mapped stacks / KSTACKOVERFLOW, this >> naturally leads to different solutions. The upstream kernel had a >> bunch of buggy drivers that played badly with virtually mapped stacks. >> grsecurity sensibly went for the approach where the buggy drivers kept >> working. The upstream kernel went for the approach of fixing the >> drivers rather than keeping a compatibility workaround. Different >> constraints, different solutions. > > except that's not what happened at all. spender's first version did just > a vmalloc for the kstack like the totally NIH'd version upstream does > now. while we always anticipated buggy dma users and thus had code that > would detect them so that we could fix them, we quickly figured that the > upstream kernel wasn't quite up to snuff as we had assumed and faced with > the amount of buggy code, we went for the current vmap approach which > kept users' systems working instead of breaking them. > > you're trying to imply that upstream fixed the drivers but as the facts > show, that's not true. you simply unleashed your code on the world and > hoped(?) that enough suckers would try it out during the -rc window. as > we all know several releases and almost a year later, that was a losing > bet as you still keep fixing those drivers (and something tells me that > we haven't seen the end of it). this is simply irresponsible engineering > for no technical reason. I consider breaking buggy drivers (in a way that they either generally work okay or that they break with a nice OOPS depending on config) to be better than having a special case in what's supposed to be a fast path to keep them working. I did consider forcing the relevant debug options on for a while just to help shake these bugs out the woodwork faster. > >> In the case of rare writes or pax_open_kernel [1] or whatever we want >> to call it, CR3 would work without arch-specific code, and CR0 would >> not. That's an argument for CR3 that would need to be countered by >> something. (Sure, avoiding leaks either way might need arch changes. >> OTOH, a *randomized* CR3-based approach might not have as much of a >> leak issue to begin with.) > > i have yet to see anyone explain what they mean by 'leak' here but if it > is what i think it is then the arch specific entry/exit changes are not > optional but mandatory. see below for randomization. By "leak" I mean that a bug or exploit causes unintended code to run with CR0.WP or a special CR3 or a special PTE or whatever loaded. PaX hooks the entry code to avoid leaks. >> At boot, choose a random address A. > > what is the threat that a random address defends against? Makes it harder to exploit a case where the CR3 setting leaks. > >> Create an mm_struct that has a >> single VMA starting at A that represents the kernel's rarely-written >> section. Compute O = (A - VA of rarely-written section). To do a >> rare write, use_mm() the mm, write to (VA + O), then unuse_mm(). > > the problem is that the amount of __read_only data extends beyond vmlinux, > i.e., this approach won't scale. another problem is that it can't be used > inside use_mm and switch_mm themselves (no read-only task structs or percpu > pgd for you ;) and probably several other contexts. Can you clarify these uses that extend beyond vmlinux? I haven't looked at the grsecurity patch extensively. Are you talking about the BPF JIT stuff? If so, I think that should possibly be handled a bit differently, since I think the normal write-to-rare-write-vmlinux-sections primitive should preferably *not* be usable to write to executable pages. Using a real mm_struct for this could help. > > last but not least, use_mm says this about itself: > > (Note: this routine is intended to be called only > from a kernel thread context) > > so using it will need some engineering (or the comment be fixed). Indeed. >> It has the added benefit that writes to non-rare-write data using the >> rare-write primitive will fail. > > what is the threat model you're assuming for this feature? based on what i > have for PaX (arbitrary read/write access exploited for data-only attacks), > the above makes no sense to me... > If I use the primitive to try to write a value to the wrong section (write to kernel text, for example), IMO it would be nice to OOPS instead of succeeding. Please keep in mind that, unlike PaX, uses of a pax_open_kernel()-like function will may be carefully audited by a friendly security expert such as yourself. It would be nice to harden the primitive to a reasonable extent against minor misuses such as putting it in a context where the compiler will emit mov-a-reg-with-WP-set-to-CR0; ret.
On Fri, 7 Apr 2017, PaX Team wrote: > On 7 Apr 2017 at 11:46, Thomas Gleixner wrote: > > > On Fri, 7 Apr 2017, Mathias Krause wrote: > > > Well, doesn't look good to me. NMIs will still be able to interrupt > > > this code and will run with CR0.WP = 0. > > > > > > Shouldn't you instead question yourself why PaX can do it "just" with > > > preempt_disable() instead?! > > > > That's silly. Just because PaX does it, doesn't mean it's correct. > > is that FUD or do you have actionable information to share? That has absolutely nothing to do with FUD. I'm merily not accepting argumentations which say: PaX can do it "just".... That has exactly zero technical merit and it's not asked too much to provide precise technical arguments why one implementation is better than some other. > > To be honest, playing games with the CR0.WP bit is outright stupid to begin with. > > why is that? cr0.wp exists since the i486 and its behaviour fits my > purposes quite well, it's the best security/performance i know of. Works for me has never be a good engineering principle. > > Whether protected by preempt_disable or local_irq_disable, to make that > > work it needs CR0 handling in the exception entry/exit at the lowest > > level. > > correct. > > > And that's just a nightmare maintainence wise as it's prone to be > > broken over time. > > i've got 14 years of experience of maintaining it and i never saw it break. It's a difference whether you maintain a special purpose patch set out of tree for a subset of architectures - I certainly know what I'm talking about - or keeping stuff sane in the upstream kernel. > > I certainly don't want to take the chance to leak CR0.WP ever > > why and where would cr0.wp leak? It's bound to happen due to some subtle mistake and up to the point where you catch it (in the scheduler or entry/exit path) the world is writeable. And that will be some almost never executed error path which can be triggered by a carefully crafted attack. A very restricted writeable region is definitely preferred over full world writeable then, right? > > Why the heck should we care about rare writes being performant? > > because you've been misled by the NIH crowd here that the PaX feature they > tried to (badly) extract from has anything to do with frequency of writes. It would be apprectiated if you could keep your feud out of this. It's enough to tell me that 'rare write' is a misleading term and why. > > Making the world and some more writeable hardly qualifies as tightly > > focused. > > you forgot to add 'for a window of a few insns' and that the map/unmap If it'd be guaranteed to be a few instructions, then I wouldn't be that worried. The availability of make_world_writeable() as an unrestricted usable function makes me nervous as hell. We've had long standing issues where kmap_atomic() got leaked through a hard to spot almost never executed error handling path. And the same is bound to happen with this, just with a way worse outcome. > approach does the same under an attacker controlled ptr. Which attacker controlled pointer? Thanks, tglx
On Sat, Apr 08, 2017 at 08:20:03AM -0700, Andy Lutomirski wrote: > On Sat, Apr 8, 2017 at 12:33 AM, Daniel Micay <danielmicay@gmail.com> wrote: > > The > > submitted code is aimed at rare writes to globals, but this feature is > > more than that and design decisions shouldn't be based on just the > > short term. > > Then, if you disagree with a proposed design, *explain why* in a > standalone manner. Say what future uses a different design would > have. > > > I actually care a lot more about 64-bit ARM support than I do x86, but > > using a portable API for pax_open_kernel (for the simple uses at > > least) is separate from choosing the underlying implementation. There > > might not be a great way to do it on the architectures I care about > > but that doesn't need to hinder x86. It's really not that much code... > > A weaker/slower implementation for x86 also encourages the same > > elsewhere. > > No one has explained how CR0.WP is weaker or slower than my proposal. > Here's what I'm proposing: > > At boot, choose a random address A. Create an mm_struct that has a > single VMA starting at A that represents the kernel's rarely-written > section. Compute O = (A - VA of rarely-written section). To do a > rare write, use_mm() the mm, write to (VA + O), then unuse_mm(). > > This should work on any arch that has an MMU that allows this type of > aliasing and that doesn't have PA-based protections on the > rarely-written section. Modulo randomization, that sounds exactly like what I had envisaged [1], so that makes sense to me. Thanks, Mark. [1] http://www.openwall.com/lists/kernel-hardening/2016/11/18/3
On Fri, Apr 07, 2017 at 04:45:26PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 07, 2017 at 12:51:15PM +0200, Mathias Krause wrote:
> > Why that? It allows fast and CPU local modifications of r/o memory.
> > OTOH, an approach that needs to fiddle with page table entries
> > requires global synchronization to keep the individual TLB states in
> > sync. Hmm.. Not that fast, I'd say.
>
> The fixmaps used for kmap_atomic are per-cpu, no global sync required.
That might be fine for x86, but for some architectures fixmap slots and
kmap_atomic mappings happen to be visible to other CPUs even if they're
not required to be.
Using an mm solves that for all, though.
Thanks,
Mark.
On 9 Apr 2017 at 17:10, Andy Lutomirski wrote: > On Sun, Apr 9, 2017 at 5:47 AM, PaX Team <pageexec@freemail.hu> wrote: > > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0 > > writes and nothing else, use_mm suffers not only from the cr3 writes but > > also locking/atomic ops and cr4 writes on its path and the inevitable TLB > > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp > > a fast path in the microcode and reduce its overhead by an order of magnitude. > > > > If the CR4 writes happen in for this use case, that's a bug. that depends on how you plan to handle perf/rdpmc users and how many alternative mm structs you plan to manage (one global, one per cpu, one per mm struct, etc). > > you'll be duplicating TLB entries in the alternative PCID for both code > > and data, where they will accumulate (=take room away from the normal PCID > > and expose unwanted memory for access) unless you also flush them when > > switching back (which then will cost even more cycles). also i'm not sure > > that processors implement all the 12 PCID bits so depending on how many PCIDs > > you plan to use, you could be causing even more unnecessary TLB replacements. > > > > Unless the CPU is rather dumber than I expect, the only duplicated > entries should be for the writable aliases of pages that are written. > The rest of the pages are global and should be shared for all PCIDs. well, 4.10.2.4 has language like this (4.10.3.2 implies similar): A logical processor may use a global TLB entry to translate a linear address, even if the TLB entry is associated with a PCID different from the current PCID. that to me says that global page entries are associated with a PCID and may (not) be used while in another PCID. in Intel-speak that's not 'dumb' but "tricks up our sleeve that we don't really want to tell you about in detail, except perhaps under a NDA".
On Mon, Apr 10, 2017 at 3:42 AM, PaX Team <pageexec@freemail.hu> wrote: > On 9 Apr 2017 at 17:10, Andy Lutomirski wrote: > >> On Sun, Apr 9, 2017 at 5:47 AM, PaX Team <pageexec@freemail.hu> wrote: >> > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0 >> > writes and nothing else, use_mm suffers not only from the cr3 writes but >> > also locking/atomic ops and cr4 writes on its path and the inevitable TLB >> > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp >> > a fast path in the microcode and reduce its overhead by an order of magnitude. >> > >> >> If the CR4 writes happen in for this use case, that's a bug. > > that depends on how you plan to handle perf/rdpmc users and how many > alternative mm structs you plan to manage (one global, one per cpu, > one per mm struct, etc). > I was thinking one global unless more are needed for some reason. >> > you'll be duplicating TLB entries in the alternative PCID for both code >> > and data, where they will accumulate (=take room away from the normal PCID >> > and expose unwanted memory for access) unless you also flush them when >> > switching back (which then will cost even more cycles). also i'm not sure >> > that processors implement all the 12 PCID bits so depending on how many PCIDs >> > you plan to use, you could be causing even more unnecessary TLB replacements. >> > >> >> Unless the CPU is rather dumber than I expect, the only duplicated >> entries should be for the writable aliases of pages that are written. >> The rest of the pages are global and should be shared for all PCIDs. > > well, 4.10.2.4 has language like this (4.10.3.2 implies similar): > > A logical processor may use a global TLB entry to translate a linear > address, even if the TLB entry is associated with a PCID different > from the current PCID. I read this as: the CPU still semantically tags global TLB entries with a PCID, but the CPU will use (probably) use those TLB entries even if the PCIDs don't match. IIRC none of the TLB instructions have any effect that makes the PCID associated with a global entry visible, so the CPU could presumably omit the PCID tags entirely for global entries. E.g. I don't think there's any way to say "flush global entries with a given PCID". --Andy
On 9 Apr 2017 at 17:31, Andy Lutomirski wrote: > On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote: > > > I consider breaking buggy drivers (in a way that they either generally > work okay do they work okay when the dma transfer goes to a buffer that crosses physically non-contiguous page boundaries? > or that they break with a nice OOPS depending on config) to > be better than having a special case in what's supposed to be a fast > path to keep them working. I did consider forcing the relevant debug > options on for a while just to help shake these bugs out the woodwork > faster. that's a false dichotomy, discovering buggy drivers is orthogonal to (not) breaking users' systems as grsec shows. and how did you expect to 'shake these bugs out' when your own suggestion at the time was for distros to not enable this feature 'for a while'? > > i have yet to see anyone explain what they mean by 'leak' here but if it > > is what i think it is then the arch specific entry/exit changes are not > > optional but mandatory. see below for randomization. > > By "leak" I mean that a bug or exploit causes unintended code to run > with CR0.WP or a special CR3 or a special PTE or whatever loaded. how can a bug/exploit cause something like this? > PaX hooks the entry code to avoid leaks. PaX doesn't instrument enter/exit paths to prevent state leaks into interrupt context (it's a useful sideeffect though), rather it's needed for correctness if the kernel can be interrupted at all while it's open (address space switching will need to handle this too but you have yet to address it). > >> At boot, choose a random address A. > > > > what is the threat that a random address defends against? > > Makes it harder to exploit a case where the CR3 setting leaks. if an attacker has the ability to cause this leak (details of which are subject to the question i asked above) then why wouldn't he simply also make use of the primitives to modify his target via the writable vma without ever having to know the randomized address? i also wonder what exploit power you assume for this attack and whether that is already enough to simply go after page tables, etc instead of figuring out the alternative address space. > > the problem is that the amount of __read_only data extends beyond vmlinux, > > i.e., this approach won't scale. another problem is that it can't be used > > inside use_mm and switch_mm themselves (no read-only task structs or percpu > > pgd for you ;) and probably several other contexts. > > Can you clarify these uses that extend beyond vmlinux? one obvious candidate is modules. how do you want to handle them? then there's a whole bunch of dynamically allocated data that is a candidate for __read_only treatment. > > what is the threat model you're assuming for this feature? based on what i > > have for PaX (arbitrary read/write access exploited for data-only attacks), > > the above makes no sense to me... > > If I use the primitive to try to write a value to the wrong section > (write to kernel text, for example), IMO it would be nice to OOPS > instead of succeeding. this doesn't tell me what power you're assuming the attacker has. is it my generic arbitrary read-write ability or something more restricted and thus less realistic? i.e., how does the attacker get to 'use the primitive' and (presumably) also control the ptr/data? as for your specific example, kernel text isn't 'non-rare-write data' that you spoke of before, but that aside, what prevents an attacker from computing his target ptr so that after your accessor rebases it, it'd point back to his intended target instead? will you range-check (find_vma eventually?) each time? how will you make all this code safe from races from another task? the more checks you make, the more likely that something sensitive will spill to memory and be a target itself in order to hijack the sensitive write. > Please keep in mind that, unlike PaX, uses of a pax_open_kernel()-like > function will may be carefully audited by a friendly security expert > such as yourself. It would be nice to harden the primitive to a > reasonable extent against minor misuses such as putting it in a > context where the compiler will emit mov-a-reg-with-WP-set-to-CR0; > ret. i don't understand what's there to audit. if you want to treat a given piece of data as __read_only then you have no choice but to allow writes to it via the open/close mechanism and the compiler can tell you just where those writes are (and even do the instrumentation when you get tired of doing it by hand).
On 10 Apr 2017 at 10:26, Thomas Gleixner wrote: > On Fri, 7 Apr 2017, PaX Team wrote: > > On 7 Apr 2017 at 11:46, Thomas Gleixner wrote: > > > That's silly. Just because PaX does it, doesn't mean it's correct. > > > > is that FUD or do you have actionable information to share? > > That has absolutely nothing to do with FUD. I'm merily not accepting > argumentations which say: PaX can do it "just".... you implied that what PaX does may not be correct. if you can't back that up with facts and technical arguments then it is FUD. your turn. > That has exactly zero technical merit and it's not asked too much to > provide precise technical arguments why one implementation is better than > some other. exactly. start with explaining what is not correct in PaX with "precise technical arguments". > > > To be honest, playing games with the CR0.WP bit is outright stupid to begin with. > > > > why is that? cr0.wp exists since the i486 and its behaviour fits my > > purposes quite well, it's the best security/performance i know of. > > Works for me has never be a good engineering principle. good thing i didn't say that. on the other hand you failed to provide "precise technical arguments" for why "playing games with the CR0.WP bit is outright stupid to begin with". do you have any to share and discuss? > > > And that's just a nightmare maintainence wise as it's prone to be > > > broken over time. > > > > i've got 14 years of experience of maintaining it and i never saw it break. > > It's a difference whether you maintain a special purpose patch set out of > tree for a subset of architectures - I certainly know what I'm talking > about - or keeping stuff sane in the upstream kernel. there's no difference to me, i keep my stuff sane regardless. of course what you do with your out-of-tree code is your business but don't extrapolate it to mine. now besides argumentum ad verecundiam do you have "precise technical arguments" as to why maintaining a cr0.wp based approach would be "a nightmare maintainence wise as it's prone to be broken over time."? > > > I certainly don't want to take the chance to leak CR0.WP ever > > > > why and where would cr0.wp leak? > > It's bound to happen due to some subtle mistake i don't see what subtle mistake you're thinking of here. can you give me an example? > and up to the point where you catch it (in the scheduler or entry/exit path) > the world is writeable. where such a leak is caught depends on what subtle mistake you're talking about, so let's get back to this point once you answered that question. > And that will be some almost never executed error path which can > be triggered by a carefully crafted attack. open/close calls have nothing to do with error paths or even conditional execution, they're always executed as a sequence so this situation cannot occur. > A very restricted writeable region is definitely preferred over full > world writeable then, right? it doesn't matter when the attacker has an arbitrary read/write primitive which he can just use to modify that 'very restricted writeable region' to whatever he needs to cover first. now if all the data managing this region were also protected then it'd matter but that's never going to happen in the upstream kernel. > > > Making the world and some more writeable hardly qualifies as tightly > > > focused. > > > > you forgot to add 'for a window of a few insns' and that the map/unmap > > If it'd be guaranteed to be a few instructions, then I wouldn't be that > worried. it is, by definition assignments to otherwise __read_only data are (have to be) bracketed with open/close instrumentation. > The availability of make_world_writeable() as an unrestricted > usable function makes me nervous as hell. i can't imagine the nightmares you must have lived through for the two decades during which the kernel was all wide open... spass beiseite, we can address these fears of yours once you explain just what kind of error situations you have in mind and why they don't also apply to say text_poke(). > We've had long standing issues where kmap_atomic() got leaked through a > hard to spot almost never executed error handling path. And the same is > bound to happen with this, just with a way worse outcome. the lifetime and use of kmaps is very different so you'll have to explain in more detail why the problems they had apply here as well. > > approach does the same under an attacker controlled ptr. > > Which attacker controlled pointer? i meant the one passed to the map/unmap code, attacker control over it means the whole world is effectively writable again (Andy's vma approach would restrict this to just the interesting read-only data except the whole thing is irrelevant until all participating data (vma, page tables, etc) are also protected).
On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote:
> On 7 Apr 2017 at 22:07, Andy Lutomirski wrote:
>> No one has explained how CR0.WP is weaker or slower than my proposal.
>
> you misunderstood, Daniel was talking about your use_mm approach.
>
>> Here's what I'm proposing:
>>
>> At boot, choose a random address A.
>
> what is the threat that a random address defends against?
>
>> Create an mm_struct that has a
>> single VMA starting at A that represents the kernel's rarely-written
>> section. Compute O = (A - VA of rarely-written section). To do a
>> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().
>
> the problem is that the amount of __read_only data extends beyond vmlinux,
> i.e., this approach won't scale. another problem is that it can't be used
> inside use_mm and switch_mm themselves (no read-only task structs or percpu
> pgd for you ;) and probably several other contexts.
These are the limitations that concern me: what will we NOT be able to
make read-only as a result of the use_mm() design choice? My RFC
series included a simple case and a constify case, but I did not
include things like making page tables read-only, etc.
I cant accept not using cr0, since we need to design something that
works on arm64 too, which doesn't have anything like this (AFAIK), but
I'd like to make sure we don't paint ourselves into a corner.
-Kees
--
Kees Cook
Pixel Security
On Mon, Apr 10, 2017 at 1:13 PM, Kees Cook <keescook@chromium.org> wrote:
> On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote:
>> On 7 Apr 2017 at 22:07, Andy Lutomirski wrote:
>>> No one has explained how CR0.WP is weaker or slower than my proposal.
>>
>> you misunderstood, Daniel was talking about your use_mm approach.
>>
>>> Here's what I'm proposing:
>>>
>>> At boot, choose a random address A.
>>
>> what is the threat that a random address defends against?
>>
>>> Create an mm_struct that has a
>>> single VMA starting at A that represents the kernel's rarely-written
>>> section. Compute O = (A - VA of rarely-written section). To do a
>>> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().
>>
>> the problem is that the amount of __read_only data extends beyond vmlinux,
>> i.e., this approach won't scale. another problem is that it can't be used
>> inside use_mm and switch_mm themselves (no read-only task structs or percpu
>> pgd for you ;) and probably several other contexts.
>
> These are the limitations that concern me: what will we NOT be able to
> make read-only as a result of the use_mm() design choice? My RFC
> series included a simple case and a constify case, but I did not
> include things like making page tables read-only, etc.
If we make page tables read-only, we may need to have multiple levels
of rareness. Page table writes aren't all that rare, and I can
imagine distros configuring the kernel so that static structs full of
function pointers are read-only (IMO that should be the default or
even mandatory), but page tables may be a different story.
That being said, CR3-twiddling to write to page tables could actually
work. Hmm.
On Mon, Apr 10, 2017 at 12:47 PM, PaX Team <pageexec@freemail.hu> wrote: > On 9 Apr 2017 at 17:31, Andy Lutomirski wrote: > >> On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote: >> > >> I consider breaking buggy drivers (in a way that they either generally >> work okay > > do they work okay when the dma transfer goes to a buffer that crosses > physically non-contiguous page boundaries? Nope. Like I said, i considered making the debugging mandatory. I may still send patches to do that. >> By "leak" I mean that a bug or exploit causes unintended code to run >> with CR0.WP or a special CR3 or a special PTE or whatever loaded. > > how can a bug/exploit cause something like this? For example: a bug in entry logic, a bug in perf NMI handling, or even a bug in *nested* perf NMI handling (egads!). Or maybe some super nasty interaction with suspend/resume. These are all fairly unlikely (except the nested perf case), but still. As a concrete example, back before my big NMI improvement series, it was possible for an NMI return to invoke espfix and/or take an IRET fault. This *shouldn't* happen on return to a context with CR0.WP set, but it would be incredibly nasty if it did. The code is separated out now, so it should be okay... > >> PaX hooks the entry code to avoid leaks. > > PaX doesn't instrument enter/exit paths to prevent state leaks into interrupt > context (it's a useful sideeffect though), rather it's needed for correctness > if the kernel can be interrupted at all while it's open (address space switching > will need to handle this too but you have yet to address it). I don't think we disagree here. A leak would be a case of incorrectness. > >> >> At boot, choose a random address A. >> > >> > what is the threat that a random address defends against? >> >> Makes it harder to exploit a case where the CR3 setting leaks. > > if an attacker has the ability to cause this leak (details of which are subject > to the question i asked above) then why wouldn't he simply also make use of the > primitives to modify his target via the writable vma without ever having to know > the randomized address? i also wonder what exploit power you assume for this > attack and whether that is already enough to simply go after page tables, etc > instead of figuring out the alternative address space. I'm imagining the power to (a) cause some code path to execute while the kernel is "open" and (b) the ability to use the buggy code path in question to write a a fully- or partially-controlled address. With CR0.WP clear, this can write shellcode directly. With CR3 pointing to a page table that maps some parts of the kernel (but not text!) at a randomized offset, you need to figure out the offset and find some other target in the mapping that gets your exploit farther along. You can't write shellcode directly. > >> > the problem is that the amount of __read_only data extends beyond vmlinux, >> > i.e., this approach won't scale. another problem is that it can't be used >> > inside use_mm and switch_mm themselves (no read-only task structs or percpu >> > pgd for you ;) and probably several other contexts. >> >> Can you clarify these uses that extend beyond vmlinux? > > one obvious candidate is modules. how do you want to handle them? then there's > a whole bunch of dynamically allocated data that is a candidate for __read_only > treatment. Exactly the same way. Map those regions at the same offset, maybe even in the same VMA. There's no reason that an artificial VMA used for this purpose can't be many gigabytes long and have vm_ops that only allow access to certain things. But multiple VMAs would work, too. > >> > what is the threat model you're assuming for this feature? based on what i >> > have for PaX (arbitrary read/write access exploited for data-only attacks), >> > the above makes no sense to me... >> >> If I use the primitive to try to write a value to the wrong section >> (write to kernel text, for example), IMO it would be nice to OOPS >> instead of succeeding. > > this doesn't tell me what power you're assuming the attacker has. is it > my generic arbitrary read-write ability or something more restricted and > thus less realistic? i.e., how does the attacker get to 'use the primitive' > and (presumably) also control the ptr/data? > > as for your specific example, kernel text isn't 'non-rare-write data' that > you spoke of before, but that aside, what prevents an attacker from computing > his target ptr so that after your accessor rebases it, it'd point back to his > intended target instead? It's a restriction on what targets can be hit. With CR0.WP, you can hit anything that has a VA. With CR3, you can hit only that which is mapped. > will you range-check (find_vma eventually?) each time? > how will you make all this code safe from races from another task? the more > checks you make, the more likely that something sensitive will spill to memory > and be a target itself in order to hijack the sensitive write. There's no code here making the checks at write time. It's just page table / VMA setup. > >> Please keep in mind that, unlike PaX, uses of a pax_open_kernel()-like >> function will may be carefully audited by a friendly security expert >> such as yourself. It would be nice to harden the primitive to a >> reasonable extent against minor misuses such as putting it in a >> context where the compiler will emit mov-a-reg-with-WP-set-to-CR0; >> ret. > > i don't understand what's there to audit. if you want to treat a given piece > of data as __read_only then you have no choice but to allow writes to it via > the open/close mechanism and the compiler can tell you just where those > writes are (and even do the instrumentation when you get tired of doing it > by hand). > I mean auditing all uses of pax_open_kernel() or any other function that opens the kernel. That function is, as used in PaX, terrifying. PaX probably gets every user right, but I don't trust driver writers with a function like pax_open_kernel() that's as powerful as PaX's. Suppose you get driver code like this: void foo(int (*func)()) { pax_open_kernel(); *thingy = func(); pax_close_kernel(); } That would be a very, very juicy target for a ROP-like attack. Just get the kernel to call this function with func pointing to something that does a memcpy or similar into executable space. Boom, shellcode execution. If CR3 is used instead, exploiting this is considerably more complicated.