-- Summary -- Preliminary version of memory protection patchset, including a sample use case, turning into write-rare the IMA measurement list. The core idea is to introduce two new types of memory protection, beside const and __ro_after_init, which will support: - statically allocated "write rare" memory - dynamically allocated "read only" and "write rare" memory On top of that, follows a set of patches which create a "write rare" counterpart of the kernel infrastructure used in the example chose for hardening: the IMA measurement list. -- Mechanism -- Statically allocated protected memory is identified by the __wr_after_init tag, which will cause the linker to place it in a special section. Dynamically allocated memory is obtained through vmalloc, but compacting each allocation, where possible, in the latest obtained vmap_area. The write rare mechanism is implemented by creating a temporary alternate writable mapping, applying the change through this mapping and then removing it. All of this is possible thanks to the system MMU, which must be able to provide write protection. -- Brief history -- I sent out various versions of memory protection over the last year or so, however this patchset is significantly expanded, including several helper data structures and a use case, so I decided to reset the numbering to v1. As reference, the latest "old" version is here [1]. The current version is not yet ready for merge, however it is sufficiently complete for supporting an end-to-end discussion, I think. Eventually, I plan to write a white paper, once the code is in better shape. In the meanwhile, an overview can be had from these slides [2], which are the support material for my presentation at the Linux Security Summit 2018 Europe. -- Validation -- Most of the testing is done on a Fedora image, with QEMU x86_64, however the code has been also tested on a real x86_64 PC, yielding similar positive results. For ARM64, I use a custom Debian installation, still with QEMU, but I have obtained similar failures when testing with a real device, using a Kirin970. I have written some test cases for the most basic parts and the behaviour of IMA and the Fedora image in general do not seem to be negatively affected, when used in conjunction with this patchset. However, it's far from being exhaustive testing and the torture test for rcu is completely missing. -- Known Issues -- As said, this version is preliminary and certain parts need rework. This is a short and incomplete list of known issues: * arm64 support is broken for __wr_after_init I must create a separate section with proper mappings, similar to the ones used for vmalloc() * alignment of data structures has not been throughly checked There are probably several redundant forced alignments * there is no fallback for platforms missing MMU write protection * some additional care might be needed when dealing with double mapping vs data cache coherency, on multicore systems * lots of additional (stress) tests are needed * memory reuse (object caches) are probably needed, to support converting more use cases, and so also other data structures. * credits for original code: I have reimplemented various data structures, I am not sure if I have given credit correctly to the original authors. * documentation for the re-implemented data structures is missing * confirm that the hardened usercopy logic is correct -- Q&As -- During reviews of the older patchset, several objections and questions were formulated. They are collected here in Q&A format, with both some old and new answers: 1 - "The protection can still be undone" Yes, it is true. Using a hypervisor, like it is done in certain Huawei and Samsung phones, provides a better level of protection. However, even without that, it still gives a significantly better level of protection than not protecting the memory at all. The main advantage of this patchset is that now the attack has to focus on the page table, which is a significantly smaller area, than the whole kernel data. It is my intention, eventually, to provide also support for interaction with a FOSS hypervisor (ex: KVM), but this patchset should support also those cases where it's not even possible to have an hypervisor. So it seems simpler to start from there. The hypervisor is not mandatory. 2 - "Do not provide a chain of trust, but protect some memory and refer to it with a writable pointer." This might be ok for protecting against bugs, but in the case of an attacker trying to compromise the system, the unprotected pointer has become the new target. It doesn't change much. Samsung does use a similar implementation, for protecting LSM hooks, however that solution also add a pointer, from the protected memory back to the writable memory, as validation loop. And the price to pay is that every time the unprotected pointer must be used, it first has to be validated, to point to a certain memory range and to have a specific alignment. It's an alternative solution to the full chain of trust and each has its specific advantages, depending on the data structures that one wants to protect. 3 - "Do not use a secondary mapping, unprotect the current one" The purpose of the secondary mapping is to create a hard-to-spot window of writability at a random address, which cannot be easily exploited. Unprotecting the primary mapping would allow an attack where a core is busy looping trying to figure out if a specific location becomes writable and race against the legitimate writer. For the same reason, interrupts are disabled on the core that is performing the write-rare operation. 4 - "Do not create another allocator over vmalloc(), use it plain" This is not good for various reasons: a) vmalloc() allocates at least one page for every request it receives, leaving most of the page typically unused. While it might not be a big deal on large systems, on IoT class devices it is possible to find relatively powerful cores paired to relatively little memory. Taking as example a system using SELinux, a relatively small set of rules can genarate a few thousands of allocations (SELinux is deny-by-default). Modeling each allocation to be about 64bytes, on a system with 4kB pages, assuming that the grand total of allocation is 100k, that means 100k * 4kB = 390MB while, using each 64bytes slot in a page yields: 100k * 64B = 6MB The first case would not be very compatible with a system having only 512MB or 1GB. b) even worse, the amount of thrashing of the TLB would be terrible, with each allocation having its own translation. -- Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> -- References -- [1]: https://lkml.org/lkml/2018/4/23/508 [2]: https://events.linuxfoundation.org/wp-content/uploads/2017/12/Kernel-Hardening-Protecting-the-Protection-Mechanisms-Igor-Stoppa-Huawei.pdf -- List of patches -- [PATCH 01/17] prmem: linker section for static write rare [PATCH 02/17] prmem: write rare for static allocation [PATCH 03/17] prmem: vmalloc support for dynamic allocation [PATCH 04/17] prmem: dynamic allocation [PATCH 05/17] prmem: shorthands for write rare on common types [PATCH 06/17] prmem: test cases for memory protection [PATCH 07/17] prmem: lkdtm tests for memory protection [PATCH 08/17] prmem: struct page: track vmap_area [PATCH 09/17] prmem: hardened usercopy [PATCH 10/17] prmem: documentation [PATCH 11/17] prmem: llist: use designated initializer [PATCH 12/17] prmem: linked list: set alignment [PATCH 13/17] prmem: linked list: disable layout randomization [PATCH 14/17] prmem: llist, hlist, both plain and rcu [PATCH 15/17] prmem: test cases for prlist and prhlist [PATCH 16/17] prmem: pratomic-long [PATCH 17/17] prmem: ima: turn the measurements list write rare -- Diffstat -- Documentation/core-api/index.rst | 1 + Documentation/core-api/prmem.rst | 172 +++++ MAINTAINERS | 14 + drivers/misc/lkdtm/core.c | 13 + drivers/misc/lkdtm/lkdtm.h | 13 + drivers/misc/lkdtm/perms.c | 248 +++++++ include/asm-generic/vmlinux.lds.h | 20 + include/linux/cache.h | 17 + include/linux/list.h | 5 +- include/linux/mm_types.h | 25 +- include/linux/pratomic-long.h | 73 ++ include/linux/prlist.h | 934 ++++++++++++++++++++++++ include/linux/prmem.h | 446 +++++++++++ include/linux/prmemextra.h | 133 ++++ include/linux/types.h | 20 +- include/linux/vmalloc.h | 11 +- lib/Kconfig.debug | 9 + lib/Makefile | 1 + lib/test_prlist.c | 252 +++++++ mm/Kconfig | 6 + mm/Kconfig.debug | 9 + mm/Makefile | 2 + mm/prmem.c | 273 +++++++ mm/test_pmalloc.c | 629 ++++++++++++++++ mm/test_write_rare.c | 236 ++++++ mm/usercopy.c | 5 + mm/vmalloc.c | 7 + security/integrity/ima/ima.h | 18 +- security/integrity/ima/ima_api.c | 29 +- security/integrity/ima/ima_fs.c | 12 +- security/integrity/ima/ima_main.c | 6 + security/integrity/ima/ima_queue.c | 28 +- security/integrity/ima/ima_template.c | 14 +- security/integrity/ima/ima_template_lib.c | 14 +- 34 files changed, 3635 insertions(+), 60 deletions(-)
Introduce a section and a label for statically allocated write rare data. The label is named "__write_rare_after_init". As the name implies, after the init phase is completed, this section will be modifiable only by invoking write rare functions. NOTE: this needs rework, because the current write-rare mechanism works only on x86_64 and not arm64, due to arm64 mappings. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Arnd Bergmann <arnd@arndb.de> CC: Thomas Gleixner <tglx@linutronix.de> CC: Kate Stewart <kstewart@linuxfoundation.org> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: Philippe Ombredanne <pombredanne@nexb.com> CC: linux-arch@vger.kernel.org CC: linux-kernel@vger.kernel.org --- include/asm-generic/vmlinux.lds.h | 20 ++++++++++++++++++++ include/linux/cache.h | 17 +++++++++++++++++ 2 files changed, 37 insertions(+) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index d7701d466b60..fd40a15e3b24 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -300,6 +300,25 @@ . = __start_init_task + THREAD_SIZE; \ __end_init_task = .; +/* + * Allow architectures to handle wr_after_init data on their + * own by defining an empty WR_AFTER_INIT_DATA. + * However, it's important that pages containing WR_RARE data do not + * hold anything else, to avoid both accidentally unprotecting something + * that is supposed to stay read-only all the time and also to protect + * something else that is supposed to be writeable all the time. + */ +#ifndef WR_AFTER_INIT_DATA +#define WR_AFTER_INIT_DATA(align) \ + . = ALIGN(PAGE_SIZE); \ + __start_wr_after_init = .; \ + . = ALIGN(align); \ + *(.data..wr_after_init) \ + . = ALIGN(PAGE_SIZE); \ + __end_wr_after_init = .; \ + . = ALIGN(align); +#endif + /* * Allow architectures to handle ro_after_init data on their * own by defining an empty RO_AFTER_INIT_DATA. @@ -320,6 +339,7 @@ __start_rodata = .; \ *(.rodata) *(.rodata.*) \ RO_AFTER_INIT_DATA /* Read only after init */ \ + WR_AFTER_INIT_DATA(align) /* wr after init */ \ KEEP(*(__vermagic)) /* Kernel version magic */ \ . = ALIGN(8); \ __start___tracepoints_ptrs = .; \ diff --git a/include/linux/cache.h b/include/linux/cache.h index 750621e41d1c..9a7e7134b887 100644 --- a/include/linux/cache.h +++ b/include/linux/cache.h @@ -31,6 +31,23 @@ #define __ro_after_init __attribute__((__section__(".data..ro_after_init"))) #endif +/* + * __wr_after_init is used to mark objects that cannot be modified + * directly after init (i.e. after mark_rodata_ro() has been called). + * These objects become effectively read-only, from the perspective of + * performing a direct write, like a variable assignment. + * However, they can be altered through a dedicated function. + * It is intended for those objects which are occasionally modified after + * init, however they are modified so seldomly, that the extra cost from + * the indirect modification is either negligible or worth paying, for the + * sake of the protection gained. + */ +#ifndef __wr_after_init +#define __wr_after_init \ + __attribute__((__section__(".data..wr_after_init"))) +#endif + + #ifndef ____cacheline_aligned #define ____cacheline_aligned __attribute__((__aligned__(SMP_CACHE_BYTES))) #endif -- 2.17.1
Implementation of write rare for statically allocated data, located in a specific memory section through the use of the __write_rare label. The basic functions are wr_memcpy() and wr_memset(): the write rare counterparts of memcpy() and memset() respectively. To minimize chances of attacks, this implementation does not unprotect existing memory pages. Instead, it remaps them, one by one, at random free locations, as writable. Each page is mapped as writable strictly for the time needed to perform changes in said page. While a page is remapped, interrupts are disabled on the core performing the write rare operation, to avoid being frozen mid-air by an attack using interrupts for stretching the duration of the alternate mapping. OTOH, to avoid introducing unpredictable delays, the interrupts are re-enabled inbetween page remapping, when write operations are either completed or not yet started, and there is not alternate, writable mapping to exploit. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Michal Hocko <mhocko@kernel.org> CC: Vlastimil Babka <vbabka@suse.cz> CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Pavel Tatashin <pasha.tatashin@oracle.com> CC: linux-mm@kvack.org CC: linux-kernel@vger.kernel.org --- MAINTAINERS | 7 ++ include/linux/prmem.h | 213 ++++++++++++++++++++++++++++++++++++++++++ mm/Makefile | 1 + mm/prmem.c | 10 ++ 4 files changed, 231 insertions(+) create mode 100644 include/linux/prmem.h create mode 100644 mm/prmem.c diff --git a/MAINTAINERS b/MAINTAINERS index b2f710eee67a..e566c5d09faf 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9454,6 +9454,13 @@ F: kernel/sched/membarrier.c F: include/uapi/linux/membarrier.h F: arch/powerpc/include/asm/membarrier.h +MEMORY HARDENING +M: Igor Stoppa <igor.stoppa@gmail.com> +L: kernel-hardening@lists.openwall.com +S: Maintained +F: include/linux/prmem.h +F: mm/prmem.c + MEMORY MANAGEMENT L: linux-mm@kvack.org W: http://www.linux-mm.org diff --git a/include/linux/prmem.h b/include/linux/prmem.h new file mode 100644 index 000000000000..3ba41d76a582 --- /dev/null +++ b/include/linux/prmem.h @@ -0,0 +1,213 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * prmem.h: Header for memory protection library + * + * (C) Copyright 2018 Huawei Technologies Co. Ltd. + * Author: Igor Stoppa <igor.stoppa@huawei.com> + * + * Support for: + * - statically allocated write rare data + */ + +#ifndef _LINUX_PRMEM_H +#define _LINUX_PRMEM_H + +#include <linux/set_memory.h> +#include <linux/mm.h> +#include <linux/vmalloc.h> +#include <linux/string.h> +#include <linux/slab.h> +#include <linux/mutex.h> +#include <linux/compiler.h> +#include <linux/irqflags.h> +#include <linux/set_memory.h> + +/* ============================ Write Rare ============================ */ + +extern const char WR_ERR_RANGE_MSG[]; +extern const char WR_ERR_PAGE_MSG[]; + +/* + * The following two variables are statically allocated by the linker + * script at the the boundaries of the memory region (rounded up to + * multiples of PAGE_SIZE) reserved for __wr_after_init. + */ +extern long __start_wr_after_init; +extern long __end_wr_after_init; + +static __always_inline bool __is_wr_after_init(const void *ptr, size_t size) +{ + size_t start = (size_t)&__start_wr_after_init; + size_t end = (size_t)&__end_wr_after_init; + size_t low = (size_t)ptr; + size_t high = (size_t)ptr + size; + + return likely(start <= low && low < high && high <= end); +} + +/** + * wr_memset() - sets n bytes of the destination to the c value + * @dst: beginning of the memory to write to + * @c: byte to replicate + * @size: amount of bytes to copy + * + * Returns true on success, false otherwise. + */ +static __always_inline +bool wr_memset(const void *dst, const int c, size_t n_bytes) +{ + size_t size; + unsigned long flags; + uintptr_t d = (uintptr_t)dst; + + if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG)) + return false; + while (n_bytes) { + struct page *page; + uintptr_t base; + uintptr_t offset; + uintptr_t offset_complement; + + local_irq_save(flags); + page = virt_to_page(d); + offset = d & ~PAGE_MASK; + offset_complement = PAGE_SIZE - offset; + size = min(n_bytes, offset_complement); + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL); + if (WARN(!base, WR_ERR_PAGE_MSG)) { + local_irq_restore(flags); + return false; + } + memset((void *)(base + offset), c, size); + vunmap((void *)base); + d += size; + n_bytes -= size; + local_irq_restore(flags); + } + return true; +} + +/** + * wr_memcpy() - copyes n bytes from source to destination + * @dst: beginning of the memory to write to + * @src: beginning of the memory to read from + * @n_bytes: amount of bytes to copy + * + * Returns true on success, false otherwise. + */ +static __always_inline +bool wr_memcpy(const void *dst, const void *src, size_t n_bytes) +{ + size_t size; + unsigned long flags; + uintptr_t d = (uintptr_t)dst; + uintptr_t s = (uintptr_t)src; + + if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG)) + return false; + while (n_bytes) { + struct page *page; + uintptr_t base; + uintptr_t offset; + uintptr_t offset_complement; + + local_irq_save(flags); + page = virt_to_page(d); + offset = d & ~PAGE_MASK; + offset_complement = PAGE_SIZE - offset; + size = (size_t)min(n_bytes, offset_complement); + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL); + if (WARN(!base, WR_ERR_PAGE_MSG)) { + local_irq_restore(flags); + return false; + } + __write_once_size((void *)(base + offset), (void *)s, size); + vunmap((void *)base); + d += size; + s += size; + n_bytes -= size; + local_irq_restore(flags); + } + return true; +} + +/* + * rcu_assign_pointer is a macro, which takes advantage of being able to + * take the address of the destination parameter "p", so that it can be + * passed to WRITE_ONCE(), which is called in one of the branches of + * rcu_assign_pointer() and also, being a macro, can rely on the + * preprocessor for taking the address of its parameter. + * For the sake of staying compatible with the API, also + * wr_rcu_assign_pointer() is a macro that accepts a pointer as parameter, + * instead of the address of said pointer. + * However it is simply a wrapper to __wr_rcu_ptr(), which receives the + * address of the pointer. + */ +static __always_inline +uintptr_t __wr_rcu_ptr(const void *dst_p_p, const void *src_p) +{ + unsigned long flags; + struct page *page; + void *base; + uintptr_t offset; + const size_t size = sizeof(void *); + + if (WARN(!__is_wr_after_init(dst_p_p, size), WR_ERR_RANGE_MSG)) + return (uintptr_t)NULL; + local_irq_save(flags); + page = virt_to_page(dst_p_p); + offset = (uintptr_t)dst_p_p & ~PAGE_MASK; + base = vmap(&page, 1, VM_MAP, PAGE_KERNEL); + if (WARN(!base, WR_ERR_PAGE_MSG)) { + local_irq_restore(flags); + return (uintptr_t)NULL; + } + rcu_assign_pointer((*(void **)(offset + (uintptr_t)base)), src_p); + vunmap(base); + local_irq_restore(flags); + return (uintptr_t)src_p; +} + +#define wr_rcu_assign_pointer(p, v) __wr_rcu_ptr(&p, v) + +#define __wr_simple(dst_ptr, src_ptr) \ + wr_memcpy(dst_ptr, src_ptr, sizeof(*(src_ptr))) + +#define __wr_safe(dst_ptr, src_ptr, \ + unique_dst_ptr, unique_src_ptr) \ +({ \ + typeof(dst_ptr) unique_dst_ptr = (dst_ptr); \ + typeof(src_ptr) unique_src_ptr = (src_ptr); \ + \ + wr_memcpy(unique_dst_ptr, unique_src_ptr, \ + sizeof(*(unique_src_ptr))); \ +}) + +#define __safe_ops(dst, src) \ + (__typecheck(dst, src) && __no_side_effects(dst, src)) + +/** + * wr - copies an object over another of same type and size + * @dst_ptr: address of the destination object + * @src_ptr: address of the source object + */ +#define wr(dst_ptr, src_ptr) \ + __builtin_choose_expr(__safe_ops(dst_ptr, src_ptr), \ + __wr_simple(dst_ptr, src_ptr), \ + __wr_safe(dst_ptr, src_ptr, \ + __UNIQUE_ID(__dst_ptr), \ + __UNIQUE_ID(__src_ptr))) + +/** + * wr_ptr() - alters a pointer in write rare memory + * @dst: target for write + * @val: new value + * + * Returns true on success, false otherwise. + */ +static __always_inline +bool wr_ptr(const void *dst, const void *val) +{ + return wr_memcpy(dst, &val, sizeof(val)); +} +#endif diff --git a/mm/Makefile b/mm/Makefile index 26ef77a3883b..215c6a6d7304 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -64,6 +64,7 @@ obj-$(CONFIG_SPARSEMEM) += sparse.o obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o obj-$(CONFIG_SLOB) += slob.o obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o +obj-$(CONFIG_PRMEM) += prmem.o obj-$(CONFIG_KSM) += ksm.o obj-$(CONFIG_PAGE_POISONING) += page_poison.o obj-$(CONFIG_SLAB) += slab.o diff --git a/mm/prmem.c b/mm/prmem.c new file mode 100644 index 000000000000..de9258f5f29a --- /dev/null +++ b/mm/prmem.c @@ -0,0 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * prmem.c: Memory Protection Library + * + * (C) Copyright 2017-2018 Huawei Technologies Co. Ltd. + * Author: Igor Stoppa <igor.stoppa@huawei.com> + */ + +const char WR_ERR_RANGE_MSG[] = "Write rare on invalid memory range."; +const char WR_ERR_PAGE_MSG[] = "Failed to remap write rare page."; -- 2.17.1
Prepare vmalloc for: - tagging areas used for dynamic allocation of protected memory - supporting various tags, related to the property that an area might have - extrapolating the pool containing a given area - chaining the areas in each pool - extrapolating the area containing a given memory address NOTE: Since there is a list_head structure that is used only when disposing of the allocation (the field purge_list), there are two pointers for the take, before it comes the time of freeing the allocation. To avoid increasing the size of the vmap_area structure, instead of using a standard doubly linked list for tracking the chain of vmap_areas, only one pointer is spent for this purpose, in a single linked list, while the other is used to provide a direct connection to the parent pool. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Michal Hocko <mhocko@kernel.org> CC: Andrew Morton <akpm@linux-foundation.org> CC: Chintan Pandya <cpandya@codeaurora.org> CC: Joe Perches <joe@perches.com> CC: "Luis R. Rodriguez" <mcgrof@kernel.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Kate Stewart <kstewart@linuxfoundation.org> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: Philippe Ombredanne <pombredanne@nexb.com> CC: linux-mm@kvack.org CC: linux-kernel@vger.kernel.org --- include/linux/vmalloc.h | 12 +++++++++++- mm/vmalloc.c | 2 +- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 398e9c95cd61..4d14a3b8089e 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -21,6 +21,9 @@ struct notifier_block; /* in notifier.h */ #define VM_UNINITIALIZED 0x00000020 /* vm_struct is not fully initialized */ #define VM_NO_GUARD 0x00000040 /* don't add guard page */ #define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */ +#define VM_PMALLOC 0x00000100 /* pmalloc area - see docs */ +#define VM_PMALLOC_WR 0x00000200 /* pmalloc write rare area */ +#define VM_PMALLOC_PROTECTED 0x00000400 /* pmalloc protected area */ /* bits [20..32] reserved for arch specific ioremap internals */ /* @@ -48,7 +51,13 @@ struct vmap_area { unsigned long flags; struct rb_node rb_node; /* address sorted rbtree */ struct list_head list; /* address sorted list */ - struct llist_node purge_list; /* "lazy purge" list */ + union { + struct llist_node purge_list; /* "lazy purge" list */ + struct { + struct vmap_area *next; + struct pmalloc_pool *pool; + }; + }; struct vm_struct *vm; struct rcu_head rcu_head; }; @@ -134,6 +143,7 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size, const void *caller); extern struct vm_struct *remove_vm_area(const void *addr); extern struct vm_struct *find_vm_area(const void *addr); +extern struct vmap_area *find_vmap_area(unsigned long addr); extern int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index a728fc492557..15850005fea5 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -742,7 +742,7 @@ static void free_unmap_vmap_area(struct vmap_area *va) free_vmap_area_noflush(va); } -static struct vmap_area *find_vmap_area(unsigned long addr) +struct vmap_area *find_vmap_area(unsigned long addr) { struct vmap_area *va; -- 2.17.1
Extension of protected memory to dynamic allocations. Allocations are performed from "pools". A pool is a list of virtual memory areas, in various state of protection. Supported cases =============== Read Only Pool -------------- Memory is allocated from the pool, in writable state. Then it gets written and the content of the pool is write protected and it cannot be altered anymore. It is only possible to destroy the pool. Auto Read Only Pool ------------------- Same as the plain read only, but every time a memory area is full and phased out, it is automatically marked as read only. Write Rare Pool --------------- Memory is allocated from the pool, in writable state. Then it gets written and the content of the pool is write protected and it can be altered only by invoking special write rare functions. Auto Write Rare Pool -------------------- Same as the plain write rare, but every time a memory area is full and phased out, it is automatically marked as write rare. Start Write Rare Pool --------------------- The memory handed out is already in write rare mode and the only way to alter it is to use write rare functions. When a pool is destroyed, all the memory that was obtained from it is automatically freed. This is the only way to release protected memory. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Michal Hocko <mhocko@kernel.org> CC: Vlastimil Babka <vbabka@suse.cz> CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Pavel Tatashin <pasha.tatashin@oracle.com> CC: linux-mm@kvack.org CC: linux-kernel@vger.kernel.org --- include/linux/prmem.h | 220 +++++++++++++++++++++++++++++++++-- mm/Kconfig | 6 + mm/prmem.c | 263 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 482 insertions(+), 7 deletions(-) diff --git a/include/linux/prmem.h b/include/linux/prmem.h index 3ba41d76a582..26fd48410d97 100644 --- a/include/linux/prmem.h +++ b/include/linux/prmem.h @@ -7,6 +7,8 @@ * * Support for: * - statically allocated write rare data + * - dynamically allocated read only data + * - dynamically allocated write rare data */ #ifndef _LINUX_PRMEM_H @@ -22,6 +24,11 @@ #include <linux/irqflags.h> #include <linux/set_memory.h> +#define VM_PMALLOC_MASK \ + (VM_PMALLOC | VM_PMALLOC_WR | VM_PMALLOC_PROTECTED) +#define VM_PMALLOC_WR_MASK (VM_PMALLOC | VM_PMALLOC_WR) +#define VM_PMALLOC_PROTECTED_MASK (VM_PMALLOC | VM_PMALLOC_PROTECTED) + /* ============================ Write Rare ============================ */ extern const char WR_ERR_RANGE_MSG[]; @@ -45,11 +52,23 @@ static __always_inline bool __is_wr_after_init(const void *ptr, size_t size) return likely(start <= low && low < high && high <= end); } +static __always_inline bool __is_wr_pool(const void *ptr, size_t size) +{ + struct vmap_area *area; + + if (!is_vmalloc_addr(ptr)) + return false; + area = find_vmap_area((unsigned long)ptr); + return area && area->vm && (area->vm->size >= size) && + ((area->vm->flags & (VM_PMALLOC | VM_PMALLOC_WR)) == + (VM_PMALLOC | VM_PMALLOC_WR)); +} + /** * wr_memset() - sets n bytes of the destination to the c value * @dst: beginning of the memory to write to * @c: byte to replicate - * @size: amount of bytes to copy + * @n_bytes: amount of bytes to copy * * Returns true on success, false otherwise. */ @@ -59,8 +78,10 @@ bool wr_memset(const void *dst, const int c, size_t n_bytes) size_t size; unsigned long flags; uintptr_t d = (uintptr_t)dst; + bool is_virt = __is_wr_after_init(dst, n_bytes); - if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG)) + if (WARN(!(is_virt || likely(__is_wr_pool(dst, n_bytes))), + WR_ERR_RANGE_MSG)) return false; while (n_bytes) { struct page *page; @@ -69,7 +90,10 @@ bool wr_memset(const void *dst, const int c, size_t n_bytes) uintptr_t offset_complement; local_irq_save(flags); - page = virt_to_page(d); + if (is_virt) + page = virt_to_page(d); + else + page = vmalloc_to_page((void *)d); offset = d & ~PAGE_MASK; offset_complement = PAGE_SIZE - offset; size = min(n_bytes, offset_complement); @@ -102,8 +126,10 @@ bool wr_memcpy(const void *dst, const void *src, size_t n_bytes) unsigned long flags; uintptr_t d = (uintptr_t)dst; uintptr_t s = (uintptr_t)src; + bool is_virt = __is_wr_after_init(dst, n_bytes); - if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG)) + if (WARN(!(is_virt || likely(__is_wr_pool(dst, n_bytes))), + WR_ERR_RANGE_MSG)) return false; while (n_bytes) { struct page *page; @@ -112,7 +138,10 @@ bool wr_memcpy(const void *dst, const void *src, size_t n_bytes) uintptr_t offset_complement; local_irq_save(flags); - page = virt_to_page(d); + if (is_virt) + page = virt_to_page(d); + else + page = vmalloc_to_page((void *)d); offset = d & ~PAGE_MASK; offset_complement = PAGE_SIZE - offset; size = (size_t)min(n_bytes, offset_complement); @@ -151,11 +180,13 @@ uintptr_t __wr_rcu_ptr(const void *dst_p_p, const void *src_p) void *base; uintptr_t offset; const size_t size = sizeof(void *); + bool is_virt = __is_wr_after_init(dst_p_p, size); - if (WARN(!__is_wr_after_init(dst_p_p, size), WR_ERR_RANGE_MSG)) + if (WARN(!(is_virt || likely(__is_wr_pool(dst_p_p, size))), + WR_ERR_RANGE_MSG)) return (uintptr_t)NULL; local_irq_save(flags); - page = virt_to_page(dst_p_p); + page = is_virt ? virt_to_page(dst_p_p) : vmalloc_to_page(dst_p_p); offset = (uintptr_t)dst_p_p & ~PAGE_MASK; base = vmap(&page, 1, VM_MAP, PAGE_KERNEL); if (WARN(!base, WR_ERR_PAGE_MSG)) { @@ -210,4 +241,179 @@ bool wr_ptr(const void *dst, const void *val) { return wr_memcpy(dst, &val, sizeof(val)); } + +/* ============================ Allocator ============================ */ + +#define PMALLOC_REFILL_DEFAULT (0) +#define PMALLOC_DEFAULT_REFILL_SIZE PAGE_SIZE +#define PMALLOC_ALIGN_ORDER_DEFAULT ilog2(ARCH_KMALLOC_MINALIGN) + +#define PMALLOC_RO 0x00 +#define PMALLOC_WR 0x01 +#define PMALLOC_AUTO 0x02 +#define PMALLOC_START 0x04 +#define PMALLOC_MASK (PMALLOC_WR | PMALLOC_AUTO | PMALLOC_START) + +#define PMALLOC_MODE_RO PMALLOC_RO +#define PMALLOC_MODE_WR PMALLOC_WR +#define PMALLOC_MODE_AUTO_RO (PMALLOC_RO | PMALLOC_AUTO) +#define PMALLOC_MODE_AUTO_WR (PMALLOC_WR | PMALLOC_AUTO) +#define PMALLOC_MODE_START_WR (PMALLOC_WR | PMALLOC_START) + +struct pmalloc_pool { + struct mutex mutex; + struct list_head pool_node; + struct vmap_area *area; + size_t align; + size_t refill; + size_t offset; + uint8_t mode; +}; + +/* + * The write rare functionality is fully implemented as __always_inline, + * to prevent having an internal function call that is capable of modifying + * write protected memory. + * Fully inlining the function allows the compiler to optimize away its + * interface, making it harder for an attacker to hijack it. + * This still leaves the door open to attacks that might try to reuse part + * of the code, by jumping in the middle of the function, however it can + * be mitigated by having a compiler plugin that enforces Control Flow + * Integrity (CFI). + * Any addition/modification to the write rare path must follow the same + * approach. + */ + +void pmalloc_init_custom_pool(struct pmalloc_pool *pool, size_t refill, + short align_order, uint8_t mode); + +struct pmalloc_pool *pmalloc_create_custom_pool(size_t refill, + short align_order, + uint8_t mode); + +/** + * pmalloc_create_pool() - create a protectable memory pool + * @mode: can the data be altered after protection + * + * Shorthand for pmalloc_create_custom_pool() with default argument: + * * refill is set to PMALLOC_REFILL_DEFAULT + * * align_order is set to PMALLOC_ALIGN_ORDER_DEFAULT + * + * Returns: + * * pointer to the new pool - success + * * NULL - error + */ +static inline struct pmalloc_pool *pmalloc_create_pool(uint8_t mode) +{ + return pmalloc_create_custom_pool(PMALLOC_REFILL_DEFAULT, + PMALLOC_ALIGN_ORDER_DEFAULT, + mode); +} + +void *pmalloc(struct pmalloc_pool *pool, size_t size); + +/** + * pzalloc() - zero-initialized version of pmalloc() + * @pool: handle to the pool to be used for memory allocation + * @size: amount of memory (in bytes) requested + * + * Executes pmalloc(), initializing the memory requested to 0, before + * returning its address. + * + * Return: + * * pointer to the memory requested - success + * * NULL - error + */ +static inline void *pzalloc(struct pmalloc_pool *pool, size_t size) +{ + void *ptr = pmalloc(pool, size); + + if (unlikely(!ptr)) + return ptr; + if ((pool->mode & PMALLOC_MODE_START_WR) == PMALLOC_MODE_START_WR) + wr_memset(ptr, 0, size); + else + memset(ptr, 0, size); + return ptr; +} + +/** + * pmalloc_array() - array version of pmalloc() + * @pool: handle to the pool to be used for memory allocation + * @n: number of elements in the array + * @size: amount of memory (in bytes) requested for each element + * + * Executes pmalloc(), on an array. + * + * Return: + * * the pmalloc result - success + * * NULL - error + */ + +static inline +void *pmalloc_array(struct pmalloc_pool *pool, size_t n, size_t size) +{ + size_t total_size = n * size; + + if (unlikely(!(n && (total_size / n == size)))) + return NULL; + return pmalloc(pool, n * size); +} + +/** + * pcalloc() - array version of pzalloc() + * @pool: handle to the pool to be used for memory allocation + * @n: number of elements in the array + * @size: amount of memory (in bytes) requested for each element + * + * Executes pzalloc(), on an array. + * + * Return: + * * the pmalloc result - success + * * NULL - error + */ +static inline +void *pcalloc(struct pmalloc_pool *pool, size_t n, size_t size) +{ + size_t total_size = n * size; + + if (unlikely(!(n && (total_size / n == size)))) + return NULL; + return pzalloc(pool, n * size); +} + +/** + * pstrdup() - duplicate a string, using pmalloc() + * @pool: handle to the pool to be used for memory allocation + * @s: string to duplicate + * + * Generates a copy of the given string, allocating sufficient memory + * from the given pmalloc pool. + * + * Return: + * * pointer to the replica - success + * * NULL - error + */ +static inline char *pstrdup(struct pmalloc_pool *pool, const char *s) +{ + size_t len; + char *buf; + + len = strlen(s) + 1; + buf = pmalloc(pool, len); + if (unlikely(!buf)) + return buf; + if ((pool->mode & PMALLOC_MODE_START_WR) == PMALLOC_MODE_START_WR) + wr_memcpy(buf, s, len); + else + strncpy(buf, s, len); + return buf; +} + + +void pmalloc_protect_pool(struct pmalloc_pool *pool); + +void pmalloc_make_pool_ro(struct pmalloc_pool *pool); + +void pmalloc_destroy_pool(struct pmalloc_pool *pool); #endif diff --git a/mm/Kconfig b/mm/Kconfig index de64ea658716..1885f5565cbc 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -764,4 +764,10 @@ config GUP_BENCHMARK config ARCH_HAS_PTE_SPECIAL bool +config PRMEM + bool + depends on MMU + depends on ARCH_HAS_SET_MEMORY + default y + endmenu diff --git a/mm/prmem.c b/mm/prmem.c index de9258f5f29a..7dd13ea43304 100644 --- a/mm/prmem.c +++ b/mm/prmem.c @@ -6,5 +6,268 @@ * Author: Igor Stoppa <igor.stoppa@huawei.com> */ +#include <linux/printk.h> +#include <linux/init.h> +#include <linux/mm.h> +#include <linux/vmalloc.h> +#include <linux/kernel.h> +#include <linux/log2.h> +#include <linux/slab.h> +#include <linux/set_memory.h> +#include <linux/bug.h> +#include <linux/mutex.h> +#include <linux/llist.h> +#include <asm/cacheflush.h> +#include <asm/page.h> + +#include <linux/prmem.h> + const char WR_ERR_RANGE_MSG[] = "Write rare on invalid memory range."; const char WR_ERR_PAGE_MSG[] = "Failed to remap write rare page."; + +static LIST_HEAD(pools_list); +static DEFINE_MUTEX(pools_mutex); + +#define MAX_ALIGN_ORDER (ilog2(sizeof(void *))) + + +/* Various helper functions. Inlined, to reduce the attack surface. */ + +static __always_inline void protect_area(struct vmap_area *area) +{ + set_memory_ro(area->va_start, area->vm->nr_pages); + area->vm->flags |= VM_PMALLOC_PROTECTED_MASK; +} + +static __always_inline bool empty(struct pmalloc_pool *pool) +{ + return unlikely(!pool->area); +} + +/* Allocation from a protcted area is allowed only for a START_WR pool. */ +static __always_inline bool unwritable(struct pmalloc_pool *pool) +{ + return (pool->area->vm->flags & VM_PMALLOC_PROTECTED) && + !((pool->area->vm->flags & VM_PMALLOC_WR) && + (pool->mode & PMALLOC_START)); +} + +static __always_inline +bool exhausted(struct pmalloc_pool *pool, size_t size) +{ + size_t space_before; + size_t space_after; + + space_before = round_down(pool->offset, pool->align); + space_after = pool->offset - space_before; + return unlikely(space_after < size && space_before < size); +} + +static __always_inline +bool space_needed(struct pmalloc_pool *pool, size_t size) +{ + return empty(pool) || unwritable(pool) || exhausted(pool, size); +} + +/** + * pmalloc_init_custom_pool() - initialize a protectable memory pool + * @pool: the pointer to the struct pmalloc_pool to initialize + * @refill: the minimum size to allocate when in need of more memory. + * It will be rounded up to a multiple of PAGE_SIZE + * The value of 0 gives the default amount of PAGE_SIZE. + * @align_order: log2 of the alignment to use when allocating memory + * Negative values give ARCH_KMALLOC_MINALIGN + * @mode: is the data RO or RareWrite and should be provided already in + * protected mode. + * The value is one of: + * PMALLOC_MODE_RO, PMALLOC_MODE_WR, PMALLOC_MODE_AUTO_RO + * PMALLOC_MODE_AUTO_WR, PMALLOC_MODE_START_WR + * + * Initializes an empty memory pool, for allocation of protectable + * memory. Memory will be allocated upon request (through pmalloc). + */ +void pmalloc_init_custom_pool(struct pmalloc_pool *pool, size_t refill, + short align_order, uint8_t mode) +{ + mutex_init(&pool->mutex); + pool->area = NULL; + if (align_order < 0) + pool->align = ARCH_KMALLOC_MINALIGN; + else + pool->align = 1UL << align_order; + pool->refill = refill ? PAGE_ALIGN(refill) : + PMALLOC_DEFAULT_REFILL_SIZE; + mode &= PMALLOC_MASK; + if (mode & PMALLOC_START) + mode |= PMALLOC_WR; + pool->mode = mode & PMALLOC_MASK; + pool->offset = 0; + mutex_lock(&pools_mutex); + list_add(&pool->pool_node, &pools_list); + mutex_unlock(&pools_mutex); +} +EXPORT_SYMBOL(pmalloc_init_custom_pool); + +/** + * pmalloc_create_custom_pool() - create a new protectable memory pool + * @refill: the minimum size to allocate when in need of more memory. + * It will be rounded up to a multiple of PAGE_SIZE + * The value of 0 gives the default amount of PAGE_SIZE. + * @align_order: log2 of the alignment to use when allocating memory + * Negative values give ARCH_KMALLOC_MINALIGN + * @mode: can the data be altered after protection + * + * Creates a new (empty) memory pool for allocation of protectable + * memory. Memory will be allocated upon request (through pmalloc). + * + * Return: + * * pointer to the new pool - success + * * NULL - error + */ +struct pmalloc_pool *pmalloc_create_custom_pool(size_t refill, + short align_order, + uint8_t mode) +{ + struct pmalloc_pool *pool; + + pool = kmalloc(sizeof(struct pmalloc_pool), GFP_KERNEL); + if (WARN(!pool, "Could not allocate pool meta data.")) + return NULL; + pmalloc_init_custom_pool(pool, refill, align_order, mode); + return pool; +} +EXPORT_SYMBOL(pmalloc_create_custom_pool); + +static int grow(struct pmalloc_pool *pool, size_t min_size) +{ + void *addr; + struct vmap_area *new_area; + unsigned long size; + uint32_t tag_mask; + + size = (min_size > pool->refill) ? min_size : pool->refill; + addr = vmalloc(size); + if (WARN(!addr, "Failed to allocate %zd bytes", PAGE_ALIGN(size))) + return -ENOMEM; + + new_area = find_vmap_area((uintptr_t)addr); + tag_mask = VM_PMALLOC; + if (pool->mode & PMALLOC_WR) + tag_mask |= VM_PMALLOC_WR; + new_area->vm->flags |= (tag_mask & VM_PMALLOC_MASK); + new_area->pool = pool; + if (pool->mode & PMALLOC_START) + protect_area(new_area); + if (pool->mode & PMALLOC_AUTO && !empty(pool)) + protect_area(pool->area); + /* The area size backed by pages, without the canary bird. */ + pool->offset = new_area->vm->nr_pages * PAGE_SIZE; + new_area->next = pool->area; + pool->area = new_area; + return 0; +} + +/** + * pmalloc() - allocate protectable memory from a pool + * @pool: handle to the pool to be used for memory allocation + * @size: amount of memory (in bytes) requested + * + * Allocates memory from a pool. + * If needed, the pool will automatically allocate enough memory to + * either satisfy the request or meet the "refill" parameter received + * upon creation. + * New allocation can happen also if the current memory in the pool is + * already write protected. + * Allocation happens with a mutex locked, therefore it is assumed to have + * exclusive write access to both the pool structure and the list of + * vmap_areas, while inside the lock. + * + * Return: + * * pointer to the memory requested - success + * * NULL - error + */ +void *pmalloc(struct pmalloc_pool *pool, size_t size) +{ + void *retval = NULL; + + mutex_lock(&pool->mutex); + if (unlikely(space_needed(pool, size)) && + unlikely(grow(pool, size) != 0)) + goto error; + pool->offset = round_down(pool->offset - size, pool->align); + retval = (void *)(pool->area->va_start + pool->offset); +error: + mutex_unlock(&pool->mutex); + return retval; +} +EXPORT_SYMBOL(pmalloc); + +/** + * pmalloc_protect_pool() - write-protects the memory in the pool + * @pool: the pool associated to the memory to write-protect + * + * Write-protects all the memory areas currently assigned to the pool + * that are still unprotected. + * This does not prevent further allocation of additional memory, that + * can be initialized and protected. + * The catch is that protecting a pool will make unavailable whatever + * free memory it might still contain. + * Successive allocations will grab more free pages. + */ +void pmalloc_protect_pool(struct pmalloc_pool *pool) +{ + struct vmap_area *area; + + mutex_lock(&pool->mutex); + for (area = pool->area; area; area = area->next) + protect_area(area); + mutex_unlock(&pool->mutex); +} +EXPORT_SYMBOL(pmalloc_protect_pool); + + +/** + * pmalloc_make_pool_ro() - forces a pool to become read-only + * @pool: the pool associated to the memory to make ro + * + * Drops the possibility to perform controlled writes from both the pool + * metadata and all the vm_area structures associated to the pool. + * In case the pool was configured to automatically protect memory when + * allocating it, the configuration is dropped. + */ +void pmalloc_make_pool_ro(struct pmalloc_pool *pool) +{ + struct vmap_area *area; + + mutex_lock(&pool->mutex); + pool->mode &= ~(PMALLOC_WR | PMALLOC_START); + for (area = pool->area; area; area = area->next) + protect_area(area); + mutex_unlock(&pool->mutex); +} +EXPORT_SYMBOL(pmalloc_make_pool_ro); + +/** + * pmalloc_destroy_pool() - destroys a pool and all the associated memory + * @pool: the pool to destroy + * + * All the memory associated to the pool will be freed, including the + * metadata used for the pool. + */ +void pmalloc_destroy_pool(struct pmalloc_pool *pool) +{ + struct vmap_area *area; + + mutex_lock(&pools_mutex); + list_del(&pool->pool_node); + mutex_unlock(&pools_mutex); + while (pool->area) { + area = pool->area; + pool->area = area->next; + set_memory_rw(area->va_start, area->vm->nr_pages); + area->vm->flags &= ~VM_PMALLOC_MASK; + vfree((void *)area->va_start); + } + kfree(pool); +} +EXPORT_SYMBOL(pmalloc_destroy_pool); -- 2.17.1
Wrappers around the basic write rare functionality, addressing several common data types found in the kernel, allowing to specify the new values through immediates, like constants and defines. Note: The list is not complete and could be expanded. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Michal Hocko <mhocko@kernel.org> CC: Vlastimil Babka <vbabka@suse.cz> CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Pavel Tatashin <pasha.tatashin@oracle.com> CC: linux-mm@kvack.org CC: linux-kernel@vger.kernel.org --- MAINTAINERS | 1 + include/linux/prmemextra.h | 133 +++++++++++++++++++++++++++++++++++++ 2 files changed, 134 insertions(+) create mode 100644 include/linux/prmemextra.h diff --git a/MAINTAINERS b/MAINTAINERS index e566c5d09faf..df7221eca160 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9459,6 +9459,7 @@ M: Igor Stoppa <igor.stoppa@gmail.com> L: kernel-hardening@lists.openwall.com S: Maintained F: include/linux/prmem.h +F: include/linux/prmemextra.h F: mm/prmem.c MEMORY MANAGEMENT diff --git a/include/linux/prmemextra.h b/include/linux/prmemextra.h new file mode 100644 index 000000000000..36995717720e --- /dev/null +++ b/include/linux/prmemextra.h @@ -0,0 +1,133 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * prmemextra.h: Shorthands for write rare of basic data types + * + * (C) Copyright 2018 Huawei Technologies Co. Ltd. + * Author: Igor Stoppa <igor.stoppa@huawei.com> + * + */ + +#ifndef _LINUX_PRMEMEXTRA_H +#define _LINUX_PRMEMEXTRA_H + +#include <linux/prmem.h> + +/** + * wr_char - alters a char in write rare memory + * @dst: target for write + * @val: new value + * + * Returns true on success, false otherwise. + */ +static __always_inline +bool wr_char(const char *dst, const char val) +{ + return wr_memcpy(dst, &val, sizeof(val)); +} + +/** + * wr_short - alters a short in write rare memory + * @dst: target for write + * @val: new value + * + * Returns true on success, false otherwise. + */ +static __always_inline +bool wr_short(const short *dst, const short val) +{ + return wr_memcpy(dst, &val, sizeof(val)); +} + +/** + * wr_ushort - alters an unsigned short in write rare memory + * @dst: target for write + * @val: new value + * + * Returns true on success, false otherwise. + */ +static __always_inline +bool wr_ushort(const unsigned short *dst, const unsigned short val) +{ + return wr_memcpy(dst, &val, sizeof(val)); +} + +/** + * wr_int - alters an int in write rare memory + * @dst: target for write + * @val: new value + * + * Returns true on success, false otherwise. + */ +static __always_inline +bool wr_int(const int *dst, const int val) +{ + return wr_memcpy(dst, &val, sizeof(val)); +} + +/** + * wr_uint - alters an unsigned int in write rare memory + * @dst: target for write + * @val: new value + * + * Returns true on success, false otherwise. + */ +static __always_inline +bool wr_uint(const unsigned int *dst, const unsigned int val) +{ + return wr_memcpy(dst, &val, sizeof(val)); +} + +/** + * wr_long - alters a long in write rare memory + * @dst: target for write + * @val: new value + * + * Returns true on success, false otherwise. + */ +static __always_inline +bool wr_long(const long *dst, const long val) +{ + return wr_memcpy(dst, &val, sizeof(val)); +} + +/** + * wr_ulong - alters an unsigned long in write rare memory + * @dst: target for write + * @val: new value + * + * Returns true on success, false otherwise. + */ +static __always_inline +bool wr_ulong(const unsigned long *dst, const unsigned long val) +{ + return wr_memcpy(dst, &val, sizeof(val)); +} + +/** + * wr_longlong - alters a long long in write rare memory + * @dst: target for write + * @val: new value + * + * Returns true on success, false otherwise. + */ +static __always_inline +bool wr_longlong(const long long *dst, const long long val) +{ + return wr_memcpy(dst, &val, sizeof(val)); +} + +/** + * wr_ulonglong - alters an unsigned long long in write rare memory + * @dst: target for write + * @val: new value + * + * Returns true on success, false otherwise. + */ +static __always_inline +bool wr_ulonglong(const unsigned long long *dst, + const unsigned long long val) +{ + return wr_memcpy(dst, &val, sizeof(val)); +} + +#endif -- 2.17.1
The test cases verify the various interfaces offered by both prmem.h and prmemextra.h The tests avoid triggering crashes, by not performing actions that would be treated as illegal. That part is handled in the lkdtm patch. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Michal Hocko <mhocko@kernel.org> CC: Vlastimil Babka <vbabka@suse.cz> CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Pavel Tatashin <pasha.tatashin@oracle.com> CC: linux-mm@kvack.org CC: linux-kernel@vger.kernel.org --- MAINTAINERS | 2 + mm/Kconfig.debug | 9 + mm/Makefile | 1 + mm/test_pmalloc.c | 633 +++++++++++++++++++++++++++++++++++++++++++ mm/test_write_rare.c | 236 ++++++++++++++++ 5 files changed, 881 insertions(+) create mode 100644 mm/test_pmalloc.c create mode 100644 mm/test_write_rare.c diff --git a/MAINTAINERS b/MAINTAINERS index df7221eca160..ea979a5a9ec9 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9461,6 +9461,8 @@ S: Maintained F: include/linux/prmem.h F: include/linux/prmemextra.h F: mm/prmem.c +F: mm/test_write_rare.c +F: mm/test_pmalloc.c MEMORY MANAGEMENT L: linux-mm@kvack.org diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug index 9a7b8b049d04..57de5b3c0bae 100644 --- a/mm/Kconfig.debug +++ b/mm/Kconfig.debug @@ -94,3 +94,12 @@ config DEBUG_RODATA_TEST depends on STRICT_KERNEL_RWX ---help--- This option enables a testcase for the setting rodata read-only. + +config DEBUG_PRMEM_TEST + tristate "Run self test for protected memory" + depends on STRICT_KERNEL_RWX + select PRMEM + default n + help + Tries to verify that the memory protection works correctly and that + the memory is effectively protected. diff --git a/mm/Makefile b/mm/Makefile index 215c6a6d7304..93b503d4659f 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -65,6 +65,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o obj-$(CONFIG_SLOB) += slob.o obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o obj-$(CONFIG_PRMEM) += prmem.o +obj-$(CONFIG_DEBUG_PRMEM_TEST) += test_write_rare.o test_pmalloc.o obj-$(CONFIG_KSM) += ksm.o obj-$(CONFIG_PAGE_POISONING) += page_poison.o obj-$(CONFIG_SLAB) += slab.o diff --git a/mm/test_pmalloc.c b/mm/test_pmalloc.c new file mode 100644 index 000000000000..f9ee8fb29eea --- /dev/null +++ b/mm/test_pmalloc.c @@ -0,0 +1,633 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * test_pmalloc.c + * + * (C) Copyright 2018 Huawei Technologies Co. Ltd. + * Author: Igor Stoppa <igor.stoppa@huawei.com> + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/string.h> +#include <linux/bug.h> +#include <linux/prmemextra.h> + +#ifdef pr_fmt +#undef pr_fmt +#endif + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#define SIZE_1 (PAGE_SIZE * 3) +#define SIZE_2 1000 + +static const char MSG_NO_POOL[] = "Cannot allocate memory for the pool."; +static const char MSG_NO_PMEM[] = "Cannot allocate memory from the pool."; + +#define pr_success(test_name) \ + pr_info(test_name " test passed") + +/* --------------- tests the basic life-cycle of a pool --------------- */ + +static bool is_address_protected(void *p) +{ + struct page *page; + struct vmap_area *area; + + if (unlikely(!is_vmalloc_addr(p))) + return false; + page = vmalloc_to_page(p); + if (unlikely(!page)) + return false; + wmb(); /* Flush changes to the page table - is it needed? */ + area = find_vmap_area((uintptr_t)p); + if (unlikely((!area) || (!area->vm) || + ((area->vm->flags & VM_PMALLOC_PROTECTED_MASK) != + VM_PMALLOC_PROTECTED_MASK))) + return false; + return true; +} + +static bool create_and_destroy_pool(void) +{ + static struct pmalloc_pool *pool; + + pool = pmalloc_create_pool(PMALLOC_MODE_RO); + if (WARN(!pool, MSG_NO_POOL)) + return false; + pmalloc_destroy_pool(pool); + pr_success("pool creation and destruction"); + return true; +} + +/* verifies that it's possible to allocate from the pool */ +static bool test_alloc(void) +{ + static struct pmalloc_pool *pool; + static void *p; + + pool = pmalloc_create_pool(PMALLOC_MODE_RO); + if (WARN(!pool, MSG_NO_POOL)) + return false; + p = pmalloc(pool, SIZE_1 - 1); + pmalloc_destroy_pool(pool); + if (WARN(!p, MSG_NO_PMEM)) + return false; + pr_success("allocation capability"); + return true; +} + +/* ----------------------- tests self protection ----------------------- */ + +static bool test_auto_ro(void) +{ + struct pmalloc_pool *pool; + int *first_chunk; + int *second_chunk; + bool retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_AUTO_RO); + if (WARN(!pool, MSG_NO_POOL)) + return false; + first_chunk = (int *)pmalloc(pool, PMALLOC_DEFAULT_REFILL_SIZE); + if (WARN(!first_chunk, MSG_NO_PMEM)) + goto error; + second_chunk = (int *)pmalloc(pool, PMALLOC_DEFAULT_REFILL_SIZE); + if (WARN(!second_chunk, MSG_NO_PMEM)) + goto error; + if (WARN(!is_address_protected(first_chunk), + "Failed to automatically write protect exhausted vmarea")) + goto error; + pr_success("AUTO_RO"); + retval = true; +error: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_auto_wr(void) +{ + struct pmalloc_pool *pool; + int *first_chunk; + int *second_chunk; + bool retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_AUTO_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + first_chunk = (int *)pmalloc(pool, PMALLOC_DEFAULT_REFILL_SIZE); + if (WARN(!first_chunk, MSG_NO_PMEM)) + goto error; + second_chunk = (int *)pmalloc(pool, PMALLOC_DEFAULT_REFILL_SIZE); + if (WARN(!second_chunk, MSG_NO_PMEM)) + goto error; + if (WARN(!is_address_protected(first_chunk), + "Failed to automatically write protect exhausted vmarea")) + goto error; + pr_success("AUTO_WR"); + retval = true; +error: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_start_wr(void) +{ + struct pmalloc_pool *pool; + int *chunks[2]; + bool retval = false; + int i; + + pool = pmalloc_create_pool(PMALLOC_MODE_START_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + for (i = 0; i < 2; i++) { + chunks[i] = (int *)pmalloc(pool, 1); + if (WARN(!chunks[i], MSG_NO_PMEM)) + goto error; + if (WARN(!is_address_protected(chunks[i]), + "vmarea was not protected from the start")) + goto error; + } + if (WARN(vmalloc_to_page(chunks[0]) != vmalloc_to_page(chunks[1]), + "START_WR: mostly empty vmap area not reused")) + goto error; + pr_success("START_WR"); + retval = true; +error: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_self_protection(void) +{ + if (WARN(!(test_auto_ro() && + test_auto_wr() && + test_start_wr()), + "self protection tests failed")) + return false; + pr_success("self protection"); + return true; +} + +/* ----------------- tests basic write rare functions ----------------- */ + +#define INSERT_OFFSET (PAGE_SIZE * 3 / 2) +#define INSERT_SIZE (PAGE_SIZE * 2) +#define REGION_SIZE (PAGE_SIZE * 5) +static bool test_wr_memset(void) +{ + struct pmalloc_pool *pool; + char *region; + unsigned int i; + int retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_START_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + region = pzalloc(pool, REGION_SIZE); + if (WARN(!region, MSG_NO_PMEM)) + goto destroy_pool; + for (i = 0; i < REGION_SIZE; i++) + if (WARN(region[i], "Failed to memset wr memory")) + goto destroy_pool; + retval = !wr_memset(region + INSERT_OFFSET, 1, INSERT_SIZE); + if (WARN(retval, "wr_memset failed")) + goto destroy_pool; + for (i = 0; i < REGION_SIZE; i++) + if (i >= INSERT_OFFSET && + i < (INSERT_SIZE + INSERT_OFFSET)) { + if (WARN(!region[i], + "Failed to alter target area")) + goto destroy_pool; + } else { + if (WARN(region[i] != 0, + "Unexpected alteration outside region")) + goto destroy_pool; + } + retval = true; + pr_success("wr_memset"); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_wr_strdup(void) +{ + const char src[] = "Some text for testing pstrdup()"; + struct pmalloc_pool *pool; + char *dst; + int retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + dst = pstrdup(pool, src); + if (WARN(!dst || strcmp(src, dst), "pmalloc wr strdup failed")) + goto destroy_pool; + retval = true; + pr_success("pmalloc wr strdup"); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; +} + +/* Verify write rare across multiple pages, unaligned to PAGE_SIZE. */ +static bool test_wr_copy(void) +{ + struct pmalloc_pool *pool; + char *region; + char *mod; + unsigned int i; + int retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + region = pzalloc(pool, REGION_SIZE); + if (WARN(!region, MSG_NO_PMEM)) + goto destroy_pool; + mod = vmalloc(INSERT_SIZE); + if (WARN(!mod, "Failed to allocate memory from vmalloc")) + goto destroy_pool; + memset(mod, 0xA5, INSERT_SIZE); + pmalloc_protect_pool(pool); + retval = !wr_memcpy(region + INSERT_OFFSET, mod, INSERT_SIZE); + if (WARN(retval, "wr_copy failed")) + goto free_mod; + + for (i = 0; i < REGION_SIZE; i++) + if (i >= INSERT_OFFSET && + i < (INSERT_SIZE + INSERT_OFFSET)) { + if (WARN(region[i] != (char)0xA5, + "Failed to alter target area")) + goto free_mod; + } else { + if (WARN(region[i] != 0, + "Unexpected alteration outside region")) + goto free_mod; + } + retval = true; + pr_success("wr_copy"); +free_mod: + vfree(mod); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; +} + +/* ----------------- tests specialized write actions ------------------- */ + +#define TEST_ARRAY_SIZE 5 +#define TEST_ARRAY_TARGET (TEST_ARRAY_SIZE / 2) + +static bool test_wr_char(void) +{ + struct pmalloc_pool *pool; + char *array; + unsigned int i; + bool retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + array = pmalloc(pool, sizeof(char) * TEST_ARRAY_SIZE); + if (WARN(!array, MSG_NO_PMEM)) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + array[i] = (char)0xA5; + pmalloc_protect_pool(pool); + if (WARN(!wr_char(array + TEST_ARRAY_TARGET, (char)0x5A), + "Failed to alter char variable")) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? + (char)0x5A : (char)0xA5), + "Unexpected value in test array.")) + goto destroy_pool; + retval = true; + pr_success("wr_char"); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_wr_short(void) +{ + struct pmalloc_pool *pool; + short *array; + unsigned int i; + bool retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + array = pmalloc(pool, sizeof(short) * TEST_ARRAY_SIZE); + if (WARN(!array, MSG_NO_PMEM)) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + array[i] = (short)0xA5; + pmalloc_protect_pool(pool); + if (WARN(!wr_short(array + TEST_ARRAY_TARGET, (short)0x5A), + "Failed to alter short variable")) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? + (short)0x5A : (short)0xA5), + "Unexpected value in test array.")) + goto destroy_pool; + retval = true; + pr_success("wr_short"); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_wr_ushort(void) +{ + struct pmalloc_pool *pool; + unsigned short *array; + unsigned int i; + bool retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + array = pmalloc(pool, sizeof(unsigned short) * TEST_ARRAY_SIZE); + if (WARN(!array, MSG_NO_PMEM)) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + array[i] = (unsigned short)0xA5; + pmalloc_protect_pool(pool); + if (WARN(!wr_ushort(array + TEST_ARRAY_TARGET, + (unsigned short)0x5A), + "Failed to alter unsigned short variable")) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? + (unsigned short)0x5A : + (unsigned short)0xA5), + "Unexpected value in test array.")) + goto destroy_pool; + retval = true; + pr_success("wr_ushort"); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_wr_int(void) +{ + struct pmalloc_pool *pool; + int *array; + unsigned int i; + bool retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + array = pmalloc(pool, sizeof(int) * TEST_ARRAY_SIZE); + if (WARN(!array, MSG_NO_PMEM)) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + array[i] = 0xA5; + pmalloc_protect_pool(pool); + if (WARN(!wr_int(array + TEST_ARRAY_TARGET, 0x5A), + "Failed to alter int variable")) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? 0x5A : 0xA5), + "Unexpected value in test array.")) + goto destroy_pool; + retval = true; + pr_success("wr_int"); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_wr_uint(void) +{ + struct pmalloc_pool *pool; + unsigned int *array; + unsigned int i; + bool retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + array = pmalloc(pool, sizeof(unsigned int) * TEST_ARRAY_SIZE); + if (WARN(!array, MSG_NO_PMEM)) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + array[i] = 0xA5; + pmalloc_protect_pool(pool); + if (WARN(!wr_uint(array + TEST_ARRAY_TARGET, 0x5A), + "Failed to alter unsigned int variable")) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? 0x5A : 0xA5), + "Unexpected value in test array.")) + goto destroy_pool; + retval = true; + pr_success("wr_uint"); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_wr_long(void) +{ + struct pmalloc_pool *pool; + long *array; + unsigned int i; + bool retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + array = pmalloc(pool, sizeof(long) * TEST_ARRAY_SIZE); + if (WARN(!array, MSG_NO_PMEM)) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + array[i] = 0xA5; + pmalloc_protect_pool(pool); + if (WARN(!wr_long(array + TEST_ARRAY_TARGET, 0x5A), + "Failed to alter long variable")) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? 0x5A : 0xA5), + "Unexpected value in test array.")) + goto destroy_pool; + retval = true; + pr_success("wr_long"); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_wr_ulong(void) +{ + struct pmalloc_pool *pool; + unsigned long *array; + unsigned int i; + bool retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + array = pmalloc(pool, sizeof(unsigned long) * TEST_ARRAY_SIZE); + if (WARN(!array, MSG_NO_PMEM)) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + array[i] = 0xA5; + pmalloc_protect_pool(pool); + if (WARN(!wr_ulong(array + TEST_ARRAY_TARGET, 0x5A), + "Failed to alter unsigned long variable")) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? 0x5A : 0xA5), + "Unexpected value in test array.")) + goto destroy_pool; + retval = true; + pr_success("wr_ulong"); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_wr_longlong(void) +{ + struct pmalloc_pool *pool; + long long *array; + unsigned int i; + bool retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + array = pmalloc(pool, sizeof(long long) * TEST_ARRAY_SIZE); + if (WARN(!array, MSG_NO_PMEM)) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + array[i] = 0xA5; + pmalloc_protect_pool(pool); + if (WARN(!wr_longlong(array + TEST_ARRAY_TARGET, 0x5A), + "Failed to alter long variable")) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? 0x5A : 0xA5), + "Unexpected value in test array.")) + goto destroy_pool; + retval = true; + pr_success("wr_longlong"); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_wr_ulonglong(void) +{ + struct pmalloc_pool *pool; + unsigned long long *array; + unsigned int i; + bool retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + array = pmalloc(pool, sizeof(unsigned long long) * TEST_ARRAY_SIZE); + if (WARN(!array, MSG_NO_PMEM)) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + array[i] = 0xA5; + pmalloc_protect_pool(pool); + if (WARN(!wr_ulonglong(array + TEST_ARRAY_TARGET, 0x5A), + "Failed to alter unsigned long long variable")) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? 0x5A : 0xA5), + "Unexpected value in test array.")) + goto destroy_pool; + retval = true; + pr_success("wr_ulonglong"); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_wr_ptr(void) +{ + struct pmalloc_pool *pool; + int **array; + unsigned int i; + bool retval = false; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return false; + array = pmalloc(pool, sizeof(int *) * TEST_ARRAY_SIZE); + if (WARN(!array, MSG_NO_PMEM)) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + array[i] = NULL; + pmalloc_protect_pool(pool); + if (WARN(!wr_ptr(array + TEST_ARRAY_TARGET, array), + "Failed to alter ptr variable")) + goto destroy_pool; + for (i = 0; i < TEST_ARRAY_SIZE; i++) + if (WARN(array[i] != (i == TEST_ARRAY_TARGET ? + (void *)array : NULL), + "Unexpected value in test array.")) + goto destroy_pool; + retval = true; + pr_success("wr_ptr"); +destroy_pool: + pmalloc_destroy_pool(pool); + return retval; + +} + +static bool test_specialized_wrs(void) +{ + if (WARN(!(test_wr_char() && + test_wr_short() && + test_wr_ushort() && + test_wr_int() && + test_wr_uint() && + test_wr_long() && + test_wr_ulong() && + test_wr_longlong() && + test_wr_ulonglong() && + test_wr_ptr()), + "specialized write rare failed")) + return false; + pr_success("specialized write rare"); + return true; + +} + +/* + * test_pmalloc() -main entry point for running the test cases + */ +static int __init test_pmalloc_init_module(void) +{ + if (WARN(!(create_and_destroy_pool() && + test_alloc() && + test_self_protection() && + test_wr_memset() && + test_wr_strdup() && + test_wr_copy() && + test_specialized_wrs()), + "protected memory allocator test failed")) + return -EFAULT; + pr_success("protected memory allocator"); + return 0; +} + +module_init(test_pmalloc_init_module); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Igor Stoppa <igor.stoppa@huawei.com>"); +MODULE_DESCRIPTION("Test module for pmalloc."); diff --git a/mm/test_write_rare.c b/mm/test_write_rare.c new file mode 100644 index 000000000000..e19473bb319b --- /dev/null +++ b/mm/test_write_rare.c @@ -0,0 +1,236 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * test_write_rare.c + * + * (C) Copyright 2018 Huawei Technologies Co. Ltd. + * Author: Igor Stoppa <igor.stoppa@huawei.com> + * + * Caveat: the tests which perform modifications are run *during* init, so + * the memory they use could be still altered through a direct write + * operation. But the purpose of these tests is to confirm that the + * modification through remapping works correctly. This doesn't depend on + * the read/write status of the original mapping. + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/bug.h> +#include <linux/prmemextra.h> + +#ifdef pr_fmt +#undef pr_fmt +#endif + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#define pr_success(test_name) \ + pr_info(test_name " test passed") + +static int scalar __wr_after_init = 0xA5A5; + +/* The section must occupy a non-zero number of whole pages */ +static bool test_alignment(void) +{ + size_t pstart = (size_t)&__start_wr_after_init; + size_t pend = (size_t)&__end_wr_after_init; + + if (WARN((pstart & ~PAGE_MASK) || (pend & ~PAGE_MASK) || + (pstart >= pend), "Boundaries test failed.")) + return false; + pr_success("Boundaries"); + return true; +} + +/* Alter a scalar value */ +static bool test_simple_write(void) +{ + int new_val = 0x5A5A; + + if (WARN(!__is_wr_after_init(&scalar, sizeof(scalar)), + "The __wr_after_init modifier did NOT work.")) + return false; + + if (WARN(!wr(&scalar, &new_val) || scalar != new_val, + "Scalar write rare test failed")) + return false; + + pr_success("Scalar write rare"); + return true; +} + +#define LARGE_SIZE (PAGE_SIZE * 5) +#define CHANGE_SIZE (PAGE_SIZE * 2) +#define CHANGE_OFFSET (PAGE_SIZE / 2) + +static char large[LARGE_SIZE] __wr_after_init; + + +/* Alter data across multiple pages */ +static bool test_cross_page_write(void) +{ + unsigned int i; + char *src; + bool check; + + src = vmalloc(PAGE_SIZE * 2); + if (WARN(!src, "could not allocate memory")) + return false; + + for (i = 0; i < LARGE_SIZE; i++) + large[i] = 0xA5; + + for (i = 0; i < CHANGE_SIZE; i++) + src[i] = 0x5A; + + check = wr_memcpy(large + CHANGE_OFFSET, src, CHANGE_SIZE); + vfree(src); + if (WARN(!check, "The wr_memcpy() failed")) + return false; + + for (i = CHANGE_OFFSET; i < CHANGE_OFFSET + CHANGE_SIZE; i++) + if (WARN(large[i] != 0x5A, + "Cross-page write rare test failed")) + return false; + + pr_success("Cross-page write rare"); + return true; +} + +static bool test_memsetting(void) +{ + unsigned int i; + + wr_memset(large, 0, LARGE_SIZE); + for (i = 0; i < LARGE_SIZE; i++) + if (WARN(large[i], "Failed to reset memory")) + return false; + wr_memset(large + CHANGE_OFFSET, 1, CHANGE_SIZE); + for (i = 0; i < CHANGE_OFFSET; i++) + if (WARN(large[i], "Failed to set memory")) + return false; + for (i = CHANGE_OFFSET; i < CHANGE_OFFSET + CHANGE_SIZE; i++) + if (WARN(!large[i], "Failed to set memory")) + return false; + for (i = CHANGE_OFFSET + CHANGE_SIZE; i < LARGE_SIZE; i++) + if (WARN(large[i], "Failed to set memory")) + return false; + pr_success("Memsetting"); + return true; +} + +#define INIT_VAL 1 +#define END_VAL 4 + +/* Various tests for the shorthands provided for standard types. */ +static char char_var __wr_after_init = INIT_VAL; +static bool test_char(void) +{ + return wr_char(&char_var, END_VAL) && char_var == END_VAL; +} + +static short short_var __wr_after_init = INIT_VAL; +static bool test_short(void) +{ + return wr_short(&short_var, END_VAL) && + short_var == END_VAL; +} + +static unsigned short ushort_var __wr_after_init = INIT_VAL; +static bool test_ushort(void) +{ + return wr_ushort(&ushort_var, END_VAL) && + ushort_var == END_VAL; +} + +static int int_var __wr_after_init = INIT_VAL; +static bool test_int(void) +{ + return wr_int(&int_var, END_VAL) && + int_var == END_VAL; +} + +static unsigned int uint_var __wr_after_init = INIT_VAL; +static bool test_uint(void) +{ + return wr_uint(&uint_var, END_VAL) && + uint_var == END_VAL; +} + +static long long_var __wr_after_init = INIT_VAL; +static bool test_long(void) +{ + return wr_long(&long_var, END_VAL) && + long_var == END_VAL; +} + +static unsigned long ulong_var __wr_after_init = INIT_VAL; +static bool test_ulong(void) +{ + return wr_ulong(&ulong_var, END_VAL) && + ulong_var == END_VAL; +} + +static long long longlong_var __wr_after_init = INIT_VAL; +static bool test_longlong(void) +{ + return wr_longlong(&longlong_var, END_VAL) && + longlong_var == END_VAL; +} + +static unsigned long long ulonglong_var __wr_after_init = INIT_VAL; +static bool test_ulonglong(void) +{ + return wr_ulonglong(&ulonglong_var, END_VAL) && + ulonglong_var == END_VAL; +} + +static int referred_value = INIT_VAL; +static int *reference __wr_after_init; +static bool test_ptr(void) +{ + return wr_ptr(&reference, &referred_value) && + reference == &referred_value; +} + +static int *rcu_ptr __wr_after_init __aligned(sizeof(void *)); +static bool test_rcu_ptr(void) +{ + uintptr_t addr = wr_rcu_assign_pointer(rcu_ptr, &referred_value); + + return (addr == (uintptr_t)&referred_value) && + referred_value == *(int *)addr; +} + +static bool test_specialized_write_rare(void) +{ + if (WARN(!(test_char() && test_short() && + test_ushort() && test_int() && + test_uint() && test_long() && test_ulong() && + test_long() && test_ulong() && + test_longlong() && test_ulonglong() && + test_ptr() && test_rcu_ptr()), + "Specialized write rare test failed")) + return false; + pr_success("Specialized write rare"); + return true; +} + +static int __init test_static_wr_init_module(void) +{ + if (WARN(!(test_alignment() && + test_simple_write() && + test_cross_page_write() && + test_memsetting() && + test_specialized_write_rare()), + "static rare-write test failed")) + return -EFAULT; + pr_success("static write_rare"); + return 0; +} + +module_init(test_static_wr_init_module); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Igor Stoppa <igor.stoppa@huawei.com>"); +MODULE_DESCRIPTION("Test module for static write rare."); -- 2.17.1
Various cases meant to verify that illegal operations on protected memory will either BUG() or WARN(). The test cases fall into 2 main categories: - trying to overwrite (directly) something that is write protected - trying to use write rare functions on something that is not write rare Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Kees Cook <keescook@chromium.org> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: Arnd Bergmann <arnd@arndb.de> CC: linux-mm@kvack.org CC: linux-kernel@vger.kernel.org --- drivers/misc/lkdtm/core.c | 13 ++ drivers/misc/lkdtm/lkdtm.h | 13 ++ drivers/misc/lkdtm/perms.c | 248 +++++++++++++++++++++++++++++++++++++ 3 files changed, 274 insertions(+) diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c index 2154d1bfd18b..41a3ba16bc57 100644 --- a/drivers/misc/lkdtm/core.c +++ b/drivers/misc/lkdtm/core.c @@ -155,6 +155,19 @@ static const struct crashtype crashtypes[] = { CRASHTYPE(ACCESS_USERSPACE), CRASHTYPE(WRITE_RO), CRASHTYPE(WRITE_RO_AFTER_INIT), + CRASHTYPE(WRITE_WR_AFTER_INIT), + CRASHTYPE(WRITE_WR_AFTER_INIT_ON_RO_AFTER_INIT), + CRASHTYPE(WRITE_WR_AFTER_INIT_ON_CONST), +#ifdef CONFIG_PRMEM + CRASHTYPE(WRITE_RO_PMALLOC), + CRASHTYPE(WRITE_AUTO_RO_PMALLOC), + CRASHTYPE(WRITE_WR_PMALLOC), + CRASHTYPE(WRITE_AUTO_WR_PMALLOC), + CRASHTYPE(WRITE_START_WR_PMALLOC), + CRASHTYPE(WRITE_WR_PMALLOC_ON_RO_PMALLOC), + CRASHTYPE(WRITE_WR_PMALLOC_ON_CONST), + CRASHTYPE(WRITE_WR_PMALLOC_ON_RO_AFT_INIT), +#endif CRASHTYPE(WRITE_KERN), CRASHTYPE(REFCOUNT_INC_OVERFLOW), CRASHTYPE(REFCOUNT_ADD_OVERFLOW), diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h index 9e513dcfd809..08368c4545f7 100644 --- a/drivers/misc/lkdtm/lkdtm.h +++ b/drivers/misc/lkdtm/lkdtm.h @@ -38,6 +38,19 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void); void __init lkdtm_perms_init(void); void lkdtm_WRITE_RO(void); void lkdtm_WRITE_RO_AFTER_INIT(void); +void lkdtm_WRITE_WR_AFTER_INIT(void); +void lkdtm_WRITE_WR_AFTER_INIT_ON_RO_AFTER_INIT(void); +void lkdtm_WRITE_WR_AFTER_INIT_ON_CONST(void); +#ifdef CONFIG_PRMEM +void lkdtm_WRITE_RO_PMALLOC(void); +void lkdtm_WRITE_AUTO_RO_PMALLOC(void); +void lkdtm_WRITE_WR_PMALLOC(void); +void lkdtm_WRITE_AUTO_WR_PMALLOC(void); +void lkdtm_WRITE_START_WR_PMALLOC(void); +void lkdtm_WRITE_WR_PMALLOC_ON_RO_PMALLOC(void); +void lkdtm_WRITE_WR_PMALLOC_ON_CONST(void); +void lkdtm_WRITE_WR_PMALLOC_ON_RO_AFT_INIT(void); +#endif void lkdtm_WRITE_KERN(void); void lkdtm_EXEC_DATA(void); void lkdtm_EXEC_STACK(void); diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c index 53b85c9d16b8..3c14fd4d90ac 100644 --- a/drivers/misc/lkdtm/perms.c +++ b/drivers/misc/lkdtm/perms.c @@ -9,6 +9,7 @@ #include <linux/vmalloc.h> #include <linux/mman.h> #include <linux/uaccess.h> +#include <linux/prmemextra.h> #include <asm/cacheflush.h> /* Whether or not to fill the target memory area with do_nothing(). */ @@ -27,6 +28,10 @@ static const unsigned long rodata = 0xAA55AA55; /* This is marked __ro_after_init, so it should ultimately be .rodata. */ static unsigned long ro_after_init __ro_after_init = 0x55AA5500; +/* This is marked __wr_after_init, so it should be in .rodata. */ +static +unsigned long wr_after_init __wr_after_init = 0x55AA5500; + /* * This just returns to the caller. It is designed to be copied into * non-executable memory regions. @@ -104,6 +109,247 @@ void lkdtm_WRITE_RO_AFTER_INIT(void) *ptr ^= 0xabcd1234; } +void lkdtm_WRITE_WR_AFTER_INIT(void) +{ + unsigned long *ptr = &wr_after_init; + + /* + * Verify we were written to during init. Since an Oops + * is considered a "success", a failure is to just skip the + * real test. + */ + if ((*ptr & 0xAA) != 0xAA) { + pr_info("%p was NOT written during init!?\n", ptr); + return; + } + + pr_info("attempting bad wr_after_init write at %p\n", ptr); + *ptr ^= 0xabcd1234; +} + +#define INIT_VAL 0x5A +#define END_VAL 0xA5 + +/* Verify that write rare will not work against read-only memory. */ +static int ro_after_init_data __ro_after_init = INIT_VAL; +void lkdtm_WRITE_WR_AFTER_INIT_ON_RO_AFTER_INIT(void) +{ + pr_info("attempting illegal write rare to __ro_after_init"); + if (wr_int(&ro_after_init_data, END_VAL) || + ro_after_init_data == END_VAL) + pr_info("Unexpected successful write to __ro_after_init"); +} + +/* + * "volatile" to force the compiler to not optimize away the reading back. + * Is there a better way to do it, than using volatile? + */ +static volatile const int const_data = INIT_VAL; +void lkdtm_WRITE_WR_AFTER_INIT_ON_CONST(void) +{ + pr_info("attempting illegal write rare to const data"); + if (wr_int((int *)&const_data, END_VAL) || const_data == END_VAL) + pr_info("Unexpected successful write to const memory"); +} + +#ifdef CONFIG_PRMEM + +#define MSG_NO_POOL "Cannot allocate memory for the pool." +#define MSG_NO_PMEM "Cannot allocate memory from the pool." + +void lkdtm_WRITE_RO_PMALLOC(void) +{ + struct pmalloc_pool *pool; + int *i; + + pool = pmalloc_create_pool(PMALLOC_MODE_RO); + if (!pool) { + pr_info(MSG_NO_POOL); + return; + } + i = pmalloc(pool, sizeof(int)); + if (!i) { + pr_info(MSG_NO_PMEM); + pmalloc_destroy_pool(pool); + return; + } + *i = INT_MAX; + pmalloc_protect_pool(pool); + pr_info("attempting bad pmalloc write at %p\n", i); + *i = 0; /* Note: this will crash and leak the pool memory. */ +} + +void lkdtm_WRITE_AUTO_RO_PMALLOC(void) +{ + struct pmalloc_pool *pool; + int *i; + + pool = pmalloc_create_pool(PMALLOC_MODE_AUTO_RO); + if (!pool) { + pr_info(MSG_NO_POOL); + return; + } + i = pmalloc(pool, sizeof(int)); + if (!i) { + pr_info(MSG_NO_PMEM); + pmalloc_destroy_pool(pool); + return; + } + *i = INT_MAX; + pmalloc(pool, PMALLOC_DEFAULT_REFILL_SIZE); + pr_info("attempting bad pmalloc write at %p\n", i); + *i = 0; /* Note: this will crash and leak the pool memory. */ +} + +void lkdtm_WRITE_WR_PMALLOC(void) +{ + struct pmalloc_pool *pool; + int *i; + + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (!pool) { + pr_info(MSG_NO_POOL); + return; + } + i = pmalloc(pool, sizeof(int)); + if (!i) { + pr_info(MSG_NO_PMEM); + pmalloc_destroy_pool(pool); + return; + } + *i = INT_MAX; + pmalloc_protect_pool(pool); + pr_info("attempting bad pmalloc write at %p\n", i); + *i = 0; /* Note: this will crash and leak the pool memory. */ +} + +void lkdtm_WRITE_AUTO_WR_PMALLOC(void) +{ + struct pmalloc_pool *pool; + int *i; + + pool = pmalloc_create_pool(PMALLOC_MODE_AUTO_WR); + if (!pool) { + pr_info(MSG_NO_POOL); + return; + } + i = pmalloc(pool, sizeof(int)); + if (!i) { + pr_info(MSG_NO_PMEM); + pmalloc_destroy_pool(pool); + return; + } + *i = INT_MAX; + pmalloc(pool, PMALLOC_DEFAULT_REFILL_SIZE); + pr_info("attempting bad pmalloc write at %p\n", i); + *i = 0; /* Note: this will crash and leak the pool memory. */ +} + +void lkdtm_WRITE_START_WR_PMALLOC(void) +{ + struct pmalloc_pool *pool; + int *i; + + pool = pmalloc_create_pool(PMALLOC_MODE_START_WR); + if (!pool) { + pr_info(MSG_NO_POOL); + return; + } + i = pmalloc(pool, sizeof(int)); + if (!i) { + pr_info(MSG_NO_PMEM); + pmalloc_destroy_pool(pool); + return; + } + *i = INT_MAX; + pr_info("attempting bad pmalloc write at %p\n", i); + *i = 0; /* Note: this will crash and leak the pool memory. */ +} + +void lkdtm_WRITE_WR_PMALLOC_ON_RO_PMALLOC(void) +{ + struct pmalloc_pool *pool; + int *var_ptr; + + pool = pmalloc_create_pool(PMALLOC_MODE_RO); + if (!pool) { + pr_info(MSG_NO_POOL); + return; + } + var_ptr = pmalloc(pool, sizeof(int)); + if (!var_ptr) { + pr_info(MSG_NO_PMEM); + pmalloc_destroy_pool(pool); + return; + } + *var_ptr = INIT_VAL; + pmalloc_protect_pool(pool); + pr_info("attempting illegal write rare to R/O pool"); + if (wr_int(var_ptr, END_VAL)) + pr_info("Unexpected successful write to R/O pool"); + pmalloc_destroy_pool(pool); +} + +void lkdtm_WRITE_WR_PMALLOC_ON_CONST(void) +{ + struct pmalloc_pool *pool; + int *dummy; + bool write_result; + + /* + * The pool operations are only meant to simulate an attacker + * using a random pool as parameter for the attack against the + * const. + */ + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (!pool) { + pr_info(MSG_NO_POOL); + return; + } + dummy = pmalloc(pool, sizeof(*dummy)); + if (!dummy) { + pr_info(MSG_NO_PMEM); + pmalloc_destroy_pool(pool); + return; + } + *dummy = 1; + pmalloc_protect_pool(pool); + pr_info("attempting illegal write rare to const data"); + write_result = wr_int((int *)&const_data, END_VAL); + pmalloc_destroy_pool(pool); + if (write_result || const_data != INIT_VAL) + pr_info("Unexpected successful write to const memory"); +} + +void lkdtm_WRITE_WR_PMALLOC_ON_RO_AFT_INIT(void) +{ + struct pmalloc_pool *pool; + int *dummy; + bool write_result; + + /* + * The pool operations are only meant to simulate an attacker + * using a random pool as parameter for the attack against the + * const. + */ + pool = pmalloc_create_pool(PMALLOC_MODE_WR); + if (WARN(!pool, MSG_NO_POOL)) + return; + dummy = pmalloc(pool, sizeof(*dummy)); + if (WARN(!dummy, MSG_NO_PMEM)) { + pmalloc_destroy_pool(pool); + return; + } + *dummy = 1; + pmalloc_protect_pool(pool); + pr_info("attempting illegal write rare to ro_after_init"); + write_result = wr_int(&ro_after_init_data, END_VAL); + pmalloc_destroy_pool(pool); + WARN(write_result || ro_after_init_data != INIT_VAL, + "Unexpected successful write to ro_after_init memory"); +} +#endif + void lkdtm_WRITE_KERN(void) { size_t size; @@ -200,4 +446,6 @@ void __init lkdtm_perms_init(void) /* Make sure we can write to __ro_after_init values during __init */ ro_after_init |= 0xAA; + /* Make sure we can write to __wr_after_init during __init */ + wr_after_init |= 0xAA; } -- 2.17.1
When a page is used for virtual memory, it is often necessary to obtain a handler to the corresponding vmap_area, which refers to the virtually continuous area generated, when invoking vmalloc. The struct page has a "private" field, which can be re-used, to store a pointer to the parent area. Note: in practice a virtual memory area is characterized both by a struct vmap_area and a struct vm_struct. The reason for referring from a page to its vmap_area, rather than to the vm_struct, is that the vmap_area contains a struct vm_struct *vm field, which can be used to reach also the information stored in the corresponding vm_struct. This link, however, is unidirectional, and it's not possible to easily identify the corresponding vmap_area, given a reference to a vm_struct. Furthermore, the struct vmap_area contains a list head node which is normally used only when it's queued for free and can be put to some other use during normal operations. The connection between each page and its vmap_area avoids more expensive searches through the btree of vmap_areas. Therefore, it is possible for find_vamp_area to be again a static function, while the rest of the code will rely on the direct reference from struct page. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Michal Hocko <mhocko@kernel.org> CC: Vlastimil Babka <vbabka@suse.cz> CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Pavel Tatashin <pasha.tatashin@oracle.com> CC: linux-mm@kvack.org CC: linux-kernel@vger.kernel.org --- include/linux/mm_types.h | 25 ++++++++++++++++++------- include/linux/prmem.h | 13 ++++++++----- include/linux/vmalloc.h | 1 - mm/prmem.c | 2 +- mm/test_pmalloc.c | 12 ++++-------- mm/vmalloc.c | 9 ++++++++- 6 files changed, 39 insertions(+), 23 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5ed8f6292a53..8403bdd12d1f 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -87,13 +87,24 @@ struct page { /* See page-flags.h for PAGE_MAPPING_FLAGS */ struct address_space *mapping; pgoff_t index; /* Our offset within mapping. */ - /** - * @private: Mapping-private opaque data. - * Usually used for buffer_heads if PagePrivate. - * Used for swp_entry_t if PageSwapCache. - * Indicates order in the buddy system if PageBuddy. - */ - unsigned long private; + union { + /** + * @private: Mapping-private opaque data. + * Usually used for buffer_heads if + * PagePrivate. + * Used for swp_entry_t if PageSwapCache. + * Indicates order in the buddy system if + * PageBuddy. + */ + unsigned long private; + /** + * @area: reference to the containing area + * For pages that are mapped into a virtually + * contiguous area, avoids performing a more + * expensive lookup. + */ + struct vmap_area *area; + }; }; struct { /* slab, slob and slub */ union { diff --git a/include/linux/prmem.h b/include/linux/prmem.h index 26fd48410d97..cf713fc1c8bb 100644 --- a/include/linux/prmem.h +++ b/include/linux/prmem.h @@ -54,14 +54,17 @@ static __always_inline bool __is_wr_after_init(const void *ptr, size_t size) static __always_inline bool __is_wr_pool(const void *ptr, size_t size) { - struct vmap_area *area; + struct vm_struct *vm; + struct page *page; if (!is_vmalloc_addr(ptr)) return false; - area = find_vmap_area((unsigned long)ptr); - return area && area->vm && (area->vm->size >= size) && - ((area->vm->flags & (VM_PMALLOC | VM_PMALLOC_WR)) == - (VM_PMALLOC | VM_PMALLOC_WR)); + page = vmalloc_to_page(ptr); + if (!(page && page->area && page->area->vm)) + return false; + vm = page->area->vm; + return ((vm->size >= size) && + ((vm->flags & VM_PMALLOC_WR_MASK) == VM_PMALLOC_WR_MASK)); } /** diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 4d14a3b8089e..43a444f8b1e9 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -143,7 +143,6 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size, const void *caller); extern struct vm_struct *remove_vm_area(const void *addr); extern struct vm_struct *find_vm_area(const void *addr); -extern struct vmap_area *find_vmap_area(unsigned long addr); extern int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages); diff --git a/mm/prmem.c b/mm/prmem.c index 7dd13ea43304..96abf04909e7 100644 --- a/mm/prmem.c +++ b/mm/prmem.c @@ -150,7 +150,7 @@ static int grow(struct pmalloc_pool *pool, size_t min_size) if (WARN(!addr, "Failed to allocate %zd bytes", PAGE_ALIGN(size))) return -ENOMEM; - new_area = find_vmap_area((uintptr_t)addr); + new_area = vmalloc_to_page(addr)->area; tag_mask = VM_PMALLOC; if (pool->mode & PMALLOC_WR) tag_mask |= VM_PMALLOC_WR; diff --git a/mm/test_pmalloc.c b/mm/test_pmalloc.c index f9ee8fb29eea..c64872ff05ea 100644 --- a/mm/test_pmalloc.c +++ b/mm/test_pmalloc.c @@ -38,15 +38,11 @@ static bool is_address_protected(void *p) if (unlikely(!is_vmalloc_addr(p))) return false; page = vmalloc_to_page(p); - if (unlikely(!page)) + if (unlikely(!(page && page->area && page->area->vm))) return false; - wmb(); /* Flush changes to the page table - is it needed? */ - area = find_vmap_area((uintptr_t)p); - if (unlikely((!area) || (!area->vm) || - ((area->vm->flags & VM_PMALLOC_PROTECTED_MASK) != - VM_PMALLOC_PROTECTED_MASK))) - return false; - return true; + area = page->area; + return (area->vm->flags & VM_PMALLOC_PROTECTED_MASK) == + VM_PMALLOC_PROTECTED_MASK; } static bool create_and_destroy_pool(void) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 15850005fea5..ffef705f0523 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -742,7 +742,7 @@ static void free_unmap_vmap_area(struct vmap_area *va) free_vmap_area_noflush(va); } -struct vmap_area *find_vmap_area(unsigned long addr) +static struct vmap_area *find_vmap_area(unsigned long addr) { struct vmap_area *va; @@ -1523,6 +1523,7 @@ static void __vunmap(const void *addr, int deallocate_pages) struct page *page = area->pages[i]; BUG_ON(!page); + page->area = NULL; __free_pages(page, 0); } @@ -1731,8 +1732,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, const void *caller) { struct vm_struct *area; + struct vmap_area *va; void *addr; unsigned long real_size = size; + unsigned int i; size = PAGE_ALIGN(size); if (!size || (size >> PAGE_SHIFT) > totalram_pages) @@ -1747,6 +1750,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, if (!addr) return NULL; + va = __find_vmap_area((unsigned long)addr); + for (i = 0; i < va->vm->nr_pages; i++) + va->vm->pages[i]->area = va; + /* * In this function, newly allocated vm_struct has VM_UNINITIALIZED * flag. It means that vm_struct is not fully initialized. -- 2.17.1
Prevent leaks of protected memory to userspace. The protection from overwrited from userspace is already available, once the memory is write protected. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Kees Cook <keescook@chromium.org> CC: Chris von Recklinghausen <crecklin@redhat.com> CC: linux-mm@kvack.org CC: linux-kernel@vger.kernel.org --- include/linux/prmem.h | 24 ++++++++++++++++++++++++ mm/usercopy.c | 5 +++++ 2 files changed, 29 insertions(+) diff --git a/include/linux/prmem.h b/include/linux/prmem.h index cf713fc1c8bb..919d853ddc15 100644 --- a/include/linux/prmem.h +++ b/include/linux/prmem.h @@ -273,6 +273,30 @@ struct pmalloc_pool { uint8_t mode; }; +void __noreturn usercopy_abort(const char *name, const char *detail, + bool to_user, unsigned long offset, + unsigned long len); + +/** + * check_pmalloc_object() - helper for hardened usercopy + * @ptr: the beginning of the memory to check + * @n: the size of the memory to check + * @to_user: copy to userspace or from userspace + * + * If the check is ok, it will fall-through, otherwise it will abort. + * The function is inlined, to minimize the performance impact of the + * extra check that can end up on a hot path. + * Non-exhaustive micro benchmarking with QEMU x86_64 shows a reduction of + * the time spent in this fragment by 60%, when inlined. + */ +static inline +void check_pmalloc_object(const void *ptr, unsigned long n, bool to_user) +{ + if (unlikely(__is_wr_after_init(ptr, n) || __is_wr_pool(ptr, n))) + usercopy_abort("pmalloc", "accessing pmalloc obj", to_user, + (const unsigned long)ptr, n); +} + /* * The write rare functionality is fully implemented as __always_inline, * to prevent having an internal function call that is capable of modifying diff --git a/mm/usercopy.c b/mm/usercopy.c index 852eb4e53f06..a080dd37b684 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -22,8 +22,10 @@ #include <linux/thread_info.h> #include <linux/atomic.h> #include <linux/jump_label.h> +#include <linux/prmem.h> #include <asm/sections.h> + /* * Checks if a given pointer and length is contained by the current * stack frame (if possible). @@ -284,6 +286,9 @@ void __check_object_size(const void *ptr, unsigned long n, bool to_user) /* Check for object in kernel to avoid text exposure. */ check_kernel_text_object((const unsigned long)ptr, n, to_user); + + /* Check if object is from a pmalloc chunk. */ + check_pmalloc_object(ptr, n, to_user); } EXPORT_SYMBOL(__check_object_size); -- 2.17.1
Documentation for protected memory. Topics covered: * static memory allocation * dynamic memory allocation * write-rare Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Jonathan Corbet <corbet@lwn.net> CC: Randy Dunlap <rdunlap@infradead.org> CC: Mike Rapoport <rppt@linux.vnet.ibm.com> CC: linux-doc@vger.kernel.org CC: linux-kernel@vger.kernel.org --- Documentation/core-api/index.rst | 1 + Documentation/core-api/prmem.rst | 172 +++++++++++++++++++++++++++++++ MAINTAINERS | 1 + 3 files changed, 174 insertions(+) create mode 100644 Documentation/core-api/prmem.rst diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 26b735cefb93..1a90fa878d8d 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -31,6 +31,7 @@ Core utilities gfp_mask-from-fs-io timekeeping boot-time-mm + prmem Interfaces for kernel debugging =============================== diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst new file mode 100644 index 000000000000..16d7edfe327a --- /dev/null +++ b/Documentation/core-api/prmem.rst @@ -0,0 +1,172 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _prmem: + +Memory Protection +================= + +:Date: October 2018 +:Author: Igor Stoppa <igor.stoppa@huawei.com> + +Foreword +-------- +- In a typical system using some sort of RAM as execution environment, + **all** memory is initially writable. + +- It must be initialized with the appropriate content, be it code or data. + +- Said content typically undergoes modifications, i.e. relocations or + relocation-induced changes. + +- The present document doesn't address such transient. + +- Kernel code is protected at system level and, unlike data, it doesn't + require special attention. + +Protection mechanism +-------------------- + +- When available, the MMU can write protect memory pages that would be + otherwise writable. + +- The protection has page-level granularity. + +- An attempt to overwrite a protected page will trigger an exception. +- **Write protected data must go exclusively to write protected pages** +- **Writable data must go exclusively to writable pages** + +Available protections for kernel data +------------------------------------- + +- **constant** + Labelled as **const**, the data is never supposed to be altered. + It is statically allocated - if it has any memory footprint at all. + The compiler can even optimize it away, where possible, by replacing + references to a **const** with its actual value. + +- **read only after init** + By tagging an otherwise ordinary statically allocated variable with + **__ro_after_init**, it is placed in a special segment that will + become write protected, at the end of the kernel init phase. + The compiler has no notion of this restriction and it will treat any + write operation on such variable as legal. However, assignments that + are attempted after the write protection is in place, will cause + exceptions. + +- **write rare after init** + This can be seen as variant of read only after init, which uses the + tag **__wr_after_init**. It is also limited to statically allocated + memory. It is still possible to alter this type of variables, after + the kernel init phase is complete, however it can be done exclusively + with special functions, instead of the assignment operator. Using the + assignment operator after conclusion of the init phase will still + trigger an exception. It is not possible to transition a certain + variable from __wr_ater_init to a permanent read-only status, at + runtime. + +- **dynamically allocated write-rare / read-only** + After defining a pool, memory can be obtained through it, primarily + through the **pmalloc()** allocator. The exact writability state of the + memory obtained from **pmalloc()** and friends can be configured when + creating the pool. At any point it is possible to transition to a less + permissive write status the memory currently associated to the pool. + Once memory has become read-only, it the only valid operation, beside + reading, is to released it, by destroying the pool it belongs to. + + +Protecting dynamically allocated memory +--------------------------------------- + +When dealing with dynamically allocated memory, three options are + available for configuring its writability state: + +- **Options selected when creating a pool** + When creating the pool, it is possible to choose one of the following: + - **PMALLOC_MODE_RO** + - Writability at allocation time: *WRITABLE* + - Writability at protection time: *NONE* + - **PMALLOC_MODE_WR** + - Writability at allocation time: *WRITABLE* + - Writability at protection time: *WRITE-RARE* + - **PMALLOC_MODE_AUTO_RO** + - Writability at allocation time: + - the latest allocation: *WRITABLE* + - every other allocation: *NONE* + - Writability at protection time: *NONE* + - **PMALLOC_MODE_AUTO_WR** + - Writability at allocation time: + - the latest allocation: *WRITABLE* + - every other allocation: *WRITE-RARE* + - Writability at protection time: *WRITE-RARE* + - **PMALLOC_MODE_START_WR** + - Writability at allocation time: *WRITE-RARE* + - Writability at protection time: *WRITE-RARE* + + **Remarks:** + - The "AUTO" modes perform automatic protection of the content, whenever + the current vmap_area is used up and a new one is allocated. + - At that point, the vmap_area being phased out is protected. + - The size of the vmap_area depends on various parameters. + - It might not be possible to know for sure *when* certain data will + be protected. + - The functionality is provided as tradeoff between hardening and speed. + - Its usefulness depends on the specific use case at hand + - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed. + +- **Protecting the pool** + This is achieved with **pmalloc_protect_pool()** + - Any vmap_area currently in the pool is write-protected according to its initial configuration. + - Any residual space still available from the current vmap_area is lost, as the area is protected. + - **protecting a pool after every allocation will likely be very wasteful** + - Using PMALLOC_MODE_START_WR is likely a better choice. + +- **Upgrading the protection level** + This is achieved with **pmalloc_make_pool_ro()** + - it turns the present content of a write-rare pool into read-only + - can be useful when the content of the memory has settled + + +Caveats +------- +- Freeing of memory is not supported. Pages will be returned to the + system upon destruction of their memory pool. + +- The address range available for vmalloc (and thus for pmalloc too) is + limited, on 32-bit systems. However it shouldn't be an issue, since not + much data is expected to be dynamically allocated and turned into + write-protected. + +- Regarding SMP systems, changing state of pages and altering mappings + requires performing cross-processor synchronizations of page tables. + This is an additional reason for limiting the use of write rare. + +- Not only the pmalloc memory must be protected, but also any reference to + it that might become the target for an attack. The attack would replace + a reference to the protected memory with a reference to some other, + unprotected, memory. + +- The users of rare write must take care of ensuring the atomicity of the + action, respect to the way they use the data being altered; for example, + take a lock before making a copy of the value to modify (if it's + relevant), then alter it, issue the call to rare write and finally + release the lock. Some special scenario might be exempt from the need + for locking, but in general rare-write must be treated as an operation + that can incur into races. + +- pmalloc relies on virtual memory areas and will therefore use more + tlb entries. It still does a better job of it, compared to invoking + vmalloc for each allocation, but it is undeniably less optimized wrt to + TLB use than using the physmap directly, through kmalloc or similar. + + +Utilization +----------- + +**add examples here** + +API +--- + +.. kernel-doc:: include/linux/prmem.h +.. kernel-doc:: mm/prmem.c +.. kernel-doc:: include/linux/prmemextra.h diff --git a/MAINTAINERS b/MAINTAINERS index ea979a5a9ec9..246b1a1cc8bb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9463,6 +9463,7 @@ F: include/linux/prmemextra.h F: mm/prmem.c F: mm/test_write_rare.c F: mm/test_pmalloc.c +F: Documentation/core-api/prmem.rst MEMORY MANAGEMENT L: linux-mm@kvack.org -- 2.17.1
Using a list_head in an unnamed union poses a problem with the current implementation of the initializer, since it doesn't specify the names of the fields it is initializing. This patch makes it use designated initializers. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Kate Stewart <kstewart@linuxfoundation.org> CC: "David S. Miller" <davem@davemloft.net> CC: Edward Cree <ecree@solarflare.com> CC: Philippe Ombredanne <pombredanne@nexb.com> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: linux-kernel@vger.kernel.org --- include/linux/list.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/linux/list.h b/include/linux/list.h index de04cc5ed536..184a7b60436f 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -18,7 +18,10 @@ * using the generic single-entry routines. */ -#define LIST_HEAD_INIT(name) { &(name), &(name) } +#define LIST_HEAD_INIT(name) { \ + .next = &(name), \ + .prev = &(name), \ +} #define LIST_HEAD(name) \ struct list_head name = LIST_HEAD_INIT(name) -- 2.17.1
As preparation to using write rare on the nodes of various types of lists, specify that the fields in the basic data structures must be aligned to sizeof(void *) It is meant to ensure that any static allocation will not cross a page boundary, to allow pointers to be updated in one step. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: Andrew Morton <akpm@linux-foundation.org> CC: Masahiro Yamada <yamada.masahiro@socionext.com> CC: Alexey Dobriyan <adobriyan@gmail.com> CC: Pekka Enberg <penberg@kernel.org> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> CC: Lihao Liang <lianglihao@huawei.com> CC: linux-kernel@vger.kernel.org --- include/linux/types.h | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/include/linux/types.h b/include/linux/types.h index 9834e90aa010..53609bbdcf0f 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -183,17 +183,29 @@ typedef struct { } atomic64_t; #endif +#ifdef CONFIG_PRMEM struct list_head { - struct list_head *next, *prev; -}; + struct list_head *next __aligned(sizeof(void *)); + struct list_head *prev __aligned(sizeof(void *)); +} __aligned(sizeof(void *)); -struct hlist_head { - struct hlist_node *first; +struct hlist_node { + struct hlist_node *next __aligned(sizeof(void *)); + struct hlist_node **pprev __aligned(sizeof(void *)); +} __aligned(sizeof(void *)); +#else +struct list_head { + struct list_head *next, *prev; }; struct hlist_node { struct hlist_node *next, **pprev; }; +#endif + +struct hlist_head { + struct hlist_node *first; +}; struct ustat { __kernel_daddr_t f_tfree; -- 2.17.1
Some of the data structures used in list management are composed by two pointers. Since the kernel is now configured by default to randomize the layout of data structures soleley composed by pointers, this might prevent correct type punning between these structures and their write rare counterpart. It shouldn't be anyway a big loss, in terms of security: with only two fields, there is a 50% chance of guessing correctly the layout. The randomization is disabled only when write rare is enabled. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Kees Cook <keescook@chromium.org> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: Andrew Morton <akpm@linux-foundation.org> CC: Masahiro Yamada <yamada.masahiro@socionext.com> CC: Alexey Dobriyan <adobriyan@gmail.com> CC: Pekka Enberg <penberg@kernel.org> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> CC: Lihao Liang <lianglihao@huawei.com> CC: linux-kernel@vger.kernel.org --- include/linux/types.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/types.h b/include/linux/types.h index 53609bbdcf0f..a9f6f6515fdc 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -187,12 +187,12 @@ typedef struct { struct list_head { struct list_head *next __aligned(sizeof(void *)); struct list_head *prev __aligned(sizeof(void *)); -} __aligned(sizeof(void *)); +} __no_randomize_layout __aligned(sizeof(void *)); struct hlist_node { struct hlist_node *next __aligned(sizeof(void *)); struct hlist_node **pprev __aligned(sizeof(void *)); -} __aligned(sizeof(void *)); +} __no_randomize_layout __aligned(sizeof(void *)); #else struct list_head { struct list_head *next, *prev; -- 2.17.1
In some cases, all the data needing protection can be allocated from a pool in one go, as directly writable, then initialized and protected. The sequence is relatively short and it's acceptable to leave the entire data set unprotected. In other cases, this is not possible, because the data will trickle over a relatively long period of time, in a non predictable way, possibly for the entire duration of the operations. For these cases, the safe approach is to have the memory already write protected, when allocated. However, this will require replacing any direct assignment with calls to functions that can perform write rare. Since lists are one of the most commonly used data structures in kernel, they are a the first candidate for receiving write rare extensions. This patch implements basic functionality for altering said lists. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Kate Stewart <kstewart@linuxfoundation.org> CC: "David S. Miller" <davem@davemloft.net> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: Philippe Ombredanne <pombredanne@nexb.com> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> CC: Josh Triplett <josh@joshtriplett.org> CC: Steven Rostedt <rostedt@goodmis.org> CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> CC: Lai Jiangshan <jiangshanlai@gmail.com> CC: linux-kernel@vger.kernel.org --- MAINTAINERS | 1 + include/linux/prlist.h | 934 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 935 insertions(+) create mode 100644 include/linux/prlist.h diff --git a/MAINTAINERS b/MAINTAINERS index 246b1a1cc8bb..f5689c014e07 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9464,6 +9464,7 @@ F: mm/prmem.c F: mm/test_write_rare.c F: mm/test_pmalloc.c F: Documentation/core-api/prmem.rst +F: include/linux/prlist.h MEMORY MANAGEMENT L: linux-mm@kvack.org diff --git a/include/linux/prlist.h b/include/linux/prlist.h new file mode 100644 index 000000000000..0387c78f8be8 --- /dev/null +++ b/include/linux/prlist.h @@ -0,0 +1,934 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * prlist.h: Header for Protected Lists + * + * (C) Copyright 2018 Huawei Technologies Co. Ltd. + * Author: Igor Stoppa <igor.stoppa@huawei.com> + * + * Code from <linux/list.h> and <linux/rculist.h>, adapted to perform + * writes on write-rare data. + * These functions and macros rely on data structures that allow the reuse + * of what is already provided for reading the content of their non-write + * rare variant. + */ + +#ifndef _LINUX_PRLIST_H +#define _LINUX_PRLIST_H + +#include <linux/list.h> +#include <linux/kernel.h> +#include <linux/prmemextra.h> + +/* --------------- Circular Protected Doubly Linked List --------------- */ +union prlist_head { + struct list_head list __aligned(sizeof(void *)); + struct { + union prlist_head *next __aligned(sizeof(void *)); + union prlist_head *prev __aligned(sizeof(void *)); + } __no_randomize_layout; +} __aligned(sizeof(void *)); + +static __always_inline +union prlist_head *to_prlist_head(struct list_head *list) +{ + return container_of(list, union prlist_head, list); +} + +#define PRLIST_HEAD_INIT(name) { \ + .list = LIST_HEAD_INIT(name), \ +} + +#define PRLIST_HEAD(name) \ + union prlist_head name __wr_after_init = PRLIST_HEAD_INIT(name.list) + +static __always_inline +struct pmalloc_pool *prlist_create_custom_pool(size_t refill, + unsigned short align_order) +{ + return pmalloc_create_custom_pool(refill, align_order, + PMALLOC_MODE_START_WR); +} + +static __always_inline struct pmalloc_pool *prlist_create_pool(void) +{ + return prlist_create_custom_pool(PMALLOC_REFILL_DEFAULT, + PMALLOC_ALIGN_ORDER_DEFAULT); +} + +static __always_inline +void prlist_set_prev(union prlist_head *head, + const union prlist_head *prev) +{ + wr_ptr(&head->prev, prev); +} + +static __always_inline +void prlist_set_next(union prlist_head *head, + const union prlist_head *next) +{ + wr_ptr(&head->next, next); +} + +static __always_inline void INIT_PRLIST_HEAD(union prlist_head *head) +{ + prlist_set_prev(head, head); + prlist_set_next(head, head); +} + +/* + * Insert a new entry between two known consecutive entries. + * + * This is only for internal list manipulation where we know + * the prev/next entries already! + */ +static __always_inline +void __prlist_add(union prlist_head *new, union prlist_head *prev, + union prlist_head *next) +{ + if (!__list_add_valid(&new->list, &prev->list, &next->list)) + return; + + prlist_set_prev(next, new); + prlist_set_next(new, next); + prlist_set_prev(new, prev); + prlist_set_next(prev, new); +} + +/** + * prlist_add - add a new entry + * @new: new entry to be added + * @head: prlist head to add it after + * + * Insert a new entry after the specified head. + * This is good for implementing stacks. + */ +static __always_inline +void prlist_add(union prlist_head *new, union prlist_head *head) +{ + __prlist_add(new, head, head->next); +} + +/** + * prlist_add_tail - add a new entry + * @new: new entry to be added + * @head: list head to add it before + * + * Insert a new entry before the specified head. + * This is useful for implementing queues. + */ +static __always_inline +void prlist_add_tail(union prlist_head *new, union prlist_head *head) +{ + __prlist_add(new, head->prev, head); +} + +/* + * Delete a prlist entry by making the prev/next entries + * point to each other. + * + * This is only for internal list manipulation where we know + * the prev/next entries already! + */ +static __always_inline +void __prlist_del(union prlist_head *prev, union prlist_head *next) +{ + prlist_set_prev(next, prev); + prlist_set_next(prev, next); +} + +/** + * prlist_del - deletes entry from list. + * @entry: the element to delete from the list. + * Note: list_empty() on entry does not return true after this, the entry is + * in an undefined state. + */ +static inline void __prlist_del_entry(union prlist_head *entry) +{ + if (!__list_del_entry_valid(&entry->list)) + return; + __prlist_del(entry->prev, entry->next); +} + +static __always_inline void prlist_del(union prlist_head *entry) +{ + __prlist_del_entry(entry); + prlist_set_next(entry, LIST_POISON1); + prlist_set_prev(entry, LIST_POISON2); +} + +/** + * prlist_replace - replace old entry by new one + * @old : the element to be replaced + * @new : the new element to insert + * + * If @old was empty, it will be overwritten. + */ +static __always_inline +void prlist_replace(union prlist_head *old, union prlist_head *new) +{ + prlist_set_next(new, old->next); + prlist_set_prev(new->next, new); + prlist_set_prev(new, old->prev); + prlist_set_next(new->prev, new); +} + +static __always_inline +void prlist_replace_init(union prlist_head *old, union prlist_head *new) +{ + prlist_replace(old, new); + INIT_PRLIST_HEAD(old); +} + +/** + * prlist_del_init - deletes entry from list and reinitialize it. + * @entry: the element to delete from the list. + */ +static __always_inline void prlist_del_init(union prlist_head *entry) +{ + __prlist_del_entry(entry); + INIT_PRLIST_HEAD(entry); +} + +/** + * prlist_move - delete from one list and add as another's head + * @list: the entry to move + * @head: the head that will precede our entry + */ +static __always_inline +void prlist_move(union prlist_head *list, union prlist_head *head) +{ + __prlist_del_entry(list); + prlist_add(list, head); +} + +/** + * prlist_move_tail - delete from one list and add as another's tail + * @list: the entry to move + * @head: the head that will follow our entry + */ +static __always_inline +void prlist_move_tail(union prlist_head *list, union prlist_head *head) +{ + __prlist_del_entry(list); + prlist_add_tail(list, head); +} + +/** + * prlist_rotate_left - rotate the list to the left + * @head: the head of the list + */ +static __always_inline void prlist_rotate_left(union prlist_head *head) +{ + union prlist_head *first; + + if (!list_empty(&head->list)) { + first = head->next; + prlist_move_tail(first, head); + } +} + +static __always_inline +void __prlist_cut_position(union prlist_head *list, union prlist_head *head, + union prlist_head *entry) +{ + union prlist_head *new_first = entry->next; + + prlist_set_next(list, head->next); + prlist_set_prev(list->next, list); + prlist_set_prev(list, entry); + prlist_set_next(entry, list); + prlist_set_next(head, new_first); + prlist_set_prev(new_first, head); +} + +/** + * prlist_cut_position - cut a list into two + * @list: a new list to add all removed entries + * @head: a list with entries + * @entry: an entry within head, could be the head itself + * and if so we won't cut the list + * + * This helper moves the initial part of @head, up to and + * including @entry, from @head to @list. You should + * pass on @entry an element you know is on @head. @list + * should be an empty list or a list you do not care about + * losing its data. + * + */ +static __always_inline +void prlist_cut_position(union prlist_head *list, union prlist_head *head, + union prlist_head *entry) +{ + if (list_empty(&head->list)) + return; + if (list_is_singular(&head->list) && + (head->next != entry && head != entry)) + return; + if (entry == head) + INIT_PRLIST_HEAD(list); + else + __prlist_cut_position(list, head, entry); +} + +/** + * prlist_cut_before - cut a list into two, before given entry + * @list: a new list to add all removed entries + * @head: a list with entries + * @entry: an entry within head, could be the head itself + * + * This helper moves the initial part of @head, up to but + * excluding @entry, from @head to @list. You should pass + * in @entry an element you know is on @head. @list should + * be an empty list or a list you do not care about losing + * its data. + * If @entry == @head, all entries on @head are moved to + * @list. + */ +static __always_inline +void prlist_cut_before(union prlist_head *list, union prlist_head *head, + union prlist_head *entry) +{ + if (head->next == entry) { + INIT_PRLIST_HEAD(list); + return; + } + prlist_set_next(list, head->next); + prlist_set_prev(list->next, list); + prlist_set_prev(list, entry->prev); + prlist_set_next(list->prev, list); + prlist_set_next(head, entry); + prlist_set_prev(entry, head); +} + +static __always_inline +void __prlist_splice(const union prlist_head *list, union prlist_head *prev, + union prlist_head *next) +{ + union prlist_head *first = list->next; + union prlist_head *last = list->prev; + + prlist_set_prev(first, prev); + prlist_set_next(prev, first); + prlist_set_next(last, next); + prlist_set_prev(next, last); +} + +/** + * prlist_splice - join two lists, this is designed for stacks + * @list: the new list to add. + * @head: the place to add it in the first list. + */ +static __always_inline +void prlist_splice(const union prlist_head *list, union prlist_head *head) +{ + if (!list_empty(&list->list)) + __prlist_splice(list, head, head->next); +} + +/** + * prlist_splice_tail - join two lists, each list being a queue + * @list: the new list to add. + * @head: the place to add it in the first list. + */ +static __always_inline +void prlist_splice_tail(union prlist_head *list, union prlist_head *head) +{ + if (!list_empty(&list->list)) + __prlist_splice(list, head->prev, head); +} + +/** + * prlist_splice_init - join two lists and reinitialise the emptied list. + * @list: the new list to add. + * @head: the place to add it in the first list. + * + * The list at @list is reinitialised + */ +static __always_inline +void prlist_splice_init(union prlist_head *list, union prlist_head *head) +{ + if (!list_empty(&list->list)) { + __prlist_splice(list, head, head->next); + INIT_PRLIST_HEAD(list); + } +} + +/** + * prlist_splice_tail_init - join 2 lists and reinitialise the emptied list + * @list: the new list to add. + * @head: the place to add it in the first list. + * + * Each of the lists is a queue. + * The list at @list is reinitialised + */ +static __always_inline +void prlist_splice_tail_init(union prlist_head *list, + union prlist_head *head) +{ + if (!list_empty(&list->list)) { + __prlist_splice(list, head->prev, head); + INIT_PRLIST_HEAD(list); + } +} + +/* ---- Protected Doubly Linked List with single pointer list head ---- */ +union prhlist_head { + struct hlist_head head __aligned(sizeof(void *)); + union prhlist_node *first __aligned(sizeof(void *)); +} __aligned(sizeof(void *)); + +union prhlist_node { + struct hlist_node node __aligned(sizeof(void *)) + ; + struct { + union prhlist_node *next __aligned(sizeof(void *)); + union prhlist_node **pprev __aligned(sizeof(void *)); + } __no_randomize_layout; +} __aligned(sizeof(void *)); + +#define PRHLIST_HEAD_INIT { \ + .head = HLIST_HEAD_INIT, \ +} + +#define PRHLIST_HEAD(name) \ + union prhlist_head name __wr_after_init = PRHLIST_HEAD_INIT + + +#define is_static(object) \ + unlikely(wr_check_boundaries(object, sizeof(*object))) + +static __always_inline +struct pmalloc_pool *prhlist_create_custom_pool(size_t refill, + unsigned short align_order) +{ + return pmalloc_create_custom_pool(refill, align_order, + PMALLOC_MODE_AUTO_WR); +} + +static __always_inline +struct pmalloc_pool *prhlist_create_pool(void) +{ + return prhlist_create_custom_pool(PMALLOC_REFILL_DEFAULT, + PMALLOC_ALIGN_ORDER_DEFAULT); +} + +static __always_inline +void prhlist_set_first(union prhlist_head *head, union prhlist_node *first) +{ + wr_ptr(&head->first, first); +} + +static __always_inline +void prhlist_set_next(union prhlist_node *node, union prhlist_node *next) +{ + wr_ptr(&node->next, next); +} + +static __always_inline +void prhlist_set_pprev(union prhlist_node *node, union prhlist_node **pprev) +{ + wr_ptr(&node->pprev, pprev); +} + +static __always_inline +void prhlist_set_prev(union prhlist_node *node, union prhlist_node *prev) +{ + wr_ptr(node->pprev, prev); +} + +static __always_inline void INIT_PRHLIST_HEAD(union prhlist_head *head) +{ + prhlist_set_first(head, NULL); +} + +static __always_inline void INIT_PRHLIST_NODE(union prhlist_node *node) +{ + prhlist_set_next(node, NULL); + prhlist_set_pprev(node, NULL); +} + +static __always_inline void __prhlist_del(union prhlist_node *n) +{ + union prhlist_node *next = n->next; + union prhlist_node **pprev = n->pprev; + + wr_ptr(pprev, next); + if (next) + prhlist_set_pprev(next, pprev); +} + +static __always_inline void prhlist_del(union prhlist_node *n) +{ + __prhlist_del(n); + prhlist_set_next(n, LIST_POISON1); + prhlist_set_pprev(n, LIST_POISON2); +} + +static __always_inline void prhlist_del_init(union prhlist_node *n) +{ + if (!hlist_unhashed(&n->node)) { + __prhlist_del(n); + INIT_PRHLIST_NODE(n); + } +} + +static __always_inline +void prhlist_add_head(union prhlist_node *n, union prhlist_head *h) +{ + union prhlist_node *first = h->first; + + prhlist_set_next(n, first); + if (first) + prhlist_set_pprev(first, &n->next); + prhlist_set_first(h, n); + prhlist_set_pprev(n, &h->first); +} + +/* next must be != NULL */ +static __always_inline +void prhlist_add_before(union prhlist_node *n, union prhlist_node *next) +{ + prhlist_set_pprev(n, next->pprev); + prhlist_set_next(n, next); + prhlist_set_pprev(next, &n->next); + prhlist_set_prev(n, n); +} + +static __always_inline +void prhlist_add_behind(union prhlist_node *n, union prhlist_node *prev) +{ + prhlist_set_next(n, prev->next); + prhlist_set_next(prev, n); + prhlist_set_pprev(n, &prev->next); + if (n->next) + prhlist_set_pprev(n->next, &n->next); +} + +/* after that we'll appear to be on some hlist and hlist_del will work */ +static __always_inline void prhlist_add_fake(union prhlist_node *n) +{ + prhlist_set_pprev(n, &n->next); +} + +/* + * Move a list from one list head to another. Fixup the pprev + * reference of the first entry if it exists. + */ +static __always_inline +void prhlist_move_list(union prhlist_head *old, union prhlist_head *new) +{ + prhlist_set_first(new, old->first); + if (new->first) + prhlist_set_pprev(new->first, &new->first); + prhlist_set_first(old, NULL); +} + +/* ------------------------ RCU list and hlist ------------------------ */ + +/* + * INIT_LIST_HEAD_RCU - Initialize a list_head visible to RCU readers + * @head: list to be initialized + * + * It is exactly equivalent to INIT_LIST_HEAD() + */ +static __always_inline void INIT_PRLIST_HEAD_RCU(union prlist_head *head) +{ + INIT_PRLIST_HEAD(head); +} + +/* + * Insert a new entry between two known consecutive entries. + * + * This is only for internal list manipulation where we know + * the prev/next entries already! + */ +static __always_inline +void __prlist_add_rcu(union prlist_head *new, union prlist_head *prev, + union prlist_head *next) +{ + if (!__list_add_valid(&new->list, &prev->list, &next->list)) + return; + prlist_set_next(new, next); + prlist_set_prev(new, prev); + wr_rcu_assign_pointer(list_next_rcu(&prev->list), new); + prlist_set_prev(next, new); +} + +/** + * prlist_add_rcu - add a new entry to rcu-protected prlist + * @new: new entry to be added + * @head: prlist head to add it after + * + * Insert a new entry after the specified head. + * This is good for implementing stacks. + * + * The caller must take whatever precautions are necessary + * (such as holding appropriate locks) to avoid racing + * with another prlist-mutation primitive, such as prlist_add_rcu() + * or prlist_del_rcu(), running on this same list. + * However, it is perfectly legal to run concurrently with + * the _rcu list-traversal primitives, such as + * list_for_each_entry_rcu(). + */ +static __always_inline +void prlist_add_rcu(union prlist_head *new, union prlist_head *head) +{ + __prlist_add_rcu(new, head, head->next); +} + +/** + * prlist_add_tail_rcu - add a new entry to rcu-protected prlist + * @new: new entry to be added + * @head: prlist head to add it before + * + * Insert a new entry before the specified head. + * This is useful for implementing queues. + * + * The caller must take whatever precautions are necessary + * (such as holding appropriate locks) to avoid racing + * with another prlist-mutation primitive, such as prlist_add_tail_rcu() + * or prlist_del_rcu(), running on this same list. + * However, it is perfectly legal to run concurrently with + * the _rcu list-traversal primitives, such as + * list_for_each_entry_rcu(). + */ +static __always_inline +void prlist_add_tail_rcu(union prlist_head *new, union prlist_head *head) +{ + __prlist_add_rcu(new, head->prev, head); +} + +/** + * prlist_del_rcu - deletes entry from prlist without re-initialization + * @entry: the element to delete from the prlist. + * + * Note: list_empty() on entry does not return true after this, + * the entry is in an undefined state. It is useful for RCU based + * lockfree traversal. + * + * In particular, it means that we can not poison the forward + * pointers that may still be used for walking the list. + * + * The caller must take whatever precautions are necessary + * (such as holding appropriate locks) to avoid racing + * with another list-mutation primitive, such as prlist_del_rcu() + * or prlist_add_rcu(), running on this same prlist. + * However, it is perfectly legal to run concurrently with + * the _rcu list-traversal primitives, such as + * list_for_each_entry_rcu(). + * + * Note that the caller is not permitted to immediately free + * the newly deleted entry. Instead, either synchronize_rcu() + * or call_rcu() must be used to defer freeing until an RCU + * grace period has elapsed. + */ +static __always_inline void prlist_del_rcu(union prlist_head *entry) +{ + __prlist_del_entry(entry); + prlist_set_prev(entry, LIST_POISON2); +} + +/** + * prhlist_del_init_rcu - deletes entry from hash list with re-initialization + * @n: the element to delete from the hash list. + * + * Note: list_unhashed() on the node return true after this. It is + * useful for RCU based read lockfree traversal if the writer side + * must know if the list entry is still hashed or already unhashed. + * + * In particular, it means that we can not poison the forward pointers + * that may still be used for walking the hash list and we can only + * zero the pprev pointer so list_unhashed() will return true after + * this. + * + * The caller must take whatever precautions are necessary (such as + * holding appropriate locks) to avoid racing with another + * list-mutation primitive, such as hlist_add_head_rcu() or + * hlist_del_rcu(), running on this same list. However, it is + * perfectly legal to run concurrently with the _rcu list-traversal + * primitives, such as hlist_for_each_entry_rcu(). + */ +static __always_inline void prhlist_del_init_rcu(union prhlist_node *n) +{ + if (!hlist_unhashed(&n->node)) { + __prhlist_del(n); + prhlist_set_pprev(n, NULL); + } +} + +/** + * prlist_replace_rcu - replace old entry by new one + * @old : the element to be replaced + * @new : the new element to insert + * + * The @old entry will be replaced with the @new entry atomically. + * Note: @old should not be empty. + */ +static __always_inline +void prlist_replace_rcu(union prlist_head *old, union prlist_head *new) +{ + prlist_set_next(new, old->next); + prlist_set_prev(new, old->prev); + wr_rcu_assign_pointer(list_next_rcu(&new->prev->list), new); + prlist_set_prev(new->next, new); + prlist_set_prev(old, LIST_POISON2); +} + +/** + * __prlist_splice_init_rcu - join an RCU-protected list into an existing list. + * @list: the RCU-protected list to splice + * @prev: points to the last element of the existing list + * @next: points to the first element of the existing list + * @sync: function to sync: synchronize_rcu(), synchronize_sched(), ... + * + * The list pointed to by @prev and @next can be RCU-read traversed + * concurrently with this function. + * + * Note that this function blocks. + * + * Important note: the caller must take whatever action is necessary to prevent + * any other updates to the existing list. In principle, it is possible to + * modify the list as soon as sync() begins execution. If this sort of thing + * becomes necessary, an alternative version based on call_rcu() could be + * created. But only if -really- needed -- there is no shortage of RCU API + * members. + */ +static __always_inline +void __prlist_splice_init_rcu(union prlist_head *list, + union prlist_head *prev, + union prlist_head *next, void (*sync)(void)) +{ + union prlist_head *first = list->next; + union prlist_head *last = list->prev; + + /* + * "first" and "last" tracking list, so initialize it. RCU readers + * have access to this list, so we must use INIT_LIST_HEAD_RCU() + * instead of INIT_LIST_HEAD(). + */ + + INIT_PRLIST_HEAD_RCU(list); + + /* + * At this point, the list body still points to the source list. + * Wait for any readers to finish using the list before splicing + * the list body into the new list. Any new readers will see + * an empty list. + */ + + sync(); + + /* + * Readers are finished with the source list, so perform splice. + * The order is important if the new list is global and accessible + * to concurrent RCU readers. Note that RCU readers are not + * permitted to traverse the prev pointers without excluding + * this function. + */ + + prlist_set_next(last, next); + wr_rcu_assign_pointer(list_next_rcu(&prev->list), first); + prlist_set_prev(first, prev); + prlist_set_prev(next, last); +} + +/** + * prlist_splice_init_rcu - splice an RCU-protected list into an existing + * list, designed for stacks. + * @list: the RCU-protected list to splice + * @head: the place in the existing list to splice the first list into + * @sync: function to sync: synchronize_rcu(), synchronize_sched(), ... + */ +static __always_inline +void prlist_splice_init_rcu(union prlist_head *list, + union prlist_head *head, + void (*sync)(void)) +{ + if (!list_empty(&list->list)) + __prlist_splice_init_rcu(list, head, head->next, sync); +} + +/** + * prlist_splice_tail_init_rcu - splice an RCU-protected list into an + * existing list, designed for queues. + * @list: the RCU-protected list to splice + * @head: the place in the existing list to splice the first list into + * @sync: function to sync: synchronize_rcu(), synchronize_sched(), ... + */ +static __always_inline +void prlist_splice_tail_init_rcu(union prlist_head *list, + union prlist_head *head, + void (*sync)(void)) +{ + if (!list_empty(&list->list)) + __prlist_splice_init_rcu(list, head->prev, head, sync); +} + +/** + * prhlist_del_rcu - deletes entry from hash list without re-initialization + * @n: the element to delete from the hash list. + * + * Note: list_unhashed() on entry does not return true after this, + * the entry is in an undefined state. It is useful for RCU based + * lockfree traversal. + * + * In particular, it means that we can not poison the forward + * pointers that may still be used for walking the hash list. + * + * The caller must take whatever precautions are necessary + * (such as holding appropriate locks) to avoid racing + * with another list-mutation primitive, such as hlist_add_head_rcu() + * or hlist_del_rcu(), running on this same list. + * However, it is perfectly legal to run concurrently with + * the _rcu list-traversal primitives, such as + * hlist_for_each_entry(). + */ +static __always_inline void prhlist_del_rcu(union prhlist_node *n) +{ + __prhlist_del(n); + prhlist_set_pprev(n, LIST_POISON2); +} + +/** + * prhlist_replace_rcu - replace old entry by new one + * @old : the element to be replaced + * @new : the new element to insert + * + * The @old entry will be replaced with the @new entry atomically. + */ +static __always_inline +void prhlist_replace_rcu(union prhlist_node *old, union prhlist_node *new) +{ + union prhlist_node *next = old->next; + + prhlist_set_next(new, next); + prhlist_set_pprev(new, old->pprev); + wr_rcu_assign_pointer(*(union prhlist_node __rcu **)new->pprev, new); + if (next) + prhlist_set_pprev(new->next, &new->next); + prhlist_set_pprev(old, LIST_POISON2); +} + +/** + * prhlist_add_head_rcu + * @n: the element to add to the hash list. + * @h: the list to add to. + * + * Description: + * Adds the specified element to the specified hlist, + * while permitting racing traversals. + * + * The caller must take whatever precautions are necessary + * (such as holding appropriate locks) to avoid racing + * with another list-mutation primitive, such as hlist_add_head_rcu() + * or hlist_del_rcu(), running on this same list. + * However, it is perfectly legal to run concurrently with + * the _rcu list-traversal primitives, such as + * hlist_for_each_entry_rcu(), used to prevent memory-consistency + * problems on Alpha CPUs. Regardless of the type of CPU, the + * list-traversal primitive must be guarded by rcu_read_lock(). + */ +static __always_inline +void prhlist_add_head_rcu(union prhlist_node *n, union prhlist_head *h) +{ + union prhlist_node *first = h->first; + + prhlist_set_next(n, first); + prhlist_set_pprev(n, &h->first); + wr_rcu_assign_pointer(hlist_first_rcu(&h->head), n); + if (first) + prhlist_set_pprev(first, &n->next); +} + +/** + * prhlist_add_tail_rcu + * @n: the element to add to the hash list. + * @h: the list to add to. + * + * Description: + * Adds the specified element to the specified hlist, + * while permitting racing traversals. + * + * The caller must take whatever precautions are necessary + * (such as holding appropriate locks) to avoid racing + * with another list-mutation primitive, such as prhlist_add_head_rcu() + * or prhlist_del_rcu(), running on this same list. + * However, it is perfectly legal to run concurrently with + * the _rcu list-traversal primitives, such as + * hlist_for_each_entry_rcu(), used to prevent memory-consistency + * problems on Alpha CPUs. Regardless of the type of CPU, the + * list-traversal primitive must be guarded by rcu_read_lock(). + */ +static __always_inline +void prhlist_add_tail_rcu(union prhlist_node *n, union prhlist_head *h) +{ + union prhlist_node *i, *last = NULL; + + /* Note: write side code, so rcu accessors are not needed. */ + for (i = h->first; i; i = i->next) + last = i; + + if (last) { + prhlist_set_next(n, last->next); + prhlist_set_pprev(n, &last->next); + wr_rcu_assign_pointer(hlist_next_rcu(&last->node), n); + } else { + prhlist_add_head_rcu(n, h); + } +} + +/** + * prhlist_add_before_rcu + * @n: the new element to add to the hash list. + * @next: the existing element to add the new element before. + * + * Description: + * Adds the specified element to the specified hlist + * before the specified node while permitting racing traversals. + * + * The caller must take whatever precautions are necessary + * (such as holding appropriate locks) to avoid racing + * with another list-mutation primitive, such as prhlist_add_head_rcu() + * or prhlist_del_rcu(), running on this same list. + * However, it is perfectly legal to run concurrently with + * the _rcu list-traversal primitives, such as + * hlist_for_each_entry_rcu(), used to prevent memory-consistency + * problems on Alpha CPUs. + */ +static __always_inline +void prhlist_add_before_rcu(union prhlist_node *n, union prhlist_node *next) +{ + prhlist_set_pprev(n, next->pprev); + prhlist_set_next(n, next); + wr_rcu_assign_pointer(hlist_pprev_rcu(&n->node), n); + prhlist_set_pprev(next, &n->next); +} + +/** + * prhlist_add_behind_rcu + * @n: the new element to add to the hash list. + * @prev: the existing element to add the new element after. + * + * Description: + * Adds the specified element to the specified hlist + * after the specified node while permitting racing traversals. + * + * The caller must take whatever precautions are necessary + * (such as holding appropriate locks) to avoid racing + * with another list-mutation primitive, such as prhlist_add_head_rcu() + * or prhlist_del_rcu(), running on this same list. + * However, it is perfectly legal to run concurrently with + * the _rcu list-traversal primitives, such as + * hlist_for_each_entry_rcu(), used to prevent memory-consistency + * problems on Alpha CPUs. + */ +static __always_inline +void prhlist_add_behind_rcu(union prhlist_node *n, union prhlist_node *prev) +{ + prhlist_set_next(n, prev->next); + prhlist_set_pprev(n, &prev->next); + wr_rcu_assign_pointer(hlist_next_rcu(&prev->node), n); + if (n->next) + prhlist_set_pprev(n->next, &n->next); +} +#endif -- 2.17.1
These test cases focus on the basic operations required to operate both prlist and prhlist data, in particular creating, growing, shrinking, destroying. They can also be useful as reference for practical use of write-rare lists. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Kate Stewart <kstewart@linuxfoundation.org> CC: Philippe Ombredanne <pombredanne@nexb.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: Edward Cree <ecree@solarflare.com> CC: linux-kernel@vger.kernel.org --- MAINTAINERS | 1 + lib/Kconfig.debug | 9 ++ lib/Makefile | 1 + lib/test_prlist.c | 252 ++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 263 insertions(+) create mode 100644 lib/test_prlist.c diff --git a/MAINTAINERS b/MAINTAINERS index f5689c014e07..e7f7cb1682a6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9465,6 +9465,7 @@ F: mm/test_write_rare.c F: mm/test_pmalloc.c F: Documentation/core-api/prmem.rst F: include/linux/prlist.h +F: lib/test_prlist.c MEMORY MANAGEMENT L: linux-mm@kvack.org diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 4966c4fbe7f7..40039992f05f 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2034,6 +2034,15 @@ config IO_STRICT_DEVMEM If in doubt, say Y. +config DEBUG_PRLIST_TEST + bool "Testcase for Protected Linked List" + depends on STRICT_KERNEL_RWX && PRMEM + help + This option enables the testing of an implementation of linked + list based on write rare memory. + The test cases can also be used as examples for how to use the + prlist data structure(s). + source "arch/$(SRCARCH)/Kconfig.debug" endmenu # Kernel hacking diff --git a/lib/Makefile b/lib/Makefile index 423876446810..fe7200e84c5f 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -270,3 +270,4 @@ obj-$(CONFIG_GENERIC_LIB_LSHRDI3) += lshrdi3.o obj-$(CONFIG_GENERIC_LIB_MULDI3) += muldi3.o obj-$(CONFIG_GENERIC_LIB_CMPDI2) += cmpdi2.o obj-$(CONFIG_GENERIC_LIB_UCMPDI2) += ucmpdi2.o +obj-$(CONFIG_DEBUG_PRLIST_TEST) += test_prlist.o diff --git a/lib/test_prlist.c b/lib/test_prlist.c new file mode 100644 index 000000000000..8ee46795d72a --- /dev/null +++ b/lib/test_prlist.c @@ -0,0 +1,252 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * test_prlist.c: Test cases for protected doubly linked list + * + * (C) Copyright 2018 Huawei Technologies Co. Ltd. + * Author: Igor Stoppa <igor.stoppa@huawei.com> + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/bug.h> +#include <linux/prlist.h> + + +#ifdef pr_fmt +#undef pr_fmt +#endif + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +static struct pmalloc_pool *pool; + +static PRLIST_HEAD(test_prlist_head); + +/* ---------------------- prlist test functions ---------------------- */ +static bool test_init_prlist_head(void) +{ + if (WARN(test_prlist_head.prev != &test_prlist_head || + test_prlist_head.next != &test_prlist_head, + "static initialization of static prlist_head failed")) + return false; + wr_ptr(&test_prlist_head.next, NULL); + wr_ptr(&test_prlist_head.prev, NULL); + if (WARN(test_prlist_head.prev || test_prlist_head.next, + "resetting of static prlist_head failed")) + return false; + INIT_PRLIST_HEAD(&test_prlist_head); + if (WARN(test_prlist_head.prev != &test_prlist_head || + test_prlist_head.next != &test_prlist_head, + "initialization of static prlist_head failed")) + return false; + pr_info("initialization of static prlist_head passed"); + return true; +} + +struct prlist_data { + int d_int; + union prlist_head node; + unsigned long long d_ulonglong; +}; + + +#define LIST_INTERVAL 5 +#define LIST_INTERVALS 3 +#define LIST_NODES (LIST_INTERVALS * LIST_INTERVAL) +static bool test_build_prlist(void) +{ + short i; + struct prlist_data *data; + int delta; + + pool = prlist_create_pool(); + if (WARN(!pool, "could not create pool")) + return false; + + for (i = 0; i < LIST_NODES; i++) { + data = (struct prlist_data *)pmalloc(pool, sizeof(*data)); + if (WARN(!data, "Failed to allocate prlist node")) + goto out; + wr_int(&data->d_int, i); + wr_ulonglong(&data->d_ulonglong, i); + prlist_add_tail(&data->node, &test_prlist_head); + } + for (i = 1; i < LIST_NODES; i++) { + data = (struct prlist_data *)pmalloc(pool, sizeof(*data)); + if (WARN(!data, "Failed to allocate prlist node")) + goto out; + wr_int(&data->d_int, i); + wr_ulonglong(&data->d_ulonglong, i); + prlist_add(&data->node, &test_prlist_head); + } + i = LIST_NODES; + delta = -1; + list_for_each_entry(data, &test_prlist_head, node) { + i += delta; + if (!i) + delta = 1; + if (WARN(data->d_int != i || data->d_ulonglong != i, + "unexpected value in prlist, build test failed")) + goto out; + } + pr_info("build prlist test passed"); + return true; +out: + pmalloc_destroy_pool(pool); + return false; +} + +static bool test_teardown_prlist(void) +{ + short i; + + for (i = 0; !list_empty(&test_prlist_head.list); i++) + prlist_del(test_prlist_head.next); + if (WARN(i != LIST_NODES * 2 - 1, "teardown prlist test failed")) + return false; + pmalloc_destroy_pool(pool); + pr_info("teardown prlist test passed"); + return true; +} + +static bool test_prlist(void) +{ + if (WARN(!(test_init_prlist_head() && + test_build_prlist() && + test_teardown_prlist()), + "prlist test failed")) + return false; + pr_info("prlist test passed"); + return true; +} + +/* ---------------------- prhlist test functions ---------------------- */ +static PRHLIST_HEAD(test_prhlist_head); + +static bool test_init_prhlist_head(void) +{ + if (WARN(test_prhlist_head.first, + "static initialization of static prhlist_head failed")) + return false; + wr_ptr(&test_prhlist_head.first, (void *)-1); + if (WARN(!test_prhlist_head.first, + "resetting of static prhlist_head failed")) + return false; + INIT_PRHLIST_HEAD(&test_prhlist_head); + if (WARN(!test_prlist_head.prev, + "initialization of static prhlist_head failed")) + return false; + pr_info("initialization of static prlist_head passed"); + return true; +} + +struct prhlist_data { + int d_int; + union prhlist_node node; + unsigned long long d_ulonglong; +}; + +static bool test_build_prhlist(void) +{ + short i; + struct prhlist_data *data; + union prhlist_node *anchor; + + pool = prhlist_create_pool(); + if (WARN(!pool, "could not create pool")) + return false; + + for (i = 2 * LIST_INTERVAL - 1; i >= LIST_INTERVAL; i--) { + data = (struct prhlist_data *)pmalloc(pool, sizeof(*data)); + if (WARN(!data, "Failed to allocate prhlist node")) + goto out; + wr_int(&data->d_int, i); + wr_ulonglong(&data->d_ulonglong, i); + prhlist_add_head(&data->node, &test_prhlist_head); + } + anchor = test_prhlist_head.first; + for (i = 0; i < LIST_INTERVAL; i++) { + data = (struct prhlist_data *)pmalloc(pool, sizeof(*data)); + if (WARN(!data, "Failed to allocate prhlist node")) + goto out; + wr_int(&data->d_int, i); + wr_ulonglong(&data->d_ulonglong, i); + prhlist_add_before(&data->node, anchor); + } + hlist_for_each_entry(data, &test_prhlist_head, node) + if (!data->node.next) + anchor = &data->node; + for (i = 3 * LIST_INTERVAL - 1; i >= 2 * LIST_INTERVAL; i--) { + data = (struct prhlist_data *)pmalloc(pool, sizeof(*data)); + if (WARN(!data, "Failed to allocate prhlist node")) + goto out; + wr_int(&data->d_int, i); + wr_ulonglong(&data->d_ulonglong, i); + prhlist_add_behind(&data->node, anchor); + } + i = 0; + hlist_for_each_entry(data, &test_prhlist_head, node) { + if (WARN(data->d_int != i || data->d_ulonglong != i, + "unexpected value in prhlist, build test failed")) + goto out; + i++; + } + if (WARN(i != LIST_NODES, + "wrong number of nodes: %d, expectd %d", i, LIST_NODES)) + goto out; + pr_info("build prhlist test passed"); + return true; +out: + pmalloc_destroy_pool(pool); + return false; +} + +static bool test_teardown_prhlist(void) +{ + union prhlist_node **pnode; + bool retval = false; + + for (pnode = &test_prhlist_head.first->next; *pnode;) { + if (WARN(*(*pnode)->pprev != *pnode, + "inconsistent pprev value, delete test failed")) + goto err; + prhlist_del(*pnode); + } + prhlist_del(test_prhlist_head.first); + if (WARN(!hlist_empty(&test_prhlist_head.head), + "prhlist is not empty, delete test failed")) + goto err; + pr_info("deletion of prhlist passed"); + retval = true; +err: + pmalloc_destroy_pool(pool); + return retval; +} + +static bool test_prhlist(void) +{ + if (WARN(!(test_init_prhlist_head() && + test_build_prhlist() && + test_teardown_prhlist()), + "prhlist test failed")) + return false; + pr_info("prhlist test passed"); + return true; +} + +static int __init test_prlists_init_module(void) +{ + if (WARN(!(test_prlist() && + test_prhlist()), + "protected lists test failed")) + return -EFAULT; + pr_info("protected lists test passed"); + return 0; +} + +module_init(test_prlists_init_module); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Igor Stoppa <igor.stoppa@huawei.com>"); +MODULE_DESCRIPTION("Test module for protected doubly linked list."); -- 2.17.1
Minimalistic functionality for having the write rare version of atomic_long_t data. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Will Deacon <will.deacon@arm.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Boqun Feng <boqun.feng@gmail.com> CC: Arnd Bergmann <arnd@arndb.de> CC: linux-arch@vger.kernel.org CC: linux-kernel@vger.kernel.org --- MAINTAINERS | 1 + include/linux/pratomic-long.h | 73 +++++++++++++++++++++++++++++++++++ 2 files changed, 74 insertions(+) create mode 100644 include/linux/pratomic-long.h diff --git a/MAINTAINERS b/MAINTAINERS index e7f7cb1682a6..9d72688d00a3 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9466,6 +9466,7 @@ F: mm/test_pmalloc.c F: Documentation/core-api/prmem.rst F: include/linux/prlist.h F: lib/test_prlist.c +F: include/linux/pratomic-long.h MEMORY MANAGEMENT L: linux-mm@kvack.org diff --git a/include/linux/pratomic-long.h b/include/linux/pratomic-long.h new file mode 100644 index 000000000000..8f1408593733 --- /dev/null +++ b/include/linux/pratomic-long.h @@ -0,0 +1,73 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Atomic operations for write rare memory */ +#ifndef _LINUX_PRATOMIC_LONG_H +#define _LINUX_PRATOMIC_LONG_H +#include <linux/prmem.h> +#include <linux/compiler.h> +#include <asm-generic/atomic-long.h> + +struct pratomic_long_t { + atomic_long_t l __aligned(sizeof(atomic_long_t)); +} __aligned(sizeof(atomic_long_t)); + +#define PRATOMIC_LONG_INIT(i) { \ + .l = ATOMIC_LONG_INIT((i)), \ +} + +static __always_inline +bool __pratomic_long_op(bool inc, struct pratomic_long_t *l) +{ + struct page *page; + uintptr_t base; + uintptr_t offset; + unsigned long flags; + size_t size = sizeof(*l); + bool is_virt = __is_wr_after_init(l, size); + + if (WARN(!(is_virt || likely(__is_wr_pool(l, size))), + WR_ERR_RANGE_MSG)) + return false; + local_irq_save(flags); + if (is_virt) + page = virt_to_page(l); + else + vmalloc_to_page(l); + offset = (~PAGE_MASK) & (uintptr_t)l; + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL); + if (WARN(!base, WR_ERR_PAGE_MSG)) { + local_irq_restore(flags); + return false; + } + if (inc) + atomic_long_inc((atomic_long_t *)(base + offset)); + else + atomic_long_dec((atomic_long_t *)(base + offset)); + vunmap((void *)base); + local_irq_restore(flags); + return true; + +} + +/** + * pratomic_long_inc - atomic increment of rare write long + * @l: address of the variable of type struct pratomic_long_t + * + * Return: true on success, false otherwise + */ +static __always_inline bool pratomic_long_inc(struct pratomic_long_t *l) +{ + return __pratomic_long_op(true, l); +} + +/** + * pratomic_long_inc - atomic decrement of rare write long + * @l: address of the variable of type struct pratomic_long_t + * + * Return: true on success, false otherwise + */ +static __always_inline bool pratomic_long_dec(struct pratomic_long_t *l) +{ + return __pratomic_long_op(false, l); +} + +#endif -- 2.17.1
The measurement list is moved to write rare memory, including related data structures. Various boilerplate linux data structures and related functions are replaced by their write-rare counterpart. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> CC: Mimi Zohar <zohar@linux.vnet.ibm.com> CC: Dmitry Kasatkin <dmitry.kasatkin@gmail.com> CC: James Morris <jmorris@namei.org> CC: "Serge E. Hallyn" <serge@hallyn.com> CC: linux-integrity@vger.kernel.org CC: linux-kernel@vger.kernel.org --- security/integrity/ima/ima.h | 18 ++++++++------ security/integrity/ima/ima_api.c | 29 +++++++++++++---------- security/integrity/ima/ima_fs.c | 12 +++++----- security/integrity/ima/ima_main.c | 6 +++++ security/integrity/ima/ima_queue.c | 28 +++++++++++++--------- security/integrity/ima/ima_template.c | 14 ++++++----- security/integrity/ima/ima_template_lib.c | 14 +++++++---- 7 files changed, 74 insertions(+), 47 deletions(-) diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h index 67db9d9454ca..5f5959753bf5 100644 --- a/security/integrity/ima/ima.h +++ b/security/integrity/ima/ima.h @@ -24,6 +24,8 @@ #include <linux/hash.h> #include <linux/tpm.h> #include <linux/audit.h> +#include <linux/prlist.h> +#include <linux/pratomic-long.h> #include <crypto/hash_info.h> #include "../integrity.h" @@ -84,7 +86,7 @@ struct ima_template_field { /* IMA template descriptor definition */ struct ima_template_desc { - struct list_head list; + union prlist_head list; char *name; char *fmt; int num_fields; @@ -100,11 +102,13 @@ struct ima_template_entry { }; struct ima_queue_entry { - struct hlist_node hnext; /* place in hash collision list */ - struct list_head later; /* place in ima_measurements list */ + union prhlist_node hnext; /* place in hash collision list */ + union prlist_head later; /* place in ima_measurements list */ struct ima_template_entry *entry; }; -extern struct list_head ima_measurements; /* list of all measurements */ + +/* list of all measurements */ +extern union prlist_head ima_measurements __wr_after_init; /* Some details preceding the binary serialized measurement list */ struct ima_kexec_hdr { @@ -160,9 +164,9 @@ void ima_init_template_list(void); extern spinlock_t ima_queue_lock; struct ima_h_table { - atomic_long_t len; /* number of stored measurements in the list */ - atomic_long_t violations; - struct hlist_head queue[IMA_MEASURE_HTABLE_SIZE]; + struct pratomic_long_t len; /* # of measurements in the list */ + struct pratomic_long_t violations; + union prhlist_head queue[IMA_MEASURE_HTABLE_SIZE]; }; extern struct ima_h_table ima_htable; diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c index a02c5acfd403..4fc28c2478b0 100644 --- a/security/integrity/ima/ima_api.c +++ b/security/integrity/ima/ima_api.c @@ -19,9 +19,12 @@ #include <linux/xattr.h> #include <linux/evm.h> #include <linux/iversion.h> +#include <linux/prmemextra.h> +#include <linux/pratomic-long.h> #include "ima.h" +extern struct pmalloc_pool ima_pool; /* * ima_free_template_entry - free an existing template entry */ @@ -29,10 +32,10 @@ void ima_free_template_entry(struct ima_template_entry *entry) { int i; - for (i = 0; i < entry->template_desc->num_fields; i++) - kfree(entry->template_data[i].data); +// for (i = 0; i < entry->template_desc->num_fields; i++) +// kfree(entry->template_data[i].data); - kfree(entry); +// kfree(entry); } /* @@ -44,12 +47,13 @@ int ima_alloc_init_template(struct ima_event_data *event_data, struct ima_template_desc *template_desc = ima_template_desc_current(); int i, result = 0; - *entry = kzalloc(sizeof(**entry) + template_desc->num_fields * - sizeof(struct ima_field_data), GFP_NOFS); + *entry = pzalloc(&ima_pool, + sizeof(**entry) + template_desc->num_fields * + sizeof(struct ima_field_data)); if (!*entry) return -ENOMEM; - (*entry)->template_desc = template_desc; + wr_ptr(&((*entry)->template_desc), template_desc); for (i = 0; i < template_desc->num_fields; i++) { struct ima_template_field *field = template_desc->fields[i]; u32 len; @@ -59,9 +63,10 @@ int ima_alloc_init_template(struct ima_event_data *event_data, if (result != 0) goto out; - len = (*entry)->template_data[i].len; - (*entry)->template_data_len += sizeof(len); - (*entry)->template_data_len += len; + len = (*entry)->template_data_len + sizeof(len) + + (*entry)->template_data[i].len; + wr_memcpy(&(*entry)->template_data_len, &len, + sizeof(len)); } return 0; out: @@ -113,9 +118,9 @@ int ima_store_template(struct ima_template_entry *entry, audit_cause, result, 0); return result; } - memcpy(entry->digest, hash.hdr.digest, hash.hdr.length); + wr_memcpy(entry->digest, hash.hdr.digest, hash.hdr.length); } - entry->pcr = pcr; + wr_int(&entry->pcr, pcr); result = ima_add_template_entry(entry, violation, op, inode, filename); return result; } @@ -139,7 +144,7 @@ void ima_add_violation(struct file *file, const unsigned char *filename, int result; /* can overflow, only indicator */ - atomic_long_inc(&ima_htable.violations); + pratomic_long_inc(&ima_htable.violations); result = ima_alloc_init_template(&event_data, &entry); if (result < 0) { diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c index ae9d5c766a3c..ab20da1161c7 100644 --- a/security/integrity/ima/ima_fs.c +++ b/security/integrity/ima/ima_fs.c @@ -57,7 +57,8 @@ static ssize_t ima_show_htable_violations(struct file *filp, char __user *buf, size_t count, loff_t *ppos) { - return ima_show_htable_value(buf, count, ppos, &ima_htable.violations); + return ima_show_htable_value(buf, count, ppos, + &ima_htable.violations.l); } static const struct file_operations ima_htable_violations_ops = { @@ -69,8 +70,7 @@ static ssize_t ima_show_measurements_count(struct file *filp, char __user *buf, size_t count, loff_t *ppos) { - return ima_show_htable_value(buf, count, ppos, &ima_htable.len); - + return ima_show_htable_value(buf, count, ppos, &ima_htable.len.l); } static const struct file_operations ima_measurements_count_ops = { @@ -86,7 +86,7 @@ static void *ima_measurements_start(struct seq_file *m, loff_t *pos) /* we need a lock since pos could point beyond last element */ rcu_read_lock(); - list_for_each_entry_rcu(qe, &ima_measurements, later) { + list_for_each_entry_rcu(qe, &ima_measurements.list, later.list) { if (!l--) { rcu_read_unlock(); return qe; @@ -303,7 +303,7 @@ static ssize_t ima_read_policy(char *path) size -= rc; } - vfree(data); +// vfree(data); if (rc < 0) return rc; else if (size) @@ -350,7 +350,7 @@ static ssize_t ima_write_policy(struct file *file, const char __user *buf, } mutex_unlock(&ima_write_mutex); out_free: - kfree(data); +// kfree(data); out: if (result < 0) valid_policy = 0; diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c index 2d31921fbda4..d52e59006781 100644 --- a/security/integrity/ima/ima_main.c +++ b/security/integrity/ima/ima_main.c @@ -29,6 +29,7 @@ #include <linux/ima.h> #include <linux/iversion.h> #include <linux/fs.h> +#include <linux/prlist.h> #include "ima.h" @@ -536,10 +537,15 @@ int ima_load_data(enum kernel_load_data_id id) return 0; } +struct pmalloc_pool ima_pool; + +#define IMA_POOL_ALLOC_CHUNK (16 * PAGE_SIZE) static int __init init_ima(void) { int error; + pmalloc_init_custom_pool(&ima_pool, IMA_POOL_ALLOC_CHUNK, 3, + PMALLOC_MODE_START_WR); ima_init_template_list(); hash_setup(CONFIG_IMA_DEFAULT_HASH); error = ima_init(); diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c index b186819bd5aa..444c47b745d8 100644 --- a/security/integrity/ima/ima_queue.c +++ b/security/integrity/ima/ima_queue.c @@ -24,11 +24,14 @@ #include <linux/module.h> #include <linux/rculist.h> #include <linux/slab.h> +#include <linux/prmemextra.h> +#include <linux/prlist.h> +#include <linux/pratomic-long.h> #include "ima.h" #define AUDIT_CAUSE_LEN_MAX 32 -LIST_HEAD(ima_measurements); /* list of all measurements */ +PRLIST_HEAD(ima_measurements); /* list of all measurements */ #ifdef CONFIG_IMA_KEXEC static unsigned long binary_runtime_size; #else @@ -36,9 +39,9 @@ static unsigned long binary_runtime_size = ULONG_MAX; #endif /* key: inode (before secure-hashing a file) */ -struct ima_h_table ima_htable = { - .len = ATOMIC_LONG_INIT(0), - .violations = ATOMIC_LONG_INIT(0), +struct ima_h_table ima_htable __wr_after_init = { + .len = PRATOMIC_LONG_INIT(0), + .violations = PRATOMIC_LONG_INIT(0), .queue[0 ... IMA_MEASURE_HTABLE_SIZE - 1] = HLIST_HEAD_INIT }; @@ -58,7 +61,7 @@ static struct ima_queue_entry *ima_lookup_digest_entry(u8 *digest_value, key = ima_hash_key(digest_value); rcu_read_lock(); - hlist_for_each_entry_rcu(qe, &ima_htable.queue[key], hnext) { + hlist_for_each_entry_rcu(qe, &ima_htable.queue[key], hnext.node) { rc = memcmp(qe->entry->digest, digest_value, TPM_DIGEST_SIZE); if ((rc == 0) && (qe->entry->pcr == pcr)) { ret = qe; @@ -87,6 +90,8 @@ static int get_binary_runtime_size(struct ima_template_entry *entry) return size; } +extern struct pmalloc_pool ima_pool; + /* ima_add_template_entry helper function: * - Add template entry to the measurement list and hash table, for * all entries except those carried across kexec. @@ -99,20 +104,21 @@ static int ima_add_digest_entry(struct ima_template_entry *entry, struct ima_queue_entry *qe; unsigned int key; - qe = kmalloc(sizeof(*qe), GFP_KERNEL); + qe = pmalloc(&ima_pool, sizeof(*qe)); if (qe == NULL) { pr_err("OUT OF MEMORY ERROR creating queue entry\n"); return -ENOMEM; } - qe->entry = entry; + wr_ptr(&qe->entry, entry); + INIT_PRLIST_HEAD(&qe->later); + prlist_add_tail_rcu(&qe->later, &ima_measurements); + - INIT_LIST_HEAD(&qe->later); - list_add_tail_rcu(&qe->later, &ima_measurements); + pratomic_long_inc(&ima_htable.len); - atomic_long_inc(&ima_htable.len); if (update_htable) { key = ima_hash_key(entry->digest); - hlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]); + prhlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]); } if (binary_runtime_size != ULONG_MAX) { diff --git a/security/integrity/ima/ima_template.c b/security/integrity/ima/ima_template.c index 30db39b23804..40ae57a17d89 100644 --- a/security/integrity/ima/ima_template.c +++ b/security/integrity/ima/ima_template.c @@ -22,14 +22,15 @@ enum header_fields { HDR_PCR, HDR_DIGEST, HDR_TEMPLATE_NAME, HDR_TEMPLATE_DATA, HDR__LAST }; -static struct ima_template_desc builtin_templates[] = { +static struct ima_template_desc builtin_templates[] __wr_after_init = { {.name = IMA_TEMPLATE_IMA_NAME, .fmt = IMA_TEMPLATE_IMA_FMT}, {.name = "ima-ng", .fmt = "d-ng|n-ng"}, {.name = "ima-sig", .fmt = "d-ng|n-ng|sig"}, {.name = "", .fmt = ""}, /* placeholder for a custom format */ }; -static LIST_HEAD(defined_templates); +static PRLIST_HEAD(defined_templates); + static DEFINE_SPINLOCK(template_list); static struct ima_template_field supported_fields[] = { @@ -114,7 +115,8 @@ static struct ima_template_desc *lookup_template_desc(const char *name) int found = 0; rcu_read_lock(); - list_for_each_entry_rcu(template_desc, &defined_templates, list) { + list_for_each_entry_rcu(template_desc, &defined_templates.list, + list.list) { if ((strcmp(template_desc->name, name) == 0) || (strcmp(template_desc->fmt, name) == 0)) { found = 1; @@ -207,12 +209,12 @@ void ima_init_template_list(void) { int i; - if (!list_empty(&defined_templates)) + if (!list_empty(&defined_templates.list)) return; spin_lock(&template_list); for (i = 0; i < ARRAY_SIZE(builtin_templates); i++) { - list_add_tail_rcu(&builtin_templates[i].list, + prlist_add_tail_rcu(&builtin_templates[i].list, &defined_templates); } spin_unlock(&template_list); @@ -266,7 +268,7 @@ static struct ima_template_desc *restore_template_fmt(char *template_name) goto out; spin_lock(&template_list); - list_add_tail_rcu(&template_desc->list, &defined_templates); + prlist_add_tail_rcu(&template_desc->list, &defined_templates); spin_unlock(&template_list); out: return template_desc; diff --git a/security/integrity/ima/ima_template_lib.c b/security/integrity/ima/ima_template_lib.c index 43752002c222..a6d10eabf0e5 100644 --- a/security/integrity/ima/ima_template_lib.c +++ b/security/integrity/ima/ima_template_lib.c @@ -15,8 +15,12 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include <linux/printk.h> +#include <linux/prmemextra.h> #include "ima_template_lib.h" +extern struct pmalloc_pool ima_pool; + static bool ima_template_hash_algo_allowed(u8 algo) { if (algo == HASH_ALGO_SHA1 || algo == HASH_ALGO_MD5) @@ -42,11 +46,11 @@ static int ima_write_template_field_data(const void *data, const u32 datalen, if (datafmt == DATA_FMT_STRING) buflen = datalen + 1; - buf = kzalloc(buflen, GFP_KERNEL); + buf = pzalloc(&ima_pool, buflen); if (!buf) return -ENOMEM; - memcpy(buf, data, datalen); + wr_memcpy(buf, data, datalen); /* * Replace all space characters with underscore for event names and @@ -58,11 +62,11 @@ static int ima_write_template_field_data(const void *data, const u32 datalen, if (datafmt == DATA_FMT_STRING) { for (buf_ptr = buf; buf_ptr - buf < datalen; buf_ptr++) if (*buf_ptr == ' ') - *buf_ptr = '_'; + wr_char(buf_ptr, '_'); } - field_data->data = buf; - field_data->len = buflen; + wr_ptr(&field_data->data, buf); + wr_memcpy(&field_data->len, &buflen, sizeof(buflen)); return 0; } -- 2.17.1
On Wed, Oct 24, 2018 at 12:34:55AM +0300, Igor Stoppa wrote: > The connection between each page and its vmap_area avoids more expensive > searches through the btree of vmap_areas. Typo -- it's an rbtree. > +++ b/include/linux/mm_types.h > @@ -87,13 +87,24 @@ struct page { > /* See page-flags.h for PAGE_MAPPING_FLAGS */ > struct address_space *mapping; > pgoff_t index; /* Our offset within mapping. */ > - /** > - * @private: Mapping-private opaque data. > - * Usually used for buffer_heads if PagePrivate. > - * Used for swp_entry_t if PageSwapCache. > - * Indicates order in the buddy system if PageBuddy. > - */ > - unsigned long private; > + union { > + /** > + * @private: Mapping-private opaque data. > + * Usually used for buffer_heads if > + * PagePrivate. > + * Used for swp_entry_t if PageSwapCache. > + * Indicates order in the buddy system if > + * PageBuddy. > + */ > + unsigned long private; > + /** > + * @area: reference to the containing area > + * For pages that are mapped into a virtually > + * contiguous area, avoids performing a more > + * expensive lookup. > + */ > + struct vmap_area *area; > + }; Not like this. Make it part of a different struct in the existing union, not a part of the pagecache struct. And there's no need to use ->private explicitly. > @@ -1747,6 +1750,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, > if (!addr) > return NULL; > > + va = __find_vmap_area((unsigned long)addr); > + for (i = 0; i < va->vm->nr_pages; i++) > + va->vm->pages[i]->area = va; I don't like it that you're calling this for _every_ vmalloc() caller when most of them will never use this. Perhaps have page->va be initially NULL and then cache the lookup in it when it's accessed for the first time.
On 10/23/18 2:34 PM, Igor Stoppa wrote:
> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 9a7b8b049d04..57de5b3c0bae 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -94,3 +94,12 @@ config DEBUG_RODATA_TEST
> depends on STRICT_KERNEL_RWX
> ---help---
> This option enables a testcase for the setting rodata read-only.
> +
> +config DEBUG_PRMEM_TEST
> + tristate "Run self test for protected memory"
> + depends on STRICT_KERNEL_RWX
> + select PRMEM
> + default n
> + help
> + Tries to verify that the memory protection works correctly and that
> + the memory is effectively protected.
Hi,
a. It seems backwards (or upside down) to have a test case select a feature (PRMEM)
instead of depending on that feature.
b. Since PRMEM depends on MMU (in patch 04/17), the "select" here could try to
enabled PRMEM even when MMU is not enabled.
Changing this to "depends on PRMEM" would solve both of these issues.
c. Don't use "default n". That is already the default.
thanks,
--
~Randy
Hi, On 10/23/18 2:34 PM, Igor Stoppa wrote: > Documentation for protected memory. > > Topics covered: > * static memory allocation > * dynamic memory allocation > * write-rare > > Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> > CC: Jonathan Corbet <corbet@lwn.net> > CC: Randy Dunlap <rdunlap@infradead.org> > CC: Mike Rapoport <rppt@linux.vnet.ibm.com> > CC: linux-doc@vger.kernel.org > CC: linux-kernel@vger.kernel.org > --- > Documentation/core-api/index.rst | 1 + > Documentation/core-api/prmem.rst | 172 +++++++++++++++++++++++++++++++ > MAINTAINERS | 1 + > 3 files changed, 174 insertions(+) > create mode 100644 Documentation/core-api/prmem.rst > diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst > new file mode 100644 > index 000000000000..16d7edfe327a > --- /dev/null > +++ b/Documentation/core-api/prmem.rst > @@ -0,0 +1,172 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +.. _prmem: > + > +Memory Protection > +================= > + > +:Date: October 2018 > +:Author: Igor Stoppa <igor.stoppa@huawei.com> > + > +Foreword > +-------- > +- In a typical system using some sort of RAM as execution environment, > + **all** memory is initially writable. > + > +- It must be initialized with the appropriate content, be it code or data. > + > +- Said content typically undergoes modifications, i.e. relocations or > + relocation-induced changes. > + > +- The present document doesn't address such transient. transience. > + > +- Kernel code is protected at system level and, unlike data, it doesn't > + require special attention. > + > +Protection mechanism > +-------------------- > + > +- When available, the MMU can write protect memory pages that would be > + otherwise writable. > + > +- The protection has page-level granularity. > + > +- An attempt to overwrite a protected page will trigger an exception. > +- **Write protected data must go exclusively to write protected pages** > +- **Writable data must go exclusively to writable pages** > + > +Available protections for kernel data > +------------------------------------- > + > +- **constant** > + Labelled as **const**, the data is never supposed to be altered. > + It is statically allocated - if it has any memory footprint at all. > + The compiler can even optimize it away, where possible, by replacing > + references to a **const** with its actual value. > + > +- **read only after init** > + By tagging an otherwise ordinary statically allocated variable with > + **__ro_after_init**, it is placed in a special segment that will > + become write protected, at the end of the kernel init phase. > + The compiler has no notion of this restriction and it will treat any > + write operation on such variable as legal. However, assignments that > + are attempted after the write protection is in place, will cause no comma. > + exceptions. > + > +- **write rare after init** > + This can be seen as variant of read only after init, which uses the > + tag **__wr_after_init**. It is also limited to statically allocated > + memory. It is still possible to alter this type of variables, after > + the kernel init phase is complete, however it can be done exclusively > + with special functions, instead of the assignment operator. Using the > + assignment operator after conclusion of the init phase will still > + trigger an exception. It is not possible to transition a certain > + variable from __wr_ater_init to a permanent read-only status, at > + runtime. > + > +- **dynamically allocated write-rare / read-only** > + After defining a pool, memory can be obtained through it, primarily > + through the **pmalloc()** allocator. The exact writability state of the > + memory obtained from **pmalloc()** and friends can be configured when > + creating the pool. At any point it is possible to transition to a less > + permissive write status the memory currently associated to the pool. > + Once memory has become read-only, it the only valid operation, beside > + reading, is to released it, by destroying the pool it belongs to. > + > + > +Protecting dynamically allocated memory > +--------------------------------------- > + > +When dealing with dynamically allocated memory, three options are > + available for configuring its writability state: > + > +- **Options selected when creating a pool** > + When creating the pool, it is possible to choose one of the following: > + - **PMALLOC_MODE_RO** > + - Writability at allocation time: *WRITABLE* > + - Writability at protection time: *NONE* > + - **PMALLOC_MODE_WR** > + - Writability at allocation time: *WRITABLE* > + - Writability at protection time: *WRITE-RARE* > + - **PMALLOC_MODE_AUTO_RO** > + - Writability at allocation time: > + - the latest allocation: *WRITABLE* > + - every other allocation: *NONE* > + - Writability at protection time: *NONE* > + - **PMALLOC_MODE_AUTO_WR** > + - Writability at allocation time: > + - the latest allocation: *WRITABLE* > + - every other allocation: *WRITE-RARE* > + - Writability at protection time: *WRITE-RARE* > + - **PMALLOC_MODE_START_WR** > + - Writability at allocation time: *WRITE-RARE* > + - Writability at protection time: *WRITE-RARE* > + > + **Remarks:** > + - The "AUTO" modes perform automatic protection of the content, whenever > + the current vmap_area is used up and a new one is allocated. > + - At that point, the vmap_area being phased out is protected. > + - The size of the vmap_area depends on various parameters. > + - It might not be possible to know for sure *when* certain data will > + be protected. > + - The functionality is provided as tradeoff between hardening and speed. > + - Its usefulness depends on the specific use case at hand end above sentence with a period, please, like all of the others above it. > + - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed. Please try to keep the line above and a few below to < 80 characters in length. (because some of us read rst files as text files, with a text editor, and line wrap is ugly) > + > +- **Protecting the pool** > + This is achieved with **pmalloc_protect_pool()** > + - Any vmap_area currently in the pool is write-protected according to its initial configuration. > + - Any residual space still available from the current vmap_area is lost, as the area is protected. > + - **protecting a pool after every allocation will likely be very wasteful** > + - Using PMALLOC_MODE_START_WR is likely a better choice. > + > +- **Upgrading the protection level** > + This is achieved with **pmalloc_make_pool_ro()** > + - it turns the present content of a write-rare pool into read-only > + - can be useful when the content of the memory has settled > + > + > +Caveats > +------- > +- Freeing of memory is not supported. Pages will be returned to the > + system upon destruction of their memory pool. > + > +- The address range available for vmalloc (and thus for pmalloc too) is > + limited, on 32-bit systems. However it shouldn't be an issue, since not > + much data is expected to be dynamically allocated and turned into > + write-protected. > + > +- Regarding SMP systems, changing state of pages and altering mappings > + requires performing cross-processor synchronizations of page tables. > + This is an additional reason for limiting the use of write rare. > + > +- Not only the pmalloc memory must be protected, but also any reference to > + it that might become the target for an attack. The attack would replace > + a reference to the protected memory with a reference to some other, > + unprotected, memory. > + > +- The users of rare write must take care of ensuring the atomicity of the s/rare write/write rare/ ? > + action, respect to the way they use the data being altered; for example, This .. "respect to the way" is awkward, but I don't know what to change it to. > + take a lock before making a copy of the value to modify (if it's > + relevant), then alter it, issue the call to rare write and finally > + release the lock. Some special scenario might be exempt from the need > + for locking, but in general rare-write must be treated as an operation It seemed to me that "write-rare" (or write rare) was the going name, but now it's being called "rare write" (or rare-write). Just be consistent, please. > + that can incur into races. > + > +- pmalloc relies on virtual memory areas and will therefore use more > + tlb entries. It still does a better job of it, compared to invoking TLB > + vmalloc for each allocation, but it is undeniably less optimized wrt to s/wrt/with respect to/ > + TLB use than using the physmap directly, through kmalloc or similar. > + > + > +Utilization > +----------- > + > +**add examples here** > + > +API > +--- > + > +.. kernel-doc:: include/linux/prmem.h > +.. kernel-doc:: mm/prmem.c > +.. kernel-doc:: include/linux/prmemextra.h Thanks for the documentation. -- ~Randy
----- On Oct 23, 2018, at 10:35 PM, Igor Stoppa igor.stoppa@gmail.com wrote: > In some cases, all the data needing protection can be allocated from a pool > in one go, as directly writable, then initialized and protected. > The sequence is relatively short and it's acceptable to leave the entire > data set unprotected. > > In other cases, this is not possible, because the data will trickle over > a relatively long period of time, in a non predictable way, possibly for > the entire duration of the operations. > > For these cases, the safe approach is to have the memory already write > protected, when allocated. However, this will require replacing any > direct assignment with calls to functions that can perform write rare. > > Since lists are one of the most commonly used data structures in kernel, > they are a the first candidate for receiving write rare extensions. > > This patch implements basic functionality for altering said lists. I could not find a description of the overall context of this patch (e.g. a patch 00/17 ?) that would explain the attack vectors this aims to protect against. This might help figuring out whether this added complexity in the core kernel is worth it. Also, is it the right approach to duplicate existing APIs, or should we rather hook into page fault handlers and let the kernel do those "shadow" mappings under the hood ? Adding a new GFP flags for dynamic allocation, and a macro mapping to a section attribute might suffice for allocation or definition of such mostly-read-only/seldom-updated data. Thanks, Mathieu > > Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> > CC: Thomas Gleixner <tglx@linutronix.de> > CC: Kate Stewart <kstewart@linuxfoundation.org> > CC: "David S. Miller" <davem@davemloft.net> > CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > CC: Philippe Ombredanne <pombredanne@nexb.com> > CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > CC: Josh Triplett <josh@joshtriplett.org> > CC: Steven Rostedt <rostedt@goodmis.org> > CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > CC: Lai Jiangshan <jiangshanlai@gmail.com> > CC: linux-kernel@vger.kernel.org > --- > MAINTAINERS | 1 + > include/linux/prlist.h | 934 +++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 935 insertions(+) > create mode 100644 include/linux/prlist.h > > diff --git a/MAINTAINERS b/MAINTAINERS > index 246b1a1cc8bb..f5689c014e07 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -9464,6 +9464,7 @@ F: mm/prmem.c > F: mm/test_write_rare.c > F: mm/test_pmalloc.c > F: Documentation/core-api/prmem.rst > +F: include/linux/prlist.h > > MEMORY MANAGEMENT > L: linux-mm@kvack.org > diff --git a/include/linux/prlist.h b/include/linux/prlist.h > new file mode 100644 > index 000000000000..0387c78f8be8 > --- /dev/null > +++ b/include/linux/prlist.h > @@ -0,0 +1,934 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * prlist.h: Header for Protected Lists > + * > + * (C) Copyright 2018 Huawei Technologies Co. Ltd. > + * Author: Igor Stoppa <igor.stoppa@huawei.com> > + * > + * Code from <linux/list.h> and <linux/rculist.h>, adapted to perform > + * writes on write-rare data. > + * These functions and macros rely on data structures that allow the reuse > + * of what is already provided for reading the content of their non-write > + * rare variant. > + */ > + > +#ifndef _LINUX_PRLIST_H > +#define _LINUX_PRLIST_H > + > +#include <linux/list.h> > +#include <linux/kernel.h> > +#include <linux/prmemextra.h> > + > +/* --------------- Circular Protected Doubly Linked List --------------- */ > +union prlist_head { > + struct list_head list __aligned(sizeof(void *)); > + struct { > + union prlist_head *next __aligned(sizeof(void *)); > + union prlist_head *prev __aligned(sizeof(void *)); > + } __no_randomize_layout; > +} __aligned(sizeof(void *)); > + > +static __always_inline > +union prlist_head *to_prlist_head(struct list_head *list) > +{ > + return container_of(list, union prlist_head, list); > +} > + > +#define PRLIST_HEAD_INIT(name) { \ > + .list = LIST_HEAD_INIT(name), \ > +} > + > +#define PRLIST_HEAD(name) \ > + union prlist_head name __wr_after_init = PRLIST_HEAD_INIT(name.list) > + > +static __always_inline > +struct pmalloc_pool *prlist_create_custom_pool(size_t refill, > + unsigned short align_order) > +{ > + return pmalloc_create_custom_pool(refill, align_order, > + PMALLOC_MODE_START_WR); > +} > + > +static __always_inline struct pmalloc_pool *prlist_create_pool(void) > +{ > + return prlist_create_custom_pool(PMALLOC_REFILL_DEFAULT, > + PMALLOC_ALIGN_ORDER_DEFAULT); > +} > + > +static __always_inline > +void prlist_set_prev(union prlist_head *head, > + const union prlist_head *prev) > +{ > + wr_ptr(&head->prev, prev); > +} > + > +static __always_inline > +void prlist_set_next(union prlist_head *head, > + const union prlist_head *next) > +{ > + wr_ptr(&head->next, next); > +} > + > +static __always_inline void INIT_PRLIST_HEAD(union prlist_head *head) > +{ > + prlist_set_prev(head, head); > + prlist_set_next(head, head); > +} > + > +/* > + * Insert a new entry between two known consecutive entries. > + * > + * This is only for internal list manipulation where we know > + * the prev/next entries already! > + */ > +static __always_inline > +void __prlist_add(union prlist_head *new, union prlist_head *prev, > + union prlist_head *next) > +{ > + if (!__list_add_valid(&new->list, &prev->list, &next->list)) > + return; > + > + prlist_set_prev(next, new); > + prlist_set_next(new, next); > + prlist_set_prev(new, prev); > + prlist_set_next(prev, new); > +} > + > +/** > + * prlist_add - add a new entry > + * @new: new entry to be added > + * @head: prlist head to add it after > + * > + * Insert a new entry after the specified head. > + * This is good for implementing stacks. > + */ > +static __always_inline > +void prlist_add(union prlist_head *new, union prlist_head *head) > +{ > + __prlist_add(new, head, head->next); > +} > + > +/** > + * prlist_add_tail - add a new entry > + * @new: new entry to be added > + * @head: list head to add it before > + * > + * Insert a new entry before the specified head. > + * This is useful for implementing queues. > + */ > +static __always_inline > +void prlist_add_tail(union prlist_head *new, union prlist_head *head) > +{ > + __prlist_add(new, head->prev, head); > +} > + > +/* > + * Delete a prlist entry by making the prev/next entries > + * point to each other. > + * > + * This is only for internal list manipulation where we know > + * the prev/next entries already! > + */ > +static __always_inline > +void __prlist_del(union prlist_head *prev, union prlist_head *next) > +{ > + prlist_set_prev(next, prev); > + prlist_set_next(prev, next); > +} > + > +/** > + * prlist_del - deletes entry from list. > + * @entry: the element to delete from the list. > + * Note: list_empty() on entry does not return true after this, the entry is > + * in an undefined state. > + */ > +static inline void __prlist_del_entry(union prlist_head *entry) > +{ > + if (!__list_del_entry_valid(&entry->list)) > + return; > + __prlist_del(entry->prev, entry->next); > +} > + > +static __always_inline void prlist_del(union prlist_head *entry) > +{ > + __prlist_del_entry(entry); > + prlist_set_next(entry, LIST_POISON1); > + prlist_set_prev(entry, LIST_POISON2); > +} > + > +/** > + * prlist_replace - replace old entry by new one > + * @old : the element to be replaced > + * @new : the new element to insert > + * > + * If @old was empty, it will be overwritten. > + */ > +static __always_inline > +void prlist_replace(union prlist_head *old, union prlist_head *new) > +{ > + prlist_set_next(new, old->next); > + prlist_set_prev(new->next, new); > + prlist_set_prev(new, old->prev); > + prlist_set_next(new->prev, new); > +} > + > +static __always_inline > +void prlist_replace_init(union prlist_head *old, union prlist_head *new) > +{ > + prlist_replace(old, new); > + INIT_PRLIST_HEAD(old); > +} > + > +/** > + * prlist_del_init - deletes entry from list and reinitialize it. > + * @entry: the element to delete from the list. > + */ > +static __always_inline void prlist_del_init(union prlist_head *entry) > +{ > + __prlist_del_entry(entry); > + INIT_PRLIST_HEAD(entry); > +} > + > +/** > + * prlist_move - delete from one list and add as another's head > + * @list: the entry to move > + * @head: the head that will precede our entry > + */ > +static __always_inline > +void prlist_move(union prlist_head *list, union prlist_head *head) > +{ > + __prlist_del_entry(list); > + prlist_add(list, head); > +} > + > +/** > + * prlist_move_tail - delete from one list and add as another's tail > + * @list: the entry to move > + * @head: the head that will follow our entry > + */ > +static __always_inline > +void prlist_move_tail(union prlist_head *list, union prlist_head *head) > +{ > + __prlist_del_entry(list); > + prlist_add_tail(list, head); > +} > + > +/** > + * prlist_rotate_left - rotate the list to the left > + * @head: the head of the list > + */ > +static __always_inline void prlist_rotate_left(union prlist_head *head) > +{ > + union prlist_head *first; > + > + if (!list_empty(&head->list)) { > + first = head->next; > + prlist_move_tail(first, head); > + } > +} > + > +static __always_inline > +void __prlist_cut_position(union prlist_head *list, union prlist_head *head, > + union prlist_head *entry) > +{ > + union prlist_head *new_first = entry->next; > + > + prlist_set_next(list, head->next); > + prlist_set_prev(list->next, list); > + prlist_set_prev(list, entry); > + prlist_set_next(entry, list); > + prlist_set_next(head, new_first); > + prlist_set_prev(new_first, head); > +} > + > +/** > + * prlist_cut_position - cut a list into two > + * @list: a new list to add all removed entries > + * @head: a list with entries > + * @entry: an entry within head, could be the head itself > + * and if so we won't cut the list > + * > + * This helper moves the initial part of @head, up to and > + * including @entry, from @head to @list. You should > + * pass on @entry an element you know is on @head. @list > + * should be an empty list or a list you do not care about > + * losing its data. > + * > + */ > +static __always_inline > +void prlist_cut_position(union prlist_head *list, union prlist_head *head, > + union prlist_head *entry) > +{ > + if (list_empty(&head->list)) > + return; > + if (list_is_singular(&head->list) && > + (head->next != entry && head != entry)) > + return; > + if (entry == head) > + INIT_PRLIST_HEAD(list); > + else > + __prlist_cut_position(list, head, entry); > +} > + > +/** > + * prlist_cut_before - cut a list into two, before given entry > + * @list: a new list to add all removed entries > + * @head: a list with entries > + * @entry: an entry within head, could be the head itself > + * > + * This helper moves the initial part of @head, up to but > + * excluding @entry, from @head to @list. You should pass > + * in @entry an element you know is on @head. @list should > + * be an empty list or a list you do not care about losing > + * its data. > + * If @entry == @head, all entries on @head are moved to > + * @list. > + */ > +static __always_inline > +void prlist_cut_before(union prlist_head *list, union prlist_head *head, > + union prlist_head *entry) > +{ > + if (head->next == entry) { > + INIT_PRLIST_HEAD(list); > + return; > + } > + prlist_set_next(list, head->next); > + prlist_set_prev(list->next, list); > + prlist_set_prev(list, entry->prev); > + prlist_set_next(list->prev, list); > + prlist_set_next(head, entry); > + prlist_set_prev(entry, head); > +} > + > +static __always_inline > +void __prlist_splice(const union prlist_head *list, union prlist_head *prev, > + union prlist_head *next) > +{ > + union prlist_head *first = list->next; > + union prlist_head *last = list->prev; > + > + prlist_set_prev(first, prev); > + prlist_set_next(prev, first); > + prlist_set_next(last, next); > + prlist_set_prev(next, last); > +} > + > +/** > + * prlist_splice - join two lists, this is designed for stacks > + * @list: the new list to add. > + * @head: the place to add it in the first list. > + */ > +static __always_inline > +void prlist_splice(const union prlist_head *list, union prlist_head *head) > +{ > + if (!list_empty(&list->list)) > + __prlist_splice(list, head, head->next); > +} > + > +/** > + * prlist_splice_tail - join two lists, each list being a queue > + * @list: the new list to add. > + * @head: the place to add it in the first list. > + */ > +static __always_inline > +void prlist_splice_tail(union prlist_head *list, union prlist_head *head) > +{ > + if (!list_empty(&list->list)) > + __prlist_splice(list, head->prev, head); > +} > + > +/** > + * prlist_splice_init - join two lists and reinitialise the emptied list. > + * @list: the new list to add. > + * @head: the place to add it in the first list. > + * > + * The list at @list is reinitialised > + */ > +static __always_inline > +void prlist_splice_init(union prlist_head *list, union prlist_head *head) > +{ > + if (!list_empty(&list->list)) { > + __prlist_splice(list, head, head->next); > + INIT_PRLIST_HEAD(list); > + } > +} > + > +/** > + * prlist_splice_tail_init - join 2 lists and reinitialise the emptied list > + * @list: the new list to add. > + * @head: the place to add it in the first list. > + * > + * Each of the lists is a queue. > + * The list at @list is reinitialised > + */ > +static __always_inline > +void prlist_splice_tail_init(union prlist_head *list, > + union prlist_head *head) > +{ > + if (!list_empty(&list->list)) { > + __prlist_splice(list, head->prev, head); > + INIT_PRLIST_HEAD(list); > + } > +} > + > +/* ---- Protected Doubly Linked List with single pointer list head ---- */ > +union prhlist_head { > + struct hlist_head head __aligned(sizeof(void *)); > + union prhlist_node *first __aligned(sizeof(void *)); > +} __aligned(sizeof(void *)); > + > +union prhlist_node { > + struct hlist_node node __aligned(sizeof(void *)) > + ; > + struct { > + union prhlist_node *next __aligned(sizeof(void *)); > + union prhlist_node **pprev __aligned(sizeof(void *)); > + } __no_randomize_layout; > +} __aligned(sizeof(void *)); > + > +#define PRHLIST_HEAD_INIT { \ > + .head = HLIST_HEAD_INIT, \ > +} > + > +#define PRHLIST_HEAD(name) \ > + union prhlist_head name __wr_after_init = PRHLIST_HEAD_INIT > + > + > +#define is_static(object) \ > + unlikely(wr_check_boundaries(object, sizeof(*object))) > + > +static __always_inline > +struct pmalloc_pool *prhlist_create_custom_pool(size_t refill, > + unsigned short align_order) > +{ > + return pmalloc_create_custom_pool(refill, align_order, > + PMALLOC_MODE_AUTO_WR); > +} > + > +static __always_inline > +struct pmalloc_pool *prhlist_create_pool(void) > +{ > + return prhlist_create_custom_pool(PMALLOC_REFILL_DEFAULT, > + PMALLOC_ALIGN_ORDER_DEFAULT); > +} > + > +static __always_inline > +void prhlist_set_first(union prhlist_head *head, union prhlist_node *first) > +{ > + wr_ptr(&head->first, first); > +} > + > +static __always_inline > +void prhlist_set_next(union prhlist_node *node, union prhlist_node *next) > +{ > + wr_ptr(&node->next, next); > +} > + > +static __always_inline > +void prhlist_set_pprev(union prhlist_node *node, union prhlist_node **pprev) > +{ > + wr_ptr(&node->pprev, pprev); > +} > + > +static __always_inline > +void prhlist_set_prev(union prhlist_node *node, union prhlist_node *prev) > +{ > + wr_ptr(node->pprev, prev); > +} > + > +static __always_inline void INIT_PRHLIST_HEAD(union prhlist_head *head) > +{ > + prhlist_set_first(head, NULL); > +} > + > +static __always_inline void INIT_PRHLIST_NODE(union prhlist_node *node) > +{ > + prhlist_set_next(node, NULL); > + prhlist_set_pprev(node, NULL); > +} > + > +static __always_inline void __prhlist_del(union prhlist_node *n) > +{ > + union prhlist_node *next = n->next; > + union prhlist_node **pprev = n->pprev; > + > + wr_ptr(pprev, next); > + if (next) > + prhlist_set_pprev(next, pprev); > +} > + > +static __always_inline void prhlist_del(union prhlist_node *n) > +{ > + __prhlist_del(n); > + prhlist_set_next(n, LIST_POISON1); > + prhlist_set_pprev(n, LIST_POISON2); > +} > + > +static __always_inline void prhlist_del_init(union prhlist_node *n) > +{ > + if (!hlist_unhashed(&n->node)) { > + __prhlist_del(n); > + INIT_PRHLIST_NODE(n); > + } > +} > + > +static __always_inline > +void prhlist_add_head(union prhlist_node *n, union prhlist_head *h) > +{ > + union prhlist_node *first = h->first; > + > + prhlist_set_next(n, first); > + if (first) > + prhlist_set_pprev(first, &n->next); > + prhlist_set_first(h, n); > + prhlist_set_pprev(n, &h->first); > +} > + > +/* next must be != NULL */ > +static __always_inline > +void prhlist_add_before(union prhlist_node *n, union prhlist_node *next) > +{ > + prhlist_set_pprev(n, next->pprev); > + prhlist_set_next(n, next); > + prhlist_set_pprev(next, &n->next); > + prhlist_set_prev(n, n); > +} > + > +static __always_inline > +void prhlist_add_behind(union prhlist_node *n, union prhlist_node *prev) > +{ > + prhlist_set_next(n, prev->next); > + prhlist_set_next(prev, n); > + prhlist_set_pprev(n, &prev->next); > + if (n->next) > + prhlist_set_pprev(n->next, &n->next); > +} > + > +/* after that we'll appear to be on some hlist and hlist_del will work */ > +static __always_inline void prhlist_add_fake(union prhlist_node *n) > +{ > + prhlist_set_pprev(n, &n->next); > +} > + > +/* > + * Move a list from one list head to another. Fixup the pprev > + * reference of the first entry if it exists. > + */ > +static __always_inline > +void prhlist_move_list(union prhlist_head *old, union prhlist_head *new) > +{ > + prhlist_set_first(new, old->first); > + if (new->first) > + prhlist_set_pprev(new->first, &new->first); > + prhlist_set_first(old, NULL); > +} > + > +/* ------------------------ RCU list and hlist ------------------------ */ > + > +/* > + * INIT_LIST_HEAD_RCU - Initialize a list_head visible to RCU readers > + * @head: list to be initialized > + * > + * It is exactly equivalent to INIT_LIST_HEAD() > + */ > +static __always_inline void INIT_PRLIST_HEAD_RCU(union prlist_head *head) > +{ > + INIT_PRLIST_HEAD(head); > +} > + > +/* > + * Insert a new entry between two known consecutive entries. > + * > + * This is only for internal list manipulation where we know > + * the prev/next entries already! > + */ > +static __always_inline > +void __prlist_add_rcu(union prlist_head *new, union prlist_head *prev, > + union prlist_head *next) > +{ > + if (!__list_add_valid(&new->list, &prev->list, &next->list)) > + return; > + prlist_set_next(new, next); > + prlist_set_prev(new, prev); > + wr_rcu_assign_pointer(list_next_rcu(&prev->list), new); > + prlist_set_prev(next, new); > +} > + > +/** > + * prlist_add_rcu - add a new entry to rcu-protected prlist > + * @new: new entry to be added > + * @head: prlist head to add it after > + * > + * Insert a new entry after the specified head. > + * This is good for implementing stacks. > + * > + * The caller must take whatever precautions are necessary > + * (such as holding appropriate locks) to avoid racing > + * with another prlist-mutation primitive, such as prlist_add_rcu() > + * or prlist_del_rcu(), running on this same list. > + * However, it is perfectly legal to run concurrently with > + * the _rcu list-traversal primitives, such as > + * list_for_each_entry_rcu(). > + */ > +static __always_inline > +void prlist_add_rcu(union prlist_head *new, union prlist_head *head) > +{ > + __prlist_add_rcu(new, head, head->next); > +} > + > +/** > + * prlist_add_tail_rcu - add a new entry to rcu-protected prlist > + * @new: new entry to be added > + * @head: prlist head to add it before > + * > + * Insert a new entry before the specified head. > + * This is useful for implementing queues. > + * > + * The caller must take whatever precautions are necessary > + * (such as holding appropriate locks) to avoid racing > + * with another prlist-mutation primitive, such as prlist_add_tail_rcu() > + * or prlist_del_rcu(), running on this same list. > + * However, it is perfectly legal to run concurrently with > + * the _rcu list-traversal primitives, such as > + * list_for_each_entry_rcu(). > + */ > +static __always_inline > +void prlist_add_tail_rcu(union prlist_head *new, union prlist_head *head) > +{ > + __prlist_add_rcu(new, head->prev, head); > +} > + > +/** > + * prlist_del_rcu - deletes entry from prlist without re-initialization > + * @entry: the element to delete from the prlist. > + * > + * Note: list_empty() on entry does not return true after this, > + * the entry is in an undefined state. It is useful for RCU based > + * lockfree traversal. > + * > + * In particular, it means that we can not poison the forward > + * pointers that may still be used for walking the list. > + * > + * The caller must take whatever precautions are necessary > + * (such as holding appropriate locks) to avoid racing > + * with another list-mutation primitive, such as prlist_del_rcu() > + * or prlist_add_rcu(), running on this same prlist. > + * However, it is perfectly legal to run concurrently with > + * the _rcu list-traversal primitives, such as > + * list_for_each_entry_rcu(). > + * > + * Note that the caller is not permitted to immediately free > + * the newly deleted entry. Instead, either synchronize_rcu() > + * or call_rcu() must be used to defer freeing until an RCU > + * grace period has elapsed. > + */ > +static __always_inline void prlist_del_rcu(union prlist_head *entry) > +{ > + __prlist_del_entry(entry); > + prlist_set_prev(entry, LIST_POISON2); > +} > + > +/** > + * prhlist_del_init_rcu - deletes entry from hash list with re-initialization > + * @n: the element to delete from the hash list. > + * > + * Note: list_unhashed() on the node return true after this. It is > + * useful for RCU based read lockfree traversal if the writer side > + * must know if the list entry is still hashed or already unhashed. > + * > + * In particular, it means that we can not poison the forward pointers > + * that may still be used for walking the hash list and we can only > + * zero the pprev pointer so list_unhashed() will return true after > + * this. > + * > + * The caller must take whatever precautions are necessary (such as > + * holding appropriate locks) to avoid racing with another > + * list-mutation primitive, such as hlist_add_head_rcu() or > + * hlist_del_rcu(), running on this same list. However, it is > + * perfectly legal to run concurrently with the _rcu list-traversal > + * primitives, such as hlist_for_each_entry_rcu(). > + */ > +static __always_inline void prhlist_del_init_rcu(union prhlist_node *n) > +{ > + if (!hlist_unhashed(&n->node)) { > + __prhlist_del(n); > + prhlist_set_pprev(n, NULL); > + } > +} > + > +/** > + * prlist_replace_rcu - replace old entry by new one > + * @old : the element to be replaced > + * @new : the new element to insert > + * > + * The @old entry will be replaced with the @new entry atomically. > + * Note: @old should not be empty. > + */ > +static __always_inline > +void prlist_replace_rcu(union prlist_head *old, union prlist_head *new) > +{ > + prlist_set_next(new, old->next); > + prlist_set_prev(new, old->prev); > + wr_rcu_assign_pointer(list_next_rcu(&new->prev->list), new); > + prlist_set_prev(new->next, new); > + prlist_set_prev(old, LIST_POISON2); > +} > + > +/** > + * __prlist_splice_init_rcu - join an RCU-protected list into an existing list. > + * @list: the RCU-protected list to splice > + * @prev: points to the last element of the existing list > + * @next: points to the first element of the existing list > + * @sync: function to sync: synchronize_rcu(), synchronize_sched(), ... > + * > + * The list pointed to by @prev and @next can be RCU-read traversed > + * concurrently with this function. > + * > + * Note that this function blocks. > + * > + * Important note: the caller must take whatever action is necessary to prevent > + * any other updates to the existing list. In principle, it is possible to > + * modify the list as soon as sync() begins execution. If this sort of thing > + * becomes necessary, an alternative version based on call_rcu() could be > + * created. But only if -really- needed -- there is no shortage of RCU API > + * members. > + */ > +static __always_inline > +void __prlist_splice_init_rcu(union prlist_head *list, > + union prlist_head *prev, > + union prlist_head *next, void (*sync)(void)) > +{ > + union prlist_head *first = list->next; > + union prlist_head *last = list->prev; > + > + /* > + * "first" and "last" tracking list, so initialize it. RCU readers > + * have access to this list, so we must use INIT_LIST_HEAD_RCU() > + * instead of INIT_LIST_HEAD(). > + */ > + > + INIT_PRLIST_HEAD_RCU(list); > + > + /* > + * At this point, the list body still points to the source list. > + * Wait for any readers to finish using the list before splicing > + * the list body into the new list. Any new readers will see > + * an empty list. > + */ > + > + sync(); > + > + /* > + * Readers are finished with the source list, so perform splice. > + * The order is important if the new list is global and accessible > + * to concurrent RCU readers. Note that RCU readers are not > + * permitted to traverse the prev pointers without excluding > + * this function. > + */ > + > + prlist_set_next(last, next); > + wr_rcu_assign_pointer(list_next_rcu(&prev->list), first); > + prlist_set_prev(first, prev); > + prlist_set_prev(next, last); > +} > + > +/** > + * prlist_splice_init_rcu - splice an RCU-protected list into an existing > + * list, designed for stacks. > + * @list: the RCU-protected list to splice > + * @head: the place in the existing list to splice the first list into > + * @sync: function to sync: synchronize_rcu(), synchronize_sched(), ... > + */ > +static __always_inline > +void prlist_splice_init_rcu(union prlist_head *list, > + union prlist_head *head, > + void (*sync)(void)) > +{ > + if (!list_empty(&list->list)) > + __prlist_splice_init_rcu(list, head, head->next, sync); > +} > + > +/** > + * prlist_splice_tail_init_rcu - splice an RCU-protected list into an > + * existing list, designed for queues. > + * @list: the RCU-protected list to splice > + * @head: the place in the existing list to splice the first list into > + * @sync: function to sync: synchronize_rcu(), synchronize_sched(), ... > + */ > +static __always_inline > +void prlist_splice_tail_init_rcu(union prlist_head *list, > + union prlist_head *head, > + void (*sync)(void)) > +{ > + if (!list_empty(&list->list)) > + __prlist_splice_init_rcu(list, head->prev, head, sync); > +} > + > +/** > + * prhlist_del_rcu - deletes entry from hash list without re-initialization > + * @n: the element to delete from the hash list. > + * > + * Note: list_unhashed() on entry does not return true after this, > + * the entry is in an undefined state. It is useful for RCU based > + * lockfree traversal. > + * > + * In particular, it means that we can not poison the forward > + * pointers that may still be used for walking the hash list. > + * > + * The caller must take whatever precautions are necessary > + * (such as holding appropriate locks) to avoid racing > + * with another list-mutation primitive, such as hlist_add_head_rcu() > + * or hlist_del_rcu(), running on this same list. > + * However, it is perfectly legal to run concurrently with > + * the _rcu list-traversal primitives, such as > + * hlist_for_each_entry(). > + */ > +static __always_inline void prhlist_del_rcu(union prhlist_node *n) > +{ > + __prhlist_del(n); > + prhlist_set_pprev(n, LIST_POISON2); > +} > + > +/** > + * prhlist_replace_rcu - replace old entry by new one > + * @old : the element to be replaced > + * @new : the new element to insert > + * > + * The @old entry will be replaced with the @new entry atomically. > + */ > +static __always_inline > +void prhlist_replace_rcu(union prhlist_node *old, union prhlist_node *new) > +{ > + union prhlist_node *next = old->next; > + > + prhlist_set_next(new, next); > + prhlist_set_pprev(new, old->pprev); > + wr_rcu_assign_pointer(*(union prhlist_node __rcu **)new->pprev, new); > + if (next) > + prhlist_set_pprev(new->next, &new->next); > + prhlist_set_pprev(old, LIST_POISON2); > +} > + > +/** > + * prhlist_add_head_rcu > + * @n: the element to add to the hash list. > + * @h: the list to add to. > + * > + * Description: > + * Adds the specified element to the specified hlist, > + * while permitting racing traversals. > + * > + * The caller must take whatever precautions are necessary > + * (such as holding appropriate locks) to avoid racing > + * with another list-mutation primitive, such as hlist_add_head_rcu() > + * or hlist_del_rcu(), running on this same list. > + * However, it is perfectly legal to run concurrently with > + * the _rcu list-traversal primitives, such as > + * hlist_for_each_entry_rcu(), used to prevent memory-consistency > + * problems on Alpha CPUs. Regardless of the type of CPU, the > + * list-traversal primitive must be guarded by rcu_read_lock(). > + */ > +static __always_inline > +void prhlist_add_head_rcu(union prhlist_node *n, union prhlist_head *h) > +{ > + union prhlist_node *first = h->first; > + > + prhlist_set_next(n, first); > + prhlist_set_pprev(n, &h->first); > + wr_rcu_assign_pointer(hlist_first_rcu(&h->head), n); > + if (first) > + prhlist_set_pprev(first, &n->next); > +} > + > +/** > + * prhlist_add_tail_rcu > + * @n: the element to add to the hash list. > + * @h: the list to add to. > + * > + * Description: > + * Adds the specified element to the specified hlist, > + * while permitting racing traversals. > + * > + * The caller must take whatever precautions are necessary > + * (such as holding appropriate locks) to avoid racing > + * with another list-mutation primitive, such as prhlist_add_head_rcu() > + * or prhlist_del_rcu(), running on this same list. > + * However, it is perfectly legal to run concurrently with > + * the _rcu list-traversal primitives, such as > + * hlist_for_each_entry_rcu(), used to prevent memory-consistency > + * problems on Alpha CPUs. Regardless of the type of CPU, the > + * list-traversal primitive must be guarded by rcu_read_lock(). > + */ > +static __always_inline > +void prhlist_add_tail_rcu(union prhlist_node *n, union prhlist_head *h) > +{ > + union prhlist_node *i, *last = NULL; > + > + /* Note: write side code, so rcu accessors are not needed. */ > + for (i = h->first; i; i = i->next) > + last = i; > + > + if (last) { > + prhlist_set_next(n, last->next); > + prhlist_set_pprev(n, &last->next); > + wr_rcu_assign_pointer(hlist_next_rcu(&last->node), n); > + } else { > + prhlist_add_head_rcu(n, h); > + } > +} > + > +/** > + * prhlist_add_before_rcu > + * @n: the new element to add to the hash list. > + * @next: the existing element to add the new element before. > + * > + * Description: > + * Adds the specified element to the specified hlist > + * before the specified node while permitting racing traversals. > + * > + * The caller must take whatever precautions are necessary > + * (such as holding appropriate locks) to avoid racing > + * with another list-mutation primitive, such as prhlist_add_head_rcu() > + * or prhlist_del_rcu(), running on this same list. > + * However, it is perfectly legal to run concurrently with > + * the _rcu list-traversal primitives, such as > + * hlist_for_each_entry_rcu(), used to prevent memory-consistency > + * problems on Alpha CPUs. > + */ > +static __always_inline > +void prhlist_add_before_rcu(union prhlist_node *n, union prhlist_node *next) > +{ > + prhlist_set_pprev(n, next->pprev); > + prhlist_set_next(n, next); > + wr_rcu_assign_pointer(hlist_pprev_rcu(&n->node), n); > + prhlist_set_pprev(next, &n->next); > +} > + > +/** > + * prhlist_add_behind_rcu > + * @n: the new element to add to the hash list. > + * @prev: the existing element to add the new element after. > + * > + * Description: > + * Adds the specified element to the specified hlist > + * after the specified node while permitting racing traversals. > + * > + * The caller must take whatever precautions are necessary > + * (such as holding appropriate locks) to avoid racing > + * with another list-mutation primitive, such as prhlist_add_head_rcu() > + * or prhlist_del_rcu(), running on this same list. > + * However, it is perfectly legal to run concurrently with > + * the _rcu list-traversal primitives, such as > + * hlist_for_each_entry_rcu(), used to prevent memory-consistency > + * problems on Alpha CPUs. > + */ > +static __always_inline > +void prhlist_add_behind_rcu(union prhlist_node *n, union prhlist_node *prev) > +{ > + prhlist_set_next(n, prev->next); > + prhlist_set_pprev(n, &prev->next); > + wr_rcu_assign_pointer(hlist_next_rcu(&prev->node), n); > + if (n->next) > + prhlist_set_pprev(n->next, &n->next); > +} > +#endif > -- > 2.17.1 -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
On Wed, Oct 24, 2018 at 12:35:00AM +0300, Igor Stoppa wrote:
> Some of the data structures used in list management are composed by two
> pointers. Since the kernel is now configured by default to randomize the
> layout of data structures soleley composed by pointers,
Isn't this true for function pointers?
On 24/10/18 14:37, Mathieu Desnoyers wrote: > I could not find a description of the overall context of this patch > (e.g. a patch 00/17 ?) that would explain the attack vectors this aims > to protect against. Apologies, I have to admit I was a bit baffled about what to do: the patchset spans across several subsystems and I was reluctant at spamming all the mailing lists. I was hoping that by CC-ing kernel.org, the explicit recipients would get both the mail directly intended for them (as subsystem maintainers/supporters) and the rest. The next time I'll err in the opposite direction. In the meanwhile, please find the whole set here: https://www.openwall.com/lists/kernel-hardening/2018/10/23/3 > This might help figuring out whether this added complexity in the core > kernel is worth it. I hope it will. > Also, is it the right approach to duplicate existing APIs, or should we > rather hook into page fault handlers and let the kernel do those "shadow" > mappings under the hood ? This question is probably a good candidate for the small Q&A section I have in the 00/17. > Adding a new GFP flags for dynamic allocation, and a macro mapping to > a section attribute might suffice for allocation or definition of such > mostly-read-only/seldom-updated data. I think what you are proposing makes sense from a pure hardening standpoint. From a more defensive one, I'd rather minimise the chances of giving a free pass to an attacker. Maybe there is a better implementation of this, than what I have in mind. But, based on my current understanding of what you are describing, there would be few issues: 1) where would the pool go? The pool is a way to manage multiple vmas and express common property they share. Even before a vma is associated to the pool. 2) there would be more code that can seamlessly deal with both protected and regular data. Based on what? Some parameter, I suppose. That parameter would be the new target. If the code is "duplicated", as you say, the actual differences are baked in at compile time. The "duplication" would also allow to have always inlined functions for write-rare and leave more freedom to the compiler for their non-protected version. Besides, I think the separate wr version also makes it very clear, to the user of the API, that there will be a price to pay, in terms of performance. The more seamlessly alternative might make this price less obvious. -- igor
Hi, On 24/10/18 06:27, Randy Dunlap wrote: > a. It seems backwards (or upside down) to have a test case select a feature (PRMEM) > instead of depending on that feature. > > b. Since PRMEM depends on MMU (in patch 04/17), the "select" here could try to > enabled PRMEM even when MMU is not enabled. > > Changing this to "depends on PRMEM" would solve both of these issues. The weird dependency you pointed out is partially caused by the incompleteness of PRMEM. What I have in mind is to have a fallback version of it for systems without MMU capable of write protection. Possibly defaulting to kvmalloc. In that case there would not be any need for a configuration option. > c. Don't use "default n". That is already the default. ok -- igor
Hi, On 24/10/18 06:48, Randy Dunlap wrote: > Hi, > > On 10/23/18 2:34 PM, Igor Stoppa wrote: [...] >> +- The present document doesn't address such transient. > > transience. ok [...] >> + are attempted after the write protection is in place, will cause > > no comma. ok [...] >> + - Its usefulness depends on the specific use case at hand > > end above sentence with a period, please, like all of the others above it. ok >> + - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed. > > Please try to keep the line above and a few below to < 80 characters in length. > (because some of us read rst files as text files, with a text editor, and line > wrap is ugly) ok, I still have to master .rst :-( [...] >> +- The users of rare write must take care of ensuring the atomicity of the > > s/rare write/write rare/ ? thanks >> + action, respect to the way they use the data being altered; for example, > > This .. "respect to the way" is awkward, but I don't know what to > change it to. > >> + take a lock before making a copy of the value to modify (if it's >> + relevant), then alter it, issue the call to rare write and finally >> + release the lock. Some special scenario might be exempt from the need >> + for locking, but in general rare-write must be treated as an operation > > It seemed to me that "write-rare" (or write rare) was the going name, but now > it's being called "rare write" (or rare-write). Just be consistent, please. write-rare it is, because it can be shortened as wr_xxx rare_write becomes rw_xxx which wrongly hints at read/write, which it definitely is not >> + tlb entries. It still does a better job of it, compared to invoking > > TLB ok >> + vmalloc for each allocation, but it is undeniably less optimized wrt to > > s/wrt/with respect to/ yes > Thanks for the documentation. thanks for the review :-) -- igor
On Wed, Oct 24, 2018 at 05:03:01PM +0300, Igor Stoppa wrote:
> On 24/10/18 14:37, Mathieu Desnoyers wrote:
> > Also, is it the right approach to duplicate existing APIs, or should we
> > rather hook into page fault handlers and let the kernel do those "shadow"
> > mappings under the hood ?
>
> This question is probably a good candidate for the small Q&A section I have
> in the 00/17.
>
>
> > Adding a new GFP flags for dynamic allocation, and a macro mapping to
> > a section attribute might suffice for allocation or definition of such
> > mostly-read-only/seldom-updated data.
>
> I think what you are proposing makes sense from a pure hardening standpoint.
> From a more defensive one, I'd rather minimise the chances of giving a free
> pass to an attacker.
>
> Maybe there is a better implementation of this, than what I have in mind.
> But, based on my current understanding of what you are describing, there
> would be few issues:
>
> 1) where would the pool go? The pool is a way to manage multiple vmas and
> express common property they share. Even before a vma is associated to the
> pool.
>
> 2) there would be more code that can seamlessly deal with both protected and
> regular data. Based on what? Some parameter, I suppose.
> That parameter would be the new target.
> If the code is "duplicated", as you say, the actual differences are baked in
> at compile time. The "duplication" would also allow to have always inlined
> functions for write-rare and leave more freedom to the compiler for their
> non-protected version.
>
> Besides, I think the separate wr version also makes it very clear, to the
> user of the API, that there will be a price to pay, in terms of performance.
> The more seamlessly alternative might make this price less obvious.
What about something in the middle, where we move list to list_impl.h,
and add a few macros where you have list_set_prev() in prlist now, so
we could do,
// prlist.h
#define list_set_next(head, next) wr_ptr(&head->next, next)
#define list_set_prev(head, prev) wr_ptr(&head->prev, prev)
#include <linux/list_impl.h>
// list.h
#define list_set_next(next) (head->next = next)
#define list_set_next(prev) (head->prev = prev)
#include <linux/list_impl.h>
I wonder then if you can get rid of some of the type punning too? It's
not clear exactly why that's necessary from the series, but perhaps
I'm missing something obvious :)
I also wonder how much the actual differences being baked in at
compile time makes. Most (all?) of this code is inlined.
Cheers,
Tycho
On 24/10/2018 17:56, Tycho Andersen wrote: > On Wed, Oct 24, 2018 at 05:03:01PM +0300, Igor Stoppa wrote: >> On 24/10/18 14:37, Mathieu Desnoyers wrote: >>> Also, is it the right approach to duplicate existing APIs, or should we >>> rather hook into page fault handlers and let the kernel do those "shadow" >>> mappings under the hood ? >> >> This question is probably a good candidate for the small Q&A section I have >> in the 00/17. >> >> >>> Adding a new GFP flags for dynamic allocation, and a macro mapping to >>> a section attribute might suffice for allocation or definition of such >>> mostly-read-only/seldom-updated data. >> >> I think what you are proposing makes sense from a pure hardening standpoint. >> From a more defensive one, I'd rather minimise the chances of giving a free >> pass to an attacker. >> >> Maybe there is a better implementation of this, than what I have in mind. >> But, based on my current understanding of what you are describing, there >> would be few issues: >> >> 1) where would the pool go? The pool is a way to manage multiple vmas and >> express common property they share. Even before a vma is associated to the >> pool. >> >> 2) there would be more code that can seamlessly deal with both protected and >> regular data. Based on what? Some parameter, I suppose. >> That parameter would be the new target. >> If the code is "duplicated", as you say, the actual differences are baked in >> at compile time. The "duplication" would also allow to have always inlined >> functions for write-rare and leave more freedom to the compiler for their >> non-protected version. >> >> Besides, I think the separate wr version also makes it very clear, to the >> user of the API, that there will be a price to pay, in terms of performance. >> The more seamlessly alternative might make this price less obvious. > > What about something in the middle, where we move list to list_impl.h, > and add a few macros where you have list_set_prev() in prlist now, so > we could do, > > // prlist.h > > #define list_set_next(head, next) wr_ptr(&head->next, next) > #define list_set_prev(head, prev) wr_ptr(&head->prev, prev) > > #include <linux/list_impl.h> > > // list.h > > #define list_set_next(next) (head->next = next) > #define list_set_next(prev) (head->prev = prev) > > #include <linux/list_impl.h> > > I wonder then if you can get rid of some of the type punning too? It's > not clear exactly why that's necessary from the series, but perhaps > I'm missing something obvious :) nothing obvious, probably there is only half a reference in the slides I linked-to in the cover letter :-) So far I have minimized the number of "intrinsic" write rare functions, mostly because I would want first to reach an agreement on the implementation of the core write-rare. However, once that is done, it might be good to convert also the prlists to be "intrinsics". A list node is 2 pointers. If that was the alignment, i.e. __align(sizeof(list_head)), it might be possible to speed up a lot the list handling even as write rare. Taking as example the insertion operation, it would be probably sufficient, in most cases, to have only two remappings: - one covering the page with the latest two nodes - one covering the page with the list head That is 2 vs 8 remappings, and a good deal of memory barriers less. This would be incompatible with what you are proposing, yet it would be justifiable, I think, because it would provide better performance to prlist, potentially widening its adoption, where performance is a concern. > I also wonder how much the actual differences being baked in at > compile time makes. Most (all?) of this code is inlined. If the inlined function expects to receive a prlist_head *, instead of a list_head *, doesn't it help turning runtime bugs into buildtime bugs? Or maybe I miss your point? -- igor
On 24/10/2018 06:12, Matthew Wilcox wrote: > On Wed, Oct 24, 2018 at 12:34:55AM +0300, Igor Stoppa wrote: >> The connection between each page and its vmap_area avoids more expensive >> searches through the btree of vmap_areas. > > Typo -- it's an rbtree. ack >> +++ b/include/linux/mm_types.h >> @@ -87,13 +87,24 @@ struct page { >> /* See page-flags.h for PAGE_MAPPING_FLAGS */ >> struct address_space *mapping; >> pgoff_t index; /* Our offset within mapping. */ >> - /** >> - * @private: Mapping-private opaque data. >> - * Usually used for buffer_heads if PagePrivate. >> - * Used for swp_entry_t if PageSwapCache. >> - * Indicates order in the buddy system if PageBuddy. >> - */ >> - unsigned long private; >> + union { >> + /** >> + * @private: Mapping-private opaque data. >> + * Usually used for buffer_heads if >> + * PagePrivate. >> + * Used for swp_entry_t if PageSwapCache. >> + * Indicates order in the buddy system if >> + * PageBuddy. >> + */ >> + unsigned long private; >> + /** >> + * @area: reference to the containing area >> + * For pages that are mapped into a virtually >> + * contiguous area, avoids performing a more >> + * expensive lookup. >> + */ >> + struct vmap_area *area; >> + }; > > Not like this. Make it part of a different struct in the existing union, > not a part of the pagecache struct. And there's no need to use ->private > explicitly. Ok, I'll have a look at the googledoc you made >> @@ -1747,6 +1750,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, >> if (!addr) >> return NULL; >> >> + va = __find_vmap_area((unsigned long)addr); >> + for (i = 0; i < va->vm->nr_pages; i++) >> + va->vm->pages[i]->area = va; > > I don't like it that you're calling this for _every_ vmalloc() caller > when most of them will never use this. Perhaps have page->va be initially > NULL and then cache the lookup in it when it's accessed for the first time. > If __find_vmap_area() was part of the API, this loop could be left out from __vmalloc_node_range() and the user of the allocation could initialize the field, if needed. What is the reason for keeping __find_vmap_area() private? -- igor
On Wed, Oct 24, 2018 at 12:34:47AM +0300, Igor Stoppa wrote:
> -- Summary --
>
> Preliminary version of memory protection patchset, including a sample use
> case, turning into write-rare the IMA measurement list.
I haven't looked at the patches yet, but I see a significant issue
from the subject lines. "prmem" is very similar to "pmem"
(persistent memory) and that's going to cause confusion. Especially
if people start talking about prmem and pmem in the context of write
protect pmem with prmem...
Naming is hard :/
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
Hi Igor, On Wed, Oct 24, 2018 at 12:34:57AM +0300, Igor Stoppa wrote: > Documentation for protected memory. > > Topics covered: > * static memory allocation > * dynamic memory allocation > * write-rare > > Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> > CC: Jonathan Corbet <corbet@lwn.net> > CC: Randy Dunlap <rdunlap@infradead.org> > CC: Mike Rapoport <rppt@linux.vnet.ibm.com> > CC: linux-doc@vger.kernel.org > CC: linux-kernel@vger.kernel.org > --- > Documentation/core-api/index.rst | 1 + > Documentation/core-api/prmem.rst | 172 +++++++++++++++++++++++++++++++ Thanks for having docs a part of the patchset! > MAINTAINERS | 1 + > 3 files changed, 174 insertions(+) > create mode 100644 Documentation/core-api/prmem.rst > > diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst > index 26b735cefb93..1a90fa878d8d 100644 > --- a/Documentation/core-api/index.rst > +++ b/Documentation/core-api/index.rst > @@ -31,6 +31,7 @@ Core utilities > gfp_mask-from-fs-io > timekeeping > boot-time-mm > + prmem > > Interfaces for kernel debugging > =============================== > diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst > new file mode 100644 > index 000000000000..16d7edfe327a > --- /dev/null > +++ b/Documentation/core-api/prmem.rst > @@ -0,0 +1,172 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +.. _prmem: > + > +Memory Protection > +================= > + > +:Date: October 2018 > +:Author: Igor Stoppa <igor.stoppa@huawei.com> > + > +Foreword > +-------- > +- In a typical system using some sort of RAM as execution environment, > + **all** memory is initially writable. > + > +- It must be initialized with the appropriate content, be it code or data. > + > +- Said content typically undergoes modifications, i.e. relocations or > + relocation-induced changes. > + > +- The present document doesn't address such transient. > + > +- Kernel code is protected at system level and, unlike data, it doesn't > + require special attention. > + I feel that foreword should include a sentence or two saying why we need the memory protection and when it can/should be used. > +Protection mechanism > +-------------------- > + > +- When available, the MMU can write protect memory pages that would be > + otherwise writable. > + > +- The protection has page-level granularity. > + > +- An attempt to overwrite a protected page will trigger an exception. > +- **Write protected data must go exclusively to write protected pages** > +- **Writable data must go exclusively to writable pages** > + > +Available protections for kernel data > +------------------------------------- > + > +- **constant** > + Labelled as **const**, the data is never supposed to be altered. > + It is statically allocated - if it has any memory footprint at all. > + The compiler can even optimize it away, where possible, by replacing > + references to a **const** with its actual value. > + > +- **read only after init** > + By tagging an otherwise ordinary statically allocated variable with > + **__ro_after_init**, it is placed in a special segment that will > + become write protected, at the end of the kernel init phase. > + The compiler has no notion of this restriction and it will treat any > + write operation on such variable as legal. However, assignments that > + are attempted after the write protection is in place, will cause > + exceptions. > + > +- **write rare after init** > + This can be seen as variant of read only after init, which uses the > + tag **__wr_after_init**. It is also limited to statically allocated > + memory. It is still possible to alter this type of variables, after no comma ^ > + the kernel init phase is complete, however it can be done exclusively > + with special functions, instead of the assignment operator. Using the > + assignment operator after conclusion of the init phase will still > + trigger an exception. It is not possible to transition a certain > + variable from __wr_ater_init to a permanent read-only status, at __wr_aFter_init > + runtime. > + > +- **dynamically allocated write-rare / read-only** > + After defining a pool, memory can be obtained through it, primarily > + through the **pmalloc()** allocator. The exact writability state of the > + memory obtained from **pmalloc()** and friends can be configured when > + creating the pool. At any point it is possible to transition to a less > + permissive write status the memory currently associated to the pool. > + Once memory has become read-only, it the only valid operation, beside ... become read-only, the only valid operation > + reading, is to released it, by destroying the pool it belongs to. > + > + > +Protecting dynamically allocated memory > +--------------------------------------- > + > +When dealing with dynamically allocated memory, three options are > + available for configuring its writability state: > + > +- **Options selected when creating a pool** > + When creating the pool, it is possible to choose one of the following: > + - **PMALLOC_MODE_RO** > + - Writability at allocation time: *WRITABLE* > + - Writability at protection time: *NONE* > + - **PMALLOC_MODE_WR** > + - Writability at allocation time: *WRITABLE* > + - Writability at protection time: *WRITE-RARE* > + - **PMALLOC_MODE_AUTO_RO** > + - Writability at allocation time: > + - the latest allocation: *WRITABLE* > + - every other allocation: *NONE* > + - Writability at protection time: *NONE* > + - **PMALLOC_MODE_AUTO_WR** > + - Writability at allocation time: > + - the latest allocation: *WRITABLE* > + - every other allocation: *WRITE-RARE* > + - Writability at protection time: *WRITE-RARE* > + - **PMALLOC_MODE_START_WR** > + - Writability at allocation time: *WRITE-RARE* > + - Writability at protection time: *WRITE-RARE* For me this part is completely blind. Maybe arranging this as a table would make the states more clearly visible. > + > + **Remarks:** > + - The "AUTO" modes perform automatic protection of the content, whenever > + the current vmap_area is used up and a new one is allocated. > + - At that point, the vmap_area being phased out is protected. > + - The size of the vmap_area depends on various parameters. > + - It might not be possible to know for sure *when* certain data will > + be protected. > + - The functionality is provided as tradeoff between hardening and speed. > + - Its usefulness depends on the specific use case at hand > + - The "START_WR" mode is the only one which provides immediate protection, at the cost of speed. > + > +- **Protecting the pool** > + This is achieved with **pmalloc_protect_pool()** > + - Any vmap_area currently in the pool is write-protected according to its initial configuration. > + - Any residual space still available from the current vmap_area is lost, as the area is protected. > + - **protecting a pool after every allocation will likely be very wasteful** > + - Using PMALLOC_MODE_START_WR is likely a better choice. > + > +- **Upgrading the protection level** > + This is achieved with **pmalloc_make_pool_ro()** > + - it turns the present content of a write-rare pool into read-only > + - can be useful when the content of the memory has settled > + > + > +Caveats > +------- > +- Freeing of memory is not supported. Pages will be returned to the > + system upon destruction of their memory pool. > + > +- The address range available for vmalloc (and thus for pmalloc too) is > + limited, on 32-bit systems. However it shouldn't be an issue, since not no comma ^ > + much data is expected to be dynamically allocated and turned into > + write-protected. > + > +- Regarding SMP systems, changing state of pages and altering mappings > + requires performing cross-processor synchronizations of page tables. > + This is an additional reason for limiting the use of write rare. > + > +- Not only the pmalloc memory must be protected, but also any reference to > + it that might become the target for an attack. The attack would replace > + a reference to the protected memory with a reference to some other, > + unprotected, memory. > + > +- The users of rare write must take care of ensuring the atomicity of the > + action, respect to the way they use the data being altered; for example, > + take a lock before making a copy of the value to modify (if it's > + relevant), then alter it, issue the call to rare write and finally > + release the lock. Some special scenario might be exempt from the need > + for locking, but in general rare-write must be treated as an operation > + that can incur into races. > + > +- pmalloc relies on virtual memory areas and will therefore use more > + tlb entries. It still does a better job of it, compared to invoking > + vmalloc for each allocation, but it is undeniably less optimized wrt to > + TLB use than using the physmap directly, through kmalloc or similar. > + > + > +Utilization > +----------- > + > +**add examples here** > + > +API > +--- > + > +.. kernel-doc:: include/linux/prmem.h > +.. kernel-doc:: mm/prmem.c > +.. kernel-doc:: include/linux/prmemextra.h > diff --git a/MAINTAINERS b/MAINTAINERS > index ea979a5a9ec9..246b1a1cc8bb 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -9463,6 +9463,7 @@ F: include/linux/prmemextra.h > F: mm/prmem.c > F: mm/test_write_rare.c > F: mm/test_pmalloc.c > +F: Documentation/core-api/prmem.rst I think the MAINTAINERS update can go in one chunk as the last patch in the series. > MEMORY MANAGEMENT > L: linux-mm@kvack.org > -- > 2.17.1 > -- Sincerely yours, Mike.
On Wed, Oct 24, 2018 at 12:35:03AM +0300, Igor Stoppa wrote:
> +static __always_inline
> +bool __pratomic_long_op(bool inc, struct pratomic_long_t *l)
> +{
> + struct page *page;
> + uintptr_t base;
> + uintptr_t offset;
> + unsigned long flags;
> + size_t size = sizeof(*l);
> + bool is_virt = __is_wr_after_init(l, size);
> +
> + if (WARN(!(is_virt || likely(__is_wr_pool(l, size))),
> + WR_ERR_RANGE_MSG))
> + return false;
> + local_irq_save(flags);
> + if (is_virt)
> + page = virt_to_page(l);
> + else
> + vmalloc_to_page(l);
> + offset = (~PAGE_MASK) & (uintptr_t)l;
> + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
> + if (WARN(!base, WR_ERR_PAGE_MSG)) {
> + local_irq_restore(flags);
> + return false;
> + }
> + if (inc)
> + atomic_long_inc((atomic_long_t *)(base + offset));
> + else
> + atomic_long_dec((atomic_long_t *)(base + offset));
> + vunmap((void *)base);
> + local_irq_restore(flags);
> + return true;
> +
> +}
That's just hideously nasty.. and horribly broken.
We're not going to duplicate all these kernel interfaces wrapped in gunk
like that. Also, you _cannot_ call vunmap() with IRQs disabled. Clearly
you've never tested this with debug bits enabled.
> +static __always_inline bool __is_wr_after_init(const void *ptr, size_t size) > +{ > + size_t start = (size_t)&__start_wr_after_init; > + size_t end = (size_t)&__end_wr_after_init; > + size_t low = (size_t)ptr; > + size_t high = (size_t)ptr + size; > + > + return likely(start <= low && low < high && high <= end); > +} size_t is an odd type choice for doing address arithmetic. > +/** > + * wr_memset() - sets n bytes of the destination to the c value > + * @dst: beginning of the memory to write to > + * @c: byte to replicate > + * @size: amount of bytes to copy > + * > + * Returns true on success, false otherwise. > + */ > +static __always_inline > +bool wr_memset(const void *dst, const int c, size_t n_bytes) > +{ > + size_t size; > + unsigned long flags; > + uintptr_t d = (uintptr_t)dst; > + > + if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG)) > + return false; > + while (n_bytes) { > + struct page *page; > + uintptr_t base; > + uintptr_t offset; > + uintptr_t offset_complement; Again, these are really odd choices for types. vmap() returns a void* pointer, on which you can do arithmetic. Why bother keeping another type to which you have to cast to and from? BTW, our usual "pointer stored in an integer type" is 'unsigned long', if a pointer needs to be manipulated. > + local_irq_save(flags); Why are you doing the local_irq_save()? > + page = virt_to_page(d); > + offset = d & ~PAGE_MASK; > + offset_complement = PAGE_SIZE - offset; > + size = min(n_bytes, offset_complement); > + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL); Can you even call vmap() (which sleeps) with interrupts off? > + if (WARN(!base, WR_ERR_PAGE_MSG)) { > + local_irq_restore(flags); > + return false; > + } You really need some kmap_atomic()-style accessors to wrap this stuff for you. This little pattern is repeated over and over. ... > +const char WR_ERR_RANGE_MSG[] = "Write rare on invalid memory range."; > +const char WR_ERR_PAGE_MSG[] = "Failed to remap write rare page."; Doesn't the compiler de-duplicate duplicated strings for you? Is there any reason to declare these like this?
On 10/23/18 2:34 PM, Igor Stoppa wrote:
> +#define VM_PMALLOC 0x00000100 /* pmalloc area - see docs */
> +#define VM_PMALLOC_WR 0x00000200 /* pmalloc write rare area */
> +#define VM_PMALLOC_PROTECTED 0x00000400 /* pmalloc protected area */
Please introduce things as you use them. It's impossible to review a
patch that just says "see docs" that doesn't contain any docs. :)
On 10/23/18 2:34 PM, Igor Stoppa wrote:
> Wrappers around the basic write rare functionality, addressing several
> common data types found in the kernel, allowing to specify the new
> values through immediates, like constants and defines.
I have to wonder whether this is the right way, or whether a
size-agnostic function like put_user() is the right way. put_user() is
certainly easier to use.
On Thu, Oct 25, 2018 at 02:01:02AM +0300, Igor Stoppa wrote:
> > > @@ -1747,6 +1750,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > > if (!addr)
> > > return NULL;
> > > + va = __find_vmap_area((unsigned long)addr);
> > > + for (i = 0; i < va->vm->nr_pages; i++)
> > > + va->vm->pages[i]->area = va;
> >
> > I don't like it that you're calling this for _every_ vmalloc() caller
> > when most of them will never use this. Perhaps have page->va be initially
> > NULL and then cache the lookup in it when it's accessed for the first time.
> >
>
> If __find_vmap_area() was part of the API, this loop could be left out from
> __vmalloc_node_range() and the user of the allocation could initialize the
> field, if needed.
>
> What is the reason for keeping __find_vmap_area() private?
Well, for one, you're walking the rbtree without holding the spinlock,
so you're going to get crashes. I don't see why we shouldn't export
find_vmap_area() though.
Another way we could approach this is to embed the vmap_area in the
vm_struct. It'd require a bit of juggling of the alloc/free paths in
vmalloc, but it might be worthwhile.
On Thu, Oct 25, 2018 at 01:52:11AM +0300, Igor Stoppa wrote: > On 24/10/2018 17:56, Tycho Andersen wrote: > > On Wed, Oct 24, 2018 at 05:03:01PM +0300, Igor Stoppa wrote: > > > On 24/10/18 14:37, Mathieu Desnoyers wrote: > > > > Also, is it the right approach to duplicate existing APIs, or should we > > > > rather hook into page fault handlers and let the kernel do those "shadow" > > > > mappings under the hood ? > > > > > > This question is probably a good candidate for the small Q&A section I have > > > in the 00/17. > > > > > > > > > > Adding a new GFP flags for dynamic allocation, and a macro mapping to > > > > a section attribute might suffice for allocation or definition of such > > > > mostly-read-only/seldom-updated data. > > > > > > I think what you are proposing makes sense from a pure hardening standpoint. > > > From a more defensive one, I'd rather minimise the chances of giving a free > > > pass to an attacker. > > > > > > Maybe there is a better implementation of this, than what I have in mind. > > > But, based on my current understanding of what you are describing, there > > > would be few issues: > > > > > > 1) where would the pool go? The pool is a way to manage multiple vmas and > > > express common property they share. Even before a vma is associated to the > > > pool. > > > > > > 2) there would be more code that can seamlessly deal with both protected and > > > regular data. Based on what? Some parameter, I suppose. > > > That parameter would be the new target. > > > If the code is "duplicated", as you say, the actual differences are baked in > > > at compile time. The "duplication" would also allow to have always inlined > > > functions for write-rare and leave more freedom to the compiler for their > > > non-protected version. > > > > > > Besides, I think the separate wr version also makes it very clear, to the > > > user of the API, that there will be a price to pay, in terms of performance. > > > The more seamlessly alternative might make this price less obvious. > > > > What about something in the middle, where we move list to list_impl.h, > > and add a few macros where you have list_set_prev() in prlist now, so > > we could do, > > > > // prlist.h > > > > #define list_set_next(head, next) wr_ptr(&head->next, next) > > #define list_set_prev(head, prev) wr_ptr(&head->prev, prev) > > > > #include <linux/list_impl.h> > > > > // list.h > > > > #define list_set_next(next) (head->next = next) > > #define list_set_next(prev) (head->prev = prev) > > > > #include <linux/list_impl.h> > > > > I wonder then if you can get rid of some of the type punning too? It's > > not clear exactly why that's necessary from the series, but perhaps > > I'm missing something obvious :) > > nothing obvious, probably there is only half a reference in the slides I > linked-to in the cover letter :-) > > So far I have minimized the number of "intrinsic" write rare functions, > mostly because I would want first to reach an agreement on the > implementation of the core write-rare. > > However, once that is done, it might be good to convert also the prlists to > be "intrinsics". A list node is 2 pointers. > If that was the alignment, i.e. __align(sizeof(list_head)), it might be > possible to speed up a lot the list handling even as write rare. > > Taking as example the insertion operation, it would be probably sufficient, > in most cases, to have only two remappings: > - one covering the page with the latest two nodes > - one covering the page with the list head > > That is 2 vs 8 remappings, and a good deal of memory barriers less. > > This would be incompatible with what you are proposing, yet it would be > justifiable, I think, because it would provide better performance to prlist, > potentially widening its adoption, where performance is a concern. I guess the writes to these are rare, right? So perhaps it's not such a big deal :) > > I also wonder how much the actual differences being baked in at > > compile time makes. Most (all?) of this code is inlined. > > If the inlined function expects to receive a prlist_head *, instead of a > list_head *, doesn't it help turning runtime bugs into buildtime bugs? In principle it's not a bug to use the prmem helpers where the regular ones would do, it's just slower (assuming the types are the same). But mostly, it's a way to avoid actually copying and pasting most of the implementations of most of the data structures. I see some other replies in the thread already, but this seems not so good to me. Tycho
> +static bool is_address_protected(void *p)
> +{
> + struct page *page;
> + struct vmap_area *area;
> +
> + if (unlikely(!is_vmalloc_addr(p)))
> + return false;
> + page = vmalloc_to_page(p);
> + if (unlikely(!page))
> + return false;
> + wmb(); /* Flush changes to the page table - is it needed? */
No.
The rest of this is just pretty verbose and seems to have been very
heavily copied and pasted. I guess that's OK for test code, though.
Jon, So the below document is a prime example for why I think RST sucks. As a text document readability is greatly diminished by all the markup nonsense. This stuff should not become write-only content like html and other gunk. The actual text file is still the primary means of reading this. > diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst > index 26b735cefb93..1a90fa878d8d 100644 > --- a/Documentation/core-api/index.rst > +++ b/Documentation/core-api/index.rst > @@ -31,6 +31,7 @@ Core utilities > gfp_mask-from-fs-io > timekeeping > boot-time-mm > + prmem > > Interfaces for kernel debugging > =============================== > diff --git a/Documentation/core-api/prmem.rst b/Documentation/core-api/prmem.rst > new file mode 100644 > index 000000000000..16d7edfe327a > --- /dev/null > +++ b/Documentation/core-api/prmem.rst > @@ -0,0 +1,172 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +.. _prmem: > + > +Memory Protection > +================= > + > +:Date: October 2018 > +:Author: Igor Stoppa <igor.stoppa@huawei.com> > + > +Foreword > +-------- > +- In a typical system using some sort of RAM as execution environment, > + **all** memory is initially writable. > + > +- It must be initialized with the appropriate content, be it code or data. > + > +- Said content typically undergoes modifications, i.e. relocations or > + relocation-induced changes. > + > +- The present document doesn't address such transient. > + > +- Kernel code is protected at system level and, unlike data, it doesn't > + require special attention. What does this even mean? > +Protection mechanism > +-------------------- > + > +- When available, the MMU can write protect memory pages that would be > + otherwise writable. Again; what does this really want to say? > +- The protection has page-level granularity. I don't think Linux supports non-paging MMUs. > +- An attempt to overwrite a protected page will trigger an exception. > +- **Write protected data must go exclusively to write protected pages** > +- **Writable data must go exclusively to writable pages** WTH is with all those ** ? > +Available protections for kernel data > +------------------------------------- > + > +- **constant** > + Labelled as **const**, the data is never supposed to be altered. > + It is statically allocated - if it has any memory footprint at all. > + The compiler can even optimize it away, where possible, by replacing > + references to a **const** with its actual value. > + > +- **read only after init** > + By tagging an otherwise ordinary statically allocated variable with > + **__ro_after_init**, it is placed in a special segment that will > + become write protected, at the end of the kernel init phase. > + The compiler has no notion of this restriction and it will treat any > + write operation on such variable as legal. However, assignments that > + are attempted after the write protection is in place, will cause > + exceptions. > + > +- **write rare after init** > + This can be seen as variant of read only after init, which uses the > + tag **__wr_after_init**. It is also limited to statically allocated > + memory. It is still possible to alter this type of variables, after > + the kernel init phase is complete, however it can be done exclusively > + with special functions, instead of the assignment operator. Using the > + assignment operator after conclusion of the init phase will still > + trigger an exception. It is not possible to transition a certain > + variable from __wr_ater_init to a permanent read-only status, at > + runtime. > + > +- **dynamically allocated write-rare / read-only** > + After defining a pool, memory can be obtained through it, primarily > + through the **pmalloc()** allocator. The exact writability state of the > + memory obtained from **pmalloc()** and friends can be configured when > + creating the pool. At any point it is possible to transition to a less > + permissive write status the memory currently associated to the pool. > + Once memory has become read-only, it the only valid operation, beside > + reading, is to released it, by destroying the pool it belongs to. Can we ditch all the ** nonsense and put whitespace in there? More paragraphs and whitespace are more good. Also, I really don't like how you differentiate between static and dynamic wr. > +Protecting dynamically allocated memory > +--------------------------------------- > + > +When dealing with dynamically allocated memory, three options are > + available for configuring its writability state: > + > +- **Options selected when creating a pool** > + When creating the pool, it is possible to choose one of the following: > + - **PMALLOC_MODE_RO** > + - Writability at allocation time: *WRITABLE* > + - Writability at protection time: *NONE* > + - **PMALLOC_MODE_WR** > + - Writability at allocation time: *WRITABLE* > + - Writability at protection time: *WRITE-RARE* > + - **PMALLOC_MODE_AUTO_RO** > + - Writability at allocation time: > + - the latest allocation: *WRITABLE* > + - every other allocation: *NONE* > + - Writability at protection time: *NONE* > + - **PMALLOC_MODE_AUTO_WR** > + - Writability at allocation time: > + - the latest allocation: *WRITABLE* > + - every other allocation: *WRITE-RARE* > + - Writability at protection time: *WRITE-RARE* > + - **PMALLOC_MODE_START_WR** > + - Writability at allocation time: *WRITE-RARE* > + - Writability at protection time: *WRITE-RARE* That's just unreadable gibberish from here. Also what? We already have RO, why do you need more RO? > + > + **Remarks:** > + - The "AUTO" modes perform automatic protection of the content, whenever > + the current vmap_area is used up and a new one is allocated. > + - At that point, the vmap_area being phased out is protected. > + - The size of the vmap_area depends on various parameters. > + - It might not be possible to know for sure *when* certain data will > + be protected. Surely that is a problem? > + - The functionality is provided as tradeoff between hardening and speed. Which you fail to explain. > + - Its usefulness depends on the specific use case at hand How about you write sensible text inside the option descriptions instead? This is not a presentation; less bullets, more content. > +- Not only the pmalloc memory must be protected, but also any reference to > + it that might become the target for an attack. The attack would replace > + a reference to the protected memory with a reference to some other, > + unprotected, memory. I still don't really understand the whole write-rare thing; how does it really help? If we can write in kernel memory, we can write to page-tables too. And I don't think this document even begins to explain _why_ you're doing any of this. How does it help? > +- The users of rare write must take care of ensuring the atomicity of the > + action, respect to the way they use the data being altered; for example, > + take a lock before making a copy of the value to modify (if it's > + relevant), then alter it, issue the call to rare write and finally > + release the lock. Some special scenario might be exempt from the need > + for locking, but in general rare-write must be treated as an operation > + that can incur into races. What?!
On Wed, Oct 24, 2018 at 12:34:59AM +0300, Igor Stoppa wrote:
> As preparation to using write rare on the nodes of various types of
> lists, specify that the fields in the basic data structures must be
> aligned to sizeof(void *)
>
> It is meant to ensure that any static allocation will not cross a page
> boundary, to allow pointers to be updated in one step.
>
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Masahiro Yamada <yamada.masahiro@socionext.com>
> CC: Alexey Dobriyan <adobriyan@gmail.com>
> CC: Pekka Enberg <penberg@kernel.org>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: Lihao Liang <lianglihao@huawei.com>
> CC: linux-kernel@vger.kernel.org
> ---
> include/linux/types.h | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/types.h b/include/linux/types.h
> index 9834e90aa010..53609bbdcf0f 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -183,17 +183,29 @@ typedef struct {
> } atomic64_t;
> #endif
>
> +#ifdef CONFIG_PRMEM
> struct list_head {
> - struct list_head *next, *prev;
> -};
> + struct list_head *next __aligned(sizeof(void *));
> + struct list_head *prev __aligned(sizeof(void *));
> +} __aligned(sizeof(void *));
>
> -struct hlist_head {
> - struct hlist_node *first;
> +struct hlist_node {
> + struct hlist_node *next __aligned(sizeof(void *));
> + struct hlist_node **pprev __aligned(sizeof(void *));
> +} __aligned(sizeof(void *));
Argh.. are we really supporting platforms that do not naturally align
this? If so, which and can't we fix those?
Also, if you force alignment on a member, the structure as a whole
inherits the largest member alignment.
Also, you made something that was simple an unreadable mess without
proper justification (ie. you fail to show need).
On Wed, Oct 24, 2018 at 12:35:00AM +0300, Igor Stoppa wrote:
> Some of the data structures used in list management are composed by two
> pointers. Since the kernel is now configured by default to randomize the
> layout of data structures soleley composed by pointers, this might
> prevent correct type punning between these structures and their write
> rare counterpart.
'might' doesn't really work for me. Either it does or it does not.
On Wed, Oct 24, 2018 at 12:35:01AM +0300, Igor Stoppa wrote:
> In some cases, all the data needing protection can be allocated from a pool
> in one go, as directly writable, then initialized and protected.
> The sequence is relatively short and it's acceptable to leave the entire
> data set unprotected.
>
> In other cases, this is not possible, because the data will trickle over
> a relatively long period of time, in a non predictable way, possibly for
> the entire duration of the operations.
>
> For these cases, the safe approach is to have the memory already write
> protected, when allocated. However, this will require replacing any
> direct assignment with calls to functions that can perform write rare.
>
> Since lists are one of the most commonly used data structures in kernel,
> they are a the first candidate for receiving write rare extensions.
>
> This patch implements basic functionality for altering said lists.
>
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Kate Stewart <kstewart@linuxfoundation.org>
> CC: "David S. Miller" <davem@davemloft.net>
> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> CC: Philippe Ombredanne <pombredanne@nexb.com>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: Josh Triplett <josh@joshtriplett.org>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> CC: Lai Jiangshan <jiangshanlai@gmail.com>
> CC: linux-kernel@vger.kernel.org
> ---
> MAINTAINERS | 1 +
> include/linux/prlist.h | 934 +++++++++++++++++++++++++++++++++++++++++
I'm not at all sure I understand the Changelog, or how it justifies
duplicating almost 1k lines of code.
Sure lists aren't the most complicated thing we have, but duplicating
that much is still very _very_ bad form. Why are we doing this?
On Wed, Oct 24, 2018 at 12:34:49AM +0300, Igor Stoppa wrote: > +static __always_inline That's far too large for inline. > +bool wr_memset(const void *dst, const int c, size_t n_bytes) > +{ > + size_t size; > + unsigned long flags; > + uintptr_t d = (uintptr_t)dst; > + > + if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG)) > + return false; > + while (n_bytes) { > + struct page *page; > + uintptr_t base; > + uintptr_t offset; > + uintptr_t offset_complement; > + > + local_irq_save(flags); > + page = virt_to_page(d); > + offset = d & ~PAGE_MASK; > + offset_complement = PAGE_SIZE - offset; > + size = min(n_bytes, offset_complement); > + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL); > + if (WARN(!base, WR_ERR_PAGE_MSG)) { > + local_irq_restore(flags); > + return false; > + } > + memset((void *)(base + offset), c, size); > + vunmap((void *)base); BUG > + d += size; > + n_bytes -= size; > + local_irq_restore(flags); > + } > + return true; > +} > + > +static __always_inline Similarly large > +bool wr_memcpy(const void *dst, const void *src, size_t n_bytes) > +{ > + size_t size; > + unsigned long flags; > + uintptr_t d = (uintptr_t)dst; > + uintptr_t s = (uintptr_t)src; > + > + if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG)) > + return false; > + while (n_bytes) { > + struct page *page; > + uintptr_t base; > + uintptr_t offset; > + uintptr_t offset_complement; > + > + local_irq_save(flags); > + page = virt_to_page(d); > + offset = d & ~PAGE_MASK; > + offset_complement = PAGE_SIZE - offset; > + size = (size_t)min(n_bytes, offset_complement); > + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL); > + if (WARN(!base, WR_ERR_PAGE_MSG)) { > + local_irq_restore(flags); > + return false; > + } > + __write_once_size((void *)(base + offset), (void *)s, size); > + vunmap((void *)base); Similarly BUG. > + d += size; > + s += size; > + n_bytes -= size; > + local_irq_restore(flags); > + } > + return true; > +} > +static __always_inline Guess what.. > +uintptr_t __wr_rcu_ptr(const void *dst_p_p, const void *src_p) > +{ > + unsigned long flags; > + struct page *page; > + void *base; > + uintptr_t offset; > + const size_t size = sizeof(void *); > + > + if (WARN(!__is_wr_after_init(dst_p_p, size), WR_ERR_RANGE_MSG)) > + return (uintptr_t)NULL; > + local_irq_save(flags); > + page = virt_to_page(dst_p_p); > + offset = (uintptr_t)dst_p_p & ~PAGE_MASK; > + base = vmap(&page, 1, VM_MAP, PAGE_KERNEL); > + if (WARN(!base, WR_ERR_PAGE_MSG)) { > + local_irq_restore(flags); > + return (uintptr_t)NULL; > + } > + rcu_assign_pointer((*(void **)(offset + (uintptr_t)base)), src_p); > + vunmap(base); Also still bug. > + local_irq_restore(flags); > + return (uintptr_t)src_p; > +} Also, I see an amount of duplication here that shows you're not nearly lazy enough.
On Fri, Oct 26, 2018 at 11:32:05AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 24, 2018 at 12:35:00AM +0300, Igor Stoppa wrote:
> > Some of the data structures used in list management are composed by two
> > pointers. Since the kernel is now configured by default to randomize the
> > layout of data structures soleley composed by pointers, this might
> > prevent correct type punning between these structures and their write
> > rare counterpart.
>
> 'might' doesn't really work for me. Either it does or it does not.
He means "Depending on the random number generator, the two pointers
might be AB or BA. If they're of opposite polarity (50% of the time),
it _will_ break, and 50% of the time it _won't_ break."
On Fri, Oct 26, 2018 at 11:26:09AM +0200, Peter Zijlstra wrote:
> Jon,
>
> So the below document is a prime example for why I think RST sucks. As a
> text document readability is greatly diminished by all the markup
> nonsense.
>
> This stuff should not become write-only content like html and other
> gunk. The actual text file is still the primary means of reading this.
I think Igor neglected to read doc-guide/sphinx.rst:
Specific guidelines for the kernel documentation
------------------------------------------------
Here are some specific guidelines for the kernel documentation:
* Please don't go overboard with reStructuredText markup. Keep it
simple. For the most part the documentation should be plain text with
just enough consistency in formatting that it can be converted to
other formats.
I agree that it's overly marked up.
On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote: > I still don't really understand the whole write-rare thing; how does it > really help? If we can write in kernel memory, we can write to > page-tables too. I can speak to the general goal. The specific implementation here may not fit all the needs, but it's a good starting point for discussion. One aspect of hardening the kernel against attack is reducing the internal attack surface. Not all flaws are created equal, so there is variation in what limitations an attacker may have when exploiting flaws (not many flaws end up being a fully controlled "write anything, anywhere, at any time"). By making the more sensitive data structures of the kernel read-only, we reduce the risk of an attacker finding a path to manipulating the kernel's behavior in a significant way. Examples of typical sensitive targets are function pointers, security policy, and page tables. Having these "read only at rest" makes them much harder to control by an attacker using memory integrity flaws. We already have the .rodata section for "const" things. Those are trivially read-only. For example there is a long history of making sure that function pointer tables are marked "const". However, some things need to stay writable. Some of those in .data aren't changed after __init, so we added the __ro_after_init which made them read-only for the life of the system after __init. However, we're still left with a lot of sensitive structures either in .data or dynamically allocated, that need a way to be made read-only for most of the time, but get written to during very specific times. The "write rarely" name itself may not sufficiently describe what is wanted either (I'll take the blame for the inaccurate name), so I'm open to new ideas there. The implementation requirements for the "sensitive data read-only at rest" feature are rather tricky: - allow writes only from specific places in the kernel - keep those locations inline to avoid making them trivial ROP targets - keep the writeability window open only to a single uninterruptable CPU - fast enough to deal with page table updates The proposal I made a while back only covered .data things (and used x86-specific features). Igor's proposal builds on this by including a way to do this with dynamic allocation too, which greatly expands the scope of structures that can be protected. Given that the x86-only method of write-window creation was firmly rejected, this is a new proposal for how to do it (vmap window). Using switch_mm() has also been suggested, etc. We need to find a good way to do the write-windowing that works well for static and dynamic structures _and_ for the page tables... this continues to be tricky. Making it resilient against ROP-style targets makes it difficult to deal with certain data structures (like list manipulation). In my earlier RFC, I tried to provide enough examples of where this could get used to let people see some of the complexity[1]. Igor's series expands this to even more examples using dynamic allocation. [1] https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/write-rarely -- Kees Cook
Am 26.10.18 um 11:26 schrieb Peter Zijlstra: >> Jon, >> >> So the below document is a prime example for why I think RST sucks. As a >> text document readability is greatly diminished by all the markup >> nonsense. >> >> This stuff should not become write-only content like html and other >> gunk. The actual text file is still the primary means of reading this. > > I think Igor neglected to read doc-guide/sphinx.rst: > > Specific guidelines for the kernel documentation > ------------------------------------------------ > > Here are some specific guidelines for the kernel documentation: > > * Please don't go overboard with reStructuredText markup. Keep it > simple. For the most part the documentation should be plain text with > just enough consistency in formatting that it can be converted to > other formats. > > I agree that it's overly marked up. To add my two cent .. > WTH is with all those ** ? I guess Igor was looking for a definition list ... >> +Available protections for kernel data >> +------------------------------------- >> + >> +- **constant** >> + Labelled as **const**, the data is never supposed to be altered. >> + It is statically allocated - if it has any memory footprint at all. >> + The compiler can even optimize it away, where possible, by replacing >> + references to a **const** with its actual value. +Available protections for kernel data +------------------------------------- + +constant + Labelled as const, the data is never supposed to be altered. + It is statically allocated - if it has any memory footprint at all. + The compiler can even optimize it away, where possible, by replacing + references to a const with its actual value. see "Lists and Quote-like blocks" [1] [1] http://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#lists-and-quote-like-blocks -- Markus --
On Fri, 26 Oct 2018 11:26:09 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > Jon, > > So the below document is a prime example for why I think RST sucks. As a > text document readability is greatly diminished by all the markup > nonsense. Please don't confused RST with the uses of it. The fact that one can do amusing things with the C preprocessor doesn't, on its own, make C suck... > > This stuff should not become write-only content like html and other > gunk. The actual text file is still the primary means of reading this. I agree that the text file comes first, and I agree that the markup in this particular document is excessive. Igor, just like #ifdef's make code hard to read, going overboard with RST markup makes text hard to read. It's a common trap to fall into, because it lets you make slick-looking web pages, but that is not our primary goal here. Many thanks for writing documentation for this feature, that already puts you way ahead of a discouraging number of contributors. But could I ask you, please, to make a pass over it and reduce the markup to a minimum? Using lists as suggested by Markus would help here. Thanks, jon
On Wed, 24 Oct 2018 17:03:01 +0300 Igor Stoppa <igor.stoppa@gmail.com> wrote: > I was hoping that by CC-ing kernel.org, the explicit recipients would > get both the mail directly intended for them (as subsystem > maintainers/supporters) and the rest. > > The next time I'll err in the opposite direction. Please don't. > > In the meanwhile, please find the whole set here: > > https://www.openwall.com/lists/kernel-hardening/2018/10/23/3 Note, it is critical that every change log stands on its own. You do not need to Cc the entire patch set to everyone. Each patch should have enough information in it to know exactly what the patch does. It's OK if each change log has duplicate information from other patches. The important part is that one should be able to look at the change log of a specific patch and understand exactly what the patch is doing. This is because 5 years from now, if someone does a git blame and comes up with this commit, they wont have access to the patch series. All they will have is this single commit log to explain why these changes were done. If a change log depends on other commits for context, it is insufficient. Thanks, -- Steve
On Fri, Oct 26, 2018 at 11:46:28AM +0100, Kees Cook wrote: > On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > I still don't really understand the whole write-rare thing; how does it > > really help? If we can write in kernel memory, we can write to > > page-tables too. > One aspect of hardening the kernel against attack is reducing the > internal attack surface. Not all flaws are created equal, so there is > variation in what limitations an attacker may have when exploiting > flaws (not many flaws end up being a fully controlled "write anything, > anywhere, at any time"). By making the more sensitive data structures > of the kernel read-only, we reduce the risk of an attacker finding a > path to manipulating the kernel's behavior in a significant way. > > Examples of typical sensitive targets are function pointers, security > policy, and page tables. Having these "read only at rest" makes them > much harder to control by an attacker using memory integrity flaws. Because 'write-anywhere' exploits are easier than (and the typical first step to) arbitrary code execution thingies? > The "write rarely" name itself may not sufficiently describe what is > wanted either (I'll take the blame for the inaccurate name), so I'm > open to new ideas there. The implementation requirements for the > "sensitive data read-only at rest" feature are rather tricky: > > - allow writes only from specific places in the kernel > - keep those locations inline to avoid making them trivial ROP targets > - keep the writeability window open only to a single uninterruptable CPU The current patch set does not achieve that because it uses a global address space for the alias mapping (vmap) which is equally accessible from all CPUs. > - fast enough to deal with page table updates The proposed implementation needs page-tables for the alias; I don't see how you could ever do R/O page-tables when you need page-tables to modify your page-tables. And this is entirely irrespective of performance. > The proposal I made a while back only covered .data things (and used > x86-specific features). Oh, right, that CR0.WP stuff. > Igor's proposal builds on this by including a > way to do this with dynamic allocation too, which greatly expands the > scope of structures that can be protected. Given that the x86-only > method of write-window creation was firmly rejected, this is a new > proposal for how to do it (vmap window). Using switch_mm() has also > been suggested, etc. Right... /me goes find the patches we did for text_poke. Hmm, those never seem to have made it: https://lkml.kernel.org/r/20180902173224.30606-1-namit@vmware.com like that. That approach will in fact work and not be a completely broken mess like this thing. > We need to find a good way to do the write-windowing that works well > for static and dynamic structures _and_ for the page tables... this > continues to be tricky. > > Making it resilient against ROP-style targets makes it difficult to > deal with certain data structures (like list manipulation). In my > earlier RFC, I tried to provide enough examples of where this could > get used to let people see some of the complexity[1]. Igor's series > expands this to even more examples using dynamic allocation. Doing 2 CR3 writes for 'every' WR write doesn't seem like it would be fast enough for much of anything. And I don't suppose we can take the WP fault and then fix up from there, because if we're doing R/O page-tables, that'll incrase the fault depth and we'll double fault all the time, and tripple fault where we currently double fault. And we all know how _awesome_ tripple faults are. But duplicating (and wrapping in gunk) whole APIs is just not going to work.
On 10/23/2018 05:34 PM, Igor Stoppa wrote:
> Prevent leaks of protected memory to userspace.
> The protection from overwrited from userspace is already available, once
> the memory is write protected.
>
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> CC: Kees Cook <keescook@chromium.org>
> CC: Chris von Recklinghausen <crecklin@redhat.com>
> CC: linux-mm@kvack.org
> CC: linux-kernel@vger.kernel.org
> ---
> include/linux/prmem.h | 24 ++++++++++++++++++++++++
> mm/usercopy.c | 5 +++++
> 2 files changed, 29 insertions(+)
>
> diff --git a/include/linux/prmem.h b/include/linux/prmem.h
> index cf713fc1c8bb..919d853ddc15 100644
> --- a/include/linux/prmem.h
> +++ b/include/linux/prmem.h
> @@ -273,6 +273,30 @@ struct pmalloc_pool {
> uint8_t mode;
> };
>
> +void __noreturn usercopy_abort(const char *name, const char *detail,
> + bool to_user, unsigned long offset,
> + unsigned long len);
> +
> +/**
> + * check_pmalloc_object() - helper for hardened usercopy
> + * @ptr: the beginning of the memory to check
> + * @n: the size of the memory to check
> + * @to_user: copy to userspace or from userspace
> + *
> + * If the check is ok, it will fall-through, otherwise it will abort.
> + * The function is inlined, to minimize the performance impact of the
> + * extra check that can end up on a hot path.
> + * Non-exhaustive micro benchmarking with QEMU x86_64 shows a reduction of
> + * the time spent in this fragment by 60%, when inlined.
> + */
> +static inline
> +void check_pmalloc_object(const void *ptr, unsigned long n, bool to_user)
> +{
> + if (unlikely(__is_wr_after_init(ptr, n) || __is_wr_pool(ptr, n)))
> + usercopy_abort("pmalloc", "accessing pmalloc obj", to_user,
> + (const unsigned long)ptr, n);
> +}
> +
> /*
> * The write rare functionality is fully implemented as __always_inline,
> * to prevent having an internal function call that is capable of modifying
> diff --git a/mm/usercopy.c b/mm/usercopy.c
> index 852eb4e53f06..a080dd37b684 100644
> --- a/mm/usercopy.c
> +++ b/mm/usercopy.c
> @@ -22,8 +22,10 @@
> #include <linux/thread_info.h>
> #include <linux/atomic.h>
> #include <linux/jump_label.h>
> +#include <linux/prmem.h>
> #include <asm/sections.h>
>
> +
> /*
> * Checks if a given pointer and length is contained by the current
> * stack frame (if possible).
> @@ -284,6 +286,9 @@ void __check_object_size(const void *ptr, unsigned long n, bool to_user)
>
> /* Check for object in kernel to avoid text exposure. */
> check_kernel_text_object((const unsigned long)ptr, n, to_user);
> +
> + /* Check if object is from a pmalloc chunk. */
> + check_pmalloc_object(ptr, n, to_user);
> }
> EXPORT_SYMBOL(__check_object_size);
>
Could you add code somewhere (lkdtm driver if possible) to demonstrate
the issue and verify the code change?
Thanks,
Chris
On 25/10/2018 01:24, Dave Hansen wrote: >> +static __always_inline bool __is_wr_after_init(const void *ptr, size_t size) >> +{ >> + size_t start = (size_t)&__start_wr_after_init; >> + size_t end = (size_t)&__end_wr_after_init; >> + size_t low = (size_t)ptr; >> + size_t high = (size_t)ptr + size; >> + >> + return likely(start <= low && low < high && high <= end); >> +} > > size_t is an odd type choice for doing address arithmetic. it seemed more portable than unsigned long >> +/** >> + * wr_memset() - sets n bytes of the destination to the c value >> + * @dst: beginning of the memory to write to >> + * @c: byte to replicate >> + * @size: amount of bytes to copy >> + * >> + * Returns true on success, false otherwise. >> + */ >> +static __always_inline >> +bool wr_memset(const void *dst, const int c, size_t n_bytes) >> +{ >> + size_t size; >> + unsigned long flags; >> + uintptr_t d = (uintptr_t)dst; >> + >> + if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG)) >> + return false; >> + while (n_bytes) { >> + struct page *page; >> + uintptr_t base; >> + uintptr_t offset; >> + uintptr_t offset_complement; > > Again, these are really odd choices for types. vmap() returns a void* > pointer, on which you can do arithmetic. I wasn't sure of how much I could rely on the compiler not doing some unwanted optimizations. > Why bother keeping another > type to which you have to cast to and from? For the above reason. If I'm worrying unnecessarily, I can switch back to void * It certainly is easier to use. > BTW, our usual "pointer stored in an integer type" is 'unsigned long', > if a pointer needs to be manipulated. yes, I noticed that, but it seemed strange ... size_t corresponds to unsigned long, afaik but it seems that I have not fully understood where to use it anyway, I can stick to the convention with unsigned long > >> + local_irq_save(flags); > > Why are you doing the local_irq_save()? The idea was to avoid the case where an attack would somehow freeze the core doing the write-rare operation, while the temporary mapping is accessible. I have seen comments about using mappings that are private to the current core (and I will reply to those comments as well), but this approach seems architecture-dependent, while I was looking for a solution that, albeit not 100% reliable, would work on any system with an MMU. This would not prevent each arch to come up with own custom implementation that provides better coverage, performance, etc. >> + page = virt_to_page(d); >> + offset = d & ~PAGE_MASK; >> + offset_complement = PAGE_SIZE - offset; >> + size = min(n_bytes, offset_complement); >> + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL); > > Can you even call vmap() (which sleeps) with interrupts off? I accidentally disabled sleeping while atomic debugging and I totally missed this problem :-( However, to answer your question, nothing exploded while I was testing (without that type of debugging). I suspect I was just "lucky". Or maybe I was simply not triggering the sleeping sub-case. As I understood the code, sleeping _might_ happen, but it's not going to happen systematically. I wonder if I could split vmap() into two parts: first the sleeping one, with interrupts enabled, then the non sleeping one, with interrupts disabled. I need to read the code more carefully, but it seems that sleeping might happen when memory for the mapping meta data is not immediately available. BTW, wouldn't the might_sleep() call belong more to the part which really sleeps, instead than to the whole vmap() ? >> + if (WARN(!base, WR_ERR_PAGE_MSG)) { >> + local_irq_restore(flags); >> + return false; >> + } > > You really need some kmap_atomic()-style accessors to wrap this stuff > for you. This little pattern is repeated over and over. I really need to learn more about the way the kernel works and is structured. It's a work in progress. Thanks for the advice. > ... >> +const char WR_ERR_RANGE_MSG[] = "Write rare on invalid memory range."; >> +const char WR_ERR_PAGE_MSG[] = "Failed to remap write rare page."; > > Doesn't the compiler de-duplicate duplicated strings for you? Is there > any reason to declare these like this? I noticed I have made some accidental modifications in a couple of cases, when replicating the command. So I thought that if I really want to use the same string, why not doing it explicitly? It seemed also easier, in case I want to tweak the message. I need to do it only in one place. -- igor
On 25/10/2018 01:26, Dave Hansen wrote:
> On 10/23/18 2:34 PM, Igor Stoppa wrote:
>> +#define VM_PMALLOC 0x00000100 /* pmalloc area - see docs */
>> +#define VM_PMALLOC_WR 0x00000200 /* pmalloc write rare area */
>> +#define VM_PMALLOC_PROTECTED 0x00000400 /* pmalloc protected area */
>
> Please introduce things as you use them. It's impossible to review a
> patch that just says "see docs" that doesn't contain any docs. :)
Yes, otoh it's a big pain in the neck to keep the docs split into
smaller patches interleaved with the code, at least while the code is
still in a flux.
And since the docs refer to the sources, for the automated documentation
of the API, I cannot just put the documentation at the beginning of the
patchset.
Can I keep the docs as they are, for now, till the code is more stable?
--
igor
On 25/10/2018 01:28, Dave Hansen wrote:
> On 10/23/18 2:34 PM, Igor Stoppa wrote:
>> Wrappers around the basic write rare functionality, addressing several
>> common data types found in the kernel, allowing to specify the new
>> values through immediates, like constants and defines.
>
> I have to wonder whether this is the right way, or whether a
> size-agnostic function like put_user() is the right way. put_user() is
> certainly easier to use.
I definitely did not like it either.
But it was the best that came to my mind ...
The main purpose of this code was to show what I wanted to do.
Once more, thanks for pointing out a better way to do it.
--
igor
On 25/10/2018 17:43, Dave Hansen wrote: >> +static bool is_address_protected(void *p) >> +{ >> + struct page *page; >> + struct vmap_area *area; >> + >> + if (unlikely(!is_vmalloc_addr(p))) >> + return false; >> + page = vmalloc_to_page(p); >> + if (unlikely(!page)) >> + return false; >> + wmb(); /* Flush changes to the page table - is it needed? */ > > No. ok > The rest of this is just pretty verbose and seems to have been very > heavily copied and pasted. I guess that's OK for test code, though. I was tempted to play with macros, as templates to generate tests on the fly, according to the type being passed. But I was afraid it might generate an even stronger rejection than the rest of the patchset already has. Would it be acceptable/preferable? -- igor
On 25/10/2018 03:13, Matthew Wilcox wrote: > On Thu, Oct 25, 2018 at 02:01:02AM +0300, Igor Stoppa wrote: >>>> @@ -1747,6 +1750,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, >>>> if (!addr) >>>> return NULL; >>>> + va = __find_vmap_area((unsigned long)addr); >>>> + for (i = 0; i < va->vm->nr_pages; i++) >>>> + va->vm->pages[i]->area = va; >>> >>> I don't like it that you're calling this for _every_ vmalloc() caller >>> when most of them will never use this. Perhaps have page->va be initially >>> NULL and then cache the lookup in it when it's accessed for the first time. >>> >> >> If __find_vmap_area() was part of the API, this loop could be left out from >> __vmalloc_node_range() and the user of the allocation could initialize the >> field, if needed. >> >> What is the reason for keeping __find_vmap_area() private? > > Well, for one, you're walking the rbtree without holding the spinlock, > so you're going to get crashes. I don't see why we shouldn't export > find_vmap_area() though. Argh, yes, sorry. But find_vmap_area() would be enough for what I need. > Another way we could approach this is to embed the vmap_area in the > vm_struct. It'd require a bit of juggling of the alloc/free paths in > vmalloc, but it might be worthwhile. I have a feeling of unease about the whole vmap_area / vm_struct duality. They clearly are different types, with different purposes, but here and there there are functions that are named after some "area", yet they actually refer to vm_struct pointers. I wouldn't mind spending some time understanding the reason for this apparently bizarre choice, but after I have emerged from the prmem swamp. For now I'd stick to find_vmap_area(). -- igor
On 29/10/2018 11:45, Chris von Recklinghausen wrote:
[...]
> Could you add code somewhere (lkdtm driver if possible) to demonstrate
> the issue and verify the code change?
Sure.
Eventually, I'd like to add test cases for each functionality.
I didn't do it right away for those parts which are either not
immediately needed for the main functionality or I'm still not confident
enough that they won't change radically.
--
igor
Hello,
On 25/10/2018 00:04, Mike Rapoport wrote:
> I feel that foreword should include a sentence or two saying why we need
> the memory protection and when it can/should be used.
I was somewhat lost about the actual content of this sort of document.
In past reviews of older version of the docs, I go the impression that
this should focus more on the use of the API and how to use it.
And it seem that different people expect to find here different type of
information.
What is the best place where to put a more extensive description of the
feature?
--
igor
On 26/10/2018 11:20, Matthew Wilcox wrote: > On Fri, Oct 26, 2018 at 11:26:09AM +0200, Peter Zijlstra wrote: >> Jon, >> >> So the below document is a prime example for why I think RST sucks. As a >> text document readability is greatly diminished by all the markup >> nonsense. >> >> This stuff should not become write-only content like html and other >> gunk. The actual text file is still the primary means of reading this. > > I think Igor neglected to read doc-guide/sphinx.rst: Guilty as charged :-/ > Specific guidelines for the kernel documentation > ------------------------------------------------ > > Here are some specific guidelines for the kernel documentation: > > * Please don't go overboard with reStructuredText markup. Keep it > simple. For the most part the documentation should be plain text with > just enough consistency in formatting that it can be converted to > other formats. > I agree that it's overly marked up. I'll fix it -- igor
On 26/10/2018 12:09, Markus Heiser wrote: > I guess Igor was looking for a definition list ... It was meant to be bold. Even after reading the guideline "keep it simple", what exactly is simple can be subjective. If certain rst constructs are not acceptable and can be detected unequivocally, maybe it would help to teach them to checkpatch.pl [...] > see "Lists and Quote-like blocks" [1] > > [1] > http://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#lists-and-quote-like-blocks thank you -- igor
On 26/10/2018 16:05, Jonathan Corbet wrote:
> But
> could I ask you, please, to make a pass over it and reduce the markup to
> a minimum? Using lists as suggested by Markus would help here.
sure, it's even easier for me to maintain the doc :-)
as i just wrote in a related reply, it might be worth having checkpatch
to detect formatting constructs that are not welcome
--
igor
On 24/10/2018 14:43, Alexey Dobriyan wrote:
> On Wed, Oct 24, 2018 at 12:35:00AM +0300, Igor Stoppa wrote:
>> Some of the data structures used in list management are composed by two
>> pointers. Since the kernel is now configured by default to randomize the
>> layout of data structures soleley composed by pointers,
>
> Isn't this true for function pointers?
Yes, you are right.
Thanks for pointing this out.
I can drop this patch.
--
igor
On 28/10/2018 09:52, Steven Rostedt wrote:
> If a change log depends on other commits for context, it is
> insufficient.
ok, I will adjust the change logs accordingly
--
thanks, igor
On 25/10/2018 00:03, Dave Chinner wrote:
> On Wed, Oct 24, 2018 at 12:34:47AM +0300, Igor Stoppa wrote:
>> -- Summary --
>>
>> Preliminary version of memory protection patchset, including a sample use
>> case, turning into write-rare the IMA measurement list.
>
> I haven't looked at the patches yet, but I see a significant issue
> from the subject lines. "prmem" is very similar to "pmem"
> (persistent memory) and that's going to cause confusion. Especially
> if people start talking about prmem and pmem in the context of write
> protect pmem with prmem...
>
> Naming is hard :/
yes, at some point I had to go from rare-write to write-rare because I
realized that the acronym "rw" was likely to be interpreted as "read/write"
I propose to keep prmem for the time being, just to avoid adding extra
changes that are not functional.
Once the code has somewhat settled down, I can proceed with the
renaming. In the meanwhile, a better name could be discussed.
For example, would "prmemory" be sufficiently unambiguous?
--
igor
On 26/10/2018 10:41, Peter Zijlstra wrote: > On Wed, Oct 24, 2018 at 12:34:49AM +0300, Igor Stoppa wrote: >> +static __always_inline > > That's far too large for inline. The reason for it is that it's supposed to minimize the presence of gadgets that might be used in JOP attacks. I am ready to stand corrected, if I'm wrong, but this is the reason why I did it. Regarding the function being too large, yes, I would not normally choose it for inlining. Actually, I would not normally use "__always_inline" and instead I would limit myself to plain "inline", at most. > >> +bool wr_memset(const void *dst, const int c, size_t n_bytes) >> +{ >> + size_t size; >> + unsigned long flags; >> + uintptr_t d = (uintptr_t)dst; >> + >> + if (WARN(!__is_wr_after_init(dst, n_bytes), WR_ERR_RANGE_MSG)) >> + return false; >> + while (n_bytes) { >> + struct page *page; >> + uintptr_t base; >> + uintptr_t offset; >> + uintptr_t offset_complement; >> + >> + local_irq_save(flags); >> + page = virt_to_page(d); >> + offset = d & ~PAGE_MASK; >> + offset_complement = PAGE_SIZE - offset; >> + size = min(n_bytes, offset_complement); >> + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL); >> + if (WARN(!base, WR_ERR_PAGE_MSG)) { >> + local_irq_restore(flags); >> + return false; >> + } >> + memset((void *)(base + offset), c, size); >> + vunmap((void *)base); > > BUG yes, somehow I managed to drop this debug configuration from the debug builds I made. [...] > Also, I see an amount of duplication here that shows you're not nearly > lazy enough. I did notice a certain amount of duplication, but I didn't know how to exploit it. -- igor
On 26/10/2018 10:26, Peter Zijlstra wrote: >> +- Kernel code is protected at system level and, unlike data, it doesn't >> + require special attention. > > What does this even mean? I was trying to convey the notion that the pages containing kernel code do not require any special handling by the author of a generic kernel component, for example a kernel driver. Pages containing either statically or dynamically allocated data, instead, are not automatically protected. But yes, the sentence is far from being clear. > >> +Protection mechanism >> +-------------------- >> + >> +- When available, the MMU can write protect memory pages that would be >> + otherwise writable. > > Again; what does this really want to say? That it is possible to use the MMU also for write-protecting pages containing data which was not declared as constant. > >> +- The protection has page-level granularity. > > I don't think Linux supports non-paging MMUs. This probably came from a model I had in mind where a separate execution environment, such as an hypervisor, could trap and filter writes to a certain parts of a page, rejecting some and performing others, effectively emulating sub-page granularity. > >> +- An attempt to overwrite a protected page will trigger an exception. >> +- **Write protected data must go exclusively to write protected pages** >> +- **Writable data must go exclusively to writable pages** > > WTH is with all those ** ? [...] > Can we ditch all the ** nonsense and put whitespace in there? More paragraphs > and whitespace are more good. Yes > Also, I really don't like how you differentiate between static and > dynamic wr. ok, but why? what would you suggest, instead? [...] > We already have RO, why do you need more RO? I can explain, but I'm at loss of what is the best place: I was under the impression that this sort of document should focus mostly on the API and its use. I was even considering removing most of the explanation and instead put it in a separate document. > >> + >> + **Remarks:** >> + - The "AUTO" modes perform automatic protection of the content, whenever >> + the current vmap_area is used up and a new one is allocated. >> + - At that point, the vmap_area being phased out is protected. >> + - The size of the vmap_area depends on various parameters. >> + - It might not be possible to know for sure *when* certain data will >> + be protected. > > Surely that is a problem? > >> + - The functionality is provided as tradeoff between hardening and speed. > > Which you fail to explain. > >> + - Its usefulness depends on the specific use case at hand > > How about you write sensible text inside the option descriptions > instead? > > This is not a presentation; less bullets, more content. I tried to say something, without saying too much, but it was clearly a bad choice. Where should i put a thoroughly detailed explanation? Here or in a separate document? > >> +- Not only the pmalloc memory must be protected, but also any reference to >> + it that might become the target for an attack. The attack would replace >> + a reference to the protected memory with a reference to some other, >> + unprotected, memory. > > I still don't really understand the whole write-rare thing; how does it > really help? If we can write in kernel memory, we can write to > page-tables too. It has already been answered by Kees, but I'll provide also my answer: the exploits used to write in kernel memory are specific to certain products and SW builds, so it's not possible to generalize too much, however there might be some limitation in the reach of a specific vulnerability. For example, if a memory location is referred to as offset from a base address, the maximum size of the offset limits the scope of the attack. Which might make it impossible to use that specific vulnerability for writing directly to the page table. But something else, say a statically allocated variable, might be within reach. said this, there is also the almost orthogonal use case of providing increased robustness by trapping accidental modifications, caused by bugs. > And I don't think this document even begins to explain _why_ you're > doing any of this. How does it help? ok, point taken >> +- The users of rare write must take care of ensuring the atomicity of the >> + action, respect to the way they use the data being altered; for example, >> + take a lock before making a copy of the value to modify (if it's >> + relevant), then alter it, issue the call to rare write and finally >> + release the lock. Some special scenario might be exempt from the need >> + for locking, but in general rare-write must be treated as an operation >> + that can incur into races. > > What?! Probably something along the lines of: "users of write-rare are responsible of using mechanisms that allow reading/writing data in a consistent way" and if it seems obvious, I can just drop it. One of the problems I have faced is to decide what level of knowledge or understanding I should expect from the reader of such document: what I can take for granted and what I should explain. -- igor
On 28/10/2018 18:31, Peter Zijlstra wrote: > On Fri, Oct 26, 2018 at 11:46:28AM +0100, Kees Cook wrote: >> On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote: >>> I still don't really understand the whole write-rare thing; how does it >>> really help? If we can write in kernel memory, we can write to >>> page-tables too. > >> One aspect of hardening the kernel against attack is reducing the >> internal attack surface. Not all flaws are created equal, so there is >> variation in what limitations an attacker may have when exploiting >> flaws (not many flaws end up being a fully controlled "write anything, >> anywhere, at any time"). By making the more sensitive data structures >> of the kernel read-only, we reduce the risk of an attacker finding a >> path to manipulating the kernel's behavior in a significant way. >> >> Examples of typical sensitive targets are function pointers, security >> policy, and page tables. Having these "read only at rest" makes them >> much harder to control by an attacker using memory integrity flaws. > > Because 'write-anywhere' exploits are easier than (and the typical first > step to) arbitrary code execution thingies? > >> The "write rarely" name itself may not sufficiently describe what is >> wanted either (I'll take the blame for the inaccurate name), so I'm >> open to new ideas there. The implementation requirements for the >> "sensitive data read-only at rest" feature are rather tricky: >> >> - allow writes only from specific places in the kernel >> - keep those locations inline to avoid making them trivial ROP targets >> - keep the writeability window open only to a single uninterruptable CPU > > The current patch set does not achieve that because it uses a global > address space for the alias mapping (vmap) which is equally accessible > from all CPUs. I never claimed to achieve 100% resilience to attacks. While it's true that the address space is accessible to all the CPUs, it is also true that the access has a limited duration in time, and the location where the access can be performed is not fixed. Iow, assuming that the CPUs not involved in the write-rare operations are compromised and that they are trying to perform a concurrent access to the data in the writable page, they have a limited window of opportunity. Said this, this I have posted is just a tentative implementation. My primary intent was to at least give an idea of what I'd like to do: alter some data in a way that is not easily exploitable. >> - fast enough to deal with page table updates > > The proposed implementation needs page-tables for the alias; I don't see > how you could ever do R/O page-tables when you need page-tables to > modify your page-tables. It's not all-or-nothing. I hope we agree at least on the reasoning that having only a limited amount of address space directly attackable, instead of the whole set of pages containing exploitable data, is reducing the attack surface. Furthermore, if we think about possible limitations that the attack might have (maximum reach), the level of protection might be even higher. I have to use "might" because I cannot foresee the vulnerability. Furthermore, taking a different angle: your average attacker is not necessarily very social and inclined to share the vulnerability found. It is safe to assume that in most cases each attacker has to identify the attack strategy autonomously. Reducing the amount of individual who can perform an attack, by increasing the expertise required is also a way of doing damage control. > And this is entirely irrespective of performance. I have not completely given up on performance, but, being write-rare, I see improved performance as just a way of widening the range of possible recipients for the hardening. >> The proposal I made a while back only covered .data things (and used >> x86-specific features). > > Oh, right, that CR0.WP stuff. > >> Igor's proposal builds on this by including a >> way to do this with dynamic allocation too, which greatly expands the >> scope of structures that can be protected. Given that the x86-only >> method of write-window creation was firmly rejected, this is a new >> proposal for how to do it (vmap window). Using switch_mm() has also >> been suggested, etc. > > Right... /me goes find the patches we did for text_poke. Hmm, those > never seem to have made it: > > https://lkml.kernel.org/r/20180902173224.30606-1-namit@vmware.com > > like that. That approach will in fact work and not be a completely > broken mess like this thing. That approach is x86 specific. I preferred to start with something that would work on a broader set of architectures, even if with less resilience. It can be done so that individual architectures can implement their own specific way and obtain better results. >> We need to find a good way to do the write-windowing that works well >> for static and dynamic structures _and_ for the page tables... this >> continues to be tricky. >> >> Making it resilient against ROP-style targets makes it difficult to >> deal with certain data structures (like list manipulation). In my >> earlier RFC, I tried to provide enough examples of where this could >> get used to let people see some of the complexity[1]. Igor's series >> expands this to even more examples using dynamic allocation. > > Doing 2 CR3 writes for 'every' WR write doesn't seem like it would be > fast enough for much of anything. Part of the reason for the duplication is that, even if the wr_API I came up with, can be collapsed with the regular API, the implementation needs to be different, to be faster. Ex: * force the list_head structure to be aligned so that it will always be fully contained by a single page * create alternate mapping for all the pages involved (max 1 page for each list_head) * do the lsit operation on the remapped list_heads * destroy all the mappings I can try to introduce some wrapper similar to kmap_atomic(), as suggested by Dave Hansen, which can improve the coding, but it will not change the actual set of operations performed. > And I don't suppose we can take the WP fault and then fix up from there, > because if we're doing R/O page-tables, that'll incrase the fault depth > and we'll double fault all the time, and tripple fault where we > currently double fault. And we all know how _awesome_ tripple faults > are. > > But duplicating (and wrapping in gunk) whole APIs is just not going to > work. Would something like kmap_atomic() be acceptable? Do you have some better proposal, now that (I hope) it should be more clear what I'm trying to do and why? -- igor
On 25/10/2018 01:13, Peter Zijlstra wrote: > On Wed, Oct 24, 2018 at 12:35:03AM +0300, Igor Stoppa wrote: >> +static __always_inline >> +bool __pratomic_long_op(bool inc, struct pratomic_long_t *l) >> +{ >> + struct page *page; >> + uintptr_t base; >> + uintptr_t offset; >> + unsigned long flags; >> + size_t size = sizeof(*l); >> + bool is_virt = __is_wr_after_init(l, size); >> + >> + if (WARN(!(is_virt || likely(__is_wr_pool(l, size))), >> + WR_ERR_RANGE_MSG)) >> + return false; >> + local_irq_save(flags); >> + if (is_virt) >> + page = virt_to_page(l); >> + else >> + vmalloc_to_page(l); >> + offset = (~PAGE_MASK) & (uintptr_t)l; >> + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL); >> + if (WARN(!base, WR_ERR_PAGE_MSG)) { >> + local_irq_restore(flags); >> + return false; >> + } >> + if (inc) >> + atomic_long_inc((atomic_long_t *)(base + offset)); >> + else >> + atomic_long_dec((atomic_long_t *)(base + offset)); >> + vunmap((void *)base); >> + local_irq_restore(flags); >> + return true; >> + >> +} > > That's just hideously nasty.. and horribly broken. > > We're not going to duplicate all these kernel interfaces wrapped in gunk > like that. one possibility would be to have macros which use typeof() on the parameter being passed, to decide what implementation to use: regular or write-rare This means that type punning would still be needed, to select the implementation. Would this be enough? Is there some better way? > Also, you _cannot_ call vunmap() with IRQs disabled. Clearly > you've never tested this with debug bits enabled. I thought I had them. And I _did_ have them enabled, at some point. But I must have messed up with the configuration and I failed to notice this. I can think of a way it might work, albeit it's not going to be very pretty: * for the vmap(): if I understand correctly, it might sleep while obtaining memory for creating the mapping. This part could be executed before disabling interrupts. The rest of the function, instead, would be executed after interrupts are disabled. * for vunmap(): after the writing is done, change also the alternate mapping to read only, then enable interrupts and destroy the alternate mapping. Making also the secondary mapping read only makes it equally secure as the primary, which means that it can be visible also with interrupts enabled. -- igor
On Mon, Oct 29, 2018 at 11:04:01PM +0200, Igor Stoppa wrote: > On 28/10/2018 18:31, Peter Zijlstra wrote: > > On Fri, Oct 26, 2018 at 11:46:28AM +0100, Kees Cook wrote: > > > On Fri, Oct 26, 2018 at 10:26 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > I still don't really understand the whole write-rare thing; how does it > > > > really help? If we can write in kernel memory, we can write to > > > > page-tables too. > > > > > One aspect of hardening the kernel against attack is reducing the > > > internal attack surface. Not all flaws are created equal, so there is > > > variation in what limitations an attacker may have when exploiting > > > flaws (not many flaws end up being a fully controlled "write anything, > > > anywhere, at any time"). By making the more sensitive data structures > > > of the kernel read-only, we reduce the risk of an attacker finding a > > > path to manipulating the kernel's behavior in a significant way. > > > > > > Examples of typical sensitive targets are function pointers, security > > > policy, and page tables. Having these "read only at rest" makes them > > > much harder to control by an attacker using memory integrity flaws. > > > > Because 'write-anywhere' exploits are easier than (and the typical first > > step to) arbitrary code execution thingies? > > > > > The "write rarely" name itself may not sufficiently describe what is > > > wanted either (I'll take the blame for the inaccurate name), so I'm > > > open to new ideas there. The implementation requirements for the > > > "sensitive data read-only at rest" feature are rather tricky: > > > > > > - allow writes only from specific places in the kernel > > > - keep those locations inline to avoid making them trivial ROP targets > > > - keep the writeability window open only to a single uninterruptable CPU > > > > The current patch set does not achieve that because it uses a global > > address space for the alias mapping (vmap) which is equally accessible > > from all CPUs. > > I never claimed to achieve 100% resilience to attacks. You never even begun explaining what you were defending against. Let alone how well it achieved its stated goals. > While it's true that the address space is accessible to all the CPUs, it is > also true that the access has a limited duration in time, and the location > where the access can be performed is not fixed. > > Iow, assuming that the CPUs not involved in the write-rare operations are > compromised and that they are trying to perform a concurrent access to the > data in the writable page, they have a limited window of opportunity. That sounds like security through obscurity. Sure it makes it harder, but it doesn't stop anything. > Said this, this I have posted is just a tentative implementation. > My primary intent was to at least give an idea of what I'd like to do: alter > some data in a way that is not easily exploitable. Since you need to modify page-tables in order to achieve this, the page-tables are also there for the attacker to write to. > > > - fast enough to deal with page table updates > > > > The proposed implementation needs page-tables for the alias; I don't see > > how you could ever do R/O page-tables when you need page-tables to > > modify your page-tables. > > It's not all-or-nothing. I really am still strugging with what it is this thing is supposed to do. As it stands, I see very little value. Yes it makes a few things a little harder, but it doesn't really do away with things. > I hope we agree at least on the reasoning that having only a limited amount > of address space directly attackable, instead of the whole set of pages > containing exploitable data, is reducing the attack surface. I do not in fact agree. Most pages are not interesting for an attack at all. So simply reducing the set of pages you can write to isn't sufficient. IOW. removing all noninteresting pages from the writable set will satisfy your criteria, but it avoids exactly 0 exploits. What you need to argue is that we remove common exploit targets, and you've utterly failed to do so. Kees mentioned: function pointers, page-tables and a few other things. You mentioned nothing. > Furthermore, if we think about possible limitations that the attack might > have (maximum reach), the level of protection might be even higher. I have > to use "might" because I cannot foresee the vulnerability. That's just a bunch of words together that don't really say anything afaict. > Furthermore, taking a different angle: your average attacker is not > necessarily very social and inclined to share the vulnerability found. > It is safe to assume that in most cases each attacker has to identify the > attack strategy autonomously. > Reducing the amount of individual who can perform an attack, by increasing > the expertise required is also a way of doing damage control. If you make RO all noninteresting pages, you in fact increase the density of interesting targets and make things easier. > > And this is entirely irrespective of performance. > > I have not completely given up on performance, but, being write-rare, I see > improved performance as just a way of widening the range of possible > recipients for the hardening. What?! Are you saying you don't care about performance? > > Right... /me goes find the patches we did for text_poke. Hmm, those > > never seem to have made it: > > > > https://lkml.kernel.org/r/20180902173224.30606-1-namit@vmware.com > > > > like that. That approach will in fact work and not be a completely > > broken mess like this thing. > > That approach is x86 specific. It is not. Every architecture does switch_mm() with IRQs disabled, as that is exactly what the scheduler does. And keeping a second init_mm around also isn't very architecture specific I feel. > > > We need to find a good way to do the write-windowing that works well > > > for static and dynamic structures _and_ for the page tables... this > > > continues to be tricky. > > > > > > Making it resilient against ROP-style targets makes it difficult to > > > deal with certain data structures (like list manipulation). In my > > > earlier RFC, I tried to provide enough examples of where this could > > > get used to let people see some of the complexity[1]. Igor's series > > > expands this to even more examples using dynamic allocation. > > > > Doing 2 CR3 writes for 'every' WR write doesn't seem like it would be > > fast enough for much of anything. > > Part of the reason for the duplication is that, even if the wr_API I came up > with, can be collapsed with the regular API, the implementation needs to be > different, to be faster. > > Ex: > * force the list_head structure to be aligned so that it will always be > fully contained by a single page > * create alternate mapping for all the pages involved (max 1 page for each > list_head) > * do the lsit operation on the remapped list_heads > * destroy all the mappings Those constraints are due to single page aliases. > I can try to introduce some wrapper similar to kmap_atomic(), as suggested > by Dave Hansen, which can improve the coding, but it will not change the > actual set of operations performed. See below, I don't think kmap_atomic() is either correct or workable. One thing that might be interesting is teaching objtool about the write-enable write-disable markers and making it guarantee we reach a disable after every enable. IOW, ensure we never leak this state. I think that was part of why we hated on that initial thing Kees did -- well that an disabling all WP is of course completely insane ;-). > > And I don't suppose we can take the WP fault and then fix up from there, > > because if we're doing R/O page-tables, that'll incrase the fault depth > > and we'll double fault all the time, and tripple fault where we > > currently double fault. And we all know how _awesome_ tripple faults > > are. > > > > But duplicating (and wrapping in gunk) whole APIs is just not going to > > work. > > Would something like kmap_atomic() be acceptable? Don't think so. kmap_atomic() on x86_32 (64bit doesn't have it at all) only does the TLB invalidate on the one CPU, which we've established is incorrect. Also, kmap_atomic is still page-table based, which means not all page-tables can be read-only. > Do you have some better proposal, now that (I hope) it should be more clear > what I'm trying to do and why? You've still not talked about any actual attack vectors and how they're mitigated by these patches. I suppose the 'normal' attack goes like: 1) find buffer-overrun / bound check failure 2) use that to write to 'interesting' location 3) that write results arbitrary code execution 4) win Of course, if the store of 2 is to the current cred structure, and simply sets the effective uid to 0, we can skip 3. Which seems to suggest all cred structures should be made r/o like this. But I'm not sure I remember these patches doing that. Also, there is an inverse situation with all this. If you make everything R/O, then you need this allow-write for everything you do, which then is about to include a case with an overflow / bound check fail, and we're back to square 1. What are you doing to avoid that?
On Fri, Oct 26, 2018 at 03:17:07AM -0700, Matthew Wilcox wrote:
> On Fri, Oct 26, 2018 at 11:32:05AM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 24, 2018 at 12:35:00AM +0300, Igor Stoppa wrote:
> > > Some of the data structures used in list management are composed by two
> > > pointers. Since the kernel is now configured by default to randomize the
> > > layout of data structures soleley composed by pointers, this might
> > > prevent correct type punning between these structures and their write
> > > rare counterpart.
> >
> > 'might' doesn't really work for me. Either it does or it does not.
>
> He means "Depending on the random number generator, the two pointers
> might be AB or BA. If they're of opposite polarity (50% of the time),
> it _will_ break, and 50% of the time it _won't_ break."
So don't do that then. If he were to include struct list_head inside his
prlist_head, then there is only the one randomization and things will
just work.
Also, I really don't see why he needs that second type and all that type
punning crap in the first place.
On Mon, Oct 29, 2018 at 11:17:14PM +0200, Igor Stoppa wrote: > > > On 25/10/2018 01:13, Peter Zijlstra wrote: > > On Wed, Oct 24, 2018 at 12:35:03AM +0300, Igor Stoppa wrote: > > > +static __always_inline > > > +bool __pratomic_long_op(bool inc, struct pratomic_long_t *l) > > > +{ > > > + struct page *page; > > > + uintptr_t base; > > > + uintptr_t offset; > > > + unsigned long flags; > > > + size_t size = sizeof(*l); > > > + bool is_virt = __is_wr_after_init(l, size); > > > + > > > + if (WARN(!(is_virt || likely(__is_wr_pool(l, size))), > > > + WR_ERR_RANGE_MSG)) > > > + return false; > > > + local_irq_save(flags); > > > + if (is_virt) > > > + page = virt_to_page(l); > > > + else > > > + vmalloc_to_page(l); > > > + offset = (~PAGE_MASK) & (uintptr_t)l; > > > + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL); > > > + if (WARN(!base, WR_ERR_PAGE_MSG)) { > > > + local_irq_restore(flags); > > > + return false; > > > + } > > > + if (inc) > > > + atomic_long_inc((atomic_long_t *)(base + offset)); > > > + else > > > + atomic_long_dec((atomic_long_t *)(base + offset)); > > > + vunmap((void *)base); > > > + local_irq_restore(flags); > > > + return true; > > > + > > > +} > > > > That's just hideously nasty.. and horribly broken. > > > > We're not going to duplicate all these kernel interfaces wrapped in gunk > > like that. > > one possibility would be to have macros which use typeof() on the parameter > being passed, to decide what implementation to use: regular or write-rare > > This means that type punning would still be needed, to select the > implementation. > > Would this be enough? Is there some better way? Like mentioned elsewhere; if you do write_enable() + write_disable() thingies, it all becomes: write_enable(); atomic_foo(&bar); write_disable(); No magic gunk infested duplication at all. Of course, ideally you'd then teach objtool about this (or a GCC plugin I suppose) to ensure any enable reached a disable. The alternative is something like: #define ALLOW_WRITE(stmt) do { write_enable(); do { stmt; } while (0); write_disable(); } while (0) which then allows you to write: ALLOW_WRITE(atomic_foo(&bar)); No duplication. > > Also, you _cannot_ call vunmap() with IRQs disabled. Clearly > > you've never tested this with debug bits enabled. > > I thought I had them. And I _did_ have them enabled, at some point. > But I must have messed up with the configuration and I failed to notice > this. > > I can think of a way it might work, albeit it's not going to be very pretty: > > * for the vmap(): if I understand correctly, it might sleep while obtaining > memory for creating the mapping. This part could be executed before > disabling interrupts. The rest of the function, instead, would be executed > after interrupts are disabled. > > * for vunmap(): after the writing is done, change also the alternate mapping > to read only, then enable interrupts and destroy the alternate mapping. > Making also the secondary mapping read only makes it equally secure as the > primary, which means that it can be visible also with interrupts enabled. That doesn't work if you wanted to do this write while you already have IRQs disabled for example.
On Tue, Oct 30, 2018 at 04:58:41PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 29, 2018 at 11:17:14PM +0200, Igor Stoppa wrote:
> >
> >
> > On 25/10/2018 01:13, Peter Zijlstra wrote:
> > > On Wed, Oct 24, 2018 at 12:35:03AM +0300, Igor Stoppa wrote:
> > > > +static __always_inline
> > > > +bool __pratomic_long_op(bool inc, struct pratomic_long_t *l)
> > > > +{
> > > > + struct page *page;
> > > > + uintptr_t base;
> > > > + uintptr_t offset;
> > > > + unsigned long flags;
> > > > + size_t size = sizeof(*l);
> > > > + bool is_virt = __is_wr_after_init(l, size);
> > > > +
> > > > + if (WARN(!(is_virt || likely(__is_wr_pool(l, size))),
> > > > + WR_ERR_RANGE_MSG))
> > > > + return false;
> > > > + local_irq_save(flags);
> > > > + if (is_virt)
> > > > + page = virt_to_page(l);
> > > > + else
> > > > + vmalloc_to_page(l);
> > > > + offset = (~PAGE_MASK) & (uintptr_t)l;
> > > > + base = (uintptr_t)vmap(&page, 1, VM_MAP, PAGE_KERNEL);
> > > > + if (WARN(!base, WR_ERR_PAGE_MSG)) {
> > > > + local_irq_restore(flags);
> > > > + return false;
> > > > + }
> > > > + if (inc)
> > > > + atomic_long_inc((atomic_long_t *)(base + offset));
> > > > + else
> > > > + atomic_long_dec((atomic_long_t *)(base + offset));
> > > > + vunmap((void *)base);
> > > > + local_irq_restore(flags);
> > > > + return true;
> > > > +
> > > > +}
> > >
> > > That's just hideously nasty.. and horribly broken.
> > >
> > > We're not going to duplicate all these kernel interfaces wrapped in gunk
> > > like that.
> >
> > one possibility would be to have macros which use typeof() on the parameter
> > being passed, to decide what implementation to use: regular or write-rare
> >
> > This means that type punning would still be needed, to select the
> > implementation.
> >
> > Would this be enough? Is there some better way?
>
> Like mentioned elsewhere; if you do write_enable() + write_disable()
> thingies, it all becomes:
>
> write_enable();
> atomic_foo(&bar);
> write_disable();
>
> No magic gunk infested duplication at all. Of course, ideally you'd then
> teach objtool about this (or a GCC plugin I suppose) to ensure any
> enable reached a disable.
Isn't the issue here that we don't want to change the page tables for the
mapping of &bar, but instead want to create a temporary writable alias
at a random virtual address?
So you'd want:
wbar = write_enable(&bar);
atomic_foo(wbar);
write_disable(wbar);
which is probably better expressed as a map/unmap API. I suspect this
would also be the only way to do things for cmpxchg() loops, where you
want to create the mapping outside of the loop to minimise your time in
the critical section.
Will
On Tue, Oct 30, 2018 at 8:26 AM, Peter Zijlstra <peterz@infradead.org> wrote: > I suppose the 'normal' attack goes like: > > 1) find buffer-overrun / bound check failure > 2) use that to write to 'interesting' location > 3) that write results arbitrary code execution > 4) win > > Of course, if the store of 2 is to the current cred structure, and > simply sets the effective uid to 0, we can skip 3. In most cases, yes, gaining root is game over. However, I don't want to discount other threat models: some systems have been designed not to trust root, so a cred attack doesn't always get an attacker full control (e.g. lockdown series, signed modules, encrypted VMs, etc). > Which seems to suggest all cred structures should be made r/o like this. > But I'm not sure I remember these patches doing that. There are things that attempt to protect cred (and other things, like page tables) via hypervisors (see Samsung KNOX) or via syscall boundary checking (see Linux Kernel Runtime Guard). They're pretty interesting, but I'm not sure if there is a clear way forward on it working in upstream, but that's why I think these discussions are useful. > Also, there is an inverse situation with all this. If you make > everything R/O, then you need this allow-write for everything you do, > which then is about to include a case with an overflow / bound check > fail, and we're back to square 1. Sure -- this is the fine line in trying to build these defenses. The point is to narrow the scope of attack. Stupid metaphor follows: right now we have only a couple walls; if we add walls we can focus on make sure the doors and windows are safe. If we make the relatively easy-to-find-in-memory page tables read-only-at-rest then a whole class of very powerful exploits that depend on page table attacks go away. As part of all of this is the observation that there are two types of things clearly worth protecting: that which is updated rarely (no need to leave it writable for so much of its lifetime), and that which is especially sensitive (page tables, security policy, function pointers, etc). Finding a general purpose way to deal with these (like we have for other data-lifetime cases like const and __ro_after_init) would be very nice. I don't think there is a slippery slope here. -Kees -- Kees Cook
> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>
>> On Tue, Oct 30, 2018 at 8:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> I suppose the 'normal' attack goes like:
>>
>> 1) find buffer-overrun / bound check failure
>> 2) use that to write to 'interesting' location
>> 3) that write results arbitrary code execution
>> 4) win
>>
>> Of course, if the store of 2 is to the current cred structure, and
>> simply sets the effective uid to 0, we can skip 3.
>
> In most cases, yes, gaining root is game over. However, I don't want
> to discount other threat models: some systems have been designed not
> to trust root, so a cred attack doesn't always get an attacker full
> control (e.g. lockdown series, signed modules, encrypted VMs, etc).
>
>> Which seems to suggest all cred structures should be made r/o like this.
>> But I'm not sure I remember these patches doing that.
>
> There are things that attempt to protect cred (and other things, like
> page tables) via hypervisors (see Samsung KNOX) or via syscall
> boundary checking (see Linux Kernel Runtime Guard). They're pretty
> interesting, but I'm not sure if there is a clear way forward on it
> working in upstream, but that's why I think these discussions are
> useful.
>
>> Also, there is an inverse situation with all this. If you make
>> everything R/O, then you need this allow-write for everything you do,
>> which then is about to include a case with an overflow / bound check
>> fail, and we're back to square 1.
>
> Sure -- this is the fine line in trying to build these defenses. The
> point is to narrow the scope of attack. Stupid metaphor follows: right
> now we have only a couple walls; if we add walls we can focus on make
> sure the doors and windows are safe. If we make the relatively
> easy-to-find-in-memory page tables read-only-at-rest then a whole
> class of very powerful exploits that depend on page table attacks go
> away.
>
> As part of all of this is the observation that there are two types of
> things clearly worth protecting: that which is updated rarely (no need
> to leave it writable for so much of its lifetime), and that which is
> especially sensitive (page tables, security policy, function pointers,
> etc). Finding a general purpose way to deal with these (like we have
> for other data-lifetime cases like const and __ro_after_init) would be
> very nice. I don't think there is a slippery slope here.
>
>
Since I wasn’t cc’d on this series:
I support the addition of a rare-write mechanism to the upstream kernel. And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
If anyone wants to use CR0.WP instead, I’ll remind them that they have to fix up the entry code and justify the added complexity. And make performance not suck in a VM (i.e. CR0 reads on entry are probably off the table). None of these will be easy.
If anyone wants to use kmap_atomic-like tricks, I’ll point out that we already have enough problems with dangling TLB entries due to SMP issues. The last thing we need is more of them. If someone proposes a viable solution that doesn’t involve CR3 fiddling, I’ll be surprised.
Keep in mind that switch_mm() is actually decently fast on modern CPUs. It’s probably considerably faster than writing CR0, although I haven’t benchmarked it. It’s certainly faster than writing CR4. It’s also faster than INVPCID, surprisingly, which means that it will be quite hard to get better performance using any sort of trickery.
Nadav’s patch set would be an excellent starting point.
P.S. EFI is sort of grandfathered in as a hackish alternate page table hierarchy. We’re not adding another one.
On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> I support the addition of a rare-write mechanism to the upstream kernel.
> And I think that there is only one sane way to implement it: using an
> mm_struct. That mm_struct, just like any sane mm_struct, should only
> differ from init_mm in that it has extra mappings in the *user* region.
I'd like to understand this approach a little better. In a syscall path,
we run with the user task's mm. What you're proposing is that when we
want to modify rare data, we switch to rare_mm which contains a
writable mapping to all the kernel data which is rare-write.
So the API might look something like this:
void *p = rare_alloc(...); /* writable pointer */
p->a = x;
q = rare_protect(p); /* read-only pointer */
To subsequently modify q,
p = rare_modify(q);
q->a = y;
rare_protect(p);
Under the covers, rare_modify() would switch to the rare_mm and return
(void *)((unsigned long)q + ARCH_RARE_OFFSET). All of the rare data
would then be modifiable, although you don't have any other pointers
to it. rare_protect() would switch back to the previous mm and return
(p - ARCH_RARE_OFFSET).
Does this satisfy Igor's requirements? We wouldn't be able to
copy_to/from_user() while rare_mm was active. I think that's a feature
though! It certainly satisfies my interests (kernel code be able to
mark things as dynamically-allocated-and-read-only-after-initialisation)
On 10/30/18 10:58 AM, Matthew Wilcox wrote:
> Does this satisfy Igor's requirements? We wouldn't be able to
> copy_to/from_user() while rare_mm was active. I think that's a feature
> though! It certainly satisfies my interests (kernel code be able to
> mark things as dynamically-allocated-and-read-only-after-initialisation)
It has to be more than copy_to/from_user(), though, I think.
rare_modify(q) either has to preempt_disable(), or we need to teach the
context-switching code when and how to switch in/out of the rare_mm.
preempt_disable() would also keep us from sleeping.
On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote: > On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote: > > > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote: > > I support the addition of a rare-write mechanism to the upstream kernel. > > And I think that there is only one sane way to implement it: using an > > mm_struct. That mm_struct, just like any sane mm_struct, should only > > differ from init_mm in that it has extra mappings in the *user* region. > > I'd like to understand this approach a little better. In a syscall path, > we run with the user task's mm. What you're proposing is that when we > want to modify rare data, we switch to rare_mm which contains a > writable mapping to all the kernel data which is rare-write. > > So the API might look something like this: > > void *p = rare_alloc(...); /* writable pointer */ > p->a = x; > q = rare_protect(p); /* read-only pointer */ > > To subsequently modify q, > > p = rare_modify(q); > q->a = y; Do you mean p->a = y; here? I assume the intent is that q isn't writable ever, but that's the one we have in the structure at rest. Tycho > rare_protect(p); > > Under the covers, rare_modify() would switch to the rare_mm and return > (void *)((unsigned long)q + ARCH_RARE_OFFSET). All of the rare data > would then be modifiable, although you don't have any other pointers > to it. rare_protect() would switch back to the previous mm and return > (p - ARCH_RARE_OFFSET). > > Does this satisfy Igor's requirements? We wouldn't be able to > copy_to/from_user() while rare_mm was active. I think that's a feature > though! It certainly satisfies my interests (kernel code be able to > mark things as dynamically-allocated-and-read-only-after-initialisation)
> On Oct 30, 2018, at 10:58 AM, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>> I support the addition of a rare-write mechanism to the upstream kernel.
>> And I think that there is only one sane way to implement it: using an
>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>> differ from init_mm in that it has extra mappings in the *user* region.
>
> I'd like to understand this approach a little better. In a syscall path,
> we run with the user task's mm. What you're proposing is that when we
> want to modify rare data, we switch to rare_mm which contains a
> writable mapping to all the kernel data which is rare-write.
>
> So the API might look something like this:
>
> void *p = rare_alloc(...); /* writable pointer */
> p->a = x;
> q = rare_protect(p); /* read-only pointer */
>
> To subsequently modify q,
>
> p = rare_modify(q);
> q->a = y;
> rare_protect(p);
How about:
rare_write(&q->a, y);
Or, for big writes:
rare_write_copy(&q, local_q);
This avoids a whole ton of issues. In practice, actually running with a special mm requires preemption disabled as well as some other stuff, which Nadav carefully dealt with.
Also, can we maybe focus on getting something merged for statically allocated data first?
Finally, one issue: rare_alloc() is going to utterly suck performance-wise due to the global IPI when the region gets zapped out of the direct map or otherwise made RO. This is the same issue that makes all existing XPO efforts so painful. We need to either optimize the crap out of it somehow or we need to make sure it’s not called except during rare events like device enumeration.
Nadav, want to resubmit your series? IIRC the only thing wrong with it was that it was a big change and we wanted a simpler fix to backport. But that’s all done now, and I, at least, rather liked your code. :)
On Tue, Oct 30, 2018 at 11:51 AM, Andy Lutomirski <luto@amacapital.net> wrote: > > >> On Oct 30, 2018, at 10:58 AM, Matthew Wilcox <willy@infradead.org> wrote: >> >> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote: >>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote: >>> I support the addition of a rare-write mechanism to the upstream kernel. >>> And I think that there is only one sane way to implement it: using an >>> mm_struct. That mm_struct, just like any sane mm_struct, should only >>> differ from init_mm in that it has extra mappings in the *user* region. >> >> I'd like to understand this approach a little better. In a syscall path, >> we run with the user task's mm. What you're proposing is that when we >> want to modify rare data, we switch to rare_mm which contains a >> writable mapping to all the kernel data which is rare-write. >> >> So the API might look something like this: >> >> void *p = rare_alloc(...); /* writable pointer */ >> p->a = x; >> q = rare_protect(p); /* read-only pointer */ >> >> To subsequently modify q, >> >> p = rare_modify(q); >> q->a = y; >> rare_protect(p); > > How about: > > rare_write(&q->a, y); > > Or, for big writes: > > rare_write_copy(&q, local_q); > > This avoids a whole ton of issues. In practice, actually running with a special mm requires preemption disabled as well as some other stuff, which Nadav carefully dealt with. This is what I had before, yes: https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=kspp/write-rarely&id=9ab0cb2618ebbc51f830ceaa06b7d2182fe1a52d It just needs the switch_mm() backend. -- Kees Cook
On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote:
> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
> > On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > > > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> > > I support the addition of a rare-write mechanism to the upstream kernel.
> > > And I think that there is only one sane way to implement it: using an
> > > mm_struct. That mm_struct, just like any sane mm_struct, should only
> > > differ from init_mm in that it has extra mappings in the *user* region.
> >
> > I'd like to understand this approach a little better. In a syscall path,
> > we run with the user task's mm. What you're proposing is that when we
> > want to modify rare data, we switch to rare_mm which contains a
> > writable mapping to all the kernel data which is rare-write.
> >
> > So the API might look something like this:
> >
> > void *p = rare_alloc(...); /* writable pointer */
> > p->a = x;
> > q = rare_protect(p); /* read-only pointer */
> >
> > To subsequently modify q,
> >
> > p = rare_modify(q);
> > q->a = y;
>
> Do you mean
>
> p->a = y;
>
> here? I assume the intent is that q isn't writable ever, but that's
> the one we have in the structure at rest.
Yes, that was my intent, thanks.
To handle the list case that Igor has pointed out, you might want to
do something like this:
list_for_each_entry(x, &xs, entry) {
struct foo *writable = rare_modify(entry);
kref_get(&writable->ref);
rare_protect(writable);
}
but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer.
On 30/10/2018 21:20, Matthew Wilcox wrote: > On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote: >> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote: >>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote: >>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote: >>>> I support the addition of a rare-write mechanism to the upstream kernel. >>>> And I think that there is only one sane way to implement it: using an >>>> mm_struct. That mm_struct, just like any sane mm_struct, should only >>>> differ from init_mm in that it has extra mappings in the *user* region. >>> >>> I'd like to understand this approach a little better. In a syscall path, >>> we run with the user task's mm. What you're proposing is that when we >>> want to modify rare data, we switch to rare_mm which contains a >>> writable mapping to all the kernel data which is rare-write. >>> >>> So the API might look something like this: >>> >>> void *p = rare_alloc(...); /* writable pointer */ >>> p->a = x; >>> q = rare_protect(p); /* read-only pointer */ With pools and memory allocated from vmap_areas, I was able to say protect(pool) and that would do a swipe on all the pages currently in use. In the SELinux policyDB, for example, one doesn't really want to individually protect each allocation. The loading phase happens usually at boot, when the system can be assumed to be sane (one might even preload a bare-bone set of rules from initramfs and then replace it later on, with the full blown set). There is no need to process each of these tens of thousands allocations and initialization as write-rare. Would it be possible to do the same here? >>> >>> To subsequently modify q, >>> >>> p = rare_modify(q); >>> q->a = y; >> >> Do you mean >> >> p->a = y; >> >> here? I assume the intent is that q isn't writable ever, but that's >> the one we have in the structure at rest. > > Yes, that was my intent, thanks. > > To handle the list case that Igor has pointed out, you might want to > do something like this: > > list_for_each_entry(x, &xs, entry) { > struct foo *writable = rare_modify(entry); Would this mapping be impossible to spoof by other cores? I'm asking this because, from what I understand, local interrupts are enabled here, so an attack could freeze the core performing the write-rare operation, while another scrapes the memory. But blocking interrupts for the entire body of the loop would make RT latency unpredictable. > kref_get(&writable->ref); > rare_protect(writable); > } > > but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer. This seems suspiciously close to the duplication of kernel interfaces that I was roasted for :-) -- igor
> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote: > >> On 30/10/2018 21:20, Matthew Wilcox wrote: >>> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote: >>>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote: >>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote: >>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote: >>>>> I support the addition of a rare-write mechanism to the upstream kernel. >>>>> And I think that there is only one sane way to implement it: using an >>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only >>>>> differ from init_mm in that it has extra mappings in the *user* region. >>>> >>>> I'd like to understand this approach a little better. In a syscall path, >>>> we run with the user task's mm. What you're proposing is that when we >>>> want to modify rare data, we switch to rare_mm which contains a >>>> writable mapping to all the kernel data which is rare-write. >>>> >>>> So the API might look something like this: >>>> >>>> void *p = rare_alloc(...); /* writable pointer */ >>>> p->a = x; >>>> q = rare_protect(p); /* read-only pointer */ > > With pools and memory allocated from vmap_areas, I was able to say > > protect(pool) > > and that would do a swipe on all the pages currently in use. > In the SELinux policyDB, for example, one doesn't really want to individually protect each allocation. > > The loading phase happens usually at boot, when the system can be assumed to be sane (one might even preload a bare-bone set of rules from initramfs and then replace it later on, with the full blown set). > > There is no need to process each of these tens of thousands allocations and initialization as write-rare. > > Would it be possible to do the same here? I don’t see why not, although getting the API right will be a tad complicated. > >>>> >>>> To subsequently modify q, >>>> >>>> p = rare_modify(q); >>>> q->a = y; >>> >>> Do you mean >>> >>> p->a = y; >>> >>> here? I assume the intent is that q isn't writable ever, but that's >>> the one we have in the structure at rest. >> Yes, that was my intent, thanks. >> To handle the list case that Igor has pointed out, you might want to >> do something like this: >> list_for_each_entry(x, &xs, entry) { >> struct foo *writable = rare_modify(entry); > > Would this mapping be impossible to spoof by other cores? > Indeed. Only the core with the special mm loaded could see it. But I dislike allowing regular writes in the protected region. We really only need four write primitives: 1. Just write one value. Call at any time (except NMI). 2. Just copy some bytes. Same as (1) but any number of bytes. 3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization. Actually getting a modifiable pointer should be disallowed for two reasons: 1. Some architectures may want to use a special write-different-address-space operation. Heck, x86 could, too: make the actual offset be a secret and shove the offset into FSBASE or similar. Then %fs-prefixed writes would do the rare writes. 2. Alternatively, x86 could set the U bit. Then the actual writes would use the uaccess helpers, giving extra protection via SMAP. We don’t really want a situation where an unchecked pointer in the rare write region completely defeats the mechanism.
On Tue, Oct 30, 2018 at 2:02 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > >> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote: >> >>> On 30/10/2018 21:20, Matthew Wilcox wrote: >>>> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote: >>>>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote: >>>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote: >>>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote: >>>>>> I support the addition of a rare-write mechanism to the upstream kernel. >>>>>> And I think that there is only one sane way to implement it: using an >>>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only >>>>>> differ from init_mm in that it has extra mappings in the *user* region. >>>>> >>>>> I'd like to understand this approach a little better. In a syscall path, >>>>> we run with the user task's mm. What you're proposing is that when we >>>>> want to modify rare data, we switch to rare_mm which contains a >>>>> writable mapping to all the kernel data which is rare-write. >>>>> >>>>> So the API might look something like this: >>>>> >>>>> void *p = rare_alloc(...); /* writable pointer */ >>>>> p->a = x; >>>>> q = rare_protect(p); /* read-only pointer */ >> >> With pools and memory allocated from vmap_areas, I was able to say >> >> protect(pool) >> >> and that would do a swipe on all the pages currently in use. >> In the SELinux policyDB, for example, one doesn't really want to individually protect each allocation. >> >> The loading phase happens usually at boot, when the system can be assumed to be sane (one might even preload a bare-bone set of rules from initramfs and then replace it later on, with the full blown set). >> >> There is no need to process each of these tens of thousands allocations and initialization as write-rare. >> >> Would it be possible to do the same here? > > I don’t see why not, although getting the API right will be a tad complicated. > >> >>>>> >>>>> To subsequently modify q, >>>>> >>>>> p = rare_modify(q); >>>>> q->a = y; >>>> >>>> Do you mean >>>> >>>> p->a = y; >>>> >>>> here? I assume the intent is that q isn't writable ever, but that's >>>> the one we have in the structure at rest. >>> Yes, that was my intent, thanks. >>> To handle the list case that Igor has pointed out, you might want to >>> do something like this: >>> list_for_each_entry(x, &xs, entry) { >>> struct foo *writable = rare_modify(entry); >> >> Would this mapping be impossible to spoof by other cores? >> > > Indeed. Only the core with the special mm loaded could see it. > > But I dislike allowing regular writes in the protected region. We really only need four write primitives: > > 1. Just write one value. Call at any time (except NMI). > > 2. Just copy some bytes. Same as (1) but any number of bytes. > > 3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization. > > Actually getting a modifiable pointer should be disallowed for two reasons: > > 1. Some architectures may want to use a special write-different-address-space operation. Heck, x86 could, too: make the actual offset be a secret and shove the offset into FSBASE or similar. Then %fs-prefixed writes would do the rare writes. > > 2. Alternatively, x86 could set the U bit. Then the actual writes would use the uaccess helpers, giving extra protection via SMAP. > > We don’t really want a situation where an unchecked pointer in the rare write region completely defeats the mechanism. We still have to deal with certain structures under the write-rare window. For example, see: https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=kspp/write-rarely&id=60430b4d3b113aae4adab66f8339074986276474 They are wrappers to non-inline functions that have the same sanity-checking. -- Kees Cook
On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
> Finally, one issue: rare_alloc() is going to utterly suck
> performance-wise due to the global IPI when the region gets zapped out
> of the direct map or otherwise made RO. This is the same issue that
> makes all existing XPO efforts so painful. We need to either optimize
> the crap out of it somehow or we need to make sure it’s not called
> except during rare events like device enumeration.
Batching operations is kind of the whole point of the VM ;-) Either
this rare memory gets used a lot, in which case we'll want to create slab
caches for it, make it a MM zone and the whole nine yeards, or it's not
used very much in which case it doesn't matter that performance sucks.
For now, I'd suggest allocating 2MB chunks as needed, and having a
shrinker to hand back any unused pieces.
On 30/10/2018 23:07, Kees Cook wrote:
> We still have to deal with certain structures under the write-rare
> window. For example, see:
> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=kspp/write-rarely&id=60430b4d3b113aae4adab66f8339074986276474
> They are wrappers to non-inline functions that have the same sanity-checking.
Even if I also did something similar, it was not intended to be final.
Just a stop-gap solution till the write-rare mechanism is identified.
If the size of the whole list_head is used as alignment, then the whole
list_head structure can be given an alternate mapping and the plain list
function can be used on this alternate mapping.
It can halve the overhead or more.
The typical case is when not only one list_head is contained in one
page, but also the other, like when allocating and queuing multiple
items from the same pool.
One single temporary mapping would be enough.
But it becomes tricky to do it, without generating code that is
almost-but-not-quite-identical.
--
igor
On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote: > On 30/10/2018 21:20, Matthew Wilcox wrote: > > > > So the API might look something like this: > > > > > > > > void *p = rare_alloc(...); /* writable pointer */ > > > > p->a = x; > > > > q = rare_protect(p); /* read-only pointer */ > > With pools and memory allocated from vmap_areas, I was able to say > > protect(pool) > > and that would do a swipe on all the pages currently in use. > In the SELinux policyDB, for example, one doesn't really want to > individually protect each allocation. > > The loading phase happens usually at boot, when the system can be assumed to > be sane (one might even preload a bare-bone set of rules from initramfs and > then replace it later on, with the full blown set). > > There is no need to process each of these tens of thousands allocations and > initialization as write-rare. > > Would it be possible to do the same here? What Andy is proposing effectively puts all rare allocations into one pool. Although I suppose it could be generalised to multiple pools ... one mm_struct per pool. Andy, what do you think to doing that? > > but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer. > > This seems suspiciously close to the duplication of kernel interfaces that I > was roasted for :-) Can you not see the difference between adding one syntactic sugar function and duplicating the entire infrastructure?
On 30/10/2018 23:35, Matthew Wilcox wrote: > On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote: >> Would it be possible to do the same here? > > What Andy is proposing effectively puts all rare allocations into > one pool. Although I suppose it could be generalised to multiple pools > ... one mm_struct per pool. Andy, what do you think to doing that? The reason to have pools is that, continuing the SELinux example, it supports reloading the policyDB. In this case it seems to me that it would be faster to drop the entire pool in one go, and create a new one when re-initializing the rules. Or maybe the pool could be flushed, without destroying the metadata. One more reason for having pools is to assign certain property to the pool and then rely on it to be applied to every subsequent allocation. I've also been wondering if pools can be expected to have some well defined property. One might be that they do not need to be created on the fly. The number of pools should be known at compilation time. At least the meta-data could be static, but I do not know how this would work with modules. >>> but we'd probably wrap it in list_for_each_rare_entry(), just to be nicer. >> >> This seems suspiciously close to the duplication of kernel interfaces that I >> was roasted for :-) > > Can you not see the difference between adding one syntactic sugar function > and duplicating the entire infrastructure? And list_add()? or any of the other list related functions? Don't they end up receiving a similar treatment? I might have misunderstood your proposal, but it seems to me that they too will need the same type of pairs rare_modify/rare_protect. No? And same for hlist, including the _rcu variants. -- igor
On 30/10/2018 23:25, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
>> Finally, one issue: rare_alloc() is going to utterly suck
>> performance-wise due to the global IPI when the region gets zapped out
>> of the direct map or otherwise made RO. This is the same issue that
>> makes all existing XPO efforts so painful. We need to either optimize
>> the crap out of it somehow or we need to make sure it’s not called
>> except during rare events like device enumeration.
>
> Batching operations is kind of the whole point of the VM ;-) Either
> this rare memory gets used a lot, in which case we'll want to create slab
> caches for it, make it a MM zone and the whole nine yeards, or it's not
> used very much in which case it doesn't matter that performance sucks.
>
> For now, I'd suggest allocating 2MB chunks as needed, and having a
> shrinker to hand back any unused pieces.
One of the prime candidates for this sort of protection is IMA.
In the IMA case, there are ever-growing lists which are populated when
accessing files.
It's something that ends up on the critical path of any usual
performance critical use case, when accessing files for the first time,
like at boot/application startup.
Also the SELinux AVC is based on lists. It uses an object cache, but it
is still something that grows and is on the critical path of evaluating
the callbacks from the LSM hooks. A lot of them.
These are the main two reasons, so far, for me advocating an
optimization of the write-rare version of the (h)list.
--
igor
On Tue, Oct 30, 2018 at 11:55:46PM +0200, Igor Stoppa wrote:
> On 30/10/2018 23:25, Matthew Wilcox wrote:
> > On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote:
> > > Finally, one issue: rare_alloc() is going to utterly suck
> > > performance-wise due to the global IPI when the region gets zapped out
> > > of the direct map or otherwise made RO. This is the same issue that
> > > makes all existing XPO efforts so painful. We need to either optimize
> > > the crap out of it somehow or we need to make sure it’s not called
> > > except during rare events like device enumeration.
> >
> > Batching operations is kind of the whole point of the VM ;-) Either
> > this rare memory gets used a lot, in which case we'll want to create slab
> > caches for it, make it a MM zone and the whole nine yeards, or it's not
> > used very much in which case it doesn't matter that performance sucks.
> >
> > For now, I'd suggest allocating 2MB chunks as needed, and having a
> > shrinker to hand back any unused pieces.
>
> One of the prime candidates for this sort of protection is IMA.
> In the IMA case, there are ever-growing lists which are populated when
> accessing files.
> It's something that ends up on the critical path of any usual performance
> critical use case, when accessing files for the first time, like at
> boot/application startup.
>
> Also the SELinux AVC is based on lists. It uses an object cache, but it is
> still something that grows and is on the critical path of evaluating the
> callbacks from the LSM hooks. A lot of them.
>
> These are the main two reasons, so far, for me advocating an optimization of
> the write-rare version of the (h)list.
I think these are both great examples of why doubly-linked lists _suck_.
You have to modify three cachelines to add an entry to a list. Walking a
linked list is an exercise in cache misses. Far better to use an XArray /
IDR for this purpose.
On 30/10/2018 23:02, Andy Lutomirski wrote: > > >> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote: >> There is no need to process each of these tens of thousands allocations and initialization as write-rare. >> >> Would it be possible to do the same here? > > I don’t see why not, although getting the API right will be a tad complicated. yes, I have some first-hand experience with this :-/ >> >>>>> >>>>> To subsequently modify q, >>>>> >>>>> p = rare_modify(q); >>>>> q->a = y; >>>> >>>> Do you mean >>>> >>>> p->a = y; >>>> >>>> here? I assume the intent is that q isn't writable ever, but that's >>>> the one we have in the structure at rest. >>> Yes, that was my intent, thanks. >>> To handle the list case that Igor has pointed out, you might want to >>> do something like this: >>> list_for_each_entry(x, &xs, entry) { >>> struct foo *writable = rare_modify(entry); >> >> Would this mapping be impossible to spoof by other cores? >> > > Indeed. Only the core with the special mm loaded could see it. > > But I dislike allowing regular writes in the protected region. We really only need four write primitives: > > 1. Just write one value. Call at any time (except NMI). > > 2. Just copy some bytes. Same as (1) but any number of bytes. > > 3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization. Atomic? RCU? Yes, they are technically just memory writes, but shouldn't the "normal" operation be executed on the writable mapping? It is possible to sandwich any call between a rare_modify/rare_protect, however isn't that pretty close to having a write-rare version of these plain function. -- igor
From: Andy Lutomirski
Sent: October 30, 2018 at 6:51:17 PM GMT
> To: Matthew Wilcox <willy@infradead.org>, nadav.amit@gmail.com
> Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Igor Stoppa <igor.stoppa@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, linux-security-module <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
>
>
>
>
>> On Oct 30, 2018, at 10:58 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
>>> I support the addition of a rare-write mechanism to the upstream kernel.
>>> And I think that there is only one sane way to implement it: using an
>>> mm_struct. That mm_struct, just like any sane mm_struct, should only
>>> differ from init_mm in that it has extra mappings in the *user* region.
>>
>> I'd like to understand this approach a little better. In a syscall path,
>> we run with the user task's mm. What you're proposing is that when we
>> want to modify rare data, we switch to rare_mm which contains a
>> writable mapping to all the kernel data which is rare-write.
>>
>> So the API might look something like this:
>>
>> void *p = rare_alloc(...); /* writable pointer */
>> p->a = x;
>> q = rare_protect(p); /* read-only pointer */
>>
>> To subsequently modify q,
>>
>> p = rare_modify(q);
>> q->a = y;
>> rare_protect(p);
>
> How about:
>
> rare_write(&q->a, y);
>
> Or, for big writes:
>
> rare_write_copy(&q, local_q);
>
> This avoids a whole ton of issues. In practice, actually running with a
> special mm requires preemption disabled as well as some other stuff, which
> Nadav carefully dealt with.
>
> Also, can we maybe focus on getting something merged for statically
> allocated data first?
>
> Finally, one issue: rare_alloc() is going to utterly suck performance-wise
> due to the global IPI when the region gets zapped out of the direct map or
> otherwise made RO. This is the same issue that makes all existing XPO
> efforts so painful. We need to either optimize the crap out of it somehow
> or we need to make sure it’s not called except during rare events like
> device enumeration.
>
> Nadav, want to resubmit your series? IIRC the only thing wrong with it was
> that it was a big change and we wanted a simpler fix to backport. But
> that’s all done now, and I, at least, rather liked your code. :)
I guess since it was based on your ideas…
Anyhow, the only open issue that I have with v2 is Peter’s wish that I would
make kgdb use of poke_text() less disgusting. I still don’t know exactly
how to deal with it.
Perhaps it (fixing kgdb) can be postponed? In that case I can just resend
v2.
On Tue, Oct 30, 2018 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
> > On 30/10/2018 21:20, Matthew Wilcox wrote:
> > > > > So the API might look something like this:
> > > > >
> > > > > void *p = rare_alloc(...); /* writable pointer */
> > > > > p->a = x;
> > > > > q = rare_protect(p); /* read-only pointer */
> >
> > With pools and memory allocated from vmap_areas, I was able to say
> >
> > protect(pool)
> >
> > and that would do a swipe on all the pages currently in use.
> > In the SELinux policyDB, for example, one doesn't really want to
> > individually protect each allocation.
> >
> > The loading phase happens usually at boot, when the system can be assumed to
> > be sane (one might even preload a bare-bone set of rules from initramfs and
> > then replace it later on, with the full blown set).
> >
> > There is no need to process each of these tens of thousands allocations and
> > initialization as write-rare.
> >
> > Would it be possible to do the same here?
>
> What Andy is proposing effectively puts all rare allocations into
> one pool. Although I suppose it could be generalised to multiple pools
> ... one mm_struct per pool. Andy, what do you think to doing that?
Hmm. Let's see.
To clarify some of this thread, I think that the fact that rare_write
uses an mm_struct and alias mappings under the hood should be
completely invisible to users of the API. No one should ever be
handed a writable pointer to rare_write memory (except perhaps during
bootup or when initializing a large complex data structure that will
be rare_write but isn't yet, e.g. the policy db).
For example, there could easily be architectures where having a
writable alias is problematic. On such architectures, an entirely
different mechanism might work better. And, if a tool like KNOX ever
becomes a *part* of the Linux kernel (hint hint!)
If you have multiple pools and one mm_struct per pool, you'll need a
way to find the mm_struct from a given allocation. Regardless of how
the mm_structs are set up, changing rare_write memory to normal memory
or vice versa will require a global TLB flush (all ASIDs and global
pages) on all CPUs, so having extra mm_structs doesn't seem to buy
much.
(It's just possible that changing rare_write back to normal might be
able to avoid the flush if the spurious faults can be handled
reliably.)
On Tue, Oct 30, 2018 at 04:18:39PM -0700, Nadav Amit wrote:
> > Nadav, want to resubmit your series? IIRC the only thing wrong with it was
> > that it was a big change and we wanted a simpler fix to backport. But
> > that’s all done now, and I, at least, rather liked your code. :)
>
> I guess since it was based on your ideas…
>
> Anyhow, the only open issue that I have with v2 is Peter’s wish that I would
> make kgdb use of poke_text() less disgusting. I still don’t know exactly
> how to deal with it.
>
> Perhaps it (fixing kgdb) can be postponed? In that case I can just resend
> v2.
Oh man, I completely forgot about kgdb :/
Also, would it be possible to entirely remove that kmap fallback path?
Adding SELinux folks and the SELinux ml I think it's better if they participate in this discussion. On 31/10/2018 06:41, Andy Lutomirski wrote: > On Tue, Oct 30, 2018 at 2:36 PM Matthew Wilcox <willy@infradead.org> wrote: >> >> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote: >>> On 30/10/2018 21:20, Matthew Wilcox wrote: >>>>>> So the API might look something like this: >>>>>> >>>>>> void *p = rare_alloc(...); /* writable pointer */ >>>>>> p->a = x; >>>>>> q = rare_protect(p); /* read-only pointer */ >>> >>> With pools and memory allocated from vmap_areas, I was able to say >>> >>> protect(pool) >>> >>> and that would do a swipe on all the pages currently in use. >>> In the SELinux policyDB, for example, one doesn't really want to >>> individually protect each allocation. >>> >>> The loading phase happens usually at boot, when the system can be assumed to >>> be sane (one might even preload a bare-bone set of rules from initramfs and >>> then replace it later on, with the full blown set). >>> >>> There is no need to process each of these tens of thousands allocations and >>> initialization as write-rare. >>> >>> Would it be possible to do the same here? >> >> What Andy is proposing effectively puts all rare allocations into >> one pool. Although I suppose it could be generalised to multiple pools >> ... one mm_struct per pool. Andy, what do you think to doing that? > > Hmm. Let's see. > > To clarify some of this thread, I think that the fact that rare_write > uses an mm_struct and alias mappings under the hood should be > completely invisible to users of the API. I agree. > No one should ever be > handed a writable pointer to rare_write memory (except perhaps during > bootup or when initializing a large complex data structure that will > be rare_write but isn't yet, e.g. the policy db). The policy db doesn't need to be write rare. Actually, it really shouldn't be write rare. Maybe it's just a matter of wording, but effectively the policyDB can be trated with this sequence: 1) allocate various data structures in writable form 2) initialize them 3) go back to 1 as needed 4) lock down everything that has been allocated, as Read-Only The reason why I stress ReadOnly is that differentiating what is really ReadOnly from what is WriteRare provides an extra edge against attacks, because attempts to alter ReadOnly data through a WriteRare API could be detected 5) read any part of the policyDB during regular operations 6) in case of update, create a temporary new version, using steps 1..3 7) if update successful, use the new one and destroy the old one 8) if the update failed, destroy the new one The destruction at points 7 and 8 is not so much a write operation, as it is a release of the memory. So, we might have a bit different interpretation of what write-rare means wrt destroying the memory and its content. To clarify: I've been using write-rare to indicate primarily small operations that one would otherwise achieve with "=", memcpy or memset or more complex variants, like atomic ops, rcu pointer assignment, etc. Tearing down an entire set of allocations like the policyDB doesn't fit very well with that model. The only part which _needs_ to be write rare, in the policyDB, is the set of pointers which are used to access all the dynamically allocated data set. These pointers must be updated when a new policyDB is allocated. > For example, there could easily be architectures where having a > writable alias is problematic. On such architectures, an entirely > different mechanism might work better. And, if a tool like KNOX ever > becomes a *part* of the Linux kernel (hint hint!) Something related, albeit not identical is going on here [1] Eventually, it could be expanded to deal also with write rare. > If you have multiple pools and one mm_struct per pool, you'll need a > way to find the mm_struct from a given allocation. Indeed. In my patchset, based on vmas, I do the following: * a private field from the page struct points to the vma using that page * inside the vma there is alist_head used only during deletion - one pointer is used to chain vmas fro mthe same pool - one pointer points to the pool struct * the pool struct has the property to use for all the associated allocations: is it write-rare, read-only, does it auto protect, etc. > Regardless of how > the mm_structs are set up, changing rare_write memory to normal memory > or vice versa will require a global TLB flush (all ASIDs and global > pages) on all CPUs, so having extra mm_structs doesn't seem to buy > much. 1) it supports differnt levels of protection: temporarily unprotected vs read-only vs write-rare 2) the change of write permission should be possible only toward more restrictive rules (writable -> write-rare -> read-only) and only to the point that was specified while creating the pool, to avoid DOS attacks, where a write-rare is flipped into read-only and further updates fail (ex: prevent IMA from registering modifications to a file, by not letting it store new information - I'm not 100% sure this would work, but it gives the idea, I think) 3) being able to track all the allocations related to a pool would allow to perform mass operations, like reducing the writabilty or destroying all the allocations. > (It's just possible that changing rare_write back to normal might be > able to avoid the flush if the spurious faults can be handled > reliably.) I do not see the need for such a case of degrading the write permissions of an allocation, unless it refers to the release of a pool of allocations (see updating the SELinux policy DB) [1] https://www.openwall.com/lists/kernel-hardening/2018/10/26/11 -- igor
On Tue, Oct 30, 2018 at 04:28:16PM +0000, Will Deacon wrote:
> On Tue, Oct 30, 2018 at 04:58:41PM +0100, Peter Zijlstra wrote:
> > Like mentioned elsewhere; if you do write_enable() + write_disable()
> > thingies, it all becomes:
> >
> > write_enable();
> > atomic_foo(&bar);
> > write_disable();
> >
> > No magic gunk infested duplication at all. Of course, ideally you'd then
> > teach objtool about this (or a GCC plugin I suppose) to ensure any
> > enable reached a disable.
>
> Isn't the issue here that we don't want to change the page tables for the
> mapping of &bar, but instead want to create a temporary writable alias
> at a random virtual address?
>
> So you'd want:
>
> wbar = write_enable(&bar);
> atomic_foo(wbar);
> write_disable(wbar);
>
> which is probably better expressed as a map/unmap API. I suspect this
> would also be the only way to do things for cmpxchg() loops, where you
> want to create the mapping outside of the loop to minimise your time in
> the critical section.
Ah, so I was thikning that the altnerative mm would have stuff in the
same location, just RW instead of RO.
But yes, if we, like Andy suggets, use the userspace address range for
the aliases, then we need to do as you suggest.
On Tue, Oct 30, 2018 at 11:03:51AM -0700, Dave Hansen wrote:
> On 10/30/18 10:58 AM, Matthew Wilcox wrote:
> > Does this satisfy Igor's requirements? We wouldn't be able to
> > copy_to/from_user() while rare_mm was active. I think that's a feature
> > though! It certainly satisfies my interests (kernel code be able to
> > mark things as dynamically-allocated-and-read-only-after-initialisation)
>
> It has to be more than copy_to/from_user(), though, I think.
>
> rare_modify(q) either has to preempt_disable(), or we need to teach the
> context-switching code when and how to switch in/out of the rare_mm.
> preempt_disable() would also keep us from sleeping.
Yes, I think we want to have preempt disable at the very least. We could
indeed make the context switch code smart and teach is about this state,
but I think allowing preemption in such section is a bad idea. We want
to keep these sections short and simple (and bounded), such that they
can be analyzed for correctness.
Once you allow preemption, things tend to grow large and complex.
Ideally we'd even disabled interrupts over this, to further limit what
code runs in the rare_mm context. NMIs need special care anyway.
On 30/10/2018 18:37, Kees Cook wrote:
> On Tue, Oct 30, 2018 at 8:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> I suppose the 'normal' attack goes like:
>>
>> 1) find buffer-overrun / bound check failure
>> 2) use that to write to 'interesting' location
>> 3) that write results arbitrary code execution
>> 4) win
>>
>> Of course, if the store of 2 is to the current cred structure, and
>> simply sets the effective uid to 0, we can skip 3.
>
> In most cases, yes, gaining root is game over. However, I don't want
> to discount other threat models: some systems have been designed not
> to trust root, so a cred attack doesn't always get an attacker full
> control (e.g. lockdown series, signed modules, encrypted VMs, etc).
Reading these points I see where I was not clear.
Maybe I can fix that. SELinux is a good example of safeguard against a
takeover of root, so it is a primary target. Unfortunately it contains
some state variables that can be used to disable it.
Other attack that comes to my mind, for executing code within the
kernel, without sweating too much with ROP is the following:
1) find the rabbit hole to write kernel data, whatever it might be
2) spray the keyring with own key
3) use the module loading infrastructure to load own module, signed with
the key at point 2)
Just to say that direct arbitrary code execution is becoming harder to
perform, but there are ways around it which rely more on overwriting
unprotected data.
They are a lower hanging fruit for an attacker.
--
igor
On Tue, Oct 30, 2018 at 02:25:51PM -0700, Matthew Wilcox wrote: > On Tue, Oct 30, 2018 at 11:51:17AM -0700, Andy Lutomirski wrote: > > Finally, one issue: rare_alloc() is going to utterly suck > > performance-wise due to the global IPI when the region gets zapped out > > of the direct map or otherwise made RO. This is the same issue that > > makes all existing XPO efforts so painful. We need to either optimize > > the crap out of it somehow or we need to make sure it’s not called > > except during rare events like device enumeration. > > Batching operations is kind of the whole point of the VM ;-) Either > this rare memory gets used a lot, in which case we'll want to create slab > caches for it, make it a MM zone and the whole nine yeards, or it's not > used very much in which case it doesn't matter that performance sucks. Yes, for the dynamic case something along those lines would be needed. If we have a single rare zone, we could even have __GFP_RARE or whatever that manages this. The page allocator would have to grow a rare memblock type, and every rare alloc would allocate from a rare memblock, when none is available, creation of a rare block would set up the mappings etc.. > For now, I'd suggest allocating 2MB chunks as needed, and having a > shrinker to hand back any unused pieces. Something like the percpu allocator?
On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> > I support the addition of a rare-write mechanism to the upstream kernel.
> > And I think that there is only one sane way to implement it: using an
> > mm_struct. That mm_struct, just like any sane mm_struct, should only
> > differ from init_mm in that it has extra mappings in the *user* region.
>
> I'd like to understand this approach a little better. In a syscall path,
> we run with the user task's mm. What you're proposing is that when we
> want to modify rare data, we switch to rare_mm which contains a
> writable mapping to all the kernel data which is rare-write.
>
> So the API might look something like this:
>
> void *p = rare_alloc(...); /* writable pointer */
> p->a = x;
> q = rare_protect(p); /* read-only pointer */
>
> To subsequently modify q,
>
> p = rare_modify(q);
> q->a = y;
> rare_protect(p);
Why would you have rare_alloc() imply rare_modify() ? Would you have the
allocator meta data inside the rare section?
On Tue, Oct 30, 2018 at 02:02:12PM -0700, Andy Lutomirski wrote: > But I dislike allowing regular writes in the protected region. We > really only need four write primitives: > > 1. Just write one value. Call at any time (except NMI). > > 2. Just copy some bytes. Same as (1) but any number of bytes. Given the !preempt/!IRQ contraints I'd certainly put an upper limit on the number of bytes there. > 3,4: Same as 1 and 2 but must be called inside a special rare write > region. This is purely an optimization. > > Actually getting a modifiable pointer should be disallowed for two > reasons: > > 1. Some architectures may want to use a special > write-different-address-space operation. You're thinking of s390 ? :-) > Heck, x86 could, too: make > the actual offset be a secret and shove the offset into FSBASE or > similar. Then %fs-prefixed writes would do the rare writes. > 2. Alternatively, x86 could set the U bit. Then the actual writes > would use the uaccess helpers, giving extra protection via SMAP. Cute, and yes, something like that would be nice. > We don’t really want a situation where an unchecked pointer in the > rare write region completely defeats the mechanism. Agreed.
On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote: > To clarify some of this thread, I think that the fact that rare_write > uses an mm_struct and alias mappings under the hood should be > completely invisible to users of the API. No one should ever be > handed a writable pointer to rare_write memory (except perhaps during > bootup or when initializing a large complex data structure that will > be rare_write but isn't yet, e.g. the policy db). Being able to use pointers would make it far easier to do atomics and other things though. > For example, there could easily be architectures where having a > writable alias is problematic. Mostly we'd just have to be careful of cache aliases, alignment should be able to sort that I think. > If you have multiple pools and one mm_struct per pool, you'll need a > way to find the mm_struct from a given allocation. Or keep track of it externally. For example by context. If you modify page-tables you pick the page-table pool, if you modify selinux state, you pick the selinux pool. > Regardless of how the mm_structs are set up, changing rare_write > memory to normal memory or vice versa will require a global TLB flush > (all ASIDs and global pages) on all CPUs, so having extra mm_structs > doesn't seem to buy much. The way I understand it, the point is that if you stick page-tables and selinux state in different pools, a stray write in one will never affect the other.
On Wed, Oct 31, 2018 at 12:15:46AM +0200, Igor Stoppa wrote: > On 30/10/2018 23:02, Andy Lutomirski wrote: > > But I dislike allowing regular writes in the protected region. We > > really only need four write primitives: > > > > 1. Just write one value. Call at any time (except NMI). > > > > 2. Just copy some bytes. Same as (1) but any number of bytes. > > > > 3,4: Same as 1 and 2 but must be called inside a special rare write > > region. This is purely an optimization. > > Atomic? RCU? RCU can be done, that's not really a problem. Atomics otoh are a problem. Having pointers makes them just work. Andy; I understand your reason for not wanting them, but I really don't want to duplicate everything. Is there something we can do with static analysis to make you more comfortable with the pointer thing?
On Wed, Oct 31, 2018 at 10:36:59AM +0100, Peter Zijlstra wrote:
> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote:
> > On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote:
> > > > On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@chromium.org> wrote:
> > > I support the addition of a rare-write mechanism to the upstream kernel.
> > > And I think that there is only one sane way to implement it: using an
> > > mm_struct. That mm_struct, just like any sane mm_struct, should only
> > > differ from init_mm in that it has extra mappings in the *user* region.
> >
> > I'd like to understand this approach a little better. In a syscall path,
> > we run with the user task's mm. What you're proposing is that when we
> > want to modify rare data, we switch to rare_mm which contains a
> > writable mapping to all the kernel data which is rare-write.
> >
> > So the API might look something like this:
> >
> > void *p = rare_alloc(...); /* writable pointer */
> > p->a = x;
> > q = rare_protect(p); /* read-only pointer */
> >
> > To subsequently modify q,
> >
> > p = rare_modify(q);
> > q->a = y;
> > rare_protect(p);
>
> Why would you have rare_alloc() imply rare_modify() ? Would you have the
> allocator meta data inside the rare section?
Normally when I allocate some memory I need to initialise it before
doing anything else with it ;-)
I mean, you could do:
ro = rare_alloc(..);
rare = rare_modify(ro);
rare->a = x;
rare_protect(rare);
but that's more typing.
On 31/10/2018 11:08, Igor Stoppa wrote:
> Adding SELinux folks and the SELinux ml
>
> I think it's better if they participate in this discussion.
>
> On 31/10/2018 06:41, Andy Lutomirski wrote:
>> On Tue, Oct 30, 2018 at 2:36 PM Matthew Wilcox <willy@infradead.org>
>> wrote:
>>>
>>> On Tue, Oct 30, 2018 at 10:43:14PM +0200, Igor Stoppa wrote:
>>>> On 30/10/2018 21:20, Matthew Wilcox wrote:
>>>>>>> So the API might look something like this:
>>>>>>>
>>>>>>> void *p = rare_alloc(...); /* writable pointer */
>>>>>>> p->a = x;
>>>>>>> q = rare_protect(p); /* read-only pointer */
>>>>
>>>> With pools and memory allocated from vmap_areas, I was able to say
>>>>
>>>> protect(pool)
>>>>
>>>> and that would do a swipe on all the pages currently in use.
>>>> In the SELinux policyDB, for example, one doesn't really want to
>>>> individually protect each allocation.
>>>>
>>>> The loading phase happens usually at boot, when the system can be
>>>> assumed to
>>>> be sane (one might even preload a bare-bone set of rules from
>>>> initramfs and
>>>> then replace it later on, with the full blown set).
>>>>
>>>> There is no need to process each of these tens of thousands
>>>> allocations and
>>>> initialization as write-rare.
>>>>
>>>> Would it be possible to do the same here?
>>>
>>> What Andy is proposing effectively puts all rare allocations into
>>> one pool. Although I suppose it could be generalised to multiple pools
>>> ... one mm_struct per pool. Andy, what do you think to doing that?
>>
>> Hmm. Let's see.
>>
>> To clarify some of this thread, I think that the fact that rare_write
>> uses an mm_struct and alias mappings under the hood should be
>> completely invisible to users of the API.
>
> I agree.
>
>> No one should ever be
>> handed a writable pointer to rare_write memory (except perhaps during
>> bootup or when initializing a large complex data structure that will
>> be rare_write but isn't yet, e.g. the policy db).
>
> The policy db doesn't need to be write rare.
> Actually, it really shouldn't be write rare.
>
> Maybe it's just a matter of wording, but effectively the policyDB can be
> trated with this sequence:
>
> 1) allocate various data structures in writable form
>
> 2) initialize them
>
> 3) go back to 1 as needed
>
> 4) lock down everything that has been allocated, as Read-Only
> The reason why I stress ReadOnly is that differentiating what is really
> ReadOnly from what is WriteRare provides an extra edge against attacks,
> because attempts to alter ReadOnly data through a WriteRare API could be
> detected
>
> 5) read any part of the policyDB during regular operations
>
> 6) in case of update, create a temporary new version, using steps 1..3
>
> 7) if update successful, use the new one and destroy the old one
>
> 8) if the update failed, destroy the new one
>
> The destruction at points 7 and 8 is not so much a write operation, as
> it is a release of the memory.
>
> So, we might have a bit different interpretation of what write-rare
> means wrt destroying the memory and its content.
>
> To clarify: I've been using write-rare to indicate primarily small
> operations that one would otherwise achieve with "=", memcpy or memset
> or more complex variants, like atomic ops, rcu pointer assignment, etc.
>
> Tearing down an entire set of allocations like the policyDB doesn't fit
> very well with that model.
>
> The only part which _needs_ to be write rare, in the policyDB, is the
> set of pointers which are used to access all the dynamically allocated
> data set.
>
> These pointers must be updated when a new policyDB is allocated.
>
>> For example, there could easily be architectures where having a
>> writable alias is problematic. On such architectures, an entirely
>> different mechanism might work better. And, if a tool like KNOX ever
>> becomes a *part* of the Linux kernel (hint hint!)
>
> Something related, albeit not identical is going on here [1]
> Eventually, it could be expanded to deal also with write rare.
>
>> If you have multiple pools and one mm_struct per pool, you'll need a
>> way to find the mm_struct from a given allocation.
>
> Indeed. In my patchset, based on vmas, I do the following:
> * a private field from the page struct points to the vma using that page
> * inside the vma there is alist_head used only during deletion
> - one pointer is used to chain vmas fro mthe same pool
> - one pointer points to the pool struct
> * the pool struct has the property to use for all the associated
> allocations: is it write-rare, read-only, does it auto protect, etc.
>
>> Regardless of how
>> the mm_structs are set up, changing rare_write memory to normal memory
>> or vice versa will require a global TLB flush (all ASIDs and global
>> pages) on all CPUs, so having extra mm_structs doesn't seem to buy
>> much.
>
> 1) it supports differnt levels of protection:
> temporarily unprotected vs read-only vs write-rare
>
> 2) the change of write permission should be possible only toward more
> restrictive rules (writable -> write-rare -> read-only) and only to the
> point that was specified while creating the pool, to avoid DOS attacks,
> where a write-rare is flipped into read-only and further updates fail
> (ex: prevent IMA from registering modifications to a file, by not
> letting it store new information - I'm not 100% sure this would work,
> but it gives the idea, I think)
>
> 3) being able to track all the allocations related to a pool would allow
> to perform mass operations, like reducing the writabilty or destroying
> all the allocations.
>
>> (It's just possible that changing rare_write back to normal might be
>> able to avoid the flush if the spurious faults can be handled
>> reliably.)
>
> I do not see the need for such a case of degrading the write permissions
> of an allocation, unless it refers to the release of a pool of
> allocations (see updating the SELinux policy DB)
A few more thoughts about pools. Not sure if they are all correct.
Note: I stick to "pool", instead of mm_struct, because what I'll say is
mostly independent from the implementation.
- As Peter Zijlstra wrote earlier, protecting a target moves the focus
of the attack to something else. In this case, probably, the "something
else" would be the metadata of the pool(s).
- The amount of pools needed should be known at compilation time, so the
metadata used for pools could be statically allocated and any change to
it could be treated as write-rare.
- only certain fields of a pool structure would be writable, even as
write-rare, after the pool is initialized.
In case the pool is a mm_struct or a superset (containing also
additional properties, like the type of writability: RO or WR), the field
struct vm_area_struct *mmap;
is an example of what could be protected. It should be alterable only
when creating/destroying the pool and making the first initialization.
- to speed up and also improve the validation of the target of a
write-rare operation, it would be really desirable if the target had
some intrinsic property which clearly differentiates it from
non-write-rare memory. Its address, for example. The amount of write
rare data needed by a full kernel should not exceed a few tens of
megabytes. On a 64 bit system it shouldn't be so bad to reserve an
address range maybe one order of magnitude lager than that.
It could even become a parameter for the creation of a pool.
SELinux, for example, should fit within 10-20MB. Or it could be a
command line parameter.
- even if an hypervisor was present, it would be preferable to use it
exclusively as extra protection, which triggers an exception only when
something abnormal happens. The hypervisor should not become aware of
the actual meaning of kernel (meta)data. Ideally, it would be mostly
used for trapping unexpected writes to pages which are not supposed to
be modified.
- one more reason for using pools is that, if each pool was also acting
as memory cache for its users, attacks relying on use-after-free would
not have access to possible vulnerability, because the memory and
addresses associated with a pool would stay with it.
--
igor
> On Oct 31, 2018, at 3:02 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >> On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote: >> To clarify some of this thread, I think that the fact that rare_write >> uses an mm_struct and alias mappings under the hood should be >> completely invisible to users of the API. No one should ever be >> handed a writable pointer to rare_write memory (except perhaps during >> bootup or when initializing a large complex data structure that will >> be rare_write but isn't yet, e.g. the policy db). > > Being able to use pointers would make it far easier to do atomics and > other things though. This stuff is called *rare* write for a reason. Do we really want to allow atomics beyond just store-release? Taking a big lock and then writing in the right order should cover everything, no? > >> For example, there could easily be architectures where having a >> writable alias is problematic. > > Mostly we'd just have to be careful of cache aliases, alignment should > be able to sort that I think. > >> If you have multiple pools and one mm_struct per pool, you'll need a >> way to find the mm_struct from a given allocation. > > Or keep track of it externally. For example by context. If you modify > page-tables you pick the page-table pool, if you modify selinux state, > you pick the selinux pool. > >> Regardless of how the mm_structs are set up, changing rare_write >> memory to normal memory or vice versa will require a global TLB flush >> (all ASIDs and global pages) on all CPUs, so having extra mm_structs >> doesn't seem to buy much. > > The way I understand it, the point is that if you stick page-tables and > selinux state in different pools, a stray write in one will never affect > the other. > Hmm. That’s not totally crazy, but the API would need to be carefully designed. And some argument would have to be made as to why it’s better to use a different address space as opposed to checking in software along the lines of the uaccess checking.
> On Oct 31, 2018, at 3:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Wed, Oct 31, 2018 at 12:15:46AM +0200, Igor Stoppa wrote:
>> On 30/10/2018 23:02, Andy Lutomirski wrote:
>
>>> But I dislike allowing regular writes in the protected region. We
>>> really only need four write primitives:
>>>
>>> 1. Just write one value. Call at any time (except NMI).
>>>
>>> 2. Just copy some bytes. Same as (1) but any number of bytes.
>>>
>>> 3,4: Same as 1 and 2 but must be called inside a special rare write
>>> region. This is purely an optimization.
>>
>> Atomic? RCU?
>
> RCU can be done, that's not really a problem. Atomics otoh are a
> problem. Having pointers makes them just work.
>
> Andy; I understand your reason for not wanting them, but I really don't
> want to duplicate everything. Is there something we can do with static
> analysis to make you more comfortable with the pointer thing?
I’m sure we could do something with static analysis, but I think seeing a real use case where all this fanciness makes sense would be good.
And I don’t know if s390 *can* have an efficient implementation that uses pointers. OTOH they have all kinds of magic stuff, so who knows?
> On Oct 31, 2018, at 1:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>
>>> On Oct 31, 2018, at 3:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Wed, Oct 31, 2018 at 12:15:46AM +0200, Igor Stoppa wrote:
>>> On 30/10/2018 23:02, Andy Lutomirski wrote:
>>
>>>> But I dislike allowing regular writes in the protected region. We
>>>> really only need four write primitives:
>>>>
>>>> 1. Just write one value. Call at any time (except NMI).
>>>>
>>>> 2. Just copy some bytes. Same as (1) but any number of bytes.
>>>>
>>>> 3,4: Same as 1 and 2 but must be called inside a special rare write
>>>> region. This is purely an optimization.
>>>
>>> Atomic? RCU?
>>
>> RCU can be done, that's not really a problem. Atomics otoh are a
>> problem. Having pointers makes them just work.
>>
>> Andy; I understand your reason for not wanting them, but I really don't
>> want to duplicate everything. Is there something we can do with static
>> analysis to make you more comfortable with the pointer thing?
>
> I’m sure we could do something with static analysis, but I think seeing a real use case where all this fanciness makes sense would be good.
>
> And I don’t know if s390 *can* have an efficient implementation that uses pointers. OTOH they have all kinds of magic stuff, so who knows?
Also, if we’re using a hypervisor, then there are a couple ways it could be done:
1. VMFUNC. Pointers work fine. This is stronger than any amount of CR3 trickery because it can’t be defeated by page table attacks.
2. A hypercall to do the write. No pointers.
Basically, I think that if we can get away without writable pointers, we get more flexibility and less need for fancy static analysis. If we do need pointers, then so be it.
On Wed, Oct 31, 2018 at 01:36:48PM -0700, Andy Lutomirski wrote:
>
> > On Oct 31, 2018, at 3:02 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
> >> To clarify some of this thread, I think that the fact that rare_write
> >> uses an mm_struct and alias mappings under the hood should be
> >> completely invisible to users of the API. No one should ever be
> >> handed a writable pointer to rare_write memory (except perhaps during
> >> bootup or when initializing a large complex data structure that will
> >> be rare_write but isn't yet, e.g. the policy db).
> >
> > Being able to use pointers would make it far easier to do atomics and
> > other things though.
>
> This stuff is called *rare* write for a reason. Do we really want to
> allow atomics beyond just store-release? Taking a big lock and then
> writing in the right order should cover everything, no?
Ah, so no. That naming is very misleading.
We modify page-tables a _lot_. The point is that only a few sanctioned
sites are allowed writing to it, not everybody.
I _think_ the use-case for atomics is updating the reference counts of
objects that are in this write-rare domain. But I'm not entirely clear
on that myself either. I just really want to avoid duplicating that
stuff.
> On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Wed, Oct 31, 2018 at 01:36:48PM -0700, Andy Lutomirski wrote:
>>
>>>> On Oct 31, 2018, at 3:02 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>
>>>> On Tue, Oct 30, 2018 at 09:41:13PM -0700, Andy Lutomirski wrote:
>>>> To clarify some of this thread, I think that the fact that rare_write
>>>> uses an mm_struct and alias mappings under the hood should be
>>>> completely invisible to users of the API. No one should ever be
>>>> handed a writable pointer to rare_write memory (except perhaps during
>>>> bootup or when initializing a large complex data structure that will
>>>> be rare_write but isn't yet, e.g. the policy db).
>>>
>>> Being able to use pointers would make it far easier to do atomics and
>>> other things though.
>>
>> This stuff is called *rare* write for a reason. Do we really want to
>> allow atomics beyond just store-release? Taking a big lock and then
>> writing in the right order should cover everything, no?
>
> Ah, so no. That naming is very misleading.
>
> We modify page-tables a _lot_. The point is that only a few sanctioned
> sites are allowed writing to it, not everybody.
>
> I _think_ the use-case for atomics is updating the reference counts of
> objects that are in this write-rare domain. But I'm not entirely clear
> on that myself either. I just really want to avoid duplicating that
> stuff.
Sounds nuts. Doing a rare-write is many hundreds of cycles at best. Using that for a reference count sounds wacky.
Can we see a *real* use case before we over complicate the API?
On 01/11/2018 00:57, Andy Lutomirski wrote: > > >> On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote: >> I _think_ the use-case for atomics is updating the reference counts of >> objects that are in this write-rare domain. But I'm not entirely clear >> on that myself either. I just really want to avoid duplicating that >> stuff. > > Sounds nuts. Doing a rare-write is many hundreds of cycles at best. Using that for a reference count sounds wacky. > > Can we see a *real* use case before we over complicate the API? > Does patch #14 of this set not qualify? ima_htable.len ? https://www.openwall.com/lists/kernel-hardening/2018/10/23/20 -- igor
On Wed, Oct 31, 2018 at 4:10 PM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>
>
>
> On 01/11/2018 00:57, Andy Lutomirski wrote:
> >
> >
> >> On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>
>
> >> I _think_ the use-case for atomics is updating the reference counts of
> >> objects that are in this write-rare domain. But I'm not entirely clear
> >> on that myself either. I just really want to avoid duplicating that
> >> stuff.
> >
> > Sounds nuts. Doing a rare-write is many hundreds of cycles at best. Using that for a reference count sounds wacky.
> >
> > Can we see a *real* use case before we over complicate the API?
> >
>
>
> Does patch #14 of this set not qualify? ima_htable.len ?
>
> https://www.openwall.com/lists/kernel-hardening/2018/10/23/20
>
Do you mean this (sorry for whitespace damage):
+ pratomic_long_inc(&ima_htable.len);
- atomic_long_inc(&ima_htable.len);
if (update_htable) {
key = ima_hash_key(entry->digest);
- hlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]);
+ prhlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]);
}
ISTM you don't need that atomic operation -- you could take a spinlock
and then just add one directly to the variable.
--Andy
On 01/11/2018 01:19, Andy Lutomirski wrote:
> ISTM you don't need that atomic operation -- you could take a spinlock
> and then just add one directly to the variable.
It was my intention to provide a 1:1 conversion of existing code, as it
should be easier to verify the correctness of the conversion, as long as
there isn't any significant degradation in performance.
The rework could be done afterward.
--
igor
On Wed, Oct 31, 2018 at 2:10 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Oct 30, 2018 at 04:28:16PM +0000, Will Deacon wrote:
>> On Tue, Oct 30, 2018 at 04:58:41PM +0100, Peter Zijlstra wrote:
>> > Like mentioned elsewhere; if you do write_enable() + write_disable()
>> > thingies, it all becomes:
>> >
>> > write_enable();
>> > atomic_foo(&bar);
>> > write_disable();
>> >
>> > No magic gunk infested duplication at all. Of course, ideally you'd then
>> > teach objtool about this (or a GCC plugin I suppose) to ensure any
>> > enable reached a disable.
>>
>> Isn't the issue here that we don't want to change the page tables for the
>> mapping of &bar, but instead want to create a temporary writable alias
>> at a random virtual address?
>>
>> So you'd want:
>>
>> wbar = write_enable(&bar);
>> atomic_foo(wbar);
>> write_disable(wbar);
>>
>> which is probably better expressed as a map/unmap API. I suspect this
>> would also be the only way to do things for cmpxchg() loops, where you
>> want to create the mapping outside of the loop to minimise your time in
>> the critical section.
>
> Ah, so I was thikning that the altnerative mm would have stuff in the
> same location, just RW instead of RO.
I was hoping for the same location too. That allows use to use a gcc
plugin to mark, say, function pointer tables, as read-only, and
annotate their rare updates with write_rare() without any
recalculation.
-Kees
--
Kees Cook
Igor,
On Thu, 1 Nov 2018, Igor Stoppa wrote:
> On 01/11/2018 01:19, Andy Lutomirski wrote:
>
> > ISTM you don't need that atomic operation -- you could take a spinlock
> > and then just add one directly to the variable.
>
> It was my intention to provide a 1:1 conversion of existing code, as it should
> be easier to verify the correctness of the conversion, as long as there isn't
> any significant degradation in performance.
>
> The rework could be done afterward.
Please don't go there. The usual approach is to
1) Rework existing code in a way that the new functionality can be added
with minimal effort afterwards and without creating API horrors.
2) Verify correctness of the rework
3) Add the new functionality
That avoids creation of odd functionality and APIs in the first place, so
they won't be used in other places and does not leave half cleaned up code
around which will stick for a long time.
Thanks,
tglx
On 01/11/2018 10:21, Thomas Gleixner wrote: > On Thu, 1 Nov 2018, Igor Stoppa wrote: >> The rework could be done afterward. > > Please don't go there. The usual approach is to > > 1) Rework existing code in a way that the new functionality can be added > with minimal effort afterwards and without creating API horrors. Thanks a lot for the advice. It makes things even easier for me, as I can start the rework, while the discussion on the actual write-rare mechanism continues. -- igor
From: Peter Zijlstra
Sent: October 31, 2018 at 9:08:35 AM GMT
> To: Nadav Amit <nadav.amit@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>, Matthew Wilcox <willy@infradead.org>, Kees Cook <keescook@chromium.org>, Igor Stoppa <igor.stoppa@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, linux-security-module <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
>
>
> On Tue, Oct 30, 2018 at 04:18:39PM -0700, Nadav Amit wrote:
>>> Nadav, want to resubmit your series? IIRC the only thing wrong with it was
>>> that it was a big change and we wanted a simpler fix to backport. But
>>> that’s all done now, and I, at least, rather liked your code. :)
>>
>> I guess since it was based on your ideas…
>>
>> Anyhow, the only open issue that I have with v2 is Peter’s wish that I would
>> make kgdb use of poke_text() less disgusting. I still don’t know exactly
>> how to deal with it.
>>
>> Perhaps it (fixing kgdb) can be postponed? In that case I can just resend
>> v2.
>
> Oh man, I completely forgot about kgdb :/
>
> Also, would it be possible to entirely remove that kmap fallback path?
Let me see what I can do about it.
On Wed, Oct 31, 2018 at 03:57:21PM -0700, Andy Lutomirski wrote: > > On Oct 31, 2018, at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote: > >> On Wed, Oct 31, 2018 at 01:36:48PM -0700, Andy Lutomirski wrote: > >> This stuff is called *rare* write for a reason. Do we really want to > >> allow atomics beyond just store-release? Taking a big lock and then > >> writing in the right order should cover everything, no? > > > > Ah, so no. That naming is very misleading. > > > > We modify page-tables a _lot_. The point is that only a few sanctioned > > sites are allowed writing to it, not everybody. > > > > I _think_ the use-case for atomics is updating the reference counts of > > objects that are in this write-rare domain. But I'm not entirely clear > > on that myself either. I just really want to avoid duplicating that > > stuff. > > Sounds nuts. Doing a rare-write is many hundreds of cycles at best. Yes, which is why I'm somewhat sceptical of the whole endeavour.
From: Nadav Amit
Sent: November 1, 2018 at 4:31:59 PM GMT
> To: Peter Zijlstra <peterz@infradead.org>
> Cc: Andy Lutomirski <luto@amacapital.net>, Matthew Wilcox <willy@infradead.org>, Kees Cook <keescook@chromium.org>, Igor Stoppa <igor.stoppa@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, linux-security-module <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
>
>
> From: Peter Zijlstra
> Sent: October 31, 2018 at 9:08:35 AM GMT
>> To: Nadav Amit <nadav.amit@gmail.com>
>> Cc: Andy Lutomirski <luto@amacapital.net>, Matthew Wilcox <willy@infradead.org>, Kees Cook <keescook@chromium.org>, Igor Stoppa <igor.stoppa@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, linux-security-module <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
>> Subject: Re: [PATCH 10/17] prmem: documentation
>>
>>
>> On Tue, Oct 30, 2018 at 04:18:39PM -0700, Nadav Amit wrote:
>>>> Nadav, want to resubmit your series? IIRC the only thing wrong with it was
>>>> that it was a big change and we wanted a simpler fix to backport. But
>>>> that’s all done now, and I, at least, rather liked your code. :)
>>>
>>> I guess since it was based on your ideas…
>>>
>>> Anyhow, the only open issue that I have with v2 is Peter’s wish that I would
>>> make kgdb use of poke_text() less disgusting. I still don’t know exactly
>>> how to deal with it.
>>>
>>> Perhaps it (fixing kgdb) can be postponed? In that case I can just resend
>>> v2.
>>
>> Oh man, I completely forgot about kgdb :/
>>
>> Also, would it be possible to entirely remove that kmap fallback path?
>
> Let me see what I can do about it.
My patches had several embarrassing bugs. I’m fixing and will resend later,
hopefully today.
Hello, I've been studying v4 of the patch-set [1] that Nadav has been working on. Incidentally, I think it would be useful to cc also the security/hardening ml. The patch-set seems to be close to final, so I am resuming this discussion. On 30/10/2018 19:06, Andy Lutomirski wrote: > I support the addition of a rare-write mechanism to the upstream kernel. And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region. After reading the code, I see what you meant. I think I can work with it. But I have a couple of questions wrt the use of this mechanism, in the context of write rare. 1) mm_struct. Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code (live patch?), which seems to happen sequentially and in a relatively standardized way, like replacing the NOPs specifically placed in the functions that need patching. This is a bit different from the more generic write-rare case, applied to data. As example, I have in mind a system where both IMA and SELinux are in use. In this system, a file is accessed for the first time. That would trigger 2 things: - evaluation of the SELinux rules and probably update of the AVC cache - IMA measurement and update of the measurements Both of them could be write protected, meaning that they would both have to be modified through the write rare mechanism. While the events, for 1 specific file, would be sequential, it's not difficult to imagine that multiple files could be accessed at the same time. If the update of the data structures in both IMA and SELinux must use the same mm_struct, that would have to be somehow regulated and it would introduce an unnecessary (imho) dependency. How about having one mm_struct for each writer (core or thread)? 2) Iiuc, the purpose of the 2 pages being remapped is that the target of the patch might spill across the page boundary, however if I deal with the modification of generic data, I shouldn't (shouldn't I?) assume that the data will not span across multiple pages. If the data spans across multiple pages, in unknown amount, I suppose that I should not keep interrupts disabled for an unknown time, as it would hurt preemption. What I thought, in my initial patch-set, was to iterate over each page that must be written to, in a loop, re-enabling interrupts in-between iterations, to give pending interrupts a chance to be served. This would mean that the data being written to would not be consistent, but it's a problem that would have to be addressed anyways, since it can be still read by other cores, while the write is ongoing. Is this a valid concern/approach? -- igor [1] https://lkml.org/lkml/2018/11/11/110
On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote: > > Hello, > I've been studying v4 of the patch-set [1] that Nadav has been working on. > Incidentally, I think it would be useful to cc also the > security/hardening ml. > The patch-set seems to be close to final, so I am resuming this discussion. > > On 30/10/2018 19:06, Andy Lutomirski wrote: > > > I support the addition of a rare-write mechanism to the upstream kernel. And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region. > > After reading the code, I see what you meant. > I think I can work with it. > > But I have a couple of questions wrt the use of this mechanism, in the > context of write rare. > > > 1) mm_struct. > > Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code > (live patch?), which seems to happen sequentially and in a relatively > standardized way, like replacing the NOPs specifically placed in the > functions that need patching. > > This is a bit different from the more generic write-rare case, applied > to data. > > As example, I have in mind a system where both IMA and SELinux are in use. > > In this system, a file is accessed for the first time. > > That would trigger 2 things: > - evaluation of the SELinux rules and probably update of the AVC cache > - IMA measurement and update of the measurements > > Both of them could be write protected, meaning that they would both have > to be modified through the write rare mechanism. > > While the events, for 1 specific file, would be sequential, it's not > difficult to imagine that multiple files could be accessed at the same time. > > If the update of the data structures in both IMA and SELinux must use > the same mm_struct, that would have to be somehow regulated and it would > introduce an unnecessary (imho) dependency. > > How about having one mm_struct for each writer (core or thread)? > I don't think that helps anything. I think the mm_struct used for prmem (or rare_write or whatever you want to call it) should be entirely abstracted away by an appropriate API, so neither SELinux nor IMA need to be aware that there's an mm_struct involved. It's also entirely possible that some architectures won't even use an mm_struct behind the scenes -- x86, for example, could have avoided it if there were a kernel equivalent of PKRU. Sadly, there isn't. > > > 2) Iiuc, the purpose of the 2 pages being remapped is that the target of > the patch might spill across the page boundary, however if I deal with > the modification of generic data, I shouldn't (shouldn't I?) assume that > the data will not span across multiple pages. The reason for the particular architecture of text_poke() is to avoid memory allocation to get it working. i think that prmem/rare_write should have each rare-writable kernel address map to a unique user address, possibly just by offsetting everything by a constant. For rare_write, you don't actually need it to work as such until fairly late in boot, since the rare_writable data will just be writable early on. > > If the data spans across multiple pages, in unknown amount, I suppose > that I should not keep interrupts disabled for an unknown time, as it > would hurt preemption. > > What I thought, in my initial patch-set, was to iterate over each page > that must be written to, in a loop, re-enabling interrupts in-between > iterations, to give pending interrupts a chance to be served. > > This would mean that the data being written to would not be consistent, > but it's a problem that would have to be addressed anyways, since it can > be still read by other cores, while the write is ongoing. This probably makes sense, except that enabling and disabling interrupts means you also need to restore the original mm_struct (most likely), which is slow. I don't think there's a generic way to check whether in interrupt is pending without turning interrupts on.
From: Andy Lutomirski
Sent: November 13, 2018 at 5:16:09 PM GMT
> To: Igor Stoppa <igor.stoppa@gmail.com>
> Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
>
>
> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>> Hello,
>> I've been studying v4 of the patch-set [1] that Nadav has been working on.
>> Incidentally, I think it would be useful to cc also the
>> security/hardening ml.
>> The patch-set seems to be close to final, so I am resuming this discussion.
>>
>> On 30/10/2018 19:06, Andy Lutomirski wrote:
>>
>>> I support the addition of a rare-write mechanism to the upstream kernel. And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
>>
>> After reading the code, I see what you meant.
>> I think I can work with it.
>>
>> But I have a couple of questions wrt the use of this mechanism, in the
>> context of write rare.
>>
>>
>> 1) mm_struct.
>>
>> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
>> (live patch?), which seems to happen sequentially and in a relatively
>> standardized way, like replacing the NOPs specifically placed in the
>> functions that need patching.
>>
>> This is a bit different from the more generic write-rare case, applied
>> to data.
>>
>> As example, I have in mind a system where both IMA and SELinux are in use.
>>
>> In this system, a file is accessed for the first time.
>>
>> That would trigger 2 things:
>> - evaluation of the SELinux rules and probably update of the AVC cache
>> - IMA measurement and update of the measurements
>>
>> Both of them could be write protected, meaning that they would both have
>> to be modified through the write rare mechanism.
>>
>> While the events, for 1 specific file, would be sequential, it's not
>> difficult to imagine that multiple files could be accessed at the same time.
>>
>> If the update of the data structures in both IMA and SELinux must use
>> the same mm_struct, that would have to be somehow regulated and it would
>> introduce an unnecessary (imho) dependency.
>>
>> How about having one mm_struct for each writer (core or thread)?
>
> I don't think that helps anything. I think the mm_struct used for
> prmem (or rare_write or whatever you want to call it) should be
> entirely abstracted away by an appropriate API, so neither SELinux nor
> IMA need to be aware that there's an mm_struct involved. It's also
> entirely possible that some architectures won't even use an mm_struct
> behind the scenes -- x86, for example, could have avoided it if there
> were a kernel equivalent of PKRU. Sadly, there isn't.
>
>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>> the patch might spill across the page boundary, however if I deal with
>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>> the data will not span across multiple pages.
>
> The reason for the particular architecture of text_poke() is to avoid
> memory allocation to get it working. i think that prmem/rare_write
> should have each rare-writable kernel address map to a unique user
> address, possibly just by offsetting everything by a constant. For
> rare_write, you don't actually need it to work as such until fairly
> late in boot, since the rare_writable data will just be writable early
> on.
>
>> If the data spans across multiple pages, in unknown amount, I suppose
>> that I should not keep interrupts disabled for an unknown time, as it
>> would hurt preemption.
>>
>> What I thought, in my initial patch-set, was to iterate over each page
>> that must be written to, in a loop, re-enabling interrupts in-between
>> iterations, to give pending interrupts a chance to be served.
>>
>> This would mean that the data being written to would not be consistent,
>> but it's a problem that would have to be addressed anyways, since it can
>> be still read by other cores, while the write is ongoing.
>
> This probably makes sense, except that enabling and disabling
> interrupts means you also need to restore the original mm_struct (most
> likely), which is slow. I don't think there's a generic way to check
> whether in interrupt is pending without turning interrupts on.
I guess that enabling IRQs might break some hidden assumptions in the code,
but is there a fundamental reason that IRQs need to be disabled? use_mm()
got them enabled, although it is only suitable for kernel threads.
On Tue, Nov 13, 2018 at 9:43 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> From: Andy Lutomirski
> Sent: November 13, 2018 at 5:16:09 PM GMT
> > To: Igor Stoppa <igor.stoppa@gmail.com>
> > Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> > Subject: Re: [PATCH 10/17] prmem: documentation
> >
> >
> > On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
> >> Hello,
> >> I've been studying v4 of the patch-set [1] that Nadav has been working on.
> >> Incidentally, I think it would be useful to cc also the
> >> security/hardening ml.
> >> The patch-set seems to be close to final, so I am resuming this discussion.
> >>
> >> On 30/10/2018 19:06, Andy Lutomirski wrote:
> >>
> >>> I support the addition of a rare-write mechanism to the upstream kernel. And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
> >>
> >> After reading the code, I see what you meant.
> >> I think I can work with it.
> >>
> >> But I have a couple of questions wrt the use of this mechanism, in the
> >> context of write rare.
> >>
> >>
> >> 1) mm_struct.
> >>
> >> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
> >> (live patch?), which seems to happen sequentially and in a relatively
> >> standardized way, like replacing the NOPs specifically placed in the
> >> functions that need patching.
> >>
> >> This is a bit different from the more generic write-rare case, applied
> >> to data.
> >>
> >> As example, I have in mind a system where both IMA and SELinux are in use.
> >>
> >> In this system, a file is accessed for the first time.
> >>
> >> That would trigger 2 things:
> >> - evaluation of the SELinux rules and probably update of the AVC cache
> >> - IMA measurement and update of the measurements
> >>
> >> Both of them could be write protected, meaning that they would both have
> >> to be modified through the write rare mechanism.
> >>
> >> While the events, for 1 specific file, would be sequential, it's not
> >> difficult to imagine that multiple files could be accessed at the same time.
> >>
> >> If the update of the data structures in both IMA and SELinux must use
> >> the same mm_struct, that would have to be somehow regulated and it would
> >> introduce an unnecessary (imho) dependency.
> >>
> >> How about having one mm_struct for each writer (core or thread)?
> >
> > I don't think that helps anything. I think the mm_struct used for
> > prmem (or rare_write or whatever you want to call it) should be
> > entirely abstracted away by an appropriate API, so neither SELinux nor
> > IMA need to be aware that there's an mm_struct involved. It's also
> > entirely possible that some architectures won't even use an mm_struct
> > behind the scenes -- x86, for example, could have avoided it if there
> > were a kernel equivalent of PKRU. Sadly, there isn't.
> >
> >> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
> >> the patch might spill across the page boundary, however if I deal with
> >> the modification of generic data, I shouldn't (shouldn't I?) assume that
> >> the data will not span across multiple pages.
> >
> > The reason for the particular architecture of text_poke() is to avoid
> > memory allocation to get it working. i think that prmem/rare_write
> > should have each rare-writable kernel address map to a unique user
> > address, possibly just by offsetting everything by a constant. For
> > rare_write, you don't actually need it to work as such until fairly
> > late in boot, since the rare_writable data will just be writable early
> > on.
> >
> >> If the data spans across multiple pages, in unknown amount, I suppose
> >> that I should not keep interrupts disabled for an unknown time, as it
> >> would hurt preemption.
> >>
> >> What I thought, in my initial patch-set, was to iterate over each page
> >> that must be written to, in a loop, re-enabling interrupts in-between
> >> iterations, to give pending interrupts a chance to be served.
> >>
> >> This would mean that the data being written to would not be consistent,
> >> but it's a problem that would have to be addressed anyways, since it can
> >> be still read by other cores, while the write is ongoing.
> >
> > This probably makes sense, except that enabling and disabling
> > interrupts means you also need to restore the original mm_struct (most
> > likely), which is slow. I don't think there's a generic way to check
> > whether in interrupt is pending without turning interrupts on.
>
> I guess that enabling IRQs might break some hidden assumptions in the code,
> but is there a fundamental reason that IRQs need to be disabled? use_mm()
> got them enabled, although it is only suitable for kernel threads.
>
For general rare-writish stuff, I don't think we want IRQs running
with them mapped anywhere for write. For AVC and IMA, I'm less sure.
--Andy
From: Andy Lutomirski
Sent: November 13, 2018 at 5:47:16 PM GMT
> To: Nadav Amit <nadav.amit@gmail.com>
> Cc: Igor Stoppa <igor.stoppa@gmail.com>, Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: [PATCH 10/17] prmem: documentation
>
>
> On Tue, Nov 13, 2018 at 9:43 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>> From: Andy Lutomirski
>> Sent: November 13, 2018 at 5:16:09 PM GMT
>>> To: Igor Stoppa <igor.stoppa@gmail.com>
>>> Cc: Kees Cook <keescook@chromium.org>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mimi Zohar <zohar@linux.vnet.ibm.com>, Matthew Wilcox <willy@infradead.org>, Dave Chinner <david@fromorbit.com>, James Morris <jmorris@namei.org>, Michal Hocko <mhocko@kernel.org>, Kernel Hardening <kernel-hardening@lists.openwall.com>, linux-integrity <linux-integrity@vger.kernel.org>, LSM List <linux-security-module@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com>, Dave Hansen <dave.hansen@linux.intel.com>, Jonathan Corbet <corbet@lwn.net>, Laura Abbott <labbott@redhat.com>, Randy Dunlap <rdunlap@infradead.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, open list:DOCUMENTATION <linux-doc@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>
>>> Subject: Re: [PATCH 10/17] prmem: documentation
>>>
>>>
>>> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>>>> Hello,
>>>> I've been studying v4 of the patch-set [1] that Nadav has been working on.
>>>> Incidentally, I think it would be useful to cc also the
>>>> security/hardening ml.
>>>> The patch-set seems to be close to final, so I am resuming this discussion.
>>>>
>>>> On 30/10/2018 19:06, Andy Lutomirski wrote:
>>>>
>>>>> I support the addition of a rare-write mechanism to the upstream kernel. And I think that there is only one sane way to implement it: using an mm_struct. That mm_struct, just like any sane mm_struct, should only differ from init_mm in that it has extra mappings in the *user* region.
>>>>
>>>> After reading the code, I see what you meant.
>>>> I think I can work with it.
>>>>
>>>> But I have a couple of questions wrt the use of this mechanism, in the
>>>> context of write rare.
>>>>
>>>>
>>>> 1) mm_struct.
>>>>
>>>> Iiuc, the purpose of the patchset is mostly (only?) to patch kernel code
>>>> (live patch?), which seems to happen sequentially and in a relatively
>>>> standardized way, like replacing the NOPs specifically placed in the
>>>> functions that need patching.
>>>>
>>>> This is a bit different from the more generic write-rare case, applied
>>>> to data.
>>>>
>>>> As example, I have in mind a system where both IMA and SELinux are in use.
>>>>
>>>> In this system, a file is accessed for the first time.
>>>>
>>>> That would trigger 2 things:
>>>> - evaluation of the SELinux rules and probably update of the AVC cache
>>>> - IMA measurement and update of the measurements
>>>>
>>>> Both of them could be write protected, meaning that they would both have
>>>> to be modified through the write rare mechanism.
>>>>
>>>> While the events, for 1 specific file, would be sequential, it's not
>>>> difficult to imagine that multiple files could be accessed at the same time.
>>>>
>>>> If the update of the data structures in both IMA and SELinux must use
>>>> the same mm_struct, that would have to be somehow regulated and it would
>>>> introduce an unnecessary (imho) dependency.
>>>>
>>>> How about having one mm_struct for each writer (core or thread)?
>>>
>>> I don't think that helps anything. I think the mm_struct used for
>>> prmem (or rare_write or whatever you want to call it) should be
>>> entirely abstracted away by an appropriate API, so neither SELinux nor
>>> IMA need to be aware that there's an mm_struct involved. It's also
>>> entirely possible that some architectures won't even use an mm_struct
>>> behind the scenes -- x86, for example, could have avoided it if there
>>> were a kernel equivalent of PKRU. Sadly, there isn't.
>>>
>>>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>>>> the patch might spill across the page boundary, however if I deal with
>>>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>>>> the data will not span across multiple pages.
>>>
>>> The reason for the particular architecture of text_poke() is to avoid
>>> memory allocation to get it working. i think that prmem/rare_write
>>> should have each rare-writable kernel address map to a unique user
>>> address, possibly just by offsetting everything by a constant. For
>>> rare_write, you don't actually need it to work as such until fairly
>>> late in boot, since the rare_writable data will just be writable early
>>> on.
>>>
>>>> If the data spans across multiple pages, in unknown amount, I suppose
>>>> that I should not keep interrupts disabled for an unknown time, as it
>>>> would hurt preemption.
>>>>
>>>> What I thought, in my initial patch-set, was to iterate over each page
>>>> that must be written to, in a loop, re-enabling interrupts in-between
>>>> iterations, to give pending interrupts a chance to be served.
>>>>
>>>> This would mean that the data being written to would not be consistent,
>>>> but it's a problem that would have to be addressed anyways, since it can
>>>> be still read by other cores, while the write is ongoing.
>>>
>>> This probably makes sense, except that enabling and disabling
>>> interrupts means you also need to restore the original mm_struct (most
>>> likely), which is slow. I don't think there's a generic way to check
>>> whether in interrupt is pending without turning interrupts on.
>>
>> I guess that enabling IRQs might break some hidden assumptions in the code,
>> but is there a fundamental reason that IRQs need to be disabled? use_mm()
>> got them enabled, although it is only suitable for kernel threads.
>
> For general rare-writish stuff, I don't think we want IRQs running
> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
Oh.. Of course. It is sort of a measure to prevent a single malicious/faulty
write from corrupting the sensitive memory. Doing so limits the code that
can compromise the security by a write to the protected data-structures
(rephrasing for myself).
I think I should add it as a comment in your patch.
On 13/11/2018 19:16, Andy Lutomirski wrote: > On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa <igor.stoppa@gmail.com> wrote: [...] >> How about having one mm_struct for each writer (core or thread)? >> > > I don't think that helps anything. I think the mm_struct used for > prmem (or rare_write or whatever you want to call it) write_rare / rarely can be shortened to wr_ which is kinda less confusing than rare_write, since it would become rw_ and easier to confuse with R/W Any advice for better naming is welcome. > should be > entirely abstracted away by an appropriate API, so neither SELinux nor > IMA need to be aware that there's an mm_struct involved. Yes, that is fine. In my proposal I was thinking about tying it to the core/thread that performs the actual write. The high level API could be something like: wr_memcpy(void *src, void *dst, uint_t size) > It's also > entirely possible that some architectures won't even use an mm_struct > behind the scenes -- x86, for example, could have avoided it if there > were a kernel equivalent of PKRU. Sadly, there isn't. The mm_struct - or whatever is the means to do the write on that architecture - can be kept hidden from the API. But the reason why I was proposing to have one mm_struct per writer is that, iiuc, the secondary mapping is created in the secondary mm_struct for each writer using it. So the updating of IMA measurements would have, theoretically, also write access to the SELinux AVC. Which I was trying to avoid. And similarly any other write rare updater. Is this correct? >> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of >> the patch might spill across the page boundary, however if I deal with >> the modification of generic data, I shouldn't (shouldn't I?) assume that >> the data will not span across multiple pages. > > The reason for the particular architecture of text_poke() is to avoid > memory allocation to get it working. i think that prmem/rare_write > should have each rare-writable kernel address map to a unique user > address, possibly just by offsetting everything by a constant. For > rare_write, you don't actually need it to work as such until fairly > late in boot, since the rare_writable data will just be writable early > on. Yes, that is true. I think it's safe to assume, from an attack pattern, that as long as user space is not started, the system can be considered ok. Even user-space code run from initrd should be ok, since it can be bundled (and signed) as a single binary with the kernel. Modules loaded from a regular filesystem are a bit more risky, because an attack might inject a rogue key in the key-ring and use it to load malicious modules. >> If the data spans across multiple pages, in unknown amount, I suppose >> that I should not keep interrupts disabled for an unknown time, as it >> would hurt preemption. >> >> What I thought, in my initial patch-set, was to iterate over each page >> that must be written to, in a loop, re-enabling interrupts in-between >> iterations, to give pending interrupts a chance to be served. >> >> This would mean that the data being written to would not be consistent, >> but it's a problem that would have to be addressed anyways, since it can >> be still read by other cores, while the write is ongoing. > > This probably makes sense, except that enabling and disabling > interrupts means you also need to restore the original mm_struct (most > likely), which is slow. I don't think there's a generic way to check > whether in interrupt is pending without turning interrupts on. The only "excuse" I have is that write_rare is opt-in and is "rare". Maybe the enabling/disabling of interrupts - and the consequent switch of mm_struct - could be somehow tied to the latency configuration? If preemption is disabled, the expectations on the system latency are anyway more relaxed. But I'm not sure how it would work against I/O. -- igor
On 13/11/2018 19:47, Andy Lutomirski wrote:
> For general rare-writish stuff, I don't think we want IRQs running
> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
Why would these be less sensitive?
But I see a big difference between my initial implementation and this one.
In my case, by using a shared mapping, visible to all cores, freezing
the core that is performing the write would have exposed the writable
mapping to a potential attack run from another core.
If the mapping is private to the core performing the write, even if it
is frozen, it's much harder to figure out what it had mapped and where,
from another core.
To access that mapping, the attack should be performed from the ISR, I
think.
--
igor
I forgot one sentence :-(
On 13/11/2018 20:31, Igor Stoppa wrote:
> On 13/11/2018 19:47, Andy Lutomirski wrote:
>
>> For general rare-writish stuff, I don't think we want IRQs running
>> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
>
> Why would these be less sensitive?
>
> But I see a big difference between my initial implementation and this one.
>
> In my case, by using a shared mapping, visible to all cores, freezing
> the core that is performing the write would have exposed the writable
> mapping to a potential attack run from another core.
>
> If the mapping is private to the core performing the write, even if it
> is frozen, it's much harder to figure out what it had mapped and where,
> from another core.
>
> To access that mapping, the attack should be performed from the ISR, I
> think.
Unless the secondary mapping is also available to other cores, through
the shared mm_struct ?
--
igor
On Tue, Nov 13, 2018 at 10:26 AM Igor Stoppa <igor.stoppa@huawei.com> wrote: > > On 13/11/2018 19:16, Andy Lutomirski wrote: > > > should be > > entirely abstracted away by an appropriate API, so neither SELinux nor > > IMA need to be aware that there's an mm_struct involved. > > Yes, that is fine. In my proposal I was thinking about tying it to the > core/thread that performs the actual write. > > The high level API could be something like: > > wr_memcpy(void *src, void *dst, uint_t size) > > > It's also > > entirely possible that some architectures won't even use an mm_struct > > behind the scenes -- x86, for example, could have avoided it if there > > were a kernel equivalent of PKRU. Sadly, there isn't. > > The mm_struct - or whatever is the means to do the write on that > architecture - can be kept hidden from the API. > > But the reason why I was proposing to have one mm_struct per writer is > that, iiuc, the secondary mapping is created in the secondary mm_struct > for each writer using it. > > So the updating of IMA measurements would have, theoretically, also > write access to the SELinux AVC. Which I was trying to avoid. > And similarly any other write rare updater. Is this correct? If you call a wr_memcpy() function with the signature you suggested, then you can overwrite any memory of this type. Having a different mm_struct under the hood makes no difference. As far as I'm concerned, for *dynamically allocated* rare-writable memory, you might as well allocate all the paging structures at the same time, so the mm_struct will always contain the mappings. If there are serious bugs in wr_memcpy() that cause it to write to the wrong place, we have bigger problems. I can imagine that we'd want a *typed* wr_memcpy()-like API some day, but that can wait for v2. And it still doesn't obviously need multiple mm_structs. > > >> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of > >> the patch might spill across the page boundary, however if I deal with > >> the modification of generic data, I shouldn't (shouldn't I?) assume that > >> the data will not span across multiple pages. > > > > The reason for the particular architecture of text_poke() is to avoid > > memory allocation to get it working. i think that prmem/rare_write > > should have each rare-writable kernel address map to a unique user > > address, possibly just by offsetting everything by a constant. For > > rare_write, you don't actually need it to work as such until fairly > > late in boot, since the rare_writable data will just be writable early > > on. > > Yes, that is true. I think it's safe to assume, from an attack pattern, > that as long as user space is not started, the system can be considered > ok. Even user-space code run from initrd should be ok, since it can be > bundled (and signed) as a single binary with the kernel. > > Modules loaded from a regular filesystem are a bit more risky, because > an attack might inject a rogue key in the key-ring and use it to load > malicious modules. If a malicious module is loaded, the game is over. > > >> If the data spans across multiple pages, in unknown amount, I suppose > >> that I should not keep interrupts disabled for an unknown time, as it > >> would hurt preemption. > >> > >> What I thought, in my initial patch-set, was to iterate over each page > >> that must be written to, in a loop, re-enabling interrupts in-between > >> iterations, to give pending interrupts a chance to be served. > >> > >> This would mean that the data being written to would not be consistent, > >> but it's a problem that would have to be addressed anyways, since it can > >> be still read by other cores, while the write is ongoing. > > > > This probably makes sense, except that enabling and disabling > > interrupts means you also need to restore the original mm_struct (most > > likely), which is slow. I don't think there's a generic way to check > > whether in interrupt is pending without turning interrupts on. > > The only "excuse" I have is that write_rare is opt-in and is "rare". > Maybe the enabling/disabling of interrupts - and the consequent switch > of mm_struct - could be somehow tied to the latency configuration? > > If preemption is disabled, the expectations on the system latency are > anyway more relaxed. > > But I'm not sure how it would work against I/O. I think it's entirely reasonable for the API to internally break up very large memcpys.
On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>
> I forgot one sentence :-(
>
> On 13/11/2018 20:31, Igor Stoppa wrote:
> > On 13/11/2018 19:47, Andy Lutomirski wrote:
> >
> >> For general rare-writish stuff, I don't think we want IRQs running
> >> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
> >
> > Why would these be less sensitive?
> >
> > But I see a big difference between my initial implementation and this one.
> >
> > In my case, by using a shared mapping, visible to all cores, freezing
> > the core that is performing the write would have exposed the writable
> > mapping to a potential attack run from another core.
> >
> > If the mapping is private to the core performing the write, even if it
> > is frozen, it's much harder to figure out what it had mapped and where,
> > from another core.
> >
> > To access that mapping, the attack should be performed from the ISR, I
> > think.
>
> Unless the secondary mapping is also available to other cores, through
> the shared mm_struct ?
>
I don't think this matters much. The other cores will only be able to
use that mapping when they're doing a rare write.
On Tue, Nov 13, 2018 at 10:31 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>
> On 13/11/2018 19:47, Andy Lutomirski wrote:
>
> > For general rare-writish stuff, I don't think we want IRQs running
> > with them mapped anywhere for write. For AVC and IMA, I'm less sure.
>
> Why would these be less sensitive?
I'm not really saying they're less sensitive so much as that the
considerations are different. I think the original rare-write code is
based on ideas from grsecurity, and it was intended to protect static
data like structs full of function pointers. Those targets have some
different properties:
- Static targets are at addresses that are much more guessable, so
they're easier targets for most attacks. (Not spraying attacks like
the ones you're interested in, though.)
- Static targets are higher value. No offense to IMA or AVC, but
outright execution of shellcode, hijacking of control flow, or compete
disablement of core security features is higher impact than bypassing
SELinux or IMA. Why would you bother corrupting the AVC if you could
instead just set enforcing=0? (I suppose that corrupting the AVC is
less likely to be noticed by monitoring tools.)
- Static targets are small. This means that the interrupt latency
would be negligible, especially in comparison to the latency of
replacing the entire SELinux policy object.
Anyway, I'm not all that familiar with SELinux under the hood, but I'm
wondering if a different approach to thinks like the policy database
might be appropriate. When the policy is changed, rather than
allocating rare-write memory and writing to it, what if we instead
allocated normal memory, wrote to it, write-protected it, and then
used the rare-write infrastructure to do a much smaller write to
replace the pointer?
Admittedly, this creates a window where another core could corrupt the
data as it's being written. That may not matter so much if an
attacker can't force a policy update. Alternatively, the update code
could re-verify the policy after write-protecting it, or there could
be a fancy API to allocate some temporarily-writable memory (by
creating a whole new mm_struct, mapping the memory writable just in
that mm_struct, and activating it) so that only the actual policy
loader could touch the memory. But I'm mostly speculating here, since
I'm not familiar with the code in question.
Anyway, I tend to think that the right way to approach mainlining all
this is to first get the basic rare write support for static data into
place and then to build on that. I think it's great that you're
pushing this effort, but doing this for SELinux and IMA is a bigger
project than doing it for static data, and it might make sense to do
it in bite-sized pieces.
Does any of this make sense?
On 13/11/2018 20:35, Andy Lutomirski wrote: > On Tue, Nov 13, 2018 at 10:26 AM Igor Stoppa <igor.stoppa@huawei.com> wrote: [...] >> The high level API could be something like: >> >> wr_memcpy(void *src, void *dst, uint_t size) [...] > If you call a wr_memcpy() function with the signature you suggested, > then you can overwrite any memory of this type. Having a different > mm_struct under the hood makes no difference. As far as I'm > concerned, for *dynamically allocated* rare-writable memory, you might > as well allocate all the paging structures at the same time, so the > mm_struct will always contain the mappings. If there are serious bugs > in wr_memcpy() that cause it to write to the wrong place, we have > bigger problems. Beside bugs, I'm also thinking about possible vulnerability. It might be overthinking, though. I do not think it's possible to really protect against control flow attacks, unless there is some support from the HW and/or the compiler. What is left, are data-based attacks. In this case, it would be an attacker using one existing wr_ invocation with doctored parameters. However, there is always the objection that it would be possible to come up with a "writing kit" for plowing through the page tables and unprotect anything that might be of value. Ideally, that should be the only type of data-based attack left. In practice, it might just be an excess of paranoia from my side. > I can imagine that we'd want a *typed* wr_memcpy()-like API some day, > but that can wait for v2. And it still doesn't obviously need > multiple mm_structs. I left that out, for now, but yes, typing would play some role here. [...] > I think it's entirely reasonable for the API to internally break up > very large memcpys. The criteria for deciding if/how to break it down is not clear to me, though. The single page was easy, but might be (probably is) too much. -- igor
On 13/11/2018 20:36, Andy Lutomirski wrote: > On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote: [...] >> Unless the secondary mapping is also available to other cores, through >> the shared mm_struct ? >> > > I don't think this matters much. The other cores will only be able to > use that mapping when they're doing a rare write. Which has now been promoted to target for attacks :-) -- igor
On 13/11/2018 20:48, Andy Lutomirski wrote: > On Tue, Nov 13, 2018 at 10:31 AM Igor Stoppa <igor.stoppa@huawei.com> wrote: >> >> On 13/11/2018 19:47, Andy Lutomirski wrote: >> >>> For general rare-writish stuff, I don't think we want IRQs running >>> with them mapped anywhere for write. For AVC and IMA, I'm less sure. >> >> Why would these be less sensitive? > > I'm not really saying they're less sensitive so much as that the > considerations are different. I think the original rare-write code is > based on ideas from grsecurity, and it was intended to protect static > data like structs full of function pointers. Those targets have some > different properties: > > - Static targets are at addresses that are much more guessable, so > they're easier targets for most attacks. (Not spraying attacks like > the ones you're interested in, though.) > > - Static targets are higher value. No offense to IMA or AVC, but > outright execution of shellcode, hijacking of control flow, or compete > disablement of core security features is higher impact than bypassing > SELinux or IMA. Why would you bother corrupting the AVC if you could > instead just set enforcing=0? (I suppose that corrupting the AVC is > less likely to be noticed by monitoring tools.) > > - Static targets are small. This means that the interrupt latency > would be negligible, especially in comparison to the latency of > replacing the entire SELinux policy object. Your analysis is correct. In my case, having already taken care of those, I was going *also* after the next target in line. Admittedly, flipping a bit located at a fixed offset is way easier than spraying dynamically allocated data structured. However, once the bit is not easily writable, the only options are to either find another way to flip it (unprotect it or subvert something that can write it) or to identify another target that is still writable. AVC and policyDB fit the latter description. > Anyway, I'm not all that familiar with SELinux under the hood, but I'm > wondering if a different approach to thinks like the policy database > might be appropriate. When the policy is changed, rather than > allocating rare-write memory and writing to it, what if we instead > allocated normal memory, wrote to it, write-protected it, and then > used the rare-write infrastructure to do a much smaller write to > replace the pointer? Actually, that's exactly what I did. I did not want to overload this discussion, but since you brought it up, I'm not sure write rare is enough. * write_rare is for stuff that sometimes changes all the time, ex: AVC * dynamic read only is for stuff that at some point should not be modified anymore, but could still be destroyed. Ex: policyDB I think it would be good to differentiate, at runtime, between the two, to minimize the chance that a write_rare function is used against some read_only data. Releasing dynamically allocated protected memory is also a big topic. In some cases it's allocated and released continuously, like in the AVC. Maybe it can be optimized, or maybe it can be turned into an object cache of protected object. But for releasing, it would be good, I think, to have a mechanism for freeing all the memory in one loop, like having a pool containing all the memory that was allocated for a specific use (ex: policyDB) > Admittedly, this creates a window where another core could corrupt the > data as it's being written. That may not matter so much if an > attacker can't force a policy update. Alternatively, the update code > could re-verify the policy after write-protecting it, or there could > be a fancy API to allocate some temporarily-writable memory (by > creating a whole new mm_struct, mapping the memory writable just in > that mm_struct, and activating it) so that only the actual policy > loader could touch the memory. But I'm mostly speculating here, since > I'm not familiar with the code in question. They are all corner cases ... possible but unlikely. Another, maybe more critical, one is that the policyDB is not available at boot. There is a window of opportunity, before it's loaded. But it could be mitigated by loading a barebone set of rules, either from initrd or even as "firmware". > Anyway, I tend to think that the right way to approach mainlining all > this is to first get the basic rare write support for static data into > place and then to build on that. I think it's great that you're > pushing this effort, but doing this for SELinux and IMA is a bigger > project than doing it for static data, and it might make sense to do > it in bite-sized pieces. > > Does any of this make sense? Yes, sure. I *have* to do SELinux, but I do not necessarily have to wait for the final version to be merged upstream. And anyways Android is on a different kernel. However, I think both SELinux and IMA have a value in being sufficiently complex cases to be used for validating the design as it evolves. Each of them has static data that could be the first target for protection, in a smaller patch. Lists of write rare data are probably the next big thing, in terms of defining the API. But I could start with introducing __wr_after_init. -- igor
Hi,
On 13/11/2018 20:36, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>
>> I forgot one sentence :-(
>>
>> On 13/11/2018 20:31, Igor Stoppa wrote:
>>> On 13/11/2018 19:47, Andy Lutomirski wrote:
>>>
>>>> For general rare-writish stuff, I don't think we want IRQs running
>>>> with them mapped anywhere for write. For AVC and IMA, I'm less sure.
>>>
>>> Why would these be less sensitive?
>>>
>>> But I see a big difference between my initial implementation and this one.
>>>
>>> In my case, by using a shared mapping, visible to all cores, freezing
>>> the core that is performing the write would have exposed the writable
>>> mapping to a potential attack run from another core.
>>>
>>> If the mapping is private to the core performing the write, even if it
>>> is frozen, it's much harder to figure out what it had mapped and where,
>>> from another core.
>>>
>>> To access that mapping, the attack should be performed from the ISR, I
>>> think.
>>
>> Unless the secondary mapping is also available to other cores, through
>> the shared mm_struct ?
>>
>
> I don't think this matters much. The other cores will only be able to
> use that mapping when they're doing a rare write.
I'm still mulling over this.
There might be other reasons for replicating the mm_struct.
If I understand correctly how the text patching works, it happens
sequentially, because of the text_mutex used by arch_jump_label_transform
Which might be fine for this specific case, but I think I shouldn't
introduce a global mutex, when it comes to data.
Most likely, if two or more cores want to perform a write rare
operation, there is no correlation between them, they could proceed in
parallel. And if there really is, then the user of the API should
introduce own locking, for that specific case.
A bit unrelated question, related to text patching: I see that each
patching operation is validated, but wouldn't it be more robust to first
validate all of then, and only after they are all found to be
compliant, to proceed with the actual modifications?
And about the actual implementation of the write rare for the statically
allocated variables, is it expected that I use Nadav's function?
Or that I refactor the code?
The name, referring to text would definitely not be ok for data.
And I would have to also generalize it, to deal with larger amounts of data.
I would find it easier, as first cut, to replicate its behavior and
refactor only later, once it has stabilized and possibly Nadav's patches
have been acked.
--
igor
> On Nov 21, 2018, at 8:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote: > > Hi, > > On 13/11/2018 20:36, Andy Lutomirski wrote: >> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote: >>> I forgot one sentence :-( >>> >>> On 13/11/2018 20:31, Igor Stoppa wrote: >>>> On 13/11/2018 19:47, Andy Lutomirski wrote: >>>> >>>>> For general rare-writish stuff, I don't think we want IRQs running >>>>> with them mapped anywhere for write. For AVC and IMA, I'm less sure. >>>> >>>> Why would these be less sensitive? >>>> >>>> But I see a big difference between my initial implementation and this one. >>>> >>>> In my case, by using a shared mapping, visible to all cores, freezing >>>> the core that is performing the write would have exposed the writable >>>> mapping to a potential attack run from another core. >>>> >>>> If the mapping is private to the core performing the write, even if it >>>> is frozen, it's much harder to figure out what it had mapped and where, >>>> from another core. >>>> >>>> To access that mapping, the attack should be performed from the ISR, I >>>> think. >>> >>> Unless the secondary mapping is also available to other cores, through >>> the shared mm_struct ? >> I don't think this matters much. The other cores will only be able to >> use that mapping when they're doing a rare write. > > > I'm still mulling over this. > There might be other reasons for replicating the mm_struct. > > If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform > > Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data. > Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case. I think that if you create per-CPU temporary mms as you proposed, you do not need a global lock. You would need to protect against module unloading, and maybe refactor the code. Specifically, I’m not sure whether protection against IRQs is something that you need. I’m also not familiar with this patch-set so I’m not sure what atomicity guarantees do you need. > A bit unrelated question, related to text patching: I see that each patching operation is validated, but wouldn't it be more robust to first validate all of then, and only after they are all found to be compliant, to proceed with the actual modifications? > > And about the actual implementation of the write rare for the statically allocated variables, is it expected that I use Nadav's function? It’s not “my” function. ;-) IMHO the code is in relatively good and stable state. The last couple of versions were due to me being afraid to add BUG_ONs as Peter asked me to. The code is rather simple, but there are a couple of pitfalls that hopefully I avoided correctly. Regards, Nadav
On 21/11/2018 19:36, Nadav Amit wrote: >> On Nov 21, 2018, at 8:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote: [...] >> There might be other reasons for replicating the mm_struct. >> >> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform >> >> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data. >> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case. > > I think that if you create per-CPU temporary mms as you proposed, you do not > need a global lock. You would need to protect against module unloading, yes, it's unlikely to happen and probably a bug in the module, if it tries to write while being unloaded, but I can do it > and > maybe refactor the code. Specifically, I’m not sure whether protection > against IRQs is something that you need. With the initial way I used to do write rare, which was done by creating a temporary mapping visible to every core, disabling IRQs was meant to prevent that the "writer" core would be frozen and then the mappings scrubbed for the page in writable state. Without shared mapping of the page, the only way to attack it should be to generate an interrupt on the "writer" core, while the writing is ongoing, and to perform the attack from the interrupt itself, because it is on the same core that has the writable mapping. Maybe it's possible, but it seems to have become quite a corner case. > I’m also not familiar with this > patch-set so I’m not sure what atomicity guarantees do you need. At the very least, I think I need to ensure that pointers are updated atomically, like with WRITE_ONCE() And spinlocks. Maybe atomic types can be left out. >> A bit unrelated question, related to text patching: I see that each patching operation is validated, but wouldn't it be more robust to first validate all of then, and only after they are all found to be compliant, to proceed with the actual modifications? >> >> And about the actual implementation of the write rare for the statically allocated variables, is it expected that I use Nadav's function? > > It’s not “my” function. ;-) :-P ok, what I meant is that the signature of the __text_poke() function is quite specific to what it's meant to do. I do not rule out that it might be eventually refactored as a special case of a more generic __write_rare() function, that would operate on different targets, but I'd rather do the refactoring after I have a clear understanding of how to alter write-protected data. The refactoring could be the last patch of the write rare patchset. > IMHO the code is in relatively good and stable state. The last couple of > versions were due to me being afraid to add BUG_ONs as Peter asked me to. > > The code is rather simple, but there are a couple of pitfalls that hopefully > I avoided correctly. Yes, I did not mean to question the quality of the code, but I'd prefer to not have to carry also this patchset, while it's not yet merged. I actually hope it gets merged soon :-) -- igor
> On Nov 21, 2018, at 9:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote: > > Hi, > >>> On 13/11/2018 20:36, Andy Lutomirski wrote: >>> On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa <igor.stoppa@huawei.com> wrote: >>> >>> I forgot one sentence :-( >>> >>>>> On 13/11/2018 20:31, Igor Stoppa wrote: >>>>> On 13/11/2018 19:47, Andy Lutomirski wrote: >>>>> >>>>> For general rare-writish stuff, I don't think we want IRQs running >>>>> with them mapped anywhere for write. For AVC and IMA, I'm less sure. >>>> >>>> Why would these be less sensitive? >>>> >>>> But I see a big difference between my initial implementation and this one. >>>> >>>> In my case, by using a shared mapping, visible to all cores, freezing >>>> the core that is performing the write would have exposed the writable >>>> mapping to a potential attack run from another core. >>>> >>>> If the mapping is private to the core performing the write, even if it >>>> is frozen, it's much harder to figure out what it had mapped and where, >>>> from another core. >>>> >>>> To access that mapping, the attack should be performed from the ISR, I >>>> think. >>> >>> Unless the secondary mapping is also available to other cores, through >>> the shared mm_struct ? >> I don't think this matters much. The other cores will only be able to >> use that mapping when they're doing a rare write. > > > I'm still mulling over this. > There might be other reasons for replicating the mm_struct. > > If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform > > Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data. > Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case. Text patching uses the same VA for different physical addresses, so it need a mutex to avoid conflicts. I think that, for rare writes, you should just map each rare-writable address at a *different* VA. You’ll still need a mutex (mmap_sem) to synchronize allocation and freeing of rare-writable ranges, but that shouldn’t have much contention. > > A bit unrelated question, related to text patching: I see that each patching operation is validated, but wouldn't it be more robust to first validate all of then, and only after they are all found to be compliant, to proceed with the actual modifications? > > And about the actual implementation of the write rare for the statically allocated variables, is it expected that I use Nadav's function? > Or that I refactor the code? I would either refactor it or create a new function to handle the write. The main thing that Nadav is adding that I think you’ll want to use is the infrastructure for temporarily switching mms from a non-kernel-thread context. > > The name, referring to text would definitely not be ok for data. > And I would have to also generalize it, to deal with larger amounts of data. > > I would find it easier, as first cut, to replicate its behavior and refactor only later, once it has stabilized and possibly Nadav's patches have been acked. > Sounds reasonable. You’ll still want Nadav’s code for setting up the mm in the first place, though. > -- > igor
On 21/11/2018 20:15, Andy Lutomirski wrote: >> On Nov 21, 2018, at 9:34 AM, Igor Stoppa <igor.stoppa@gmail.com> wrote: [...] >> There might be other reasons for replicating the mm_struct. >> >> If I understand correctly how the text patching works, it happens sequentially, because of the text_mutex used by arch_jump_label_transform >> >> Which might be fine for this specific case, but I think I shouldn't introduce a global mutex, when it comes to data. >> Most likely, if two or more cores want to perform a write rare operation, there is no correlation between them, they could proceed in parallel. And if there really is, then the user of the API should introduce own locking, for that specific case. > > Text patching uses the same VA for different physical addresses, so it need a mutex to avoid conflicts. I think that, for rare writes, you should just map each rare-writable address at a *different* VA. You’ll still need a mutex (mmap_sem) to synchronize allocation and freeing of rare-writable ranges, but that shouldn’t have much contention. I have studied the code involved with Nadav's patchset. I am perplexed about these sentences you wrote. More to the point (to the best of my understanding): poking_init() ------------- 1. it gets one random poking address and ensures to have at least 2 consecutive PTEs from the same PMD 2. it then proceeds to map/unmap an address from the first of the 2 consecutive PTEs, so that, later on, there will be no need to allocate pages, which might fail, if poking from atomic context. 3. at this point, the page tables are populated, for the address that was obtained at point 1, and this is ok, because the address is fixed write_rare ---------- 4. it can happen on any available core / thread at any time, therefore each of them needs a different address 5. CPUs support hotplug, but from what I have read, I might get away with having up to nr_cpus different addresses (determined at init) and I would follow the same technique used by Nadav, of forcing the mapping of 1 or 2 (1 could be enough, I have to loop anyway, at some point) pages at each address, to ensure the population of the page tables so far, so good, but ... 6. the addresses used by each CPU are fixed 7. I do not understand the reference you make to "allocation and freeing of rare-writable ranges", because if I treat the range as such, then there is a risk that I need to populate more entries in the page table, which would have problems with the atomic context, unless write_rare from atomic is ruled out. If write_rare from atomic can be avoided, then I can also have one-use randomized addresses at each write-rare operation, instead of fixed ones, like in point 6. and, apologies for being dense: the following is still not clear to me: 8. you wrote: > You’ll still need a mutex (mmap_sem) to synchronize allocation > and freeing of rare-writable ranges, but that shouldn’t have much > contention. What causes the contention? It's the fact that the various cores are using the same mm, if I understood correctly. However, if there was one mm for each core, wouldn't that make it unnecessary to have any mutex? I feel there must be some obvious reason why multiple mms are not a good idea, yet I cannot grasp it :-( switch_mm_irqs_off() seems to have lots of references to "this_cpu_something"; if there is any optimization from having the same next across multiple cores, I'm missing it [...] > I would either refactor it or create a new function to handle the write. The main thing that Nadav is adding that I think you’ll want to use is the infrastructure for temporarily switching mms from a non-kernel-thread context. yes [...] > You’ll still want Nadav’s code for setting up the mm in the first place, though. yes -- thanks, igor
On Thu, Nov 22, 2018 at 09:27:02PM +0200, Igor Stoppa wrote:
> I have studied the code involved with Nadav's patchset.
> I am perplexed about these sentences you wrote.
>
> More to the point (to the best of my understanding):
>
> poking_init()
> -------------
> 1. it gets one random poking address and ensures to have at least 2
> consecutive PTEs from the same PMD
> 2. it then proceeds to map/unmap an address from the first of the 2
> consecutive PTEs, so that, later on, there will be no need to
> allocate pages, which might fail, if poking from atomic context.
> 3. at this point, the page tables are populated, for the address that
> was obtained at point 1, and this is ok, because the address is fixed
>
> write_rare
> ----------
> 4. it can happen on any available core / thread at any time, therefore
> each of them needs a different address
No? Each CPU has its own CR3 (eg each CPU might be running a different
user task). If you have _one_ address for each allocation, it may or
may not be mapped on other CPUs at the same time -- you simply don't care.
The writable address can even be a simple formula to calculate from
the read-only address, you don't have to allocate an address in the
writable mapping space.
On Thu, Nov 22, 2018 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Nov 22, 2018 at 09:27:02PM +0200, Igor Stoppa wrote:
> > I have studied the code involved with Nadav's patchset.
> > I am perplexed about these sentences you wrote.
> >
> > More to the point (to the best of my understanding):
> >
> > poking_init()
> > -------------
> > 1. it gets one random poking address and ensures to have at least 2
> > consecutive PTEs from the same PMD
> > 2. it then proceeds to map/unmap an address from the first of the 2
> > consecutive PTEs, so that, later on, there will be no need to
> > allocate pages, which might fail, if poking from atomic context.
> > 3. at this point, the page tables are populated, for the address that
> > was obtained at point 1, and this is ok, because the address is fixed
> >
> > write_rare
> > ----------
> > 4. it can happen on any available core / thread at any time, therefore
> > each of them needs a different address
>
> No? Each CPU has its own CR3 (eg each CPU might be running a different
> user task). If you have _one_ address for each allocation, it may or
> may not be mapped on other CPUs at the same time -- you simply don't care.
>
> The writable address can even be a simple formula to calculate from
> the read-only address, you don't have to allocate an address in the
> writable mapping space.
>
Agreed. I suggest the formula:
writable_address = readable_address - rare_write_offset. For
starters, rare_write_offset can just be a constant. If we want to get
fancy later on, it can be randomized.
If we do it like this, then we don't need to modify any pagetables at
all when we do a rare write. Instead we can set up the mapping at
boot or when we allocate the rare write space, and the actual rare
write code can just switch mms and do the write.
Hello, apologies for the delayed answer. Please find my reply to both last mails in the thread, below. On 22/11/2018 22:53, Andy Lutomirski wrote: > On Thu, Nov 22, 2018 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote: >> >> On Thu, Nov 22, 2018 at 09:27:02PM +0200, Igor Stoppa wrote: >>> I have studied the code involved with Nadav's patchset. >>> I am perplexed about these sentences you wrote. >>> >>> More to the point (to the best of my understanding): >>> >>> poking_init() >>> ------------- >>> 1. it gets one random poking address and ensures to have at least 2 >>> consecutive PTEs from the same PMD >>> 2. it then proceeds to map/unmap an address from the first of the 2 >>> consecutive PTEs, so that, later on, there will be no need to >>> allocate pages, which might fail, if poking from atomic context. >>> 3. at this point, the page tables are populated, for the address that >>> was obtained at point 1, and this is ok, because the address is fixed >>> >>> write_rare >>> ---------- >>> 4. it can happen on any available core / thread at any time, therefore >>> each of them needs a different address >> >> No? Each CPU has its own CR3 (eg each CPU might be running a different >> user task). If you have _one_ address for each allocation, it may or >> may not be mapped on other CPUs at the same time -- you simply don't care. Yes, somehow I lost track of that aspect. >> The writable address can even be a simple formula to calculate from >> the read-only address, you don't have to allocate an address in the >> writable mapping space. >> > > Agreed. I suggest the formula: > > writable_address = readable_address - rare_write_offset. For > starters, rare_write_offset can just be a constant. If we want to get > fancy later on, it can be randomized. ok, I hope I captured it here [1] > If we do it like this, then we don't need to modify any pagetables at > all when we do a rare write. Instead we can set up the mapping at > boot or when we allocate the rare write space, and the actual rare > write code can just switch mms and do the write. I did it. I have little feeling about the actual amount of data involved, but there is a (probably very remote) chance that the remap wouldn't work, at least in the current implementation. It's a bit different from what I had in mind initially, since I was thinking to have one single approach to both statically allocated memory (is there a better way to describe it) and what is provided from the allocator that would come next. As I wrote, I do not particularly like the way I implemented multiple functionality vs remapping, but I couldn't figure out any better way to do it, so eventually I kept this one, hoping to get some advice on how to improve it. I did not provide yet an example, yet, but IMA has some flags that are probably very suitable, since they depend on policy reloading, which can happen multiple times, but could be used to disable it. [1] https://www.openwall.com/lists/kernel-hardening/2018/12/04/3 -- igor