netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Don’t leave executable TLB entries to freed pages
@ 2018-12-12  0:03 Rick Edgecombe
  2018-12-12  0:03 ` [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms Rick Edgecombe
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Rick Edgecombe @ 2018-12-12  0:03 UTC (permalink / raw)
  To: akpm, luto, will.deacon, linux-mm, linux-kernel,
	kernel-hardening, naveen.n.rao, anil.s.keshavamurthy, davem,
	mhiramat, rostedt, mingo, ast, daniel, jeyu, namit, netdev,
	ard.biesheuvel, jannh
  Cc: kristen, dave.hansen, deneen.t.dock, Rick Edgecombe

Sometimes when memory is freed via the module subsystem, an executable
permissioned TLB entry can remain to a freed page. If the page is re-used to
back an address that will receive data from userspace, it can result in user
data being mapped as executable in the kernel. The root of this behavior is
vfree lazily flushing the TLB, but not lazily freeing the underlying pages.

This v2 enables vfree to handle freeing memory with special permissions. So now
it can be done with no W^X window, centralizing the logic for this operation,
and also to do this with only one TLB flush on x86.

I'm not sure if the algorithm Andy Lutomirski suggested (to do the whole
teardown with one TLB flush) will work across other architectures or not, so it
is in an x86 arch breakout(arch_vunmap) in this version. The default arch_vunmap
implementation does what Nadav is proposing users of module_alloc do on tear
down so it should be unchanged in behavior, just centralized. The main
difference will be BPF teardown will now get an extra TLB flush on archs that
have set_memory_* defined from set_memory_nx in addition to set_memory_rw. On
x86, due to the more efficient arch version, it will be unchanged at one flush.

The logic enabling this behavior is plugged into kernel/module.c and bpf cross
arch pieces. So it should be enabled for all architectures for regular .ko
modules and bpf but the other module_alloc users will be unchanged for now.

I did find one small downside with this approach, and that is that there is
occasionally one extra directmap page split in modules tear down, since one of
the modules subsections is RW. The x86 arch_vunmap will set the RW directmap of
the pages not present, since it doesn't know the whole thing is not executable,
so sometimes this results in an splitting an extra large page because the paging
structure would have its first special permission. But on the plus side many TLB
flushes are reduced down to one (on x86 here, and likely others in the future).
The other usages of modules (bpf, etc) will not have RW subsections and so this
will not increase. So I am thinking its not a big downside for a few modules
compared to reducing TLB flushes, removing executable stale TLB entries and code
simplicity.

Todo:
 - Merge with Nadav Amit's patchset
 - Test on x86 32 bit with highmem
 - Plug into ftrace and kprobes implementations in Nadav's next version of his
   patchset

Changes since v1:
 - New efficient algorithm on x86 for tearing down executable RO memory and
   flag for this (Andy Lutomirski)
 - Have no W^X violating window on tear down (Nadav Amit)


Rick Edgecombe (4):
  vmalloc: New flags for safe vfree on special perms
  modules: Add new special vfree flags
  bpf: switch to new vmalloc vfree flags
  x86/vmalloc: Add TLB efficient x86 arch_vunmap

 arch/x86/include/asm/set_memory.h |  2 +
 arch/x86/mm/Makefile              |  3 +-
 arch/x86/mm/pageattr.c            | 11 +++--
 arch/x86/mm/vmalloc.c             | 71 ++++++++++++++++++++++++++++++
 include/linux/filter.h            | 26 +++++------
 include/linux/vmalloc.h           |  2 +
 kernel/bpf/core.c                 |  1 -
 kernel/module.c                   | 43 +++++-------------
 mm/vmalloc.c                      | 73 ++++++++++++++++++++++++++++---
 9 files changed, 173 insertions(+), 59 deletions(-)
 create mode 100644 arch/x86/mm/vmalloc.c

-- 
2.17.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-12  0:03 [PATCH v2 0/4] Don’t leave executable TLB entries to freed pages Rick Edgecombe
@ 2018-12-12  0:03 ` Rick Edgecombe
  2018-12-12  2:20   ` Andy Lutomirski
  2018-12-12  0:03 ` [PATCH v2 2/4] modules: Add new special vfree flags Rick Edgecombe
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Rick Edgecombe @ 2018-12-12  0:03 UTC (permalink / raw)
  To: akpm, luto, will.deacon, linux-mm, linux-kernel,
	kernel-hardening, naveen.n.rao, anil.s.keshavamurthy, davem,
	mhiramat, rostedt, mingo, ast, daniel, jeyu, namit, netdev,
	ard.biesheuvel, jannh
  Cc: kristen, dave.hansen, deneen.t.dock, Rick Edgecombe

This adds two new flags VM_IMMEDIATE_UNMAP and VM_HAS_SPECIAL_PERMS, for
enabling vfree operations to immediately clear executable TLB entries to freed
pages, and handle freeing memory with special permissions.

In order to support vfree being called on memory that might be RO, the vfree
deferred list node is moved to a kmalloc allocated struct, from where it is
today, reusing the allocation being freed.

arch_vunmap is a new __weak function that implements the actual unmapping and
resetting of the direct map permissions. It can be overridden by more efficient
architecture specific implementations.

For the default implementation, it uses architecture agnostic methods which are
equivalent to what most usages do before calling vfree. So now it is just
centralized here.

This implementation derives from two sketches from Dave Hansen and Andy
Lutomirski.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/vmalloc.h |  2 ++
 mm/vmalloc.c            | 73 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..872bcde17aca 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -21,6 +21,8 @@ struct notifier_block;		/* in notifier.h */
 #define VM_UNINITIALIZED	0x00000020	/* vm_struct is not fully initialized */
 #define VM_NO_GUARD		0x00000040      /* don't add guard page */
 #define VM_KASAN		0x00000080      /* has allocated kasan shadow memory */
+#define VM_IMMEDIATE_UNMAP	0x00000200	/* flush before releasing pages */
+#define VM_HAS_SPECIAL_PERMS	0x00000400	/* may be freed with special perms */
 /* bits [20..32] reserved for arch specific ioremap internals */
 
 /*
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 97d4b25d0373..02b284d2245a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -18,6 +18,7 @@
 #include <linux/interrupt.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
+#include <linux/set_memory.h>
 #include <linux/debugobjects.h>
 #include <linux/kallsyms.h>
 #include <linux/list.h>
@@ -38,6 +39,11 @@
 
 #include "internal.h"
 
+struct vfree_work {
+	struct llist_node node;
+	void *addr;
+};
+
 struct vfree_deferred {
 	struct llist_head list;
 	struct work_struct wq;
@@ -50,9 +56,13 @@ static void free_work(struct work_struct *w)
 {
 	struct vfree_deferred *p = container_of(w, struct vfree_deferred, wq);
 	struct llist_node *t, *llnode;
+	struct vfree_work *cur;
 
-	llist_for_each_safe(llnode, t, llist_del_all(&p->list))
-		__vunmap((void *)llnode, 1);
+	llist_for_each_safe(llnode, t, llist_del_all(&p->list)) {
+		cur = container_of(llnode, struct vfree_work, node);
+		__vunmap(cur->addr, 1);
+		kfree(cur);
+	}
 }
 
 /*** Page table manipulation functions ***/
@@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const void *addr)
 	return NULL;
 }
 
+/*
+ * This function handles unmapping and resetting the direct map as efficiently
+ * as it can with cross arch functions. The three categories of architectures
+ * are:
+ *   1. Architectures with no set_memory implementations and no direct map
+ *      permissions.
+ *   2. Architectures with set_memory implementations but no direct map
+ *      permissions
+ *   3. Architectures with set_memory implementations and direct map permissions
+ */
+void __weak arch_vunmap(struct vm_struct *area, int deallocate_pages)
+{
+	unsigned long addr = (unsigned long)area->addr;
+	int immediate = area->flags & VM_IMMEDIATE_UNMAP;
+	int special = area->flags & VM_HAS_SPECIAL_PERMS;
+
+	/*
+	 * In case of 2 and 3, use this general way of resetting the permissions
+	 * on the directmap. Do NX before RW, in case of X, so there is no W^X
+	 * violation window.
+	 *
+	 * For case 1 these will be noops.
+	 */
+	if (immediate)
+		set_memory_nx(addr, area->nr_pages);
+	if (deallocate_pages && special)
+		set_memory_rw(addr, area->nr_pages);
+
+	/* Always actually remove the area */
+	remove_vm_area(area->addr);
+
+	/*
+	 * Need to flush the TLB before freeing pages in the case of this flag.
+	 * As long as that's happening, unmap aliases.
+	 *
+	 * For 2 and 3, this will not be needed because of the set_memory_nx
+	 * above, because the stale TLBs will be NX.
+	 */
+	if (immediate && !IS_ENABLED(ARCH_HAS_SET_MEMORY))
+		vm_unmap_aliases();
+}
+
 static void __vunmap(const void *addr, int deallocate_pages)
 {
 	struct vm_struct *area;
@@ -1515,7 +1567,8 @@ static void __vunmap(const void *addr, int deallocate_pages)
 	debug_check_no_locks_freed(area->addr, get_vm_area_size(area));
 	debug_check_no_obj_freed(area->addr, get_vm_area_size(area));
 
-	remove_vm_area(addr);
+	arch_vunmap(area, deallocate_pages);
+
 	if (deallocate_pages) {
 		int i;
 
@@ -1542,8 +1595,15 @@ static inline void __vfree_deferred(const void *addr)
 	 * nother cpu's list.  schedule_work() should be fine with this too.
 	 */
 	struct vfree_deferred *p = raw_cpu_ptr(&vfree_deferred);
+	struct vfree_work *w = kmalloc(sizeof(struct vfree_work), GFP_ATOMIC);
+
+	/* If no memory for the deferred list node, give up */
+	if (!w)
+		return;
 
-	if (llist_add((struct llist_node *)addr, &p->list))
+	w->addr = (void *)addr;
+
+	if (llist_add(&w->node, &p->list))
 		schedule_work(&p->wq);
 }
 
@@ -1925,8 +1985,9 @@ EXPORT_SYMBOL(vzalloc_node);
 
 void *vmalloc_exec(unsigned long size)
 {
-	return __vmalloc_node(size, 1, GFP_KERNEL, PAGE_KERNEL_EXEC,
-			      NUMA_NO_NODE, __builtin_return_address(0));
+	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+			GFP_KERNEL, PAGE_KERNEL_EXEC, VM_IMMEDIATE_UNMAP,
+			NUMA_NO_NODE, __builtin_return_address(0));
 }
 
 #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/4] modules: Add new special vfree flags
  2018-12-12  0:03 [PATCH v2 0/4] Don’t leave executable TLB entries to freed pages Rick Edgecombe
  2018-12-12  0:03 ` [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms Rick Edgecombe
@ 2018-12-12  0:03 ` Rick Edgecombe
  2018-12-12 23:40   ` Nadav Amit
  2018-12-12  0:03 ` [PATCH v2 3/4] bpf: switch to new vmalloc " Rick Edgecombe
  2018-12-12  0:03 ` [PATCH v2 4/4] x86/vmalloc: Add TLB efficient x86 arch_vunmap Rick Edgecombe
  3 siblings, 1 reply; 26+ messages in thread
From: Rick Edgecombe @ 2018-12-12  0:03 UTC (permalink / raw)
  To: akpm, luto, will.deacon, linux-mm, linux-kernel,
	kernel-hardening, naveen.n.rao, anil.s.keshavamurthy, davem,
	mhiramat, rostedt, mingo, ast, daniel, jeyu, namit, netdev,
	ard.biesheuvel, jannh
  Cc: kristen, dave.hansen, deneen.t.dock, Rick Edgecombe

Add new flags for handling freeing of special permissioned memory in vmalloc,
and remove places where the handling was done in module.c.

This will enable this flag for all architectures.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 kernel/module.c | 43 ++++++++++++-------------------------------
 1 file changed, 12 insertions(+), 31 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 49a405891587..910f92b402f8 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1941,11 +1941,23 @@ void module_disable_ro(const struct module *mod)
 	frob_rodata(&mod->init_layout, set_memory_rw);
 }
 
+static void module_set_vm_flags(const struct module_layout *layout)
+{
+	struct vm_struct *vm = find_vm_area(layout->base);
+
+	if (vm) {
+		vm->flags |= VM_HAS_SPECIAL_PERMS;
+		vm->flags |= VM_IMMEDIATE_UNMAP;
+	}
+}
+
 void module_enable_ro(const struct module *mod, bool after_init)
 {
 	if (!rodata_enabled)
 		return;
 
+	module_set_vm_flags(&mod->core_layout);
+	module_set_vm_flags(&mod->init_layout);
 	frob_text(&mod->core_layout, set_memory_ro);
 	frob_rodata(&mod->core_layout, set_memory_ro);
 	frob_text(&mod->init_layout, set_memory_ro);
@@ -1964,15 +1976,6 @@ static void module_enable_nx(const struct module *mod)
 	frob_writable_data(&mod->init_layout, set_memory_nx);
 }
 
-static void module_disable_nx(const struct module *mod)
-{
-	frob_rodata(&mod->core_layout, set_memory_x);
-	frob_ro_after_init(&mod->core_layout, set_memory_x);
-	frob_writable_data(&mod->core_layout, set_memory_x);
-	frob_rodata(&mod->init_layout, set_memory_x);
-	frob_writable_data(&mod->init_layout, set_memory_x);
-}
-
 /* Iterate through all modules and set each module's text as RW */
 void set_all_modules_text_rw(void)
 {
@@ -2016,23 +2019,8 @@ void set_all_modules_text_ro(void)
 	}
 	mutex_unlock(&module_mutex);
 }
-
-static void disable_ro_nx(const struct module_layout *layout)
-{
-	if (rodata_enabled) {
-		frob_text(layout, set_memory_rw);
-		frob_rodata(layout, set_memory_rw);
-		frob_ro_after_init(layout, set_memory_rw);
-	}
-	frob_rodata(layout, set_memory_x);
-	frob_ro_after_init(layout, set_memory_x);
-	frob_writable_data(layout, set_memory_x);
-}
-
 #else
-static void disable_ro_nx(const struct module_layout *layout) { }
 static void module_enable_nx(const struct module *mod) { }
-static void module_disable_nx(const struct module *mod) { }
 #endif
 
 #ifdef CONFIG_LIVEPATCH
@@ -2163,7 +2151,6 @@ static void free_module(struct module *mod)
 	mutex_unlock(&module_mutex);
 
 	/* This may be empty, but that's OK */
-	disable_ro_nx(&mod->init_layout);
 	module_arch_freeing_init(mod);
 	module_memfree(mod->init_layout.base);
 	kfree(mod->args);
@@ -2173,7 +2160,6 @@ static void free_module(struct module *mod)
 	lockdep_free_key_range(mod->core_layout.base, mod->core_layout.size);
 
 	/* Finally, free the core (containing the module structure) */
-	disable_ro_nx(&mod->core_layout);
 	module_memfree(mod->core_layout.base);
 }
 
@@ -3497,7 +3483,6 @@ static noinline int do_init_module(struct module *mod)
 #endif
 	module_enable_ro(mod, true);
 	mod_tree_remove_init(mod);
-	disable_ro_nx(&mod->init_layout);
 	module_arch_freeing_init(mod);
 	mod->init_layout.base = NULL;
 	mod->init_layout.size = 0;
@@ -3812,10 +3797,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
 	module_bug_cleanup(mod);
 	mutex_unlock(&module_mutex);
 
-	/* we can't deallocate the module until we clear memory protection */
-	module_disable_ro(mod);
-	module_disable_nx(mod);
-
  ddebug_cleanup:
 	ftrace_release_mod(mod);
 	dynamic_debug_remove(mod, info->debug);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/4] bpf: switch to new vmalloc vfree flags
  2018-12-12  0:03 [PATCH v2 0/4] Don’t leave executable TLB entries to freed pages Rick Edgecombe
  2018-12-12  0:03 ` [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms Rick Edgecombe
  2018-12-12  0:03 ` [PATCH v2 2/4] modules: Add new special vfree flags Rick Edgecombe
@ 2018-12-12  0:03 ` Rick Edgecombe
  2018-12-12  0:03 ` [PATCH v2 4/4] x86/vmalloc: Add TLB efficient x86 arch_vunmap Rick Edgecombe
  3 siblings, 0 replies; 26+ messages in thread
From: Rick Edgecombe @ 2018-12-12  0:03 UTC (permalink / raw)
  To: akpm, luto, will.deacon, linux-mm, linux-kernel,
	kernel-hardening, naveen.n.rao, anil.s.keshavamurthy, davem,
	mhiramat, rostedt, mingo, ast, daniel, jeyu, namit, netdev,
	ard.biesheuvel, jannh
  Cc: kristen, dave.hansen, deneen.t.dock, Rick Edgecombe

This switches to use the new vmalloc flags to control freeing memory with
special permissions.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/filter.h | 26 ++++++++++++--------------
 kernel/bpf/core.c      |  1 -
 2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 795ff0b869bb..2aeb93d3337d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -20,6 +20,7 @@
 #include <linux/set_memory.h>
 #include <linux/kallsyms.h>
 #include <linux/if_vlan.h>
+#include <linux/vmalloc.h>
 
 #include <net/sch_generic.h>
 
@@ -487,7 +488,6 @@ struct bpf_prog {
 	u16			pages;		/* Number of allocated pages */
 	u16			jited:1,	/* Is our filter JIT'ed? */
 				jit_requested:1,/* archs need to JIT the prog */
-				undo_set_mem:1,	/* Passed set_memory_ro() checkpoint */
 				gpl_compatible:1, /* Is filter GPL compatible? */
 				cb_access:1,	/* Is control block accessed? */
 				dst_needed:1,	/* Do we need dst entry? */
@@ -699,24 +699,23 @@ bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 size_default)
 
 static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
 {
-	fp->undo_set_mem = 1;
-	set_memory_ro((unsigned long)fp, fp->pages);
-}
+	struct vm_struct *vm = find_vm_area(fp);
 
-static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
-{
-	if (fp->undo_set_mem)
-		set_memory_rw((unsigned long)fp, fp->pages);
+	if (vm)
+		vm->flags |= VM_HAS_SPECIAL_PERMS;
+	set_memory_ro((unsigned long)fp, fp->pages);
 }
 
 static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr)
 {
-	set_memory_ro((unsigned long)hdr, hdr->pages);
-}
+	struct vm_struct *vm = find_vm_area(hdr);
 
-static inline void bpf_jit_binary_unlock_ro(struct bpf_binary_header *hdr)
-{
-	set_memory_rw((unsigned long)hdr, hdr->pages);
+	if (vm) {
+		vm->flags |= VM_HAS_SPECIAL_PERMS;
+		vm->flags |= VM_IMMEDIATE_UNMAP;
+	}
+
+	set_memory_ro((unsigned long)hdr, hdr->pages);
 }
 
 static inline struct bpf_binary_header *
@@ -746,7 +745,6 @@ void __bpf_prog_free(struct bpf_prog *fp);
 
 static inline void bpf_prog_unlock_free(struct bpf_prog *fp)
 {
-	bpf_prog_unlock_ro(fp);
 	__bpf_prog_free(fp);
 }
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index b1a3545d0ec8..bd3efd7ce526 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -663,7 +663,6 @@ void __weak bpf_jit_free(struct bpf_prog *fp)
 	if (fp->jited) {
 		struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
 
-		bpf_jit_binary_unlock_ro(hdr);
 		bpf_jit_binary_free(hdr);
 
 		WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 4/4] x86/vmalloc: Add TLB efficient x86 arch_vunmap
  2018-12-12  0:03 [PATCH v2 0/4] Don’t leave executable TLB entries to freed pages Rick Edgecombe
                   ` (2 preceding siblings ...)
  2018-12-12  0:03 ` [PATCH v2 3/4] bpf: switch to new vmalloc " Rick Edgecombe
@ 2018-12-12  0:03 ` Rick Edgecombe
  2018-12-12  2:24   ` Andy Lutomirski
  2018-12-12  6:30   ` Nadav Amit
  3 siblings, 2 replies; 26+ messages in thread
From: Rick Edgecombe @ 2018-12-12  0:03 UTC (permalink / raw)
  To: akpm, luto, will.deacon, linux-mm, linux-kernel,
	kernel-hardening, naveen.n.rao, anil.s.keshavamurthy, davem,
	mhiramat, rostedt, mingo, ast, daniel, jeyu, namit, netdev,
	ard.biesheuvel, jannh
  Cc: kristen, dave.hansen, deneen.t.dock, Rick Edgecombe

This adds a more efficient x86 architecture specific implementation of
arch_vunmap, that can free any type of special permission memory with only 1 TLB
flush.

In order to enable this, _set_pages_p and _set_pages_np are made non-static and
renamed set_pages_p_noflush and set_pages_np_noflush to better communicate
their different (non-flushing) behavior from the rest of the set_pages_*
functions.

The method for doing this with only 1 TLB flush was suggested by Andy
Lutomirski.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/set_memory.h |  2 +
 arch/x86/mm/Makefile              |  3 +-
 arch/x86/mm/pageattr.c            | 11 +++--
 arch/x86/mm/vmalloc.c             | 71 +++++++++++++++++++++++++++++++
 4 files changed, 80 insertions(+), 7 deletions(-)
 create mode 100644 arch/x86/mm/vmalloc.c

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 07a25753e85c..70ee81e8914b 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -84,6 +84,8 @@ int set_pages_x(struct page *page, int numpages);
 int set_pages_nx(struct page *page, int numpages);
 int set_pages_ro(struct page *page, int numpages);
 int set_pages_rw(struct page *page, int numpages);
+int set_pages_np_noflush(struct page *page, int numpages);
+int set_pages_p_noflush(struct page *page, int numpages);
 
 extern int kernel_set_to_readonly;
 void set_kernel_text_rw(void);
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd6e52f..189681f863a6 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -13,7 +13,8 @@ CFLAGS_REMOVE_mem_encrypt_identity.o	= -pg
 endif
 
 obj-y	:=  init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
-	    pat.o pgtable.o physaddr.o setup_nx.o tlb.o cpu_entry_area.o
+	    pat.o pgtable.o physaddr.o setup_nx.o tlb.o cpu_entry_area.o \
+	    vmalloc.o
 
 # Make sure __phys_addr has no stackprotector
 nostackp := $(call cc-option, -fno-stack-protector)
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index db7a10082238..db0a4dfb5a7f 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -2248,9 +2248,7 @@ int set_pages_rw(struct page *page, int numpages)
 	return set_memory_rw(addr, numpages);
 }
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
-
-static int __set_pages_p(struct page *page, int numpages)
+int set_pages_p_noflush(struct page *page, int numpages)
 {
 	unsigned long tempaddr = (unsigned long) page_address(page);
 	struct cpa_data cpa = { .vaddr = &tempaddr,
@@ -2269,7 +2267,7 @@ static int __set_pages_p(struct page *page, int numpages)
 	return __change_page_attr_set_clr(&cpa, 0);
 }
 
-static int __set_pages_np(struct page *page, int numpages)
+int set_pages_np_noflush(struct page *page, int numpages)
 {
 	unsigned long tempaddr = (unsigned long) page_address(page);
 	struct cpa_data cpa = { .vaddr = &tempaddr,
@@ -2288,6 +2286,7 @@ static int __set_pages_np(struct page *page, int numpages)
 	return __change_page_attr_set_clr(&cpa, 0);
 }
 
+#ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
 	if (PageHighMem(page))
@@ -2303,9 +2302,9 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
 	 * and hence no memory allocations during large page split.
 	 */
 	if (enable)
-		__set_pages_p(page, numpages);
+		set_pages_p_noflush(page, numpages);
 	else
-		__set_pages_np(page, numpages);
+		set_pages_np_noflush(page, numpages);
 
 	/*
 	 * We should perform an IPI and flush all tlbs,
diff --git a/arch/x86/mm/vmalloc.c b/arch/x86/mm/vmalloc.c
new file mode 100644
index 000000000000..be9ea42c3dfe
--- /dev/null
+++ b/arch/x86/mm/vmalloc.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * vmalloc.c: x86 arch version of vmalloc.c
+ *
+ * (C) Copyright 2018 Intel Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <linux/mm.h>
+#include <linux/set_memory.h>
+#include <linux/vmalloc.h>
+
+static void set_area_direct_np(struct vm_struct *area)
+{
+	int i;
+
+	for (i = 0; i < area->nr_pages; i++)
+		set_pages_np_noflush(area->pages[i], 1);
+}
+
+static void set_area_direct_prw(struct vm_struct *area)
+{
+	int i;
+
+	for (i = 0; i < area->nr_pages; i++)
+		set_pages_p_noflush(area->pages[i], 1);
+}
+
+void arch_vunmap(struct vm_struct *area, int deallocate_pages)
+{
+	int immediate = area->flags & VM_IMMEDIATE_UNMAP;
+	int special = area->flags & VM_HAS_SPECIAL_PERMS;
+
+	/* Unmap from vmalloc area */
+	remove_vm_area(area->addr);
+
+	/* If no need to reset directmap perms, just check if need to flush */
+	if (!(deallocate_pages || special)) {
+		if (immediate)
+			vm_unmap_aliases();
+		return;
+	}
+
+	/* From here we need to make sure to reset the direct map perms */
+
+	/*
+	 * If the area being freed does not have any extra capabilities, we can
+	 * just reset the directmap to RW before freeing.
+	 */
+	if (!immediate) {
+		set_area_direct_prw(area);
+		vm_unmap_aliases();
+		return;
+	}
+
+	/*
+	 * If the vm being freed has security sensitive capabilities such as
+	 * executable we need to make sure there is no W window on the directmap
+	 * before removing the X in the TLB. So we set not present first so we
+	 * can flush without any other CPU picking up the mapping. Then we reset
+	 * RW+P without a flush, since NP prevented it from being cached by
+	 * other cpus.
+	 */
+	set_area_direct_np(area);
+	vm_unmap_aliases();
+	set_area_direct_prw(area);
+}
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-12  0:03 ` [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms Rick Edgecombe
@ 2018-12-12  2:20   ` Andy Lutomirski
  2018-12-12 19:50     ` Edgecombe, Rick P
  2018-12-21 16:39     ` Ard Biesheuvel
  0 siblings, 2 replies; 26+ messages in thread
From: Andy Lutomirski @ 2018-12-12  2:20 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andrew Morton, Andrew Lutomirski, Will Deacon, Linux-MM, LKML,
	Kernel Hardening, Naveen N . Rao, Anil S Keshavamurthy,
	David S. Miller, Masami Hiramatsu, Steven Rostedt, Ingo Molnar,
	Alexei Starovoitov, Daniel Borkmann, Jessica Yu, Nadav Amit,
	Network Development, Ard Biesheuvel, Jann Horn

On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
>
> This adds two new flags VM_IMMEDIATE_UNMAP and VM_HAS_SPECIAL_PERMS, for
> enabling vfree operations to immediately clear executable TLB entries to freed
> pages, and handle freeing memory with special permissions.
>
> In order to support vfree being called on memory that might be RO, the vfree
> deferred list node is moved to a kmalloc allocated struct, from where it is
> today, reusing the allocation being freed.
>
> arch_vunmap is a new __weak function that implements the actual unmapping and
> resetting of the direct map permissions. It can be overridden by more efficient
> architecture specific implementations.
>
> For the default implementation, it uses architecture agnostic methods which are
> equivalent to what most usages do before calling vfree. So now it is just
> centralized here.
>
> This implementation derives from two sketches from Dave Hansen and Andy
> Lutomirski.
>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Suggested-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  include/linux/vmalloc.h |  2 ++
>  mm/vmalloc.c            | 73 +++++++++++++++++++++++++++++++++++++----
>  2 files changed, 69 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 398e9c95cd61..872bcde17aca 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -21,6 +21,8 @@ struct notifier_block;                /* in notifier.h */
>  #define VM_UNINITIALIZED       0x00000020      /* vm_struct is not fully initialized */
>  #define VM_NO_GUARD            0x00000040      /* don't add guard page */
>  #define VM_KASAN               0x00000080      /* has allocated kasan shadow memory */
> +#define VM_IMMEDIATE_UNMAP     0x00000200      /* flush before releasing pages */
> +#define VM_HAS_SPECIAL_PERMS   0x00000400      /* may be freed with special perms */
>  /* bits [20..32] reserved for arch specific ioremap internals */
>
>  /*
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 97d4b25d0373..02b284d2245a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -18,6 +18,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
> +#include <linux/set_memory.h>
>  #include <linux/debugobjects.h>
>  #include <linux/kallsyms.h>
>  #include <linux/list.h>
> @@ -38,6 +39,11 @@
>
>  #include "internal.h"
>
> +struct vfree_work {
> +       struct llist_node node;
> +       void *addr;
> +};
> +
>  struct vfree_deferred {
>         struct llist_head list;
>         struct work_struct wq;
> @@ -50,9 +56,13 @@ static void free_work(struct work_struct *w)
>  {
>         struct vfree_deferred *p = container_of(w, struct vfree_deferred, wq);
>         struct llist_node *t, *llnode;
> +       struct vfree_work *cur;
>
> -       llist_for_each_safe(llnode, t, llist_del_all(&p->list))
> -               __vunmap((void *)llnode, 1);
> +       llist_for_each_safe(llnode, t, llist_del_all(&p->list)) {
> +               cur = container_of(llnode, struct vfree_work, node);
> +               __vunmap(cur->addr, 1);
> +               kfree(cur);
> +       }
>  }
>
>  /*** Page table manipulation functions ***/
> @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const void *addr)
>         return NULL;
>  }
>
> +/*
> + * This function handles unmapping and resetting the direct map as efficiently
> + * as it can with cross arch functions. The three categories of architectures
> + * are:
> + *   1. Architectures with no set_memory implementations and no direct map
> + *      permissions.
> + *   2. Architectures with set_memory implementations but no direct map
> + *      permissions
> + *   3. Architectures with set_memory implementations and direct map permissions
> + */
> +void __weak arch_vunmap(struct vm_struct *area, int deallocate_pages)

My general preference is to avoid __weak functions -- they don't
optimize well.  Instead, I prefer either:

#ifndef arch_vunmap
void arch_vunmap(...);
#endif

or

#ifdef CONFIG_HAVE_ARCH_VUNMAP
...
#endif


> +{
> +       unsigned long addr = (unsigned long)area->addr;
> +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> +
> +       /*
> +        * In case of 2 and 3, use this general way of resetting the permissions
> +        * on the directmap. Do NX before RW, in case of X, so there is no W^X
> +        * violation window.
> +        *
> +        * For case 1 these will be noops.
> +        */
> +       if (immediate)
> +               set_memory_nx(addr, area->nr_pages);
> +       if (deallocate_pages && special)
> +               set_memory_rw(addr, area->nr_pages);

Can you elaborate on the intent here?  VM_IMMEDIATE_UNMAP means "I
want that alias gone before any deallocation happens".
VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix it for
me, please".  deallocate means "this was vfree -- please free the
pages".  I'm not convinced that all the various combinations make
sense.  Do we really need both flags?

(VM_IMMEDIATE_UNMAP is a bit of a lie, since, if in_interrupt(), it's
not immediate.)

If we do keep both flags, maybe some restructuring would make sense,
like this, perhaps.  Sorry about horrible whitespace damage.

if (special) {
  /* VM_HAS_SPECIAL_PERMS makes little sense without deallocate_pages. */
  WARN_ON_ONCE(!deallocate_pages);

  if (immediate) {
    /* It's possible that the vmap alias is X and we're about to make
the direct map RW.  To avoid a window where executable memory is
writable, first mark the vmap alias NX.  This is silly, since we're
about to *unmap* it, but this is the best we can do if all we have to
work with is the set_memory_abc() APIs.  Architectures should override
this whole function to get better behavior. */
    set_memory_nx(...);
  }

  set_memory_rw(addr, area->nr_pages);
}


> +
> +       /* Always actually remove the area */
> +       remove_vm_area(area->addr);
> +
> +       /*
> +        * Need to flush the TLB before freeing pages in the case of this flag.
> +        * As long as that's happening, unmap aliases.
> +        *
> +        * For 2 and 3, this will not be needed because of the set_memory_nx
> +        * above, because the stale TLBs will be NX.

I'm not sure I agree with this comment.  If the caller asked for an
immediate unmap, we should give an immediate unmap.  But I'm still not
sure I see why VM_IMMEDIATE_UNMAP needs to exist as a separate flag.

> +        */
> +       if (immediate && !IS_ENABLED(ARCH_HAS_SET_MEMORY))
> +               vm_unmap_aliases();
> +}
> +
>  static void __vunmap(const void *addr, int deallocate_pages)
>  {
>         struct vm_struct *area;
> @@ -1515,7 +1567,8 @@ static void __vunmap(const void *addr, int deallocate_pages)
>         debug_check_no_locks_freed(area->addr, get_vm_area_size(area));
>         debug_check_no_obj_freed(area->addr, get_vm_area_size(area));
>
> -       remove_vm_area(addr);
> +       arch_vunmap(area, deallocate_pages);
> +
>         if (deallocate_pages) {
>                 int i;
>
> @@ -1542,8 +1595,15 @@ static inline void __vfree_deferred(const void *addr)
>          * nother cpu's list.  schedule_work() should be fine with this too.
>          */
>         struct vfree_deferred *p = raw_cpu_ptr(&vfree_deferred);
> +       struct vfree_work *w = kmalloc(sizeof(struct vfree_work), GFP_ATOMIC);
> +
> +       /* If no memory for the deferred list node, give up */
> +       if (!w)
> +               return;

That's nasty.  I see what you're trying to do here, but I think you're
solving a problem that doesn't need solving quite so urgently.  How
about dropping this part and replacing it with a comment like "NB:
this writes a word to a potentially executable address.  It would be
nice if we could avoid doing this."  And maybe a future patch could
more robustly avoid it without risking memory leaks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/4] x86/vmalloc: Add TLB efficient x86 arch_vunmap
  2018-12-12  0:03 ` [PATCH v2 4/4] x86/vmalloc: Add TLB efficient x86 arch_vunmap Rick Edgecombe
@ 2018-12-12  2:24   ` Andy Lutomirski
  2018-12-12 19:51     ` Edgecombe, Rick P
  2018-12-12  6:30   ` Nadav Amit
  1 sibling, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2018-12-12  2:24 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andrew Morton, Andrew Lutomirski, Will Deacon, Linux-MM, LKML,
	Kernel Hardening, Naveen N . Rao, Anil S Keshavamurthy,
	David S. Miller, Masami Hiramatsu, Steven Rostedt, Ingo Molnar,
	Alexei Starovoitov, Daniel Borkmann, Jessica Yu, Nadav Amit,
	Network Development, Ard Biesheuvel, Jann Horn

On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
>
> This adds a more efficient x86 architecture specific implementation of
> arch_vunmap, that can free any type of special permission memory with only 1 TLB
> flush.
>
> In order to enable this, _set_pages_p and _set_pages_np are made non-static and
> renamed set_pages_p_noflush and set_pages_np_noflush to better communicate
> their different (non-flushing) behavior from the rest of the set_pages_*
> functions.
>
> The method for doing this with only 1 TLB flush was suggested by Andy
> Lutomirski.
>
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/include/asm/set_memory.h |  2 +
>  arch/x86/mm/Makefile              |  3 +-
>  arch/x86/mm/pageattr.c            | 11 +++--
>  arch/x86/mm/vmalloc.c             | 71 +++++++++++++++++++++++++++++++
>  4 files changed, 80 insertions(+), 7 deletions(-)
>  create mode 100644 arch/x86/mm/vmalloc.c
>
> diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
> index 07a25753e85c..70ee81e8914b 100644
> --- a/arch/x86/include/asm/set_memory.h
> +++ b/arch/x86/include/asm/set_memory.h
> @@ -84,6 +84,8 @@ int set_pages_x(struct page *page, int numpages);
>  int set_pages_nx(struct page *page, int numpages);
>  int set_pages_ro(struct page *page, int numpages);
>  int set_pages_rw(struct page *page, int numpages);
> +int set_pages_np_noflush(struct page *page, int numpages);
> +int set_pages_p_noflush(struct page *page, int numpages);
>
>  extern int kernel_set_to_readonly;
>  void set_kernel_text_rw(void);
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index 4b101dd6e52f..189681f863a6 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -13,7 +13,8 @@ CFLAGS_REMOVE_mem_encrypt_identity.o  = -pg
>  endif
>
>  obj-y  :=  init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
> -           pat.o pgtable.o physaddr.o setup_nx.o tlb.o cpu_entry_area.o
> +           pat.o pgtable.o physaddr.o setup_nx.o tlb.o cpu_entry_area.o \
> +           vmalloc.o
>
>  # Make sure __phys_addr has no stackprotector
>  nostackp := $(call cc-option, -fno-stack-protector)
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index db7a10082238..db0a4dfb5a7f 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -2248,9 +2248,7 @@ int set_pages_rw(struct page *page, int numpages)
>         return set_memory_rw(addr, numpages);
>  }
>
> -#ifdef CONFIG_DEBUG_PAGEALLOC
> -
> -static int __set_pages_p(struct page *page, int numpages)
> +int set_pages_p_noflush(struct page *page, int numpages)

Maybe set_pages_rwp_noflush()?

> diff --git a/arch/x86/mm/vmalloc.c b/arch/x86/mm/vmalloc.c
> new file mode 100644
> index 000000000000..be9ea42c3dfe
> --- /dev/null
> +++ b/arch/x86/mm/vmalloc.c
> @@ -0,0 +1,71 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * vmalloc.c: x86 arch version of vmalloc.c
> + *
> + * (C) Copyright 2018 Intel Corporation
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; version 2
> + * of the License.

This paragraph may be redundant with the SPDX line.

> + */
> +
> +#include <linux/mm.h>
> +#include <linux/set_memory.h>
> +#include <linux/vmalloc.h>
> +
> +static void set_area_direct_np(struct vm_struct *area)
> +{
> +       int i;
> +
> +       for (i = 0; i < area->nr_pages; i++)
> +               set_pages_np_noflush(area->pages[i], 1);
> +}
> +
> +static void set_area_direct_prw(struct vm_struct *area)
> +{
> +       int i;
> +
> +       for (i = 0; i < area->nr_pages; i++)
> +               set_pages_p_noflush(area->pages[i], 1);
> +}
> +
> +void arch_vunmap(struct vm_struct *area, int deallocate_pages)
> +{
> +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> +
> +       /* Unmap from vmalloc area */
> +       remove_vm_area(area->addr);
> +
> +       /* If no need to reset directmap perms, just check if need to flush */
> +       if (!(deallocate_pages || special)) {
> +               if (immediate)
> +                       vm_unmap_aliases();
> +               return;
> +       }
> +
> +       /* From here we need to make sure to reset the direct map perms */
> +
> +       /*
> +        * If the area being freed does not have any extra capabilities, we can
> +        * just reset the directmap to RW before freeing.
> +        */
> +       if (!immediate) {
> +               set_area_direct_prw(area);
> +               vm_unmap_aliases();
> +               return;
> +       }
> +
> +       /*
> +        * If the vm being freed has security sensitive capabilities such as
> +        * executable we need to make sure there is no W window on the directmap
> +        * before removing the X in the TLB. So we set not present first so we
> +        * can flush without any other CPU picking up the mapping. Then we reset
> +        * RW+P without a flush, since NP prevented it from being cached by
> +        * other cpus.
> +        */
> +       set_area_direct_np(area);
> +       vm_unmap_aliases();
> +       set_area_direct_prw(area);

Here you're using "immediate" as a proxy for "was executable".  And
it's barely faster to omit immediate -- it's the same number of
flushes, and all you save is one pass over the direct map.

Do we really need to support all these combinations?  Even if we do
support them, I think that "immediate" needs a better name.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/4] x86/vmalloc: Add TLB efficient x86 arch_vunmap
  2018-12-12  0:03 ` [PATCH v2 4/4] x86/vmalloc: Add TLB efficient x86 arch_vunmap Rick Edgecombe
  2018-12-12  2:24   ` Andy Lutomirski
@ 2018-12-12  6:30   ` Nadav Amit
  2018-12-12 21:05     ` Edgecombe, Rick P
  1 sibling, 1 reply; 26+ messages in thread
From: Nadav Amit @ 2018-12-12  6:30 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andrew Morton, Andy Lutomirski, Will Deacon, linux-mm,
	linux-kernel, kernel-hardening, naveen.n.rao,
	anil.s.keshavamurthy, davem, mhiramat, rostedt, mingo, ast,
	daniel, jeyu, netdev@vger.kernel.org

> On Dec 11, 2018, at 4:03 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> 
> This adds a more efficient x86 architecture specific implementation of
> arch_vunmap, that can free any type of special permission memory with only 1 TLB
> flush.
> 
> In order to enable this, _set_pages_p and _set_pages_np are made non-static and
> renamed set_pages_p_noflush and set_pages_np_noflush to better communicate
> their different (non-flushing) behavior from the rest of the set_pages_*
> functions.
> 
> The method for doing this with only 1 TLB flush was suggested by Andy
> Lutomirski.
> 

[snip]

> +	/*
> +	 * If the vm being freed has security sensitive capabilities such as
> +	 * executable we need to make sure there is no W window on the directmap
> +	 * before removing the X in the TLB. So we set not present first so we
> +	 * can flush without any other CPU picking up the mapping. Then we reset
> +	 * RW+P without a flush, since NP prevented it from being cached by
> +	 * other cpus.
> +	 */
> +	set_area_direct_np(area);
> +	vm_unmap_aliases();

Does vm_unmap_aliases() flush in the TLB the direct mapping range as well? I
can only find the flush of the vmalloc range.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-12  2:20   ` Andy Lutomirski
@ 2018-12-12 19:50     ` Edgecombe, Rick P
  2018-12-12 19:57       ` Andy Lutomirski
  2018-12-21 16:39     ` Ard Biesheuvel
  1 sibling, 1 reply; 26+ messages in thread
From: Edgecombe, Rick P @ 2018-12-12 19:50 UTC (permalink / raw)
  To: luto
  Cc: linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
	linux-mm, jannh, Dock, Deneen T, kristen, akpm, will.deacon,
	mingo, namit, kernel-hardening, Keshavamurthy, Anil S

On Tue, 2018-12-11 at 18:20 -0800, Andy Lutomirski wrote:
> On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > This adds two new flags VM_IMMEDIATE_UNMAP and VM_HAS_SPECIAL_PERMS, for
> > enabling vfree operations to immediately clear executable TLB entries to
> > freed
> > pages, and handle freeing memory with special permissions.
> > 
> > In order to support vfree being called on memory that might be RO, the vfree
> > deferred list node is moved to a kmalloc allocated struct, from where it is
> > today, reusing the allocation being freed.
> > 
> > arch_vunmap is a new __weak function that implements the actual unmapping
> > and
> > resetting of the direct map permissions. It can be overridden by more
> > efficient
> > architecture specific implementations.
> > 
> > For the default implementation, it uses architecture agnostic methods which
> > are
> > equivalent to what most usages do before calling vfree. So now it is just
> > centralized here.
> > 
> > This implementation derives from two sketches from Dave Hansen and Andy
> > Lutomirski.
> > 
> > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > Suggested-by: Andy Lutomirski <luto@kernel.org>
> > Suggested-by: Will Deacon <will.deacon@arm.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> >  include/linux/vmalloc.h |  2 ++
> >  mm/vmalloc.c            | 73 +++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 69 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index 398e9c95cd61..872bcde17aca 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -21,6 +21,8 @@ struct notifier_block;                /* in notifier.h */
> >  #define VM_UNINITIALIZED       0x00000020      /* vm_struct is not fully
> > initialized */
> >  #define VM_NO_GUARD            0x00000040      /* don't add guard page */
> >  #define VM_KASAN               0x00000080      /* has allocated kasan
> > shadow memory */
> > +#define VM_IMMEDIATE_UNMAP     0x00000200      /* flush before releasing
> > pages */
> > +#define VM_HAS_SPECIAL_PERMS   0x00000400      /* may be freed with special
> > perms */
> >  /* bits [20..32] reserved for arch specific ioremap internals */
> > 
> >  /*
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 97d4b25d0373..02b284d2245a 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/interrupt.h>
> >  #include <linux/proc_fs.h>
> >  #include <linux/seq_file.h>
> > +#include <linux/set_memory.h>
> >  #include <linux/debugobjects.h>
> >  #include <linux/kallsyms.h>
> >  #include <linux/list.h>
> > @@ -38,6 +39,11 @@
> > 
> >  #include "internal.h"
> > 
> > +struct vfree_work {
> > +       struct llist_node node;
> > +       void *addr;
> > +};
> > +
> >  struct vfree_deferred {
> >         struct llist_head list;
> >         struct work_struct wq;
> > @@ -50,9 +56,13 @@ static void free_work(struct work_struct *w)
> >  {
> >         struct vfree_deferred *p = container_of(w, struct vfree_deferred,
> > wq);
> >         struct llist_node *t, *llnode;
> > +       struct vfree_work *cur;
> > 
> > -       llist_for_each_safe(llnode, t, llist_del_all(&p->list))
> > -               __vunmap((void *)llnode, 1);
> > +       llist_for_each_safe(llnode, t, llist_del_all(&p->list)) {
> > +               cur = container_of(llnode, struct vfree_work, node);
> > +               __vunmap(cur->addr, 1);
> > +               kfree(cur);
> > +       }
> >  }
> > 
> >  /*** Page table manipulation functions ***/
> > @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const void *addr)
> >         return NULL;
> >  }
> > 
> > +/*
> > + * This function handles unmapping and resetting the direct map as
> > efficiently
> > + * as it can with cross arch functions. The three categories of
> > architectures
> > + * are:
> > + *   1. Architectures with no set_memory implementations and no direct map
> > + *      permissions.
> > + *   2. Architectures with set_memory implementations but no direct map
> > + *      permissions
> > + *   3. Architectures with set_memory implementations and direct map
> > permissions
> > + */
> > +void __weak arch_vunmap(struct vm_struct *area, int deallocate_pages)
> 
> My general preference is to avoid __weak functions -- they don't
> optimize well.  Instead, I prefer either:
> 
> #ifndef arch_vunmap
> void arch_vunmap(...);
> #endif
> 
> or
> 
> #ifdef CONFIG_HAVE_ARCH_VUNMAP
> ...
> #endif
Ok.
> 
> > +{
> > +       unsigned long addr = (unsigned long)area->addr;
> > +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> > +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> > +
> > +       /*
> > +        * In case of 2 and 3, use this general way of resetting the
> > permissions
> > +        * on the directmap. Do NX before RW, in case of X, so there is no
> > W^X
> > +        * violation window.
> > +        *
> > +        * For case 1 these will be noops.
> > +        */
> > +       if (immediate)
> > +               set_memory_nx(addr, area->nr_pages);
> > +       if (deallocate_pages && special)
> > +               set_memory_rw(addr, area->nr_pages);
> 
> Can you elaborate on the intent here?  VM_IMMEDIATE_UNMAP means "I
> want that alias gone before any deallocation happens".
> VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix it for
> me, please".  deallocate means "this was vfree -- please free the
> pages".  I'm not convinced that all the various combinations make
> sense.  Do we really need both flags?
VM_HAS_SPECIAL_PERMS is supposed to mean, like you said, "reset the direct map".
Where VM_IMMEDIATE_UNMAP means, the vmalloc allocation has extra capabilties
where we don't want to leave an enhanced capability TLB entry to the freed page.

I was trying to pick names that could apply more generally for potential future 
special memory capabilities. Today VM_HAS_SPECIAL_PERMS does just mean reset
write to the directmap and VM_IMMEDIATE_UNMAP means vmalloc mapping is
executable.

A present day reason for keeping both flags is, it is more efficient in the
arch-agnostic implementation when freeing memory that is just RO and not
executable. It saves a TLB flush.

> (VM_IMMEDIATE_UNMAP is a bit of a lie, since, if in_interrupt(), it's
> not immediate.)
True, maybe VM_MUST_FLUSH or something else?

> If we do keep both flags, maybe some restructuring would make sense,
> like this, perhaps.  Sorry about horrible whitespace damage.
> 
> if (special) {
>   /* VM_HAS_SPECIAL_PERMS makes little sense without deallocate_pages. */
>   WARN_ON_ONCE(!deallocate_pages);
> 
>   if (immediate) {
>     /* It's possible that the vmap alias is X and we're about to make
> the direct map RW.  To avoid a window where executable memory is
> writable, first mark the vmap alias NX.  This is silly, since we're
> about to *unmap* it, but this is the best we can do if all we have to
> work with is the set_memory_abc() APIs.  Architectures should override
> this whole function to get better behavior. */
>     set_memory_nx(...);
>   }
> 
>   set_memory_rw(addr, area->nr_pages);
> }
Ok.

> 
> > +
> > +       /* Always actually remove the area */
> > +       remove_vm_area(area->addr);
> > +
> > +       /*
> > +        * Need to flush the TLB before freeing pages in the case of this
> > flag.
> > +        * As long as that's happening, unmap aliases.
> > +        *
> > +        * For 2 and 3, this will not be needed because of the set_memory_nx
> > +        * above, because the stale TLBs will be NX.
> 
> I'm not sure I agree with this comment.  If the caller asked for an
> immediate unmap, we should give an immediate unmap.  But I'm still not
> sure I see why VM_IMMEDIATE_UNMAP needs to exist as a separate flag.
Yea. I was just trying to save a TLB flush, since for today's callers that have
set_memory there isn't a security downside I know of to just leaving it NX.
Maybe its not worth the tradeoff of confusion? Or I can clarify that in the
comment.

> > +        */
> > +       if (immediate && !IS_ENABLED(ARCH_HAS_SET_MEMORY))
> > +               vm_unmap_aliases();
> > +}
> > +
> >  static void __vunmap(const void *addr, int deallocate_pages)
> >  {
> >         struct vm_struct *area;
> > @@ -1515,7 +1567,8 @@ static void __vunmap(const void *addr, int
> > deallocate_pages)
> >         debug_check_no_locks_freed(area->addr, get_vm_area_size(area));
> >         debug_check_no_obj_freed(area->addr, get_vm_area_size(area));
> > 
> > -       remove_vm_area(addr);
> > +       arch_vunmap(area, deallocate_pages);
> > +
> >         if (deallocate_pages) {
> >                 int i;
> > 
> > @@ -1542,8 +1595,15 @@ static inline void __vfree_deferred(const void *addr)
> >          * nother cpu's list.  schedule_work() should be fine with this too.
> >          */
> >         struct vfree_deferred *p = raw_cpu_ptr(&vfree_deferred);
> > +       struct vfree_work *w = kmalloc(sizeof(struct vfree_work),
> > GFP_ATOMIC);
> > +
> > +       /* If no memory for the deferred list node, give up */
> > +       if (!w)
> > +               return;
> 
> That's nasty.  I see what you're trying to do here, but I think you're
> solving a problem that doesn't need solving quite so urgently.  How
> about dropping this part and replacing it with a comment like "NB:
> this writes a word to a potentially executable address.  It would be
> nice if we could avoid doing this."  And maybe a future patch could
> more robustly avoid it without risking memory leaks.
Yea, sorry I should have called this out, because I wasn't sure on how likely
that was to happen. I did find some other places in the kernel with the same
ignoring logic.

I'll have to think though, I am not sure what the alternative is. Since the
memory can be RO in the module_memfree case, the old method of re-using the
allocation will no longer work. The list node could be stuffed on the vm_struct,
but then the all of the spin_lock(&vmap_area_lock)'s need to be changed to work
with interrupts so that the struct could be looked up. Not sure of the
implications of that. Or maybe have some slow backup that resets the permissions
and re-uses the allocation if kmalloc fails?

I guess it could also go back to the old v1 implementation that doesn't handle
RO and the directmap, and leave the W^X violation window during teardown. Then
solve that problem when modules are loaded via something like Nadav's stuff.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/4] x86/vmalloc: Add TLB efficient x86 arch_vunmap
  2018-12-12  2:24   ` Andy Lutomirski
@ 2018-12-12 19:51     ` Edgecombe, Rick P
  0 siblings, 0 replies; 26+ messages in thread
From: Edgecombe, Rick P @ 2018-12-12 19:51 UTC (permalink / raw)
  To: luto
  Cc: linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
	linux-mm, jannh, Dock, Deneen T, kristen, akpm, will.deacon,
	mingo, namit, kernel-hardening, Keshavamurthy, Anil S

On Tue, 2018-12-11 at 18:24 -0800, Andy Lutomirski wrote:
> On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > This adds a more efficient x86 architecture specific implementation of
> > arch_vunmap, that can free any type of special permission memory with only 1
> > TLB
> > flush.
> > 
> > In order to enable this, _set_pages_p and _set_pages_np are made non-static
> > and
> > renamed set_pages_p_noflush and set_pages_np_noflush to better communicate
> > their different (non-flushing) behavior from the rest of the set_pages_*
> > functions.
> > 
> > The method for doing this with only 1 TLB flush was suggested by Andy
> > Lutomirski.
> > 
> > Suggested-by: Andy Lutomirski <luto@kernel.org>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> >  arch/x86/include/asm/set_memory.h |  2 +
> >  arch/x86/mm/Makefile              |  3 +-
> >  arch/x86/mm/pageattr.c            | 11 +++--
> >  arch/x86/mm/vmalloc.c             | 71 +++++++++++++++++++++++++++++++
> >  4 files changed, 80 insertions(+), 7 deletions(-)
> >  create mode 100644 arch/x86/mm/vmalloc.c
> > 
> > diff --git a/arch/x86/include/asm/set_memory.h
> > b/arch/x86/include/asm/set_memory.h
> > index 07a25753e85c..70ee81e8914b 100644
> > --- a/arch/x86/include/asm/set_memory.h
> > +++ b/arch/x86/include/asm/set_memory.h
> > @@ -84,6 +84,8 @@ int set_pages_x(struct page *page, int numpages);
> >  int set_pages_nx(struct page *page, int numpages);
> >  int set_pages_ro(struct page *page, int numpages);
> >  int set_pages_rw(struct page *page, int numpages);
> > +int set_pages_np_noflush(struct page *page, int numpages);
> > +int set_pages_p_noflush(struct page *page, int numpages);
> > 
> >  extern int kernel_set_to_readonly;
> >  void set_kernel_text_rw(void);
> > diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> > index 4b101dd6e52f..189681f863a6 100644
> > --- a/arch/x86/mm/Makefile
> > +++ b/arch/x86/mm/Makefile
> > @@ -13,7 +13,8 @@ CFLAGS_REMOVE_mem_encrypt_identity.o  = -pg
> >  endif
> > 
> >  obj-y  :=  init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o
> > mmap.o \
> > -           pat.o pgtable.o physaddr.o setup_nx.o tlb.o cpu_entry_area.o
> > +           pat.o pgtable.o physaddr.o setup_nx.o tlb.o cpu_entry_area.o \
> > +           vmalloc.o
> > 
> >  # Make sure __phys_addr has no stackprotector
> >  nostackp := $(call cc-option, -fno-stack-protector)
> > diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> > index db7a10082238..db0a4dfb5a7f 100644
> > --- a/arch/x86/mm/pageattr.c
> > +++ b/arch/x86/mm/pageattr.c
> > @@ -2248,9 +2248,7 @@ int set_pages_rw(struct page *page, int numpages)
> >         return set_memory_rw(addr, numpages);
> >  }
> > 
> > -#ifdef CONFIG_DEBUG_PAGEALLOC
> > -
> > -static int __set_pages_p(struct page *page, int numpages)
> > +int set_pages_p_noflush(struct page *page, int numpages)
> 
> Maybe set_pages_rwp_noflush()?
Sure.
> > diff --git a/arch/x86/mm/vmalloc.c b/arch/x86/mm/vmalloc.c
> > new file mode 100644
> > index 000000000000..be9ea42c3dfe
> > --- /dev/null
> > +++ b/arch/x86/mm/vmalloc.c
> > @@ -0,0 +1,71 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * vmalloc.c: x86 arch version of vmalloc.c
> > + *
> > + * (C) Copyright 2018 Intel Corporation
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; version 2
> > + * of the License.
> 
> This paragraph may be redundant with the SPDX line.
Ok.
> > + */
> > +
> > +#include <linux/mm.h>
> > +#include <linux/set_memory.h>
> > +#include <linux/vmalloc.h>
> > +
> > +static void set_area_direct_np(struct vm_struct *area)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < area->nr_pages; i++)
> > +               set_pages_np_noflush(area->pages[i], 1);
> > +}
> > +
> > +static void set_area_direct_prw(struct vm_struct *area)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < area->nr_pages; i++)
> > +               set_pages_p_noflush(area->pages[i], 1);
> > +}
> > +
> > +void arch_vunmap(struct vm_struct *area, int deallocate_pages)
> > +{
> > +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> > +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> > +
> > +       /* Unmap from vmalloc area */
> > +       remove_vm_area(area->addr);
> > +
> > +       /* If no need to reset directmap perms, just check if need to flush
> > */
> > +       if (!(deallocate_pages || special)) {
> > +               if (immediate)
> > +                       vm_unmap_aliases();
> > +               return;
> > +       }
> > +
> > +       /* From here we need to make sure to reset the direct map perms */
> > +
> > +       /*
> > +        * If the area being freed does not have any extra capabilities, we
> > can
> > +        * just reset the directmap to RW before freeing.
> > +        */
> > +       if (!immediate) {
> > +               set_area_direct_prw(area);
> > +               vm_unmap_aliases();
> > +               return;
> > +       }
> > +
> > +       /*
> > +        * If the vm being freed has security sensitive capabilities such as
> > +        * executable we need to make sure there is no W window on the
> > directmap
> > +        * before removing the X in the TLB. So we set not present first so
> > we
> > +        * can flush without any other CPU picking up the mapping. Then we
> > reset
> > +        * RW+P without a flush, since NP prevented it from being cached by
> > +        * other cpus.
> > +        */
> > +       set_area_direct_np(area);
> > +       vm_unmap_aliases();
> > +       set_area_direct_prw(area);
> 
> Here you're using "immediate" as a proxy for "was executable".  And
> it's barely faster to omit immediate -- it's the same number of
> flushes, and all you save is one pass over the direct map.
Not sure I understand, I think its still generic to "special capabilities". You
mean fix the comment?

> Do we really need to support all these combinations?  Even if we do
> support them, I think that "immediate" needs a better name.
It saves TLB flushes in the generic implementation for the just RO case, but yea
here it doesn't save much here.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-12 19:50     ` Edgecombe, Rick P
@ 2018-12-12 19:57       ` Andy Lutomirski
  2018-12-12 22:01         ` Edgecombe, Rick P
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2018-12-12 19:57 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andrew Lutomirski, LKML, Daniel Borkmann, Jessica Yu,
	Steven Rostedt, Alexei Starovoitov, Ard Biesheuvel, Linux-MM,
	Jann Horn, Dock, Deneen T, Kristen Carlson Accardi,
	Andrew Morton, Will Deacon, Ingo Molnar, Nadav Amit,
	Kernel Hardening, Anil S Keshavamurthy, Masami Hiramatsu,
	Naveen N . Rao

On Wed, Dec 12, 2018 at 11:50 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-11 at 18:20 -0800, Andy Lutomirski wrote:
> > On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > This adds two new flags VM_IMMEDIATE_UNMAP and VM_HAS_SPECIAL_PERMS, for
> > > enabling vfree operations to immediately clear executable TLB entries to
> > > freed
> > > pages, and handle freeing memory with special permissions.
> > >
> > > In order to support vfree being called on memory that might be RO, the vfree
> > > deferred list node is moved to a kmalloc allocated struct, from where it is
> > > today, reusing the allocation being freed.
> > >
> > > arch_vunmap is a new __weak function that implements the actual unmapping
> > > and
> > > resetting of the direct map permissions. It can be overridden by more
> > > efficient
> > > architecture specific implementations.
> > >
> > > For the default implementation, it uses architecture agnostic methods which
> > > are
> > > equivalent to what most usages do before calling vfree. So now it is just
> > > centralized here.
> > >
> > > This implementation derives from two sketches from Dave Hansen and Andy
> > > Lutomirski.
> > >
> > > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > > Suggested-by: Andy Lutomirski <luto@kernel.org>
> > > Suggested-by: Will Deacon <will.deacon@arm.com>
> > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > ---
> > >  include/linux/vmalloc.h |  2 ++
> > >  mm/vmalloc.c            | 73 +++++++++++++++++++++++++++++++++++++----
> > >  2 files changed, 69 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > > index 398e9c95cd61..872bcde17aca 100644
> > > --- a/include/linux/vmalloc.h
> > > +++ b/include/linux/vmalloc.h
> > > @@ -21,6 +21,8 @@ struct notifier_block;                /* in notifier.h */
> > >  #define VM_UNINITIALIZED       0x00000020      /* vm_struct is not fully
> > > initialized */
> > >  #define VM_NO_GUARD            0x00000040      /* don't add guard page */
> > >  #define VM_KASAN               0x00000080      /* has allocated kasan
> > > shadow memory */
> > > +#define VM_IMMEDIATE_UNMAP     0x00000200      /* flush before releasing
> > > pages */
> > > +#define VM_HAS_SPECIAL_PERMS   0x00000400      /* may be freed with special
> > > perms */
> > >  /* bits [20..32] reserved for arch specific ioremap internals */
> > >
> > >  /*
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 97d4b25d0373..02b284d2245a 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -18,6 +18,7 @@
> > >  #include <linux/interrupt.h>
> > >  #include <linux/proc_fs.h>
> > >  #include <linux/seq_file.h>
> > > +#include <linux/set_memory.h>
> > >  #include <linux/debugobjects.h>
> > >  #include <linux/kallsyms.h>
> > >  #include <linux/list.h>
> > > @@ -38,6 +39,11 @@
> > >
> > >  #include "internal.h"
> > >
> > > +struct vfree_work {
> > > +       struct llist_node node;
> > > +       void *addr;
> > > +};
> > > +
> > >  struct vfree_deferred {
> > >         struct llist_head list;
> > >         struct work_struct wq;
> > > @@ -50,9 +56,13 @@ static void free_work(struct work_struct *w)
> > >  {
> > >         struct vfree_deferred *p = container_of(w, struct vfree_deferred,
> > > wq);
> > >         struct llist_node *t, *llnode;
> > > +       struct vfree_work *cur;
> > >
> > > -       llist_for_each_safe(llnode, t, llist_del_all(&p->list))
> > > -               __vunmap((void *)llnode, 1);
> > > +       llist_for_each_safe(llnode, t, llist_del_all(&p->list)) {
> > > +               cur = container_of(llnode, struct vfree_work, node);
> > > +               __vunmap(cur->addr, 1);
> > > +               kfree(cur);
> > > +       }
> > >  }
> > >
> > >  /*** Page table manipulation functions ***/
> > > @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const void *addr)
> > >         return NULL;
> > >  }
> > >
> > > +/*
> > > + * This function handles unmapping and resetting the direct map as
> > > efficiently
> > > + * as it can with cross arch functions. The three categories of
> > > architectures
> > > + * are:
> > > + *   1. Architectures with no set_memory implementations and no direct map
> > > + *      permissions.
> > > + *   2. Architectures with set_memory implementations but no direct map
> > > + *      permissions
> > > + *   3. Architectures with set_memory implementations and direct map
> > > permissions
> > > + */
> > > +void __weak arch_vunmap(struct vm_struct *area, int deallocate_pages)
> >
> > My general preference is to avoid __weak functions -- they don't
> > optimize well.  Instead, I prefer either:
> >
> > #ifndef arch_vunmap
> > void arch_vunmap(...);
> > #endif
> >
> > or
> >
> > #ifdef CONFIG_HAVE_ARCH_VUNMAP
> > ...
> > #endif
> Ok.
> >
> > > +{
> > > +       unsigned long addr = (unsigned long)area->addr;
> > > +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> > > +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> > > +
> > > +       /*
> > > +        * In case of 2 and 3, use this general way of resetting the
> > > permissions
> > > +        * on the directmap. Do NX before RW, in case of X, so there is no
> > > W^X
> > > +        * violation window.
> > > +        *
> > > +        * For case 1 these will be noops.
> > > +        */
> > > +       if (immediate)
> > > +               set_memory_nx(addr, area->nr_pages);
> > > +       if (deallocate_pages && special)
> > > +               set_memory_rw(addr, area->nr_pages);
> >
> > Can you elaborate on the intent here?  VM_IMMEDIATE_UNMAP means "I
> > want that alias gone before any deallocation happens".
> > VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix it for
> > me, please".  deallocate means "this was vfree -- please free the
> > pages".  I'm not convinced that all the various combinations make
> > sense.  Do we really need both flags?
> VM_HAS_SPECIAL_PERMS is supposed to mean, like you said, "reset the direct map".
> Where VM_IMMEDIATE_UNMAP means, the vmalloc allocation has extra capabilties
> where we don't want to leave an enhanced capability TLB entry to the freed page.
>
> I was trying to pick names that could apply more generally for potential future
> special memory capabilities. Today VM_HAS_SPECIAL_PERMS does just mean reset
> write to the directmap and VM_IMMEDIATE_UNMAP means vmalloc mapping is
> executable.
>
> A present day reason for keeping both flags is, it is more efficient in the
> arch-agnostic implementation when freeing memory that is just RO and not
> executable. It saves a TLB flush.
>
> > (VM_IMMEDIATE_UNMAP is a bit of a lie, since, if in_interrupt(), it's
> > not immediate.)
> True, maybe VM_MUST_FLUSH or something else?
>
> > If we do keep both flags, maybe some restructuring would make sense,
> > like this, perhaps.  Sorry about horrible whitespace damage.
> >
> > if (special) {
> >   /* VM_HAS_SPECIAL_PERMS makes little sense without deallocate_pages. */
> >   WARN_ON_ONCE(!deallocate_pages);
> >
> >   if (immediate) {
> >     /* It's possible that the vmap alias is X and we're about to make
> > the direct map RW.  To avoid a window where executable memory is
> > writable, first mark the vmap alias NX.  This is silly, since we're
> > about to *unmap* it, but this is the best we can do if all we have to
> > work with is the set_memory_abc() APIs.  Architectures should override
> > this whole function to get better behavior. */
> >     set_memory_nx(...);
> >   }
> >
> >   set_memory_rw(addr, area->nr_pages);
> > }
> Ok.
>
> >
> > > +
> > > +       /* Always actually remove the area */
> > > +       remove_vm_area(area->addr);
> > > +
> > > +       /*
> > > +        * Need to flush the TLB before freeing pages in the case of this
> > > flag.
> > > +        * As long as that's happening, unmap aliases.
> > > +        *
> > > +        * For 2 and 3, this will not be needed because of the set_memory_nx
> > > +        * above, because the stale TLBs will be NX.
> >
> > I'm not sure I agree with this comment.  If the caller asked for an
> > immediate unmap, we should give an immediate unmap.  But I'm still not
> > sure I see why VM_IMMEDIATE_UNMAP needs to exist as a separate flag.
> Yea. I was just trying to save a TLB flush, since for today's callers that have
> set_memory there isn't a security downside I know of to just leaving it NX.
> Maybe its not worth the tradeoff of confusion? Or I can clarify that in the
> comment.

Don't both of the users in your series set both flags, though?  My
real objection to having them be separate is that, in the absence of
users, it's less clear exactly what they should do and the code
doesn't get exercised.

If you document that VM_IMMEDIATE_UNMAP means "I want the TLB entries
gone", then I can re-review the code in light of that.  But then I'm
unconvinced by your generic implementation, since set_memory_nx()
seems like an odd way to go about it.

>
> > > +        */
> > > +       if (immediate && !IS_ENABLED(ARCH_HAS_SET_MEMORY))
> > > +               vm_unmap_aliases();
> > > +}
> > > +
> > >  static void __vunmap(const void *addr, int deallocate_pages)
> > >  {
> > >         struct vm_struct *area;
> > > @@ -1515,7 +1567,8 @@ static void __vunmap(const void *addr, int
> > > deallocate_pages)
> > >         debug_check_no_locks_freed(area->addr, get_vm_area_size(area));
> > >         debug_check_no_obj_freed(area->addr, get_vm_area_size(area));
> > >
> > > -       remove_vm_area(addr);
> > > +       arch_vunmap(area, deallocate_pages);
> > > +
> > >         if (deallocate_pages) {
> > >                 int i;
> > >
> > > @@ -1542,8 +1595,15 @@ static inline void __vfree_deferred(const void *addr)
> > >          * nother cpu's list.  schedule_work() should be fine with this too.
> > >          */
> > >         struct vfree_deferred *p = raw_cpu_ptr(&vfree_deferred);
> > > +       struct vfree_work *w = kmalloc(sizeof(struct vfree_work),
> > > GFP_ATOMIC);
> > > +
> > > +       /* If no memory for the deferred list node, give up */
> > > +       if (!w)
> > > +               return;
> >
> > That's nasty.  I see what you're trying to do here, but I think you're
> > solving a problem that doesn't need solving quite so urgently.  How
> > about dropping this part and replacing it with a comment like "NB:
> > this writes a word to a potentially executable address.  It would be
> > nice if we could avoid doing this."  And maybe a future patch could
> > more robustly avoid it without risking memory leaks.
> Yea, sorry I should have called this out, because I wasn't sure on how likely
> that was to happen. I did find some other places in the kernel with the same
> ignoring logic.
>
> I'll have to think though, I am not sure what the alternative is. Since the
> memory can be RO in the module_memfree case, the old method of re-using the
> allocation will no longer work. The list node could be stuffed on the vm_struct,
> but then the all of the spin_lock(&vmap_area_lock)'s need to be changed to work
> with interrupts so that the struct could be looked up. Not sure of the
> implications of that. Or maybe have some slow backup that resets the permissions
> and re-uses the allocation if kmalloc fails?
>
> I guess it could also go back to the old v1 implementation that doesn't handle
> RO and the directmap, and leave the W^X violation window during teardown. Then
> solve that problem when modules are loaded via something like Nadav's stuff.
>

Hmm.  Switching to spin_lock_irqsave() doesn't seem so bad to me.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/4] x86/vmalloc: Add TLB efficient x86 arch_vunmap
  2018-12-12  6:30   ` Nadav Amit
@ 2018-12-12 21:05     ` Edgecombe, Rick P
  2018-12-12 21:16       ` Nadav Amit
  0 siblings, 1 reply; 26+ messages in thread
From: Edgecombe, Rick P @ 2018-12-12 21:05 UTC (permalink / raw)
  To: namit
  Cc: linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
	linux-mm, jannh, Dock, Deneen T, kristen, akpm, will.deacon,
	mingo, luto, kernel-hardening, Keshavamurthy, Anil S

On Wed, 2018-12-12 at 06:30 +0000, Nadav Amit wrote:
> > On Dec 11, 2018, at 4:03 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > wrote:
> > 
> > This adds a more efficient x86 architecture specific implementation of
> > arch_vunmap, that can free any type of special permission memory with only 1
> > TLB
> > flush.
> > 
> > In order to enable this, _set_pages_p and _set_pages_np are made non-static
> > and
> > renamed set_pages_p_noflush and set_pages_np_noflush to better communicate
> > their different (non-flushing) behavior from the rest of the set_pages_*
> > functions.
> > 
> > The method for doing this with only 1 TLB flush was suggested by Andy
> > Lutomirski.
> > 
> 
> [snip]
> 
> > +	/*
> > +	 * If the vm being freed has security sensitive capabilities such as
> > +	 * executable we need to make sure there is no W window on the directmap
> > +	 * before removing the X in the TLB. So we set not present first so we
> > +	 * can flush without any other CPU picking up the mapping. Then we reset
> > +	 * RW+P without a flush, since NP prevented it from being cached by
> > +	 * other cpus.
> > +	 */
> > +	set_area_direct_np(area);
> > +	vm_unmap_aliases();
> 
> Does vm_unmap_aliases() flush in the TLB the direct mapping range as well? I
> can only find the flush of the vmalloc range.
Hmmm. It should usually (I tested), but now I wonder if there are cases where it
doesn't and it could depend on architecture as well. I'll have to trace through
this to verify, thanks.

Rick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/4] x86/vmalloc: Add TLB efficient x86 arch_vunmap
  2018-12-12 21:05     ` Edgecombe, Rick P
@ 2018-12-12 21:16       ` Nadav Amit
  0 siblings, 0 replies; 26+ messages in thread
From: Nadav Amit @ 2018-12-12 21:16 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
	linux-mm, jannh, Dock, Deneen T, kristen, akpm, will.deacon,
	mingo, luto, kernel-hardening, Keshavamurthy, Anil S, mhiramat,
	naveen.n.rao@linux.vnet.ibm.com

> On Dec 12, 2018, at 1:05 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Wed, 2018-12-12 at 06:30 +0000, Nadav Amit wrote:
>>> On Dec 11, 2018, at 4:03 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> wrote:
>>> 
>>> This adds a more efficient x86 architecture specific implementation of
>>> arch_vunmap, that can free any type of special permission memory with only 1
>>> TLB
>>> flush.
>>> 
>>> In order to enable this, _set_pages_p and _set_pages_np are made non-static
>>> and
>>> renamed set_pages_p_noflush and set_pages_np_noflush to better communicate
>>> their different (non-flushing) behavior from the rest of the set_pages_*
>>> functions.
>>> 
>>> The method for doing this with only 1 TLB flush was suggested by Andy
>>> Lutomirski.
>> 
>> [snip]
>> 
>>> +	/*
>>> +	 * If the vm being freed has security sensitive capabilities such as
>>> +	 * executable we need to make sure there is no W window on the directmap
>>> +	 * before removing the X in the TLB. So we set not present first so we
>>> +	 * can flush without any other CPU picking up the mapping. Then we reset
>>> +	 * RW+P without a flush, since NP prevented it from being cached by
>>> +	 * other cpus.
>>> +	 */
>>> +	set_area_direct_np(area);
>>> +	vm_unmap_aliases();
>> 
>> Does vm_unmap_aliases() flush in the TLB the direct mapping range as well? I
>> can only find the flush of the vmalloc range.
> Hmmm. It should usually (I tested), but now I wonder if there are cases where it
> doesn't and it could depend on architecture as well. I'll have to trace through
> this to verify, thanks.

I think that it mostly does, since you try to flush more than 33 PTEs (the
threshold to flush the whole TLB instead of individual entries). But you
shouldn’t count on it. Even this threshold is configurable.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-12 19:57       ` Andy Lutomirski
@ 2018-12-12 22:01         ` Edgecombe, Rick P
  2018-12-15 18:52           ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Edgecombe, Rick P @ 2018-12-12 22:01 UTC (permalink / raw)
  To: luto
  Cc: linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
	linux-mm, jannh, Dock, Deneen T, kristen, akpm, will.deacon,
	mingo, namit, kernel-hardening, Keshavamurthy, Anil S

On Wed, 2018-12-12 at 11:57 -0800, Andy Lutomirski wrote:
> On Wed, Dec 12, 2018 at 11:50 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > On Tue, 2018-12-11 at 18:20 -0800, Andy Lutomirski wrote:
> > > On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
> > > <rick.p.edgecombe@intel.com> wrote:
> > > > 
> > > > This adds two new flags VM_IMMEDIATE_UNMAP and VM_HAS_SPECIAL_PERMS, for
> > > > enabling vfree operations to immediately clear executable TLB entries to
> > > > freed
> > > > pages, and handle freeing memory with special permissions.
> > > > 
> > > > In order to support vfree being called on memory that might be RO, the
> > > > vfree
> > > > deferred list node is moved to a kmalloc allocated struct, from where it
> > > > is
> > > > today, reusing the allocation being freed.
> > > > 
> > > > arch_vunmap is a new __weak function that implements the actual
> > > > unmapping
> > > > and
> > > > resetting of the direct map permissions. It can be overridden by more
> > > > efficient
> > > > architecture specific implementations.
> > > > 
> > > > For the default implementation, it uses architecture agnostic methods
> > > > which
> > > > are
> > > > equivalent to what most usages do before calling vfree. So now it is
> > > > just
> > > > centralized here.
> > > > 
> > > > This implementation derives from two sketches from Dave Hansen and Andy
> > > > Lutomirski.
> > > > 
> > > > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > > > Suggested-by: Andy Lutomirski <luto@kernel.org>
> > > > Suggested-by: Will Deacon <will.deacon@arm.com>
> > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > ---
> > > >  include/linux/vmalloc.h |  2 ++
> > > >  mm/vmalloc.c            | 73 +++++++++++++++++++++++++++++++++++++----
> > > >  2 files changed, 69 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > > > index 398e9c95cd61..872bcde17aca 100644
> > > > --- a/include/linux/vmalloc.h
> > > > +++ b/include/linux/vmalloc.h
> > > > @@ -21,6 +21,8 @@ struct notifier_block;                /* in notifier.h
> > > > */
> > > >  #define VM_UNINITIALIZED       0x00000020      /* vm_struct is not
> > > > fully
> > > > initialized */
> > > >  #define VM_NO_GUARD            0x00000040      /* don't add guard page
> > > > */
> > > >  #define VM_KASAN               0x00000080      /* has allocated kasan
> > > > shadow memory */
> > > > +#define VM_IMMEDIATE_UNMAP     0x00000200      /* flush before
> > > > releasing
> > > > pages */
> > > > +#define VM_HAS_SPECIAL_PERMS   0x00000400      /* may be freed with
> > > > special
> > > > perms */
> > > >  /* bits [20..32] reserved for arch specific ioremap internals */
> > > > 
> > > >  /*
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 97d4b25d0373..02b284d2245a 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -18,6 +18,7 @@
> > > >  #include <linux/interrupt.h>
> > > >  #include <linux/proc_fs.h>
> > > >  #include <linux/seq_file.h>
> > > > +#include <linux/set_memory.h>
> > > >  #include <linux/debugobjects.h>
> > > >  #include <linux/kallsyms.h>
> > > >  #include <linux/list.h>
> > > > @@ -38,6 +39,11 @@
> > > > 
> > > >  #include "internal.h"
> > > > 
> > > > +struct vfree_work {
> > > > +       struct llist_node node;
> > > > +       void *addr;
> > > > +};
> > > > +
> > > >  struct vfree_deferred {
> > > >         struct llist_head list;
> > > >         struct work_struct wq;
> > > > @@ -50,9 +56,13 @@ static void free_work(struct work_struct *w)
> > > >  {
> > > >         struct vfree_deferred *p = container_of(w, struct
> > > > vfree_deferred,
> > > > wq);
> > > >         struct llist_node *t, *llnode;
> > > > +       struct vfree_work *cur;
> > > > 
> > > > -       llist_for_each_safe(llnode, t, llist_del_all(&p->list))
> > > > -               __vunmap((void *)llnode, 1);
> > > > +       llist_for_each_safe(llnode, t, llist_del_all(&p->list)) {
> > > > +               cur = container_of(llnode, struct vfree_work, node);
> > > > +               __vunmap(cur->addr, 1);
> > > > +               kfree(cur);
> > > > +       }
> > > >  }
> > > > 
> > > >  /*** Page table manipulation functions ***/
> > > > @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const void
> > > > *addr)
> > > >         return NULL;
> > > >  }
> > > > 
> > > > +/*
> > > > + * This function handles unmapping and resetting the direct map as
> > > > efficiently
> > > > + * as it can with cross arch functions. The three categories of
> > > > architectures
> > > > + * are:
> > > > + *   1. Architectures with no set_memory implementations and no direct
> > > > map
> > > > + *      permissions.
> > > > + *   2. Architectures with set_memory implementations but no direct map
> > > > + *      permissions
> > > > + *   3. Architectures with set_memory implementations and direct map
> > > > permissions
> > > > + */
> > > > +void __weak arch_vunmap(struct vm_struct *area, int deallocate_pages)
> > > 
> > > My general preference is to avoid __weak functions -- they don't
> > > optimize well.  Instead, I prefer either:
> > > 
> > > #ifndef arch_vunmap
> > > void arch_vunmap(...);
> > > #endif
> > > 
> > > or
> > > 
> > > #ifdef CONFIG_HAVE_ARCH_VUNMAP
> > > ...
> > > #endif
> > 
> > Ok.
> > > 
> > > > +{
> > > > +       unsigned long addr = (unsigned long)area->addr;
> > > > +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> > > > +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> > > > +
> > > > +       /*
> > > > +        * In case of 2 and 3, use this general way of resetting the
> > > > permissions
> > > > +        * on the directmap. Do NX before RW, in case of X, so there is
> > > > no
> > > > W^X
> > > > +        * violation window.
> > > > +        *
> > > > +        * For case 1 these will be noops.
> > > > +        */
> > > > +       if (immediate)
> > > > +               set_memory_nx(addr, area->nr_pages);
> > > > +       if (deallocate_pages && special)
> > > > +               set_memory_rw(addr, area->nr_pages);
> > > 
> > > Can you elaborate on the intent here?  VM_IMMEDIATE_UNMAP means "I
> > > want that alias gone before any deallocation happens".
> > > VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix it for
> > > me, please".  deallocate means "this was vfree -- please free the
> > > pages".  I'm not convinced that all the various combinations make
> > > sense.  Do we really need both flags?
> > 
> > VM_HAS_SPECIAL_PERMS is supposed to mean, like you said, "reset the direct
> > map".
> > Where VM_IMMEDIATE_UNMAP means, the vmalloc allocation has extra capabilties
> > where we don't want to leave an enhanced capability TLB entry to the freed
> > page.
> > 
> > I was trying to pick names that could apply more generally for potential
> > future
> > special memory capabilities. Today VM_HAS_SPECIAL_PERMS does just mean reset
> > write to the directmap and VM_IMMEDIATE_UNMAP means vmalloc mapping is
> > executable.
> > 
> > A present day reason for keeping both flags is, it is more efficient in the
> > arch-agnostic implementation when freeing memory that is just RO and not
> > executable. It saves a TLB flush.
> > 
> > > (VM_IMMEDIATE_UNMAP is a bit of a lie, since, if in_interrupt(), it's
> > > not immediate.)
> > 
> > True, maybe VM_MUST_FLUSH or something else?
> > 
> > > If we do keep both flags, maybe some restructuring would make sense,
> > > like this, perhaps.  Sorry about horrible whitespace damage.
> > > 
> > > if (special) {
> > >   /* VM_HAS_SPECIAL_PERMS makes little sense without deallocate_pages. */
> > >   WARN_ON_ONCE(!deallocate_pages);
> > > 
> > >   if (immediate) {
> > >     /* It's possible that the vmap alias is X and we're about to make
> > > the direct map RW.  To avoid a window where executable memory is
> > > writable, first mark the vmap alias NX.  This is silly, since we're
> > > about to *unmap* it, but this is the best we can do if all we have to
> > > work with is the set_memory_abc() APIs.  Architectures should override
> > > this whole function to get better behavior. */
> > >     set_memory_nx(...);
> > >   }
> > > 
> > >   set_memory_rw(addr, area->nr_pages);
> > > }
> > 
> > Ok.
> > 
> > > 
> > > > +
> > > > +       /* Always actually remove the area */
> > > > +       remove_vm_area(area->addr);
> > > > +
> > > > +       /*
> > > > +        * Need to flush the TLB before freeing pages in the case of
> > > > this
> > > > flag.
> > > > +        * As long as that's happening, unmap aliases.
> > > > +        *
> > > > +        * For 2 and 3, this will not be needed because of the
> > > > set_memory_nx
> > > > +        * above, because the stale TLBs will be NX.
> > > 
> > > I'm not sure I agree with this comment.  If the caller asked for an
> > > immediate unmap, we should give an immediate unmap.  But I'm still not
> > > sure I see why VM_IMMEDIATE_UNMAP needs to exist as a separate flag.
> > 
> > Yea. I was just trying to save a TLB flush, since for today's callers that
> > have
> > set_memory there isn't a security downside I know of to just leaving it NX.
> > Maybe its not worth the tradeoff of confusion? Or I can clarify that in the
> > comment.
> 
> Don't both of the users in your series set both flags, though?  My
> real objection to having them be separate is that, in the absence of
> users, it's less clear exactly what they should do and the code
> doesn't get exercised.
The only "just RO" user today is one of the BPF allocations. I don't have a
strong objection to combining them, just explaining the thinking. I guess if we
could always add another flag later if it becomes more needed.

> If you document that VM_IMMEDIATE_UNMAP means "I want the TLB entries
> gone", then I can re-review the code in light of that.  But then I'm
> unconvinced by your generic implementation, since set_memory_nx()
> seems like an odd way to go about it.
Masami Hiramatsu pointed out if we don't do set_memory_nx before set_memory_rw,
then there will be a small window of W^X violation. So that was the concern for
the executable case, regardless of the semantics. I think the concern applies
for any "special capability" permissions. Alternatively, if we remove_vm_area
before we reset the direct map perms RW, maybe that would accomplish the same
thing, if that's possible in a cross arch way. Maybe this is too much designing
for hypothetical future... just was trying to avoid having to change the
interface, and could just update the generic implementation if new permissions
or usages come up.

The set_memory_ stuff is really only needed for arm64 which seems to be the only
other one with directmap permissions. So if it could eventually have its own
arch_vunmap then all of the set_memory_ parts could be dropped and the default
would just be the simple unmap then flush logic that it was originally.

Or we have up to three flushes for the generic version and meet the name
expectations and needed functionality today. I guess I'll just try that.
> > 
> > > > +        */
> > > > +       if (immediate && !IS_ENABLED(ARCH_HAS_SET_MEMORY))
> > > > +               vm_unmap_aliases();
> > > > +}
> > > > +
> > > >  static void __vunmap(const void *addr, int deallocate_pages)
> > > >  {
> > > >         struct vm_struct *area;
> > > > @@ -1515,7 +1567,8 @@ static void __vunmap(const void *addr, int
> > > > deallocate_pages)
> > > >         debug_check_no_locks_freed(area->addr, get_vm_area_size(area));
> > > >         debug_check_no_obj_freed(area->addr, get_vm_area_size(area));
> > > > 
> > > > -       remove_vm_area(addr);
> > > > +       arch_vunmap(area, deallocate_pages);
> > > > +
> > > >         if (deallocate_pages) {
> > > >                 int i;
> > > > 
> > > > @@ -1542,8 +1595,15 @@ static inline void __vfree_deferred(const void
> > > > *addr)
> > > >          * nother cpu's list.  schedule_work() should be fine with this
> > > > too.
> > > >          */
> > > >         struct vfree_deferred *p = raw_cpu_ptr(&vfree_deferred);
> > > > +       struct vfree_work *w = kmalloc(sizeof(struct vfree_work),
> > > > GFP_ATOMIC);
> > > > +
> > > > +       /* If no memory for the deferred list node, give up */
> > > > +       if (!w)
> > > > +               return;
> > > 
> > > That's nasty.  I see what you're trying to do here, but I think you're
> > > solving a problem that doesn't need solving quite so urgently.  How
> > > about dropping this part and replacing it with a comment like "NB:
> > > this writes a word to a potentially executable address.  It would be
> > > nice if we could avoid doing this."  And maybe a future patch could
> > > more robustly avoid it without risking memory leaks.
> > 
> > Yea, sorry I should have called this out, because I wasn't sure on how
> > likely
> > that was to happen. I did find some other places in the kernel with the same
> > ignoring logic.
> > 
> > I'll have to think though, I am not sure what the alternative is. Since the
> > memory can be RO in the module_memfree case, the old method of re-using the
> > allocation will no longer work. The list node could be stuffed on the
> > vm_struct,
> > but then the all of the spin_lock(&vmap_area_lock)'s need to be changed to
> > work
> > with interrupts so that the struct could be looked up. Not sure of the
> > implications of that. Or maybe have some slow backup that resets the
> > permissions
> > and re-uses the allocation if kmalloc fails?
> > 
> > I guess it could also go back to the old v1 implementation that doesn't
> > handle
> > RO and the directmap, and leave the W^X violation window during teardown.
> > Then
> > solve that problem when modules are loaded via something like Nadav's stuff.
> > 
> 
> Hmm.  Switching to spin_lock_irqsave() doesn't seem so bad to me.
Ok.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/4] modules: Add new special vfree flags
  2018-12-12  0:03 ` [PATCH v2 2/4] modules: Add new special vfree flags Rick Edgecombe
@ 2018-12-12 23:40   ` Nadav Amit
  2018-12-13 19:02     ` Edgecombe, Rick P
  0 siblings, 1 reply; 26+ messages in thread
From: Nadav Amit @ 2018-12-12 23:40 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andrew Morton, Andy Lutomirski, Will Deacon, Linux-MM, LKML,
	Kernel Hardening, Naveen N . Rao, Anil S Keshavamurthy,
	David S. Miller, Masami Hiramatsu, Steven Rostedt, Ingo Molnar,
	Alexei Starovoitov, Daniel Borkmann, Jessica Yu,
	Network Development, Ard Biesheuvel, Jann Horn,
	Kristen Carlson Accardi

> On Dec 11, 2018, at 4:03 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> 
> Add new flags for handling freeing of special permissioned memory in vmalloc,
> and remove places where the handling was done in module.c.
> 
> This will enable this flag for all architectures.
> 
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> kernel/module.c | 43 ++++++++++++-------------------------------
> 1 file changed, 12 insertions(+), 31 deletions(-)
> 

I count on you for merging your patch-set with mine, since clearly they
conflict.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/4] modules: Add new special vfree flags
  2018-12-12 23:40   ` Nadav Amit
@ 2018-12-13 19:02     ` Edgecombe, Rick P
  2018-12-13 19:27       ` Nadav Amit
  0 siblings, 1 reply; 26+ messages in thread
From: Edgecombe, Rick P @ 2018-12-13 19:02 UTC (permalink / raw)
  To: namit
  Cc: linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
	linux-mm, jannh, Dock, Deneen T, kristen, akpm, will.deacon,
	mingo, luto, kernel-hardening, Keshavamurthy, Anil S

On Wed, 2018-12-12 at 23:40 +0000, Nadav Amit wrote:
> > On Dec 11, 2018, at 4:03 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > wrote:
> > 
> > Add new flags for handling freeing of special permissioned memory in
> > vmalloc,
> > and remove places where the handling was done in module.c.
> > 
> > This will enable this flag for all architectures.
> > 
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> > kernel/module.c | 43 ++++++++++++-------------------------------
> > 1 file changed, 12 insertions(+), 31 deletions(-)
> > 
> 
> I count on you for merging your patch-set with mine, since clearly they
> conflict.
> 
Yes, I can rebase on top of yours if you omit the changes around module_memfree 
for your next version. It should fit together pretty cleanly for BPF and modules
I think. Not sure what you are planning for kprobes and ftrace.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/4] modules: Add new special vfree flags
  2018-12-13 19:02     ` Edgecombe, Rick P
@ 2018-12-13 19:27       ` Nadav Amit
  2018-12-13 21:48         ` Edgecombe, Rick P
  0 siblings, 1 reply; 26+ messages in thread
From: Nadav Amit @ 2018-12-13 19:27 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
	linux-mm, jannh, Dock, Deneen T, kristen, akpm, will.deacon,
	mingo, luto, kernel-hardening, Keshavamurthy, Anil S

> On Dec 13, 2018, at 11:02 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Wed, 2018-12-12 at 23:40 +0000, Nadav Amit wrote:
>>> On Dec 11, 2018, at 4:03 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> wrote:
>>> 
>>> Add new flags for handling freeing of special permissioned memory in
>>> vmalloc,
>>> and remove places where the handling was done in module.c.
>>> 
>>> This will enable this flag for all architectures.
>>> 
>>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> ---
>>> kernel/module.c | 43 ++++++++++++-------------------------------
>>> 1 file changed, 12 insertions(+), 31 deletions(-)
>> 
>> I count on you for merging your patch-set with mine, since clearly they
>> conflict.
> Yes, I can rebase on top of yours if you omit the changes around module_memfree 
> for your next version. It should fit together pretty cleanly for BPF and modules
> I think. Not sure what you are planning for kprobes and ftrace.

Are you asking after looking at the latest version of my patch-set?

Kprobes is done and ack'd. ftrace needs to be broken into two separate
changes (setting x after writing, and using text_poke interfaces), unless
Steven ack’s them. The changes introduce some overhead (3x), but I think it
is a reasonable slowdown for a debug feature.

Can you have a look at the series I’ve sent and let me know which patches
to drop? It would be best (for me) if the two series are fully merged.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/4] modules: Add new special vfree flags
  2018-12-13 19:27       ` Nadav Amit
@ 2018-12-13 21:48         ` Edgecombe, Rick P
  0 siblings, 0 replies; 26+ messages in thread
From: Edgecombe, Rick P @ 2018-12-13 21:48 UTC (permalink / raw)
  To: namit
  Cc: linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
	linux-mm, jannh, Dock, Deneen T, kristen, akpm, will.deacon,
	mingo, luto, kernel-hardening, Keshavamurthy, Anil S

On Thu, 2018-12-13 at 19:27 +0000, Nadav Amit wrote:
> > On Dec 13, 2018, at 11:02 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > wrote:
> > 
> > On Wed, 2018-12-12 at 23:40 +0000, Nadav Amit wrote:
> > > > On Dec 11, 2018, at 4:03 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > wrote:
> > > > 
> > > > Add new flags for handling freeing of special permissioned memory in
> > > > vmalloc,
> > > > and remove places where the handling was done in module.c.
> > > > 
> > > > This will enable this flag for all architectures.
> > > > 
> > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > ---
> > > > kernel/module.c | 43 ++++++++++++-------------------------------
> > > > 1 file changed, 12 insertions(+), 31 deletions(-)
> > > 
> > > I count on you for merging your patch-set with mine, since clearly they
> > > conflict.
> > 
> > Yes, I can rebase on top of yours if you omit the changes around
> > module_memfree 
> > for your next version. It should fit together pretty cleanly for BPF and
> > modules
> > I think. Not sure what you are planning for kprobes and ftrace.
> 
> Are you asking after looking at the latest version of my patch-set?
> 
> Kprobes is done and ack'd. ftrace needs to be broken into two separate
> changes (setting x after writing, and using text_poke interfaces), unless
> Steven ack’s them. The changes introduce some overhead (3x), but I think it
> is a reasonable slowdown for a debug feature.
> 
> Can you have a look at the series I’ve sent and let me know which patches
> to drop? It would be best (for me) if the two series are fully merged.

Looking at v7, the only extra thing could be the tweaks to the existing lines in
kprobes free_insn_page be omitted, since those lines would just be removed in my
later changes.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-12 22:01         ` Edgecombe, Rick P
@ 2018-12-15 18:52           ` Andy Lutomirski
  2018-12-18  0:23             ` Edgecombe, Rick P
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2018-12-15 18:52 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: luto, linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
	linux-mm, jannh, Dock, Deneen T, kristen, akpm, will.deacon,
	mingo, namit, kernel-hardening,

On Wed, Dec 12, 2018 at 2:01 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2018-12-12 at 11:57 -0800, Andy Lutomirski wrote:
> > On Wed, Dec 12, 2018 at 11:50 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Tue, 2018-12-11 at 18:20 -0800, Andy Lutomirski wrote:
> > > > On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
> > > > <rick.p.edgecombe@intel.com> wrote:
> > > > >
> > > > > This adds two new flags VM_IMMEDIATE_UNMAP and VM_HAS_SPECIAL_PERMS, for
> > > > > enabling vfree operations to immediately clear executable TLB entries to
> > > > > freed
> > > > > pages, and handle freeing memory with special permissions.
> > > > >
> > > > > In order to support vfree being called on memory that might be RO, the
> > > > > vfree
> > > > > deferred list node is moved to a kmalloc allocated struct, from where it
> > > > > is
> > > > > today, reusing the allocation being freed.
> > > > >
> > > > > arch_vunmap is a new __weak function that implements the actual
> > > > > unmapping
> > > > > and
> > > > > resetting of the direct map permissions. It can be overridden by more
> > > > > efficient
> > > > > architecture specific implementations.
> > > > >
> > > > > For the default implementation, it uses architecture agnostic methods
> > > > > which
> > > > > are
> > > > > equivalent to what most usages do before calling vfree. So now it is
> > > > > just
> > > > > centralized here.
> > > > >
> > > > > This implementation derives from two sketches from Dave Hansen and Andy
> > > > > Lutomirski.
> > > > >
> > > > > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > > > > Suggested-by: Andy Lutomirski <luto@kernel.org>
> > > > > Suggested-by: Will Deacon <will.deacon@arm.com>
> > > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > ---
> > > > >  include/linux/vmalloc.h |  2 ++
> > > > >  mm/vmalloc.c            | 73 +++++++++++++++++++++++++++++++++++++----
> > > > >  2 files changed, 69 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > > > > index 398e9c95cd61..872bcde17aca 100644
> > > > > --- a/include/linux/vmalloc.h
> > > > > +++ b/include/linux/vmalloc.h
> > > > > @@ -21,6 +21,8 @@ struct notifier_block;                /* in notifier.h
> > > > > */
> > > > >  #define VM_UNINITIALIZED       0x00000020      /* vm_struct is not
> > > > > fully
> > > > > initialized */
> > > > >  #define VM_NO_GUARD            0x00000040      /* don't add guard page
> > > > > */
> > > > >  #define VM_KASAN               0x00000080      /* has allocated kasan
> > > > > shadow memory */
> > > > > +#define VM_IMMEDIATE_UNMAP     0x00000200      /* flush before
> > > > > releasing
> > > > > pages */
> > > > > +#define VM_HAS_SPECIAL_PERMS   0x00000400      /* may be freed with
> > > > > special
> > > > > perms */
> > > > >  /* bits [20..32] reserved for arch specific ioremap internals */
> > > > >
> > > > >  /*
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index 97d4b25d0373..02b284d2245a 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -18,6 +18,7 @@
> > > > >  #include <linux/interrupt.h>
> > > > >  #include <linux/proc_fs.h>
> > > > >  #include <linux/seq_file.h>
> > > > > +#include <linux/set_memory.h>
> > > > >  #include <linux/debugobjects.h>
> > > > >  #include <linux/kallsyms.h>
> > > > >  #include <linux/list.h>
> > > > > @@ -38,6 +39,11 @@
> > > > >
> > > > >  #include "internal.h"
> > > > >
> > > > > +struct vfree_work {
> > > > > +       struct llist_node node;
> > > > > +       void *addr;
> > > > > +};
> > > > > +
> > > > >  struct vfree_deferred {
> > > > >         struct llist_head list;
> > > > >         struct work_struct wq;
> > > > > @@ -50,9 +56,13 @@ static void free_work(struct work_struct *w)
> > > > >  {
> > > > >         struct vfree_deferred *p = container_of(w, struct
> > > > > vfree_deferred,
> > > > > wq);
> > > > >         struct llist_node *t, *llnode;
> > > > > +       struct vfree_work *cur;
> > > > >
> > > > > -       llist_for_each_safe(llnode, t, llist_del_all(&p->list))
> > > > > -               __vunmap((void *)llnode, 1);
> > > > > +       llist_for_each_safe(llnode, t, llist_del_all(&p->list)) {
> > > > > +               cur = container_of(llnode, struct vfree_work, node);
> > > > > +               __vunmap(cur->addr, 1);
> > > > > +               kfree(cur);
> > > > > +       }
> > > > >  }
> > > > >
> > > > >  /*** Page table manipulation functions ***/
> > > > > @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const void
> > > > > *addr)
> > > > >         return NULL;
> > > > >  }
> > > > >
> > > > > +/*
> > > > > + * This function handles unmapping and resetting the direct map as
> > > > > efficiently
> > > > > + * as it can with cross arch functions. The three categories of
> > > > > architectures
> > > > > + * are:
> > > > > + *   1. Architectures with no set_memory implementations and no direct
> > > > > map
> > > > > + *      permissions.
> > > > > + *   2. Architectures with set_memory implementations but no direct map
> > > > > + *      permissions
> > > > > + *   3. Architectures with set_memory implementations and direct map
> > > > > permissions
> > > > > + */
> > > > > +void __weak arch_vunmap(struct vm_struct *area, int deallocate_pages)
> > > >
> > > > My general preference is to avoid __weak functions -- they don't
> > > > optimize well.  Instead, I prefer either:
> > > >
> > > > #ifndef arch_vunmap
> > > > void arch_vunmap(...);
> > > > #endif
> > > >
> > > > or
> > > >
> > > > #ifdef CONFIG_HAVE_ARCH_VUNMAP
> > > > ...
> > > > #endif
> > >
> > > Ok.
> > > >
> > > > > +{
> > > > > +       unsigned long addr = (unsigned long)area->addr;
> > > > > +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> > > > > +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> > > > > +
> > > > > +       /*
> > > > > +        * In case of 2 and 3, use this general way of resetting the
> > > > > permissions
> > > > > +        * on the directmap. Do NX before RW, in case of X, so there is
> > > > > no
> > > > > W^X
> > > > > +        * violation window.
> > > > > +        *
> > > > > +        * For case 1 these will be noops.
> > > > > +        */
> > > > > +       if (immediate)
> > > > > +               set_memory_nx(addr, area->nr_pages);
> > > > > +       if (deallocate_pages && special)
> > > > > +               set_memory_rw(addr, area->nr_pages);
> > > >
> > > > Can you elaborate on the intent here?  VM_IMMEDIATE_UNMAP means "I
> > > > want that alias gone before any deallocation happens".
> > > > VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix it for
> > > > me, please".  deallocate means "this was vfree -- please free the
> > > > pages".  I'm not convinced that all the various combinations make
> > > > sense.  Do we really need both flags?
> > >
> > > VM_HAS_SPECIAL_PERMS is supposed to mean, like you said, "reset the direct
> > > map".
> > > Where VM_IMMEDIATE_UNMAP means, the vmalloc allocation has extra capabilties
> > > where we don't want to leave an enhanced capability TLB entry to the freed
> > > page.
> > >
> > > I was trying to pick names that could apply more generally for potential
> > > future
> > > special memory capabilities. Today VM_HAS_SPECIAL_PERMS does just mean reset
> > > write to the directmap and VM_IMMEDIATE_UNMAP means vmalloc mapping is
> > > executable.
> > >
> > > A present day reason for keeping both flags is, it is more efficient in the
> > > arch-agnostic implementation when freeing memory that is just RO and not
> > > executable. It saves a TLB flush.
> > >
> > > > (VM_IMMEDIATE_UNMAP is a bit of a lie, since, if in_interrupt(), it's
> > > > not immediate.)
> > >
> > > True, maybe VM_MUST_FLUSH or something else?
> > >
> > > > If we do keep both flags, maybe some restructuring would make sense,
> > > > like this, perhaps.  Sorry about horrible whitespace damage.
> > > >
> > > > if (special) {
> > > >   /* VM_HAS_SPECIAL_PERMS makes little sense without deallocate_pages. */
> > > >   WARN_ON_ONCE(!deallocate_pages);
> > > >
> > > >   if (immediate) {
> > > >     /* It's possible that the vmap alias is X and we're about to make
> > > > the direct map RW.  To avoid a window where executable memory is
> > > > writable, first mark the vmap alias NX.  This is silly, since we're
> > > > about to *unmap* it, but this is the best we can do if all we have to
> > > > work with is the set_memory_abc() APIs.  Architectures should override
> > > > this whole function to get better behavior. */
> > > >     set_memory_nx(...);
> > > >   }
> > > >
> > > >   set_memory_rw(addr, area->nr_pages);
> > > > }
> > >
> > > Ok.
> > >
> > > >
> > > > > +
> > > > > +       /* Always actually remove the area */
> > > > > +       remove_vm_area(area->addr);
> > > > > +
> > > > > +       /*
> > > > > +        * Need to flush the TLB before freeing pages in the case of
> > > > > this
> > > > > flag.
> > > > > +        * As long as that's happening, unmap aliases.
> > > > > +        *
> > > > > +        * For 2 and 3, this will not be needed because of the
> > > > > set_memory_nx
> > > > > +        * above, because the stale TLBs will be NX.
> > > >
> > > > I'm not sure I agree with this comment.  If the caller asked for an
> > > > immediate unmap, we should give an immediate unmap.  But I'm still not
> > > > sure I see why VM_IMMEDIATE_UNMAP needs to exist as a separate flag.
> > >
> > > Yea. I was just trying to save a TLB flush, since for today's callers that
> > > have
> > > set_memory there isn't a security downside I know of to just leaving it NX.
> > > Maybe its not worth the tradeoff of confusion? Or I can clarify that in the
> > > comment.
> >
> > Don't both of the users in your series set both flags, though?  My
> > real objection to having them be separate is that, in the absence of
> > users, it's less clear exactly what they should do and the code
> > doesn't get exercised.
> The only "just RO" user today is one of the BPF allocations. I don't have a
> strong objection to combining them, just explaining the thinking. I guess if we
> could always add another flag later if it becomes more needed.
>
> > If you document that VM_IMMEDIATE_UNMAP means "I want the TLB entries
> > gone", then I can re-review the code in light of that.  But then I'm
> > unconvinced by your generic implementation, since set_memory_nx()
> > seems like an odd way to go about it.
> Masami Hiramatsu pointed out if we don't do set_memory_nx before set_memory_rw,
> then there will be a small window of W^X violation. So that was the concern for
> the executable case, regardless of the semantics. I think the concern applies
> for any "special capability" permissions. Alternatively, if we remove_vm_area
> before we reset the direct map perms RW, maybe that would accomplish the same
> thing, if that's possible in a cross arch way. Maybe this is too much designing
> for hypothetical future... just was trying to avoid having to change the
> interface, and could just update the generic implementation if new permissions
> or usages come up.
>
> The set_memory_ stuff is really only needed for arm64 which seems to be the only
> other one with directmap permissions. So if it could eventually have its own
> arch_vunmap then all of the set_memory_ parts could be dropped and the default
> would just be the simple unmap then flush logic that it was originally.

I think that's probably the best solution.  If there are only two
arches that have anything fancy here, let's just fix both of them up
for real.

>
> Or we have up to three flushes for the generic version and meet the name
> expectations and needed functionality today. I guess I'll just try that.
> > >
> > > > > +        */
> > > > > +       if (immediate && !IS_ENABLED(ARCH_HAS_SET_MEMORY))
> > > > > +               vm_unmap_aliases();
> > > > > +}
> > > > > +
> > > > >  static void __vunmap(const void *addr, int deallocate_pages)
> > > > >  {
> > > > >         struct vm_struct *area;
> > > > > @@ -1515,7 +1567,8 @@ static void __vunmap(const void *addr, int
> > > > > deallocate_pages)
> > > > >         debug_check_no_locks_freed(area->addr, get_vm_area_size(area));
> > > > >         debug_check_no_obj_freed(area->addr, get_vm_area_size(area));
> > > > >
> > > > > -       remove_vm_area(addr);
> > > > > +       arch_vunmap(area, deallocate_pages);
> > > > > +
> > > > >         if (deallocate_pages) {
> > > > >                 int i;
> > > > >
> > > > > @@ -1542,8 +1595,15 @@ static inline void __vfree_deferred(const void
> > > > > *addr)
> > > > >          * nother cpu's list.  schedule_work() should be fine with this
> > > > > too.
> > > > >          */
> > > > >         struct vfree_deferred *p = raw_cpu_ptr(&vfree_deferred);
> > > > > +       struct vfree_work *w = kmalloc(sizeof(struct vfree_work),
> > > > > GFP_ATOMIC);
> > > > > +
> > > > > +       /* If no memory for the deferred list node, give up */
> > > > > +       if (!w)
> > > > > +               return;
> > > >
> > > > That's nasty.  I see what you're trying to do here, but I think you're
> > > > solving a problem that doesn't need solving quite so urgently.  How
> > > > about dropping this part and replacing it with a comment like "NB:
> > > > this writes a word to a potentially executable address.  It would be
> > > > nice if we could avoid doing this."  And maybe a future patch could
> > > > more robustly avoid it without risking memory leaks.
> > >
> > > Yea, sorry I should have called this out, because I wasn't sure on how
> > > likely
> > > that was to happen. I did find some other places in the kernel with the same
> > > ignoring logic.
> > >
> > > I'll have to think though, I am not sure what the alternative is. Since the
> > > memory can be RO in the module_memfree case, the old method of re-using the
> > > allocation will no longer work. The list node could be stuffed on the
> > > vm_struct,
> > > but then the all of the spin_lock(&vmap_area_lock)'s need to be changed to
> > > work
> > > with interrupts so that the struct could be looked up. Not sure of the
> > > implications of that. Or maybe have some slow backup that resets the
> > > permissions
> > > and re-uses the allocation if kmalloc fails?
> > >
> > > I guess it could also go back to the old v1 implementation that doesn't
> > > handle
> > > RO and the directmap, and leave the W^X violation window during teardown.
> > > Then
> > > solve that problem when modules are loaded via something like Nadav's stuff.
> > >
> >
> > Hmm.  Switching to spin_lock_irqsave() doesn't seem so bad to me.
> Ok.

Actually, I think I have a better solution.  Just declare the
problematic case to be illegal: say that you may not free memory with
the new flags set while IRQs are off.  Enforce this with a VM_WARN_ON
in the code that reads the vfree_deferred list.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-15 18:52           ` Andy Lutomirski
@ 2018-12-18  0:23             ` Edgecombe, Rick P
  2018-12-18  1:02               ` Andy Lutomirski
  0 siblings, 1 reply; 26+ messages in thread
From: Edgecombe, Rick P @ 2018-12-18  0:23 UTC (permalink / raw)
  To: luto
  Cc: linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
	linux-mm, jannh, Dock, Deneen T, kristen, akpm, will.deacon,
	mingo, namit, kernel-hardening, Keshavamurthy, Anil S

On Sat, 2018-12-15 at 10:52 -0800, Andy Lutomirski wrote:
> On Wed, Dec 12, 2018 at 2:01 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > On Wed, 2018-12-12 at 11:57 -0800, Andy Lutomirski wrote:
> > > On Wed, Dec 12, 2018 at 11:50 AM Edgecombe, Rick P
> > > <rick.p.edgecombe@intel.com> wrote:
> > > > 
> > > > On Tue, 2018-12-11 at 18:20 -0800, Andy Lutomirski wrote:
> > > > > On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
> > > > > <rick.p.edgecombe@intel.com> wrote:
> > > > > > 
> > > > > > This adds two new flags VM_IMMEDIATE_UNMAP and VM_HAS_SPECIAL_PERMS,
> > > > > > for
> > > > > > enabling vfree operations to immediately clear executable TLB
> > > > > > entries to
> > > > > > freed
> > > > > > pages, and handle freeing memory with special permissions.
> > > > > > 
> > > > > > In order to support vfree being called on memory that might be RO,
> > > > > > the
> > > > > > vfree
> > > > > > deferred list node is moved to a kmalloc allocated struct, from
> > > > > > where it
> > > > > > is
> > > > > > today, reusing the allocation being freed.
> > > > > > 
> > > > > > arch_vunmap is a new __weak function that implements the actual
> > > > > > unmapping
> > > > > > and
> > > > > > resetting of the direct map permissions. It can be overridden by
> > > > > > more
> > > > > > efficient
> > > > > > architecture specific implementations.
> > > > > > 
> > > > > > For the default implementation, it uses architecture agnostic
> > > > > > methods
> > > > > > which
> > > > > > are
> > > > > > equivalent to what most usages do before calling vfree. So now it is
> > > > > > just
> > > > > > centralized here.
> > > > > > 
> > > > > > This implementation derives from two sketches from Dave Hansen and
> > > > > > Andy
> > > > > > Lutomirski.
> > > > > > 
> > > > > > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > > > > > Suggested-by: Andy Lutomirski <luto@kernel.org>
> > > > > > Suggested-by: Will Deacon <will.deacon@arm.com>
> > > > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > > ---
> > > > > >  include/linux/vmalloc.h |  2 ++
> > > > > >  mm/vmalloc.c            | 73 +++++++++++++++++++++++++++++++++++++-
> > > > > > ---
> > > > > >  2 files changed, 69 insertions(+), 6 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > > > > > index 398e9c95cd61..872bcde17aca 100644
> > > > > > --- a/include/linux/vmalloc.h
> > > > > > +++ b/include/linux/vmalloc.h
> > > > > > @@ -21,6 +21,8 @@ struct notifier_block;                /* in
> > > > > > notifier.h
> > > > > > */
> > > > > >  #define VM_UNINITIALIZED       0x00000020      /* vm_struct is not
> > > > > > fully
> > > > > > initialized */
> > > > > >  #define VM_NO_GUARD            0x00000040      /* don't add guard
> > > > > > page
> > > > > > */
> > > > > >  #define VM_KASAN               0x00000080      /* has allocated
> > > > > > kasan
> > > > > > shadow memory */
> > > > > > +#define VM_IMMEDIATE_UNMAP     0x00000200      /* flush before
> > > > > > releasing
> > > > > > pages */
> > > > > > +#define VM_HAS_SPECIAL_PERMS   0x00000400      /* may be freed with
> > > > > > special
> > > > > > perms */
> > > > > >  /* bits [20..32] reserved for arch specific ioremap internals */
> > > > > > 
> > > > > >  /*
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index 97d4b25d0373..02b284d2245a 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > > @@ -18,6 +18,7 @@
> > > > > >  #include <linux/interrupt.h>
> > > > > >  #include <linux/proc_fs.h>
> > > > > >  #include <linux/seq_file.h>
> > > > > > +#include <linux/set_memory.h>
> > > > > >  #include <linux/debugobjects.h>
> > > > > >  #include <linux/kallsyms.h>
> > > > > >  #include <linux/list.h>
> > > > > > @@ -38,6 +39,11 @@
> > > > > > 
> > > > > >  #include "internal.h"
> > > > > > 
> > > > > > +struct vfree_work {
> > > > > > +       struct llist_node node;
> > > > > > +       void *addr;
> > > > > > +};
> > > > > > +
> > > > > >  struct vfree_deferred {
> > > > > >         struct llist_head list;
> > > > > >         struct work_struct wq;
> > > > > > @@ -50,9 +56,13 @@ static void free_work(struct work_struct *w)
> > > > > >  {
> > > > > >         struct vfree_deferred *p = container_of(w, struct
> > > > > > vfree_deferred,
> > > > > > wq);
> > > > > >         struct llist_node *t, *llnode;
> > > > > > +       struct vfree_work *cur;
> > > > > > 
> > > > > > -       llist_for_each_safe(llnode, t, llist_del_all(&p->list))
> > > > > > -               __vunmap((void *)llnode, 1);
> > > > > > +       llist_for_each_safe(llnode, t, llist_del_all(&p->list)) {
> > > > > > +               cur = container_of(llnode, struct vfree_work, node);
> > > > > > +               __vunmap(cur->addr, 1);
> > > > > > +               kfree(cur);
> > > > > > +       }
> > > > > >  }
> > > > > > 
> > > > > >  /*** Page table manipulation functions ***/
> > > > > > @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const void
> > > > > > *addr)
> > > > > >         return NULL;
> > > > > >  }
> > > > > > 
> > > > > > +/*
> > > > > > + * This function handles unmapping and resetting the direct map as
> > > > > > efficiently
> > > > > > + * as it can with cross arch functions. The three categories of
> > > > > > architectures
> > > > > > + * are:
> > > > > > + *   1. Architectures with no set_memory implementations and no
> > > > > > direct
> > > > > > map
> > > > > > + *      permissions.
> > > > > > + *   2. Architectures with set_memory implementations but no direct
> > > > > > map
> > > > > > + *      permissions
> > > > > > + *   3. Architectures with set_memory implementations and direct
> > > > > > map
> > > > > > permissions
> > > > > > + */
> > > > > > +void __weak arch_vunmap(struct vm_struct *area, int
> > > > > > deallocate_pages)
> > > > > 
> > > > > My general preference is to avoid __weak functions -- they don't
> > > > > optimize well.  Instead, I prefer either:
> > > > > 
> > > > > #ifndef arch_vunmap
> > > > > void arch_vunmap(...);
> > > > > #endif
> > > > > 
> > > > > or
> > > > > 
> > > > > #ifdef CONFIG_HAVE_ARCH_VUNMAP
> > > > > ...
> > > > > #endif
> > > > 
> > > > Ok.
> > > > > 
> > > > > > +{
> > > > > > +       unsigned long addr = (unsigned long)area->addr;
> > > > > > +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> > > > > > +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> > > > > > +
> > > > > > +       /*
> > > > > > +        * In case of 2 and 3, use this general way of resetting the
> > > > > > permissions
> > > > > > +        * on the directmap. Do NX before RW, in case of X, so there
> > > > > > is
> > > > > > no
> > > > > > W^X
> > > > > > +        * violation window.
> > > > > > +        *
> > > > > > +        * For case 1 these will be noops.
> > > > > > +        */
> > > > > > +       if (immediate)
> > > > > > +               set_memory_nx(addr, area->nr_pages);
> > > > > > +       if (deallocate_pages && special)
> > > > > > +               set_memory_rw(addr, area->nr_pages);
> > > > > 
> > > > > Can you elaborate on the intent here?  VM_IMMEDIATE_UNMAP means "I
> > > > > want that alias gone before any deallocation happens".
> > > > > VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix it for
> > > > > me, please".  deallocate means "this was vfree -- please free the
> > > > > pages".  I'm not convinced that all the various combinations make
> > > > > sense.  Do we really need both flags?
> > > > 
> > > > VM_HAS_SPECIAL_PERMS is supposed to mean, like you said, "reset the
> > > > direct
> > > > map".
> > > > Where VM_IMMEDIATE_UNMAP means, the vmalloc allocation has extra
> > > > capabilties
> > > > where we don't want to leave an enhanced capability TLB entry to the
> > > > freed
> > > > page.
> > > > 
> > > > I was trying to pick names that could apply more generally for potential
> > > > future
> > > > special memory capabilities. Today VM_HAS_SPECIAL_PERMS does just mean
> > > > reset
> > > > write to the directmap and VM_IMMEDIATE_UNMAP means vmalloc mapping is
> > > > executable.
> > > > 
> > > > A present day reason for keeping both flags is, it is more efficient in
> > > > the
> > > > arch-agnostic implementation when freeing memory that is just RO and not
> > > > executable. It saves a TLB flush.
> > > > 
> > > > > (VM_IMMEDIATE_UNMAP is a bit of a lie, since, if in_interrupt(), it's
> > > > > not immediate.)
> > > > 
> > > > True, maybe VM_MUST_FLUSH or something else?
> > > > 
> > > > > If we do keep both flags, maybe some restructuring would make sense,
> > > > > like this, perhaps.  Sorry about horrible whitespace damage.
> > > > > 
> > > > > if (special) {
> > > > >   /* VM_HAS_SPECIAL_PERMS makes little sense without deallocate_pages.
> > > > > */
> > > > >   WARN_ON_ONCE(!deallocate_pages);
> > > > > 
> > > > >   if (immediate) {
> > > > >     /* It's possible that the vmap alias is X and we're about to make
> > > > > the direct map RW.  To avoid a window where executable memory is
> > > > > writable, first mark the vmap alias NX.  This is silly, since we're
> > > > > about to *unmap* it, but this is the best we can do if all we have to
> > > > > work with is the set_memory_abc() APIs.  Architectures should override
> > > > > this whole function to get better behavior. */
> > > > >     set_memory_nx(...);
> > > > >   }
> > > > > 
> > > > >   set_memory_rw(addr, area->nr_pages);
> > > > > }
> > > > 
> > > > Ok.
> > > > 
> > > > > 
> > > > > > +
> > > > > > +       /* Always actually remove the area */
> > > > > > +       remove_vm_area(area->addr);
> > > > > > +
> > > > > > +       /*
> > > > > > +        * Need to flush the TLB before freeing pages in the case of
> > > > > > this
> > > > > > flag.
> > > > > > +        * As long as that's happening, unmap aliases.
> > > > > > +        *
> > > > > > +        * For 2 and 3, this will not be needed because of the
> > > > > > set_memory_nx
> > > > > > +        * above, because the stale TLBs will be NX.
> > > > > 
> > > > > I'm not sure I agree with this comment.  If the caller asked for an
> > > > > immediate unmap, we should give an immediate unmap.  But I'm still not
> > > > > sure I see why VM_IMMEDIATE_UNMAP needs to exist as a separate flag.
> > > > 
> > > > Yea. I was just trying to save a TLB flush, since for today's callers
> > > > that
> > > > have
> > > > set_memory there isn't a security downside I know of to just leaving it
> > > > NX.
> > > > Maybe its not worth the tradeoff of confusion? Or I can clarify that in
> > > > the
> > > > comment.
> > > 
> > > Don't both of the users in your series set both flags, though?  My
> > > real objection to having them be separate is that, in the absence of
> > > users, it's less clear exactly what they should do and the code
> > > doesn't get exercised.
> > 
> > The only "just RO" user today is one of the BPF allocations. I don't have a
> > strong objection to combining them, just explaining the thinking. I guess if
> > we
> > could always add another flag later if it becomes more needed.
> > 
> > > If you document that VM_IMMEDIATE_UNMAP means "I want the TLB entries
> > > gone", then I can re-review the code in light of that.  But then I'm
> > > unconvinced by your generic implementation, since set_memory_nx()
> > > seems like an odd way to go about it.
> > 
> > Masami Hiramatsu pointed out if we don't do set_memory_nx before
> > set_memory_rw,
> > then there will be a small window of W^X violation. So that was the concern
> > for
> > the executable case, regardless of the semantics. I think the concern
> > applies
> > for any "special capability" permissions. Alternatively, if we
> > remove_vm_area
> > before we reset the direct map perms RW, maybe that would accomplish the
> > same
> > thing, if that's possible in a cross arch way. Maybe this is too much
> > designing
> > for hypothetical future... just was trying to avoid having to change the
> > interface, and could just update the generic implementation if new
> > permissions
> > or usages come up.
> > 
> > The set_memory_ stuff is really only needed for arm64 which seems to be the
> > only
> > other one with directmap permissions. So if it could eventually have its own
> > arch_vunmap then all of the set_memory_ parts could be dropped and the
> > default
> > would just be the simple unmap then flush logic that it was originally.
> 
> I think that's probably the best solution.  If there are only two
> arches that have anything fancy here, let's just fix both of them up
> for real.
> 
> > 
> > Or we have up to three flushes for the generic version and meet the name
> > expectations and needed functionality today. I guess I'll just try that.
> > > > 
> > > > > > +        */
> > > > > > +       if (immediate && !IS_ENABLED(ARCH_HAS_SET_MEMORY))
> > > > > > +               vm_unmap_aliases();
> > > > > > +}
> > > > > > +
> > > > > >  static void __vunmap(const void *addr, int deallocate_pages)
> > > > > >  {
> > > > > >         struct vm_struct *area;
> > > > > > @@ -1515,7 +1567,8 @@ static void __vunmap(const void *addr, int
> > > > > > deallocate_pages)
> > > > > >         debug_check_no_locks_freed(area->addr,
> > > > > > get_vm_area_size(area));
> > > > > >         debug_check_no_obj_freed(area->addr,
> > > > > > get_vm_area_size(area));
> > > > > > 
> > > > > > -       remove_vm_area(addr);
> > > > > > +       arch_vunmap(area, deallocate_pages);
> > > > > > +
> > > > > >         if (deallocate_pages) {
> > > > > >                 int i;
> > > > > > 
> > > > > > @@ -1542,8 +1595,15 @@ static inline void __vfree_deferred(const
> > > > > > void
> > > > > > *addr)
> > > > > >          * nother cpu's list.  schedule_work() should be fine with
> > > > > > this
> > > > > > too.
> > > > > >          */
> > > > > >         struct vfree_deferred *p = raw_cpu_ptr(&vfree_deferred);
> > > > > > +       struct vfree_work *w = kmalloc(sizeof(struct vfree_work),
> > > > > > GFP_ATOMIC);
> > > > > > +
> > > > > > +       /* If no memory for the deferred list node, give up */
> > > > > > +       if (!w)
> > > > > > +               return;
> > > > > 
> > > > > That's nasty.  I see what you're trying to do here, but I think you're
> > > > > solving a problem that doesn't need solving quite so urgently.  How
> > > > > about dropping this part and replacing it with a comment like "NB:
> > > > > this writes a word to a potentially executable address.  It would be
> > > > > nice if we could avoid doing this."  And maybe a future patch could
> > > > > more robustly avoid it without risking memory leaks.
> > > > 
> > > > Yea, sorry I should have called this out, because I wasn't sure on how
> > > > likely
> > > > that was to happen. I did find some other places in the kernel with the
> > > > same
> > > > ignoring logic.
> > > > 
> > > > I'll have to think though, I am not sure what the alternative is. Since
> > > > the
> > > > memory can be RO in the module_memfree case, the old method of re-using
> > > > the
> > > > allocation will no longer work. The list node could be stuffed on the
> > > > vm_struct,
> > > > but then the all of the spin_lock(&vmap_area_lock)'s need to be changed
> > > > to
> > > > work
> > > > with interrupts so that the struct could be looked up. Not sure of the
> > > > implications of that. Or maybe have some slow backup that resets the
> > > > permissions
> > > > and re-uses the allocation if kmalloc fails?
> > > > 
> > > > I guess it could also go back to the old v1 implementation that doesn't
> > > > handle
> > > > RO and the directmap, and leave the W^X violation window during
> > > > teardown.
> > > > Then
> > > > solve that problem when modules are loaded via something like Nadav's
> > > > stuff.
> > > > 
> > > 
> > > Hmm.  Switching to spin_lock_irqsave() doesn't seem so bad to me.
> > 
> > Ok.
> 
> Actually, I think I have a better solution.  Just declare the
> problematic case to be illegal: say that you may not free memory with
> the new flags set while IRQs are off.  Enforce this with a VM_WARN_ON
> in the code that reads the vfree_deferred list.

Thanks. Yea just making a rule for the one case seems better that disabling
interrupts all over. It turned out to lock in quite a few places, including the
longish lazy purge operation. Reading a little history on the deferred free list
- 6 years ago vfree used to not support interrupts at all, and different clients
had their own work queues. So this will just be having the original situation 
for the new vm flag.

I think we only need to move the module init section free from the RCU callback
to a work queue, to get to the point where, functionally wise, everything should
work with the existing deferred free list implementation (since then we will
just not use it for the new special memory case). We could also just add a
WARN_ON(in_interrupt()) in module_memfree to give context to some callers to
what will soon be a "BUG: unable to handle kernel paging request". Any objection
to leaving it there?

Over the above, moving the vfree_deferred list to the struct and the associated
cost of the lookup in every interrupt/atomic vfree would enable a WARN and the
handling of a (declared) illegal case that could deadlock anyway, right? Is it
worth it?

I'm not sure we why we wouldn't have deadlocks in normal interrupt vfrees if we
don't use irq spinlocks everywhere...I may be missing your insight.






^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-18  0:23             ` Edgecombe, Rick P
@ 2018-12-18  1:02               ` Andy Lutomirski
  0 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2018-12-18  1:02 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: luto, linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
	linux-mm, jannh, Dock, Deneen T, kristen, akpm, will.deacon,
	mingo, namit, kernel-hardening,

On Mon, Dec 17, 2018 at 4:24 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Sat, 2018-12-15 at 10:52 -0800, Andy Lutomirski wrote:
> > On Wed, Dec 12, 2018 at 2:01 PM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Wed, 2018-12-12 at 11:57 -0800, Andy Lutomirski wrote:
> > > > On Wed, Dec 12, 2018 at 11:50 AM Edgecombe, Rick P
> > > > <rick.p.edgecombe@intel.com> wrote:
> > > > >
> > > > > On Tue, 2018-12-11 at 18:20 -0800, Andy Lutomirski wrote:
> > > > > > On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
> > > > > > <rick.p.edgecombe@intel.com> wrote:
> > > > > > >
> > > > > > > This adds two new flags VM_IMMEDIATE_UNMAP and VM_HAS_SPECIAL_PERMS,
> > > > > > > for
> > > > > > > enabling vfree operations to immediately clear executable TLB
> > > > > > > entries to
> > > > > > > freed
> > > > > > > pages, and handle freeing memory with special permissions.
> > > > > > >
> > > > > > > In order to support vfree being called on memory that might be RO,
> > > > > > > the
> > > > > > > vfree
> > > > > > > deferred list node is moved to a kmalloc allocated struct, from
> > > > > > > where it
> > > > > > > is
> > > > > > > today, reusing the allocation being freed.
> > > > > > >
> > > > > > > arch_vunmap is a new __weak function that implements the actual
> > > > > > > unmapping
> > > > > > > and
> > > > > > > resetting of the direct map permissions. It can be overridden by
> > > > > > > more
> > > > > > > efficient
> > > > > > > architecture specific implementations.
> > > > > > >
> > > > > > > For the default implementation, it uses architecture agnostic
> > > > > > > methods
> > > > > > > which
> > > > > > > are
> > > > > > > equivalent to what most usages do before calling vfree. So now it is
> > > > > > > just
> > > > > > > centralized here.
> > > > > > >
> > > > > > > This implementation derives from two sketches from Dave Hansen and
> > > > > > > Andy
> > > > > > > Lutomirski.
> > > > > > >
> > > > > > > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > > > > > > Suggested-by: Andy Lutomirski <luto@kernel.org>
> > > > > > > Suggested-by: Will Deacon <will.deacon@arm.com>
> > > > > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > > > ---
> > > > > > >  include/linux/vmalloc.h |  2 ++
> > > > > > >  mm/vmalloc.c            | 73 +++++++++++++++++++++++++++++++++++++-
> > > > > > > ---
> > > > > > >  2 files changed, 69 insertions(+), 6 deletions(-)
> > > > > > >
> > > > > > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > > > > > > index 398e9c95cd61..872bcde17aca 100644
> > > > > > > --- a/include/linux/vmalloc.h
> > > > > > > +++ b/include/linux/vmalloc.h
> > > > > > > @@ -21,6 +21,8 @@ struct notifier_block;                /* in
> > > > > > > notifier.h
> > > > > > > */
> > > > > > >  #define VM_UNINITIALIZED       0x00000020      /* vm_struct is not
> > > > > > > fully
> > > > > > > initialized */
> > > > > > >  #define VM_NO_GUARD            0x00000040      /* don't add guard
> > > > > > > page
> > > > > > > */
> > > > > > >  #define VM_KASAN               0x00000080      /* has allocated
> > > > > > > kasan
> > > > > > > shadow memory */
> > > > > > > +#define VM_IMMEDIATE_UNMAP     0x00000200      /* flush before
> > > > > > > releasing
> > > > > > > pages */
> > > > > > > +#define VM_HAS_SPECIAL_PERMS   0x00000400      /* may be freed with
> > > > > > > special
> > > > > > > perms */
> > > > > > >  /* bits [20..32] reserved for arch specific ioremap internals */
> > > > > > >
> > > > > > >  /*
> > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > index 97d4b25d0373..02b284d2245a 100644
> > > > > > > --- a/mm/vmalloc.c
> > > > > > > +++ b/mm/vmalloc.c
> > > > > > > @@ -18,6 +18,7 @@
> > > > > > >  #include <linux/interrupt.h>
> > > > > > >  #include <linux/proc_fs.h>
> > > > > > >  #include <linux/seq_file.h>
> > > > > > > +#include <linux/set_memory.h>
> > > > > > >  #include <linux/debugobjects.h>
> > > > > > >  #include <linux/kallsyms.h>
> > > > > > >  #include <linux/list.h>
> > > > > > > @@ -38,6 +39,11 @@
> > > > > > >
> > > > > > >  #include "internal.h"
> > > > > > >
> > > > > > > +struct vfree_work {
> > > > > > > +       struct llist_node node;
> > > > > > > +       void *addr;
> > > > > > > +};
> > > > > > > +
> > > > > > >  struct vfree_deferred {
> > > > > > >         struct llist_head list;
> > > > > > >         struct work_struct wq;
> > > > > > > @@ -50,9 +56,13 @@ static void free_work(struct work_struct *w)
> > > > > > >  {
> > > > > > >         struct vfree_deferred *p = container_of(w, struct
> > > > > > > vfree_deferred,
> > > > > > > wq);
> > > > > > >         struct llist_node *t, *llnode;
> > > > > > > +       struct vfree_work *cur;
> > > > > > >
> > > > > > > -       llist_for_each_safe(llnode, t, llist_del_all(&p->list))
> > > > > > > -               __vunmap((void *)llnode, 1);
> > > > > > > +       llist_for_each_safe(llnode, t, llist_del_all(&p->list)) {
> > > > > > > +               cur = container_of(llnode, struct vfree_work, node);
> > > > > > > +               __vunmap(cur->addr, 1);
> > > > > > > +               kfree(cur);
> > > > > > > +       }
> > > > > > >  }
> > > > > > >
> > > > > > >  /*** Page table manipulation functions ***/
> > > > > > > @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const void
> > > > > > > *addr)
> > > > > > >         return NULL;
> > > > > > >  }
> > > > > > >
> > > > > > > +/*
> > > > > > > + * This function handles unmapping and resetting the direct map as
> > > > > > > efficiently
> > > > > > > + * as it can with cross arch functions. The three categories of
> > > > > > > architectures
> > > > > > > + * are:
> > > > > > > + *   1. Architectures with no set_memory implementations and no
> > > > > > > direct
> > > > > > > map
> > > > > > > + *      permissions.
> > > > > > > + *   2. Architectures with set_memory implementations but no direct
> > > > > > > map
> > > > > > > + *      permissions
> > > > > > > + *   3. Architectures with set_memory implementations and direct
> > > > > > > map
> > > > > > > permissions
> > > > > > > + */
> > > > > > > +void __weak arch_vunmap(struct vm_struct *area, int
> > > > > > > deallocate_pages)
> > > > > >
> > > > > > My general preference is to avoid __weak functions -- they don't
> > > > > > optimize well.  Instead, I prefer either:
> > > > > >
> > > > > > #ifndef arch_vunmap
> > > > > > void arch_vunmap(...);
> > > > > > #endif
> > > > > >
> > > > > > or
> > > > > >
> > > > > > #ifdef CONFIG_HAVE_ARCH_VUNMAP
> > > > > > ...
> > > > > > #endif
> > > > >
> > > > > Ok.
> > > > > >
> > > > > > > +{
> > > > > > > +       unsigned long addr = (unsigned long)area->addr;
> > > > > > > +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> > > > > > > +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> > > > > > > +
> > > > > > > +       /*
> > > > > > > +        * In case of 2 and 3, use this general way of resetting the
> > > > > > > permissions
> > > > > > > +        * on the directmap. Do NX before RW, in case of X, so there
> > > > > > > is
> > > > > > > no
> > > > > > > W^X
> > > > > > > +        * violation window.
> > > > > > > +        *
> > > > > > > +        * For case 1 these will be noops.
> > > > > > > +        */
> > > > > > > +       if (immediate)
> > > > > > > +               set_memory_nx(addr, area->nr_pages);
> > > > > > > +       if (deallocate_pages && special)
> > > > > > > +               set_memory_rw(addr, area->nr_pages);
> > > > > >
> > > > > > Can you elaborate on the intent here?  VM_IMMEDIATE_UNMAP means "I
> > > > > > want that alias gone before any deallocation happens".
> > > > > > VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix it for
> > > > > > me, please".  deallocate means "this was vfree -- please free the
> > > > > > pages".  I'm not convinced that all the various combinations make
> > > > > > sense.  Do we really need both flags?
> > > > >
> > > > > VM_HAS_SPECIAL_PERMS is supposed to mean, like you said, "reset the
> > > > > direct
> > > > > map".
> > > > > Where VM_IMMEDIATE_UNMAP means, the vmalloc allocation has extra
> > > > > capabilties
> > > > > where we don't want to leave an enhanced capability TLB entry to the
> > > > > freed
> > > > > page.
> > > > >
> > > > > I was trying to pick names that could apply more generally for potential
> > > > > future
> > > > > special memory capabilities. Today VM_HAS_SPECIAL_PERMS does just mean
> > > > > reset
> > > > > write to the directmap and VM_IMMEDIATE_UNMAP means vmalloc mapping is
> > > > > executable.
> > > > >
> > > > > A present day reason for keeping both flags is, it is more efficient in
> > > > > the
> > > > > arch-agnostic implementation when freeing memory that is just RO and not
> > > > > executable. It saves a TLB flush.
> > > > >
> > > > > > (VM_IMMEDIATE_UNMAP is a bit of a lie, since, if in_interrupt(), it's
> > > > > > not immediate.)
> > > > >
> > > > > True, maybe VM_MUST_FLUSH or something else?
> > > > >
> > > > > > If we do keep both flags, maybe some restructuring would make sense,
> > > > > > like this, perhaps.  Sorry about horrible whitespace damage.
> > > > > >
> > > > > > if (special) {
> > > > > >   /* VM_HAS_SPECIAL_PERMS makes little sense without deallocate_pages.
> > > > > > */
> > > > > >   WARN_ON_ONCE(!deallocate_pages);
> > > > > >
> > > > > >   if (immediate) {
> > > > > >     /* It's possible that the vmap alias is X and we're about to make
> > > > > > the direct map RW.  To avoid a window where executable memory is
> > > > > > writable, first mark the vmap alias NX.  This is silly, since we're
> > > > > > about to *unmap* it, but this is the best we can do if all we have to
> > > > > > work with is the set_memory_abc() APIs.  Architectures should override
> > > > > > this whole function to get better behavior. */
> > > > > >     set_memory_nx(...);
> > > > > >   }
> > > > > >
> > > > > >   set_memory_rw(addr, area->nr_pages);
> > > > > > }
> > > > >
> > > > > Ok.
> > > > >
> > > > > >
> > > > > > > +
> > > > > > > +       /* Always actually remove the area */
> > > > > > > +       remove_vm_area(area->addr);
> > > > > > > +
> > > > > > > +       /*
> > > > > > > +        * Need to flush the TLB before freeing pages in the case of
> > > > > > > this
> > > > > > > flag.
> > > > > > > +        * As long as that's happening, unmap aliases.
> > > > > > > +        *
> > > > > > > +        * For 2 and 3, this will not be needed because of the
> > > > > > > set_memory_nx
> > > > > > > +        * above, because the stale TLBs will be NX.
> > > > > >
> > > > > > I'm not sure I agree with this comment.  If the caller asked for an
> > > > > > immediate unmap, we should give an immediate unmap.  But I'm still not
> > > > > > sure I see why VM_IMMEDIATE_UNMAP needs to exist as a separate flag.
> > > > >
> > > > > Yea. I was just trying to save a TLB flush, since for today's callers
> > > > > that
> > > > > have
> > > > > set_memory there isn't a security downside I know of to just leaving it
> > > > > NX.
> > > > > Maybe its not worth the tradeoff of confusion? Or I can clarify that in
> > > > > the
> > > > > comment.
> > > >
> > > > Don't both of the users in your series set both flags, though?  My
> > > > real objection to having them be separate is that, in the absence of
> > > > users, it's less clear exactly what they should do and the code
> > > > doesn't get exercised.
> > >
> > > The only "just RO" user today is one of the BPF allocations. I don't have a
> > > strong objection to combining them, just explaining the thinking. I guess if
> > > we
> > > could always add another flag later if it becomes more needed.
> > >
> > > > If you document that VM_IMMEDIATE_UNMAP means "I want the TLB entries
> > > > gone", then I can re-review the code in light of that.  But then I'm
> > > > unconvinced by your generic implementation, since set_memory_nx()
> > > > seems like an odd way to go about it.
> > >
> > > Masami Hiramatsu pointed out if we don't do set_memory_nx before
> > > set_memory_rw,
> > > then there will be a small window of W^X violation. So that was the concern
> > > for
> > > the executable case, regardless of the semantics. I think the concern
> > > applies
> > > for any "special capability" permissions. Alternatively, if we
> > > remove_vm_area
> > > before we reset the direct map perms RW, maybe that would accomplish the
> > > same
> > > thing, if that's possible in a cross arch way. Maybe this is too much
> > > designing
> > > for hypothetical future... just was trying to avoid having to change the
> > > interface, and could just update the generic implementation if new
> > > permissions
> > > or usages come up.
> > >
> > > The set_memory_ stuff is really only needed for arm64 which seems to be the
> > > only
> > > other one with directmap permissions. So if it could eventually have its own
> > > arch_vunmap then all of the set_memory_ parts could be dropped and the
> > > default
> > > would just be the simple unmap then flush logic that it was originally.
> >
> > I think that's probably the best solution.  If there are only two
> > arches that have anything fancy here, let's just fix both of them up
> > for real.
> >
> > >
> > > Or we have up to three flushes for the generic version and meet the name
> > > expectations and needed functionality today. I guess I'll just try that.
> > > > >
> > > > > > > +        */
> > > > > > > +       if (immediate && !IS_ENABLED(ARCH_HAS_SET_MEMORY))
> > > > > > > +               vm_unmap_aliases();
> > > > > > > +}
> > > > > > > +
> > > > > > >  static void __vunmap(const void *addr, int deallocate_pages)
> > > > > > >  {
> > > > > > >         struct vm_struct *area;
> > > > > > > @@ -1515,7 +1567,8 @@ static void __vunmap(const void *addr, int
> > > > > > > deallocate_pages)
> > > > > > >         debug_check_no_locks_freed(area->addr,
> > > > > > > get_vm_area_size(area));
> > > > > > >         debug_check_no_obj_freed(area->addr,
> > > > > > > get_vm_area_size(area));
> > > > > > >
> > > > > > > -       remove_vm_area(addr);
> > > > > > > +       arch_vunmap(area, deallocate_pages);
> > > > > > > +
> > > > > > >         if (deallocate_pages) {
> > > > > > >                 int i;
> > > > > > >
> > > > > > > @@ -1542,8 +1595,15 @@ static inline void __vfree_deferred(const
> > > > > > > void
> > > > > > > *addr)
> > > > > > >          * nother cpu's list.  schedule_work() should be fine with
> > > > > > > this
> > > > > > > too.
> > > > > > >          */
> > > > > > >         struct vfree_deferred *p = raw_cpu_ptr(&vfree_deferred);
> > > > > > > +       struct vfree_work *w = kmalloc(sizeof(struct vfree_work),
> > > > > > > GFP_ATOMIC);
> > > > > > > +
> > > > > > > +       /* If no memory for the deferred list node, give up */
> > > > > > > +       if (!w)
> > > > > > > +               return;
> > > > > >
> > > > > > That's nasty.  I see what you're trying to do here, but I think you're
> > > > > > solving a problem that doesn't need solving quite so urgently.  How
> > > > > > about dropping this part and replacing it with a comment like "NB:
> > > > > > this writes a word to a potentially executable address.  It would be
> > > > > > nice if we could avoid doing this."  And maybe a future patch could
> > > > > > more robustly avoid it without risking memory leaks.
> > > > >
> > > > > Yea, sorry I should have called this out, because I wasn't sure on how
> > > > > likely
> > > > > that was to happen. I did find some other places in the kernel with the
> > > > > same
> > > > > ignoring logic.
> > > > >
> > > > > I'll have to think though, I am not sure what the alternative is. Since
> > > > > the
> > > > > memory can be RO in the module_memfree case, the old method of re-using
> > > > > the
> > > > > allocation will no longer work. The list node could be stuffed on the
> > > > > vm_struct,
> > > > > but then the all of the spin_lock(&vmap_area_lock)'s need to be changed
> > > > > to
> > > > > work
> > > > > with interrupts so that the struct could be looked up. Not sure of the
> > > > > implications of that. Or maybe have some slow backup that resets the
> > > > > permissions
> > > > > and re-uses the allocation if kmalloc fails?
> > > > >
> > > > > I guess it could also go back to the old v1 implementation that doesn't
> > > > > handle
> > > > > RO and the directmap, and leave the W^X violation window during
> > > > > teardown.
> > > > > Then
> > > > > solve that problem when modules are loaded via something like Nadav's
> > > > > stuff.
> > > > >
> > > >
> > > > Hmm.  Switching to spin_lock_irqsave() doesn't seem so bad to me.
> > >
> > > Ok.
> >
> > Actually, I think I have a better solution.  Just declare the
> > problematic case to be illegal: say that you may not free memory with
> > the new flags set while IRQs are off.  Enforce this with a VM_WARN_ON
> > in the code that reads the vfree_deferred list.
>
> Thanks. Yea just making a rule for the one case seems better that disabling
> interrupts all over. It turned out to lock in quite a few places, including the
> longish lazy purge operation. Reading a little history on the deferred free list
> - 6 years ago vfree used to not support interrupts at all, and different clients
> had their own work queues. So this will just be having the original situation
> for the new vm flag.
>
> I think we only need to move the module init section free from the RCU callback
> to a work queue, to get to the point where, functionally wise, everything should
> work with the existing deferred free list implementation (since then we will
> just not use it for the new special memory case).

> We could also just add a
> WARN_ON(in_interrupt()) in module_memfree to give context to some callers to
> what will soon be a "BUG: unable to handle kernel paging request". Any objection
> to leaving it there?
>

Seems reasonable.

> Over the above, moving the vfree_deferred list to the struct and the associated
> cost of the lookup in every interrupt/atomic vfree would enable a WARN and the
> handling of a (declared) illegal case that could deadlock anyway, right? Is it
> worth it?

I suspect it's not worth it.

>
> I'm not sure we why we wouldn't have deadlocks in normal interrupt vfrees if we
> don't use irq spinlocks everywhere...I may be missing your insight.
>

I may be misunderstanding your question, but: I suspect that we can
easily deadlock if vfree() is called with IRQs off but
!in_interrupt().  Perhaps no one does that?  At the very least, I
assume that lockdep would scream loudly if this happened.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-12  2:20   ` Andy Lutomirski
  2018-12-12 19:50     ` Edgecombe, Rick P
@ 2018-12-21 16:39     ` Ard Biesheuvel
  2018-12-21 17:12       ` Andy Lutomirski
  1 sibling, 1 reply; 26+ messages in thread
From: Ard Biesheuvel @ 2018-12-21 16:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rick Edgecombe, Andrew Morton, Will Deacon, Linux-MM, LKML,
	Kernel Hardening, Naveen N . Rao, Anil S Keshavamurthy,
	David S. Miller, Masami Hiramatsu, Steven Rostedt, Ingo Molnar,
	Alexei Starovoitov, Daniel Borkmann, Jessica Yu, Nadav Amit,
	Network Development, Jann Horn, Kristen Carlson Accardi

On Wed, 12 Dec 2018 at 03:20, Andy Lutomirski <luto@kernel.org> wrote:
>
> On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
> <rick.p.edgecombe@intel.com> wrote:
> >
> > This adds two new flags VM_IMMEDIATE_UNMAP and VM_HAS_SPECIAL_PERMS, for
> > enabling vfree operations to immediately clear executable TLB entries to freed
> > pages, and handle freeing memory with special permissions.
> >
> > In order to support vfree being called on memory that might be RO, the vfree
> > deferred list node is moved to a kmalloc allocated struct, from where it is
> > today, reusing the allocation being freed.
> >
> > arch_vunmap is a new __weak function that implements the actual unmapping and
> > resetting of the direct map permissions. It can be overridden by more efficient
> > architecture specific implementations.
> >
> > For the default implementation, it uses architecture agnostic methods which are
> > equivalent to what most usages do before calling vfree. So now it is just
> > centralized here.
> >
> > This implementation derives from two sketches from Dave Hansen and Andy
> > Lutomirski.
> >
> > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > Suggested-by: Andy Lutomirski <luto@kernel.org>
> > Suggested-by: Will Deacon <will.deacon@arm.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> >  include/linux/vmalloc.h |  2 ++
> >  mm/vmalloc.c            | 73 +++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 69 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index 398e9c95cd61..872bcde17aca 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -21,6 +21,8 @@ struct notifier_block;                /* in notifier.h */
> >  #define VM_UNINITIALIZED       0x00000020      /* vm_struct is not fully initialized */
> >  #define VM_NO_GUARD            0x00000040      /* don't add guard page */
> >  #define VM_KASAN               0x00000080      /* has allocated kasan shadow memory */
> > +#define VM_IMMEDIATE_UNMAP     0x00000200      /* flush before releasing pages */
> > +#define VM_HAS_SPECIAL_PERMS   0x00000400      /* may be freed with special perms */
> >  /* bits [20..32] reserved for arch specific ioremap internals */
> >
> >  /*
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 97d4b25d0373..02b284d2245a 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/interrupt.h>
> >  #include <linux/proc_fs.h>
> >  #include <linux/seq_file.h>
> > +#include <linux/set_memory.h>
> >  #include <linux/debugobjects.h>
> >  #include <linux/kallsyms.h>
> >  #include <linux/list.h>
> > @@ -38,6 +39,11 @@
> >
> >  #include "internal.h"
> >
> > +struct vfree_work {
> > +       struct llist_node node;
> > +       void *addr;
> > +};
> > +
> >  struct vfree_deferred {
> >         struct llist_head list;
> >         struct work_struct wq;
> > @@ -50,9 +56,13 @@ static void free_work(struct work_struct *w)
> >  {
> >         struct vfree_deferred *p = container_of(w, struct vfree_deferred, wq);
> >         struct llist_node *t, *llnode;
> > +       struct vfree_work *cur;
> >
> > -       llist_for_each_safe(llnode, t, llist_del_all(&p->list))
> > -               __vunmap((void *)llnode, 1);
> > +       llist_for_each_safe(llnode, t, llist_del_all(&p->list)) {
> > +               cur = container_of(llnode, struct vfree_work, node);
> > +               __vunmap(cur->addr, 1);
> > +               kfree(cur);
> > +       }
> >  }
> >
> >  /*** Page table manipulation functions ***/
> > @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const void *addr)
> >         return NULL;
> >  }
> >
> > +/*
> > + * This function handles unmapping and resetting the direct map as efficiently
> > + * as it can with cross arch functions. The three categories of architectures
> > + * are:
> > + *   1. Architectures with no set_memory implementations and no direct map
> > + *      permissions.
> > + *   2. Architectures with set_memory implementations but no direct map
> > + *      permissions
> > + *   3. Architectures with set_memory implementations and direct map permissions
> > + */
> > +void __weak arch_vunmap(struct vm_struct *area, int deallocate_pages)
>
> My general preference is to avoid __weak functions -- they don't
> optimize well.  Instead, I prefer either:
>
> #ifndef arch_vunmap
> void arch_vunmap(...);
> #endif
>
> or
>
> #ifdef CONFIG_HAVE_ARCH_VUNMAP
> ...
> #endif
>
>
> > +{
> > +       unsigned long addr = (unsigned long)area->addr;
> > +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> > +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> > +
> > +       /*
> > +        * In case of 2 and 3, use this general way of resetting the permissions
> > +        * on the directmap. Do NX before RW, in case of X, so there is no W^X
> > +        * violation window.
> > +        *
> > +        * For case 1 these will be noops.
> > +        */
> > +       if (immediate)
> > +               set_memory_nx(addr, area->nr_pages);
> > +       if (deallocate_pages && special)
> > +               set_memory_rw(addr, area->nr_pages);
>
> Can you elaborate on the intent here?  VM_IMMEDIATE_UNMAP means "I
> want that alias gone before any deallocation happens".
> VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix it for
> me, please".  deallocate means "this was vfree -- please free the
> pages".  I'm not convinced that all the various combinations make
> sense.  Do we really need both flags?
>
> (VM_IMMEDIATE_UNMAP is a bit of a lie, since, if in_interrupt(), it's
> not immediate.)
>
> If we do keep both flags, maybe some restructuring would make sense,
> like this, perhaps.  Sorry about horrible whitespace damage.
>
> if (special) {
>   /* VM_HAS_SPECIAL_PERMS makes little sense without deallocate_pages. */
>   WARN_ON_ONCE(!deallocate_pages);
>
>   if (immediate) {
>     /* It's possible that the vmap alias is X and we're about to make
> the direct map RW.  To avoid a window where executable memory is
> writable, first mark the vmap alias NX.  This is silly, since we're
> about to *unmap* it, but this is the best we can do if all we have to
> work with is the set_memory_abc() APIs.  Architectures should override
> this whole function to get better behavior. */

So can't we fix this first? Assuming that architectures that bother to
implement them will not have executable mappings in the linear region,
all we'd need is set_linear_range_ro/rw() routines that default to
doing nothing, and encapsulate the existing code for x86 and arm64.
That way, we can handle do things in the proper order, i.e., release
the vmalloc mapping (without caring about the permissions), restore
the linear alias attributes, and finally release the pages.


>     set_memory_nx(...);
>   }
>
>   set_memory_rw(addr, area->nr_pages);
> }
>
>
> > +
> > +       /* Always actually remove the area */
> > +       remove_vm_area(area->addr);
> > +
> > +       /*
> > +        * Need to flush the TLB before freeing pages in the case of this flag.
> > +        * As long as that's happening, unmap aliases.
> > +        *
> > +        * For 2 and 3, this will not be needed because of the set_memory_nx
> > +        * above, because the stale TLBs will be NX.
>
> I'm not sure I agree with this comment.  If the caller asked for an
> immediate unmap, we should give an immediate unmap.  But I'm still not
> sure I see why VM_IMMEDIATE_UNMAP needs to exist as a separate flag.
>
> > +        */
> > +       if (immediate && !IS_ENABLED(ARCH_HAS_SET_MEMORY))
> > +               vm_unmap_aliases();
> > +}
> > +
> >  static void __vunmap(const void *addr, int deallocate_pages)
> >  {
> >         struct vm_struct *area;
> > @@ -1515,7 +1567,8 @@ static void __vunmap(const void *addr, int deallocate_pages)
> >         debug_check_no_locks_freed(area->addr, get_vm_area_size(area));
> >         debug_check_no_obj_freed(area->addr, get_vm_area_size(area));
> >
> > -       remove_vm_area(addr);
> > +       arch_vunmap(area, deallocate_pages);
> > +
> >         if (deallocate_pages) {
> >                 int i;
> >
> > @@ -1542,8 +1595,15 @@ static inline void __vfree_deferred(const void *addr)
> >          * nother cpu's list.  schedule_work() should be fine with this too.
> >          */
> >         struct vfree_deferred *p = raw_cpu_ptr(&vfree_deferred);
> > +       struct vfree_work *w = kmalloc(sizeof(struct vfree_work), GFP_ATOMIC);
> > +
> > +       /* If no memory for the deferred list node, give up */
> > +       if (!w)
> > +               return;
>
> That's nasty.  I see what you're trying to do here, but I think you're
> solving a problem that doesn't need solving quite so urgently.  How
> about dropping this part and replacing it with a comment like "NB:
> this writes a word to a potentially executable address.  It would be
> nice if we could avoid doing this."  And maybe a future patch could
> more robustly avoid it without risking memory leaks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-21 16:39     ` Ard Biesheuvel
@ 2018-12-21 17:12       ` Andy Lutomirski
  2018-12-21 17:25         ` Ard Biesheuvel
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2018-12-21 17:12 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Andy Lutomirski, Rick Edgecombe, Andrew Morton, Will Deacon,
	Linux-MM, LKML, Kernel Hardening, Naveen N . Rao,
	Anil S Keshavamurthy, David S. Miller, Masami Hiramatsu,
	Steven Rostedt, Ingo Molnar, Alexei Starovoitov, Daniel Borkmann,
	Jessica Yu, Nadav Amit, Network Development, Jann Horn

> On Dec 21, 2018, at 9:39 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>
>> On Wed, 12 Dec 2018 at 03:20, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
>> <rick.p.edgecombe@intel.com> wrote:
>>>
>>> This adds two new flags VM_IMMEDIATE_UNMAP and VM_HAS_SPECIAL_PERMS, for
>>> enabling vfree operations to immediately clear executable TLB entries to freed
>>> pages, and handle freeing memory with special permissions.
>>>
>>> In order to support vfree being called on memory that might be RO, the vfree
>>> deferred list node is moved to a kmalloc allocated struct, from where it is
>>> today, reusing the allocation being freed.
>>>
>>> arch_vunmap is a new __weak function that implements the actual unmapping and
>>> resetting of the direct map permissions. It can be overridden by more efficient
>>> architecture specific implementations.
>>>
>>> For the default implementation, it uses architecture agnostic methods which are
>>> equivalent to what most usages do before calling vfree. So now it is just
>>> centralized here.
>>>
>>> This implementation derives from two sketches from Dave Hansen and Andy
>>> Lutomirski.
>>>
>>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
>>> Suggested-by: Andy Lutomirski <luto@kernel.org>
>>> Suggested-by: Will Deacon <will.deacon@arm.com>
>>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> ---
>>> include/linux/vmalloc.h |  2 ++
>>> mm/vmalloc.c            | 73 +++++++++++++++++++++++++++++++++++++----
>>> 2 files changed, 69 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>>> index 398e9c95cd61..872bcde17aca 100644
>>> --- a/include/linux/vmalloc.h
>>> +++ b/include/linux/vmalloc.h
>>> @@ -21,6 +21,8 @@ struct notifier_block;                /* in notifier.h */
>>> #define VM_UNINITIALIZED       0x00000020      /* vm_struct is not fully initialized */
>>> #define VM_NO_GUARD            0x00000040      /* don't add guard page */
>>> #define VM_KASAN               0x00000080      /* has allocated kasan shadow memory */
>>> +#define VM_IMMEDIATE_UNMAP     0x00000200      /* flush before releasing pages */
>>> +#define VM_HAS_SPECIAL_PERMS   0x00000400      /* may be freed with special perms */
>>> /* bits [20..32] reserved for arch specific ioremap internals */
>>>
>>> /*
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index 97d4b25d0373..02b284d2245a 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -18,6 +18,7 @@
>>> #include <linux/interrupt.h>
>>> #include <linux/proc_fs.h>
>>> #include <linux/seq_file.h>
>>> +#include <linux/set_memory.h>
>>> #include <linux/debugobjects.h>
>>> #include <linux/kallsyms.h>
>>> #include <linux/list.h>
>>> @@ -38,6 +39,11 @@
>>>
>>> #include "internal.h"
>>>
>>> +struct vfree_work {
>>> +       struct llist_node node;
>>> +       void *addr;
>>> +};
>>> +
>>> struct vfree_deferred {
>>>        struct llist_head list;
>>>        struct work_struct wq;
>>> @@ -50,9 +56,13 @@ static void free_work(struct work_struct *w)
>>> {
>>>        struct vfree_deferred *p = container_of(w, struct vfree_deferred, wq);
>>>        struct llist_node *t, *llnode;
>>> +       struct vfree_work *cur;
>>>
>>> -       llist_for_each_safe(llnode, t, llist_del_all(&p->list))
>>> -               __vunmap((void *)llnode, 1);
>>> +       llist_for_each_safe(llnode, t, llist_del_all(&p->list)) {
>>> +               cur = container_of(llnode, struct vfree_work, node);
>>> +               __vunmap(cur->addr, 1);
>>> +               kfree(cur);
>>> +       }
>>> }
>>>
>>> /*** Page table manipulation functions ***/
>>> @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const void *addr)
>>>        return NULL;
>>> }
>>>
>>> +/*
>>> + * This function handles unmapping and resetting the direct map as efficiently
>>> + * as it can with cross arch functions. The three categories of architectures
>>> + * are:
>>> + *   1. Architectures with no set_memory implementations and no direct map
>>> + *      permissions.
>>> + *   2. Architectures with set_memory implementations but no direct map
>>> + *      permissions
>>> + *   3. Architectures with set_memory implementations and direct map permissions
>>> + */
>>> +void __weak arch_vunmap(struct vm_struct *area, int deallocate_pages)
>>
>> My general preference is to avoid __weak functions -- they don't
>> optimize well.  Instead, I prefer either:
>>
>> #ifndef arch_vunmap
>> void arch_vunmap(...);
>> #endif
>>
>> or
>>
>> #ifdef CONFIG_HAVE_ARCH_VUNMAP
>> ...
>> #endif
>>
>>
>>> +{
>>> +       unsigned long addr = (unsigned long)area->addr;
>>> +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
>>> +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
>>> +
>>> +       /*
>>> +        * In case of 2 and 3, use this general way of resetting the permissions
>>> +        * on the directmap. Do NX before RW, in case of X, so there is no W^X
>>> +        * violation window.
>>> +        *
>>> +        * For case 1 these will be noops.
>>> +        */
>>> +       if (immediate)
>>> +               set_memory_nx(addr, area->nr_pages);
>>> +       if (deallocate_pages && special)
>>> +               set_memory_rw(addr, area->nr_pages);
>>
>> Can you elaborate on the intent here?  VM_IMMEDIATE_UNMAP means "I
>> want that alias gone before any deallocation happens".
>> VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix it for
>> me, please".  deallocate means "this was vfree -- please free the
>> pages".  I'm not convinced that all the various combinations make
>> sense.  Do we really need both flags?
>>
>> (VM_IMMEDIATE_UNMAP is a bit of a lie, since, if in_interrupt(), it's
>> not immediate.)
>>
>> If we do keep both flags, maybe some restructuring would make sense,
>> like this, perhaps.  Sorry about horrible whitespace damage.
>>
>> if (special) {
>>  /* VM_HAS_SPECIAL_PERMS makes little sense without deallocate_pages. */
>>  WARN_ON_ONCE(!deallocate_pages);
>>
>>  if (immediate) {
>>    /* It's possible that the vmap alias is X and we're about to make
>> the direct map RW.  To avoid a window where executable memory is
>> writable, first mark the vmap alias NX.  This is silly, since we're
>> about to *unmap* it, but this is the best we can do if all we have to
>> work with is the set_memory_abc() APIs.  Architectures should override
>> this whole function to get better behavior. */
>
> So can't we fix this first? Assuming that architectures that bother to
> implement them will not have executable mappings in the linear region,
> all we'd need is set_linear_range_ro/rw() routines that default to
> doing nothing, and encapsulate the existing code for x86 and arm64.
> That way, we can handle do things in the proper order, i.e., release
> the vmalloc mapping (without caring about the permissions), restore
> the linear alias attributes, and finally release the pages.

Seems reasonable, except that I think it should be
set_linear_range_not_present() and set_linear_range_rw(), for three
reasons:

1. It’s not at all clear to me that we need to keep the linear mapping
around for modules.

2. At least on x86, the obvious algorithm to do the free operation
with a single flush requires it.  Someone should probably confirm that
arm’s TLB works the same way, i.e. that no flush is needed when
changing from not-present (or whatever ARM calls it) to RW.

3. Anyone playing with XPFO wants this facility anyway.  In fact, with
this change, Rick’s series will more or less implement XPFO for
vmalloc memory :)

Does that seem reasonable to you?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-21 17:12       ` Andy Lutomirski
@ 2018-12-21 17:25         ` Ard Biesheuvel
  2018-12-21 19:57           ` Edgecombe, Rick P
  0 siblings, 1 reply; 26+ messages in thread
From: Ard Biesheuvel @ 2018-12-21 17:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rick Edgecombe, Andrew Morton, Will Deacon, Linux-MM, LKML,
	Kernel Hardening, Naveen N . Rao, Anil S Keshavamurthy,
	David S. Miller, Masami Hiramatsu, Steven Rostedt, Ingo Molnar,
	Alexei Starovoitov, Daniel Borkmann, Jessica Yu, Nadav Amit,
	Network Development, Jann Horn, Kristen Carlson Accardi

On Fri, 21 Dec 2018 at 18:12, Andy Lutomirski <luto@kernel.org> wrote:
>
> > On Dec 21, 2018, at 9:39 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >
> >> On Wed, 12 Dec 2018 at 03:20, Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >> On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
> >> <rick.p.edgecombe@intel.com> wrote:
> >>>
> >>> This adds two new flags VM_IMMEDIATE_UNMAP and VM_HAS_SPECIAL_PERMS, for
> >>> enabling vfree operations to immediately clear executable TLB entries to freed
> >>> pages, and handle freeing memory with special permissions.
> >>>
> >>> In order to support vfree being called on memory that might be RO, the vfree
> >>> deferred list node is moved to a kmalloc allocated struct, from where it is
> >>> today, reusing the allocation being freed.
> >>>
> >>> arch_vunmap is a new __weak function that implements the actual unmapping and
> >>> resetting of the direct map permissions. It can be overridden by more efficient
> >>> architecture specific implementations.
> >>>
> >>> For the default implementation, it uses architecture agnostic methods which are
> >>> equivalent to what most usages do before calling vfree. So now it is just
> >>> centralized here.
> >>>
> >>> This implementation derives from two sketches from Dave Hansen and Andy
> >>> Lutomirski.
> >>>
> >>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> >>> Suggested-by: Andy Lutomirski <luto@kernel.org>
> >>> Suggested-by: Will Deacon <will.deacon@arm.com>
> >>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> >>> ---
> >>> include/linux/vmalloc.h |  2 ++
> >>> mm/vmalloc.c            | 73 +++++++++++++++++++++++++++++++++++++----
> >>> 2 files changed, 69 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> >>> index 398e9c95cd61..872bcde17aca 100644
> >>> --- a/include/linux/vmalloc.h
> >>> +++ b/include/linux/vmalloc.h
> >>> @@ -21,6 +21,8 @@ struct notifier_block;                /* in notifier.h */
> >>> #define VM_UNINITIALIZED       0x00000020      /* vm_struct is not fully initialized */
> >>> #define VM_NO_GUARD            0x00000040      /* don't add guard page */
> >>> #define VM_KASAN               0x00000080      /* has allocated kasan shadow memory */
> >>> +#define VM_IMMEDIATE_UNMAP     0x00000200      /* flush before releasing pages */
> >>> +#define VM_HAS_SPECIAL_PERMS   0x00000400      /* may be freed with special perms */
> >>> /* bits [20..32] reserved for arch specific ioremap internals */
> >>>
> >>> /*
> >>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> >>> index 97d4b25d0373..02b284d2245a 100644
> >>> --- a/mm/vmalloc.c
> >>> +++ b/mm/vmalloc.c
> >>> @@ -18,6 +18,7 @@
> >>> #include <linux/interrupt.h>
> >>> #include <linux/proc_fs.h>
> >>> #include <linux/seq_file.h>
> >>> +#include <linux/set_memory.h>
> >>> #include <linux/debugobjects.h>
> >>> #include <linux/kallsyms.h>
> >>> #include <linux/list.h>
> >>> @@ -38,6 +39,11 @@
> >>>
> >>> #include "internal.h"
> >>>
> >>> +struct vfree_work {
> >>> +       struct llist_node node;
> >>> +       void *addr;
> >>> +};
> >>> +
> >>> struct vfree_deferred {
> >>>        struct llist_head list;
> >>>        struct work_struct wq;
> >>> @@ -50,9 +56,13 @@ static void free_work(struct work_struct *w)
> >>> {
> >>>        struct vfree_deferred *p = container_of(w, struct vfree_deferred, wq);
> >>>        struct llist_node *t, *llnode;
> >>> +       struct vfree_work *cur;
> >>>
> >>> -       llist_for_each_safe(llnode, t, llist_del_all(&p->list))
> >>> -               __vunmap((void *)llnode, 1);
> >>> +       llist_for_each_safe(llnode, t, llist_del_all(&p->list)) {
> >>> +               cur = container_of(llnode, struct vfree_work, node);
> >>> +               __vunmap(cur->addr, 1);
> >>> +               kfree(cur);
> >>> +       }
> >>> }
> >>>
> >>> /*** Page table manipulation functions ***/
> >>> @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const void *addr)
> >>>        return NULL;
> >>> }
> >>>
> >>> +/*
> >>> + * This function handles unmapping and resetting the direct map as efficiently
> >>> + * as it can with cross arch functions. The three categories of architectures
> >>> + * are:
> >>> + *   1. Architectures with no set_memory implementations and no direct map
> >>> + *      permissions.
> >>> + *   2. Architectures with set_memory implementations but no direct map
> >>> + *      permissions
> >>> + *   3. Architectures with set_memory implementations and direct map permissions
> >>> + */
> >>> +void __weak arch_vunmap(struct vm_struct *area, int deallocate_pages)
> >>
> >> My general preference is to avoid __weak functions -- they don't
> >> optimize well.  Instead, I prefer either:
> >>
> >> #ifndef arch_vunmap
> >> void arch_vunmap(...);
> >> #endif
> >>
> >> or
> >>
> >> #ifdef CONFIG_HAVE_ARCH_VUNMAP
> >> ...
> >> #endif
> >>
> >>
> >>> +{
> >>> +       unsigned long addr = (unsigned long)area->addr;
> >>> +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> >>> +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> >>> +
> >>> +       /*
> >>> +        * In case of 2 and 3, use this general way of resetting the permissions
> >>> +        * on the directmap. Do NX before RW, in case of X, so there is no W^X
> >>> +        * violation window.
> >>> +        *
> >>> +        * For case 1 these will be noops.
> >>> +        */
> >>> +       if (immediate)
> >>> +               set_memory_nx(addr, area->nr_pages);
> >>> +       if (deallocate_pages && special)
> >>> +               set_memory_rw(addr, area->nr_pages);
> >>
> >> Can you elaborate on the intent here?  VM_IMMEDIATE_UNMAP means "I
> >> want that alias gone before any deallocation happens".
> >> VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix it for
> >> me, please".  deallocate means "this was vfree -- please free the
> >> pages".  I'm not convinced that all the various combinations make
> >> sense.  Do we really need both flags?
> >>
> >> (VM_IMMEDIATE_UNMAP is a bit of a lie, since, if in_interrupt(), it's
> >> not immediate.)
> >>
> >> If we do keep both flags, maybe some restructuring would make sense,
> >> like this, perhaps.  Sorry about horrible whitespace damage.
> >>
> >> if (special) {
> >>  /* VM_HAS_SPECIAL_PERMS makes little sense without deallocate_pages. */
> >>  WARN_ON_ONCE(!deallocate_pages);
> >>
> >>  if (immediate) {
> >>    /* It's possible that the vmap alias is X and we're about to make
> >> the direct map RW.  To avoid a window where executable memory is
> >> writable, first mark the vmap alias NX.  This is silly, since we're
> >> about to *unmap* it, but this is the best we can do if all we have to
> >> work with is the set_memory_abc() APIs.  Architectures should override
> >> this whole function to get better behavior. */
> >
> > So can't we fix this first? Assuming that architectures that bother to
> > implement them will not have executable mappings in the linear region,
> > all we'd need is set_linear_range_ro/rw() routines that default to
> > doing nothing, and encapsulate the existing code for x86 and arm64.
> > That way, we can handle do things in the proper order, i.e., release
> > the vmalloc mapping (without caring about the permissions), restore
> > the linear alias attributes, and finally release the pages.
>
> Seems reasonable, except that I think it should be
> set_linear_range_not_present() and set_linear_range_rw(), for three
> reasons:
>
> 1. It’s not at all clear to me that we need to keep the linear mapping
> around for modules.
>

I'm pretty sure hibernate on arm64 will have to be fixed, since it
expects to be able to read all valid pages via the linear map. But we
can fix that.

> 2. At least on x86, the obvious algorithm to do the free operation
> with a single flush requires it.  Someone should probably confirm that
> arm’s TLB works the same way, i.e. that no flush is needed when
> changing from not-present (or whatever ARM calls it) to RW.
>

Good point. ARM is similar in this regard, although we'll probably
clear the access flag rather than unmap the page entirely (which is
treated the same way in terms of required TLB management)

> 3. Anyone playing with XPFO wants this facility anyway.  In fact, with
> this change, Rick’s series will more or less implement XPFO for
> vmalloc memory :)
>
> Does that seem reasonable to you?

Absolutely.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-21 17:25         ` Ard Biesheuvel
@ 2018-12-21 19:57           ` Edgecombe, Rick P
  2018-12-22 11:12             ` Ard Biesheuvel
  0 siblings, 1 reply; 26+ messages in thread
From: Edgecombe, Rick P @ 2018-12-21 19:57 UTC (permalink / raw)
  To: ard.biesheuvel, luto
  Cc: linux-kernel, daniel, jeyu, rostedt, ast, linux-mm, jannh, Dock,
	Deneen T, kristen, akpm, will.deacon, mingo, namit,
	kernel-hardening, Keshavamurthy, Anil S, mhiramat, naveen.n.rao,
	davem,

On Fri, 2018-12-21 at 18:25 +0100, Ard Biesheuvel wrote:
> On Fri, 21 Dec 2018 at 18:12, Andy Lutomirski <luto@kernel.org>
> wrote:
> > > On Dec 21, 2018, at 9:39 AM, Ard Biesheuvel <
> > > ard.biesheuvel@linaro.org> wrote:
> > > 
> > > > On Wed, 12 Dec 2018 at 03:20, Andy Lutomirski <luto@kernel.org>
> > > > wrote:
> > > > 
> > > > On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
> > > > <rick.p.edgecombe@intel.com> wrote:
> > > > > This adds two new flags VM_IMMEDIATE_UNMAP and
> > > > > VM_HAS_SPECIAL_PERMS, for
> > > > > enabling vfree operations to immediately clear executable TLB
> > > > > entries to freed
> > > > > pages, and handle freeing memory with special permissions.
> > > > > 
> > > > > In order to support vfree being called on memory that might
> > > > > be RO, the vfree
> > > > > deferred list node is moved to a kmalloc allocated struct,
> > > > > from where it is
> > > > > today, reusing the allocation being freed.
> > > > > 
> > > > > arch_vunmap is a new __weak function that implements the
> > > > > actual unmapping and
> > > > > resetting of the direct map permissions. It can be overridden
> > > > > by more efficient
> > > > > architecture specific implementations.
> > > > > 
> > > > > For the default implementation, it uses architecture agnostic
> > > > > methods which are
> > > > > equivalent to what most usages do before calling vfree. So
> > > > > now it is just
> > > > > centralized here.
> > > > > 
> > > > > This implementation derives from two sketches from Dave
> > > > > Hansen and Andy
> > > > > Lutomirski.
> > > > > 
> > > > > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > > > > Suggested-by: Andy Lutomirski <luto@kernel.org>
> > > > > Suggested-by: Will Deacon <will.deacon@arm.com>
> > > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > ---
> > > > > include/linux/vmalloc.h |  2 ++
> > > > > mm/vmalloc.c            | 73
> > > > > +++++++++++++++++++++++++++++++++++++----
> > > > > 2 files changed, 69 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/vmalloc.h
> > > > > b/include/linux/vmalloc.h
> > > > > index 398e9c95cd61..872bcde17aca 100644
> > > > > --- a/include/linux/vmalloc.h
> > > > > +++ b/include/linux/vmalloc.h
> > > > > @@ -21,6 +21,8 @@ struct notifier_block;                /* in
> > > > > notifier.h */
> > > > > #define VM_UNINITIALIZED       0x00000020      /* vm_struct
> > > > > is not fully initialized */
> > > > > #define VM_NO_GUARD            0x00000040      /* don't add
> > > > > guard page */
> > > > > #define VM_KASAN               0x00000080      /* has
> > > > > allocated kasan shadow memory */
> > > > > +#define VM_IMMEDIATE_UNMAP     0x00000200      /* flush
> > > > > before releasing pages */
> > > > > +#define VM_HAS_SPECIAL_PERMS   0x00000400      /* may be
> > > > > freed with special perms */
> > > > > /* bits [20..32] reserved for arch specific ioremap internals
> > > > > */
> > > > > 
> > > > > /*
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index 97d4b25d0373..02b284d2245a 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -18,6 +18,7 @@
> > > > > #include <linux/interrupt.h>
> > > > > #include <linux/proc_fs.h>
> > > > > #include <linux/seq_file.h>
> > > > > +#include <linux/set_memory.h>
> > > > > #include <linux/debugobjects.h>
> > > > > #include <linux/kallsyms.h>
> > > > > #include <linux/list.h>
> > > > > @@ -38,6 +39,11 @@
> > > > > 
> > > > > #include "internal.h"
> > > > > 
> > > > > +struct vfree_work {
> > > > > +       struct llist_node node;
> > > > > +       void *addr;
> > > > > +};
> > > > > +
> > > > > struct vfree_deferred {
> > > > >        struct llist_head list;
> > > > >        struct work_struct wq;
> > > > > @@ -50,9 +56,13 @@ static void free_work(struct work_struct
> > > > > *w)
> > > > > {
> > > > >        struct vfree_deferred *p = container_of(w, struct
> > > > > vfree_deferred, wq);
> > > > >        struct llist_node *t, *llnode;
> > > > > +       struct vfree_work *cur;
> > > > > 
> > > > > -       llist_for_each_safe(llnode, t, llist_del_all(&p-
> > > > > >list))
> > > > > -               __vunmap((void *)llnode, 1);
> > > > > +       llist_for_each_safe(llnode, t, llist_del_all(&p-
> > > > > >list)) {
> > > > > +               cur = container_of(llnode, struct vfree_work,
> > > > > node);
> > > > > +               __vunmap(cur->addr, 1);
> > > > > +               kfree(cur);
> > > > > +       }
> > > > > }
> > > > > 
> > > > > /*** Page table manipulation functions ***/
> > > > > @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const
> > > > > void *addr)
> > > > >        return NULL;
> > > > > }
> > > > > 
> > > > > +/*
> > > > > + * This function handles unmapping and resetting the direct
> > > > > map as efficiently
> > > > > + * as it can with cross arch functions. The three categories
> > > > > of architectures
> > > > > + * are:
> > > > > + *   1. Architectures with no set_memory implementations and
> > > > > no direct map
> > > > > + *      permissions.
> > > > > + *   2. Architectures with set_memory implementations but no
> > > > > direct map
> > > > > + *      permissions
> > > > > + *   3. Architectures with set_memory implementations and
> > > > > direct map permissions
> > > > > + */
> > > > > +void __weak arch_vunmap(struct vm_struct *area, int
> > > > > deallocate_pages)
> > > > 
> > > > My general preference is to avoid __weak functions -- they
> > > > don't
> > > > optimize well.  Instead, I prefer either:
> > > > 
> > > > #ifndef arch_vunmap
> > > > void arch_vunmap(...);
> > > > #endif
> > > > 
> > > > or
> > > > 
> > > > #ifdef CONFIG_HAVE_ARCH_VUNMAP
> > > > ...
> > > > #endif
> > > > 
> > > > 
> > > > > +{
> > > > > +       unsigned long addr = (unsigned long)area->addr;
> > > > > +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> > > > > +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> > > > > +
> > > > > +       /*
> > > > > +        * In case of 2 and 3, use this general way of
> > > > > resetting the permissions
> > > > > +        * on the directmap. Do NX before RW, in case of X,
> > > > > so there is no W^X
> > > > > +        * violation window.
> > > > > +        *
> > > > > +        * For case 1 these will be noops.
> > > > > +        */
> > > > > +       if (immediate)
> > > > > +               set_memory_nx(addr, area->nr_pages);
> > > > > +       if (deallocate_pages && special)
> > > > > +               set_memory_rw(addr, area->nr_pages);
> > > > 
> > > > Can you elaborate on the intent here?  VM_IMMEDIATE_UNMAP means
> > > > "I
> > > > want that alias gone before any deallocation happens".
> > > > VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix
> > > > it for
> > > > me, please".  deallocate means "this was vfree -- please free
> > > > the
> > > > pages".  I'm not convinced that all the various combinations
> > > > make
> > > > sense.  Do we really need both flags?
> > > > 
> > > > (VM_IMMEDIATE_UNMAP is a bit of a lie, since, if
> > > > in_interrupt(), it's
> > > > not immediate.)
> > > > 
> > > > If we do keep both flags, maybe some restructuring would make
> > > > sense,
> > > > like this, perhaps.  Sorry about horrible whitespace damage.
> > > > 
> > > > if (special) {
> > > >  /* VM_HAS_SPECIAL_PERMS makes little sense without
> > > > deallocate_pages. */
> > > >  WARN_ON_ONCE(!deallocate_pages);
> > > > 
> > > >  if (immediate) {
> > > >    /* It's possible that the vmap alias is X and we're about to
> > > > make
> > > > the direct map RW.  To avoid a window where executable memory
> > > > is
> > > > writable, first mark the vmap alias NX.  This is silly, since
> > > > we're
> > > > about to *unmap* it, but this is the best we can do if all we
> > > > have to
> > > > work with is the set_memory_abc() APIs.  Architectures should
> > > > override
> > > > this whole function to get better behavior. */
> > > 
> > > So can't we fix this first? Assuming that architectures that
> > > bother to
> > > implement them will not have executable mappings in the linear
> > > region,
> > > all we'd need is set_linear_range_ro/rw() routines that default
> > > to
> > > doing nothing, and encapsulate the existing code for x86 and
> > > arm64.
> > > That way, we can handle do things in the proper order, i.e.,
> > > release
> > > the vmalloc mapping (without caring about the permissions),
> > > restore
> > > the linear alias attributes, and finally release the pages.
> > 
> > Seems reasonable, except that I think it should be
> > set_linear_range_not_present() and set_linear_range_rw(), for three
> > reasons:
> > 
> > 1. It’s not at all clear to me that we need to keep the linear
> > mapping
> > around for modules.
> > 
> 
> I'm pretty sure hibernate on arm64 will have to be fixed, since it
> expects to be able to read all valid pages via the linear map. But we
> can fix that.
Hmm, now I wonder what else might be trying to access the entire direct
map for some reason. Since the window of not present is so small,
issues could lurk for some time. I guess that should show up with XPFO
too though.

> > 2. At least on x86, the obvious algorithm to do the free operation
> > with a single flush requires it.  Someone should probably confirm
> > that
> > arm’s TLB works the same way, i.e. that no flush is needed when
> > changing from not-present (or whatever ARM calls it) to RW.
> > 
> 
> Good point. ARM is similar in this regard, although we'll probably
> clear the access flag rather than unmap the page entirely (which is
> treated the same way in terms of required TLB management)
How about set_alias_nv(not valid)/set_alias_default for the name? It
can cover the general behavior of not cacheable in the TLB.

Also, FYI for anyone that is following this - Nadav and I have
discussed merging this with the text poke patchset because of the
overlap. With the US holidays, I may not get this done and tested until
first week of January. I'll go back and make the efficient direct map
permissions part arch generic now too.

> > 3. Anyone playing with XPFO wants this facility anyway.  In fact,
> > with
> > this change, Rick’s series will more or less implement XPFO for
> > vmalloc memory :)
> > 
> > Does that seem reasonable to you?
> 
> Absolutely.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms
  2018-12-21 19:57           ` Edgecombe, Rick P
@ 2018-12-22 11:12             ` Ard Biesheuvel
  0 siblings, 0 replies; 26+ messages in thread
From: Ard Biesheuvel @ 2018-12-22 11:12 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: luto, linux-kernel, daniel, jeyu, rostedt, ast, linux-mm, jannh,
	Dock, Deneen T, kristen, akpm, will.deacon, mingo, namit,
	kernel-hardening, Keshavamurthy, Anil S, mhiramat, naveen.n.rao

On Fri, 21 Dec 2018 at 20:57, Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Fri, 2018-12-21 at 18:25 +0100, Ard Biesheuvel wrote:
> > On Fri, 21 Dec 2018 at 18:12, Andy Lutomirski <luto@kernel.org>
> > wrote:
> > > > On Dec 21, 2018, at 9:39 AM, Ard Biesheuvel <
> > > > ard.biesheuvel@linaro.org> wrote:
> > > >
> > > > > On Wed, 12 Dec 2018 at 03:20, Andy Lutomirski <luto@kernel.org>
> > > > > wrote:
> > > > >
> > > > > On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe
> > > > > <rick.p.edgecombe@intel.com> wrote:
> > > > > > This adds two new flags VM_IMMEDIATE_UNMAP and
> > > > > > VM_HAS_SPECIAL_PERMS, for
> > > > > > enabling vfree operations to immediately clear executable TLB
> > > > > > entries to freed
> > > > > > pages, and handle freeing memory with special permissions.
> > > > > >
> > > > > > In order to support vfree being called on memory that might
> > > > > > be RO, the vfree
> > > > > > deferred list node is moved to a kmalloc allocated struct,
> > > > > > from where it is
> > > > > > today, reusing the allocation being freed.
> > > > > >
> > > > > > arch_vunmap is a new __weak function that implements the
> > > > > > actual unmapping and
> > > > > > resetting of the direct map permissions. It can be overridden
> > > > > > by more efficient
> > > > > > architecture specific implementations.
> > > > > >
> > > > > > For the default implementation, it uses architecture agnostic
> > > > > > methods which are
> > > > > > equivalent to what most usages do before calling vfree. So
> > > > > > now it is just
> > > > > > centralized here.
> > > > > >
> > > > > > This implementation derives from two sketches from Dave
> > > > > > Hansen and Andy
> > > > > > Lutomirski.
> > > > > >
> > > > > > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > > > > > Suggested-by: Andy Lutomirski <luto@kernel.org>
> > > > > > Suggested-by: Will Deacon <will.deacon@arm.com>
> > > > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > > ---
> > > > > > include/linux/vmalloc.h |  2 ++
> > > > > > mm/vmalloc.c            | 73
> > > > > > +++++++++++++++++++++++++++++++++++++----
> > > > > > 2 files changed, 69 insertions(+), 6 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/vmalloc.h
> > > > > > b/include/linux/vmalloc.h
> > > > > > index 398e9c95cd61..872bcde17aca 100644
> > > > > > --- a/include/linux/vmalloc.h
> > > > > > +++ b/include/linux/vmalloc.h
> > > > > > @@ -21,6 +21,8 @@ struct notifier_block;                /* in
> > > > > > notifier.h */
> > > > > > #define VM_UNINITIALIZED       0x00000020      /* vm_struct
> > > > > > is not fully initialized */
> > > > > > #define VM_NO_GUARD            0x00000040      /* don't add
> > > > > > guard page */
> > > > > > #define VM_KASAN               0x00000080      /* has
> > > > > > allocated kasan shadow memory */
> > > > > > +#define VM_IMMEDIATE_UNMAP     0x00000200      /* flush
> > > > > > before releasing pages */
> > > > > > +#define VM_HAS_SPECIAL_PERMS   0x00000400      /* may be
> > > > > > freed with special perms */
> > > > > > /* bits [20..32] reserved for arch specific ioremap internals
> > > > > > */
> > > > > >
> > > > > > /*
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index 97d4b25d0373..02b284d2245a 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > > @@ -18,6 +18,7 @@
> > > > > > #include <linux/interrupt.h>
> > > > > > #include <linux/proc_fs.h>
> > > > > > #include <linux/seq_file.h>
> > > > > > +#include <linux/set_memory.h>
> > > > > > #include <linux/debugobjects.h>
> > > > > > #include <linux/kallsyms.h>
> > > > > > #include <linux/list.h>
> > > > > > @@ -38,6 +39,11 @@
> > > > > >
> > > > > > #include "internal.h"
> > > > > >
> > > > > > +struct vfree_work {
> > > > > > +       struct llist_node node;
> > > > > > +       void *addr;
> > > > > > +};
> > > > > > +
> > > > > > struct vfree_deferred {
> > > > > >        struct llist_head list;
> > > > > >        struct work_struct wq;
> > > > > > @@ -50,9 +56,13 @@ static void free_work(struct work_struct
> > > > > > *w)
> > > > > > {
> > > > > >        struct vfree_deferred *p = container_of(w, struct
> > > > > > vfree_deferred, wq);
> > > > > >        struct llist_node *t, *llnode;
> > > > > > +       struct vfree_work *cur;
> > > > > >
> > > > > > -       llist_for_each_safe(llnode, t, llist_del_all(&p-
> > > > > > >list))
> > > > > > -               __vunmap((void *)llnode, 1);
> > > > > > +       llist_for_each_safe(llnode, t, llist_del_all(&p-
> > > > > > >list)) {
> > > > > > +               cur = container_of(llnode, struct vfree_work,
> > > > > > node);
> > > > > > +               __vunmap(cur->addr, 1);
> > > > > > +               kfree(cur);
> > > > > > +       }
> > > > > > }
> > > > > >
> > > > > > /*** Page table manipulation functions ***/
> > > > > > @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const
> > > > > > void *addr)
> > > > > >        return NULL;
> > > > > > }
> > > > > >
> > > > > > +/*
> > > > > > + * This function handles unmapping and resetting the direct
> > > > > > map as efficiently
> > > > > > + * as it can with cross arch functions. The three categories
> > > > > > of architectures
> > > > > > + * are:
> > > > > > + *   1. Architectures with no set_memory implementations and
> > > > > > no direct map
> > > > > > + *      permissions.
> > > > > > + *   2. Architectures with set_memory implementations but no
> > > > > > direct map
> > > > > > + *      permissions
> > > > > > + *   3. Architectures with set_memory implementations and
> > > > > > direct map permissions
> > > > > > + */
> > > > > > +void __weak arch_vunmap(struct vm_struct *area, int
> > > > > > deallocate_pages)
> > > > >
> > > > > My general preference is to avoid __weak functions -- they
> > > > > don't
> > > > > optimize well.  Instead, I prefer either:
> > > > >
> > > > > #ifndef arch_vunmap
> > > > > void arch_vunmap(...);
> > > > > #endif
> > > > >
> > > > > or
> > > > >
> > > > > #ifdef CONFIG_HAVE_ARCH_VUNMAP
> > > > > ...
> > > > > #endif
> > > > >
> > > > >
> > > > > > +{
> > > > > > +       unsigned long addr = (unsigned long)area->addr;
> > > > > > +       int immediate = area->flags & VM_IMMEDIATE_UNMAP;
> > > > > > +       int special = area->flags & VM_HAS_SPECIAL_PERMS;
> > > > > > +
> > > > > > +       /*
> > > > > > +        * In case of 2 and 3, use this general way of
> > > > > > resetting the permissions
> > > > > > +        * on the directmap. Do NX before RW, in case of X,
> > > > > > so there is no W^X
> > > > > > +        * violation window.
> > > > > > +        *
> > > > > > +        * For case 1 these will be noops.
> > > > > > +        */
> > > > > > +       if (immediate)
> > > > > > +               set_memory_nx(addr, area->nr_pages);
> > > > > > +       if (deallocate_pages && special)
> > > > > > +               set_memory_rw(addr, area->nr_pages);
> > > > >
> > > > > Can you elaborate on the intent here?  VM_IMMEDIATE_UNMAP means
> > > > > "I
> > > > > want that alias gone before any deallocation happens".
> > > > > VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix
> > > > > it for
> > > > > me, please".  deallocate means "this was vfree -- please free
> > > > > the
> > > > > pages".  I'm not convinced that all the various combinations
> > > > > make
> > > > > sense.  Do we really need both flags?
> > > > >
> > > > > (VM_IMMEDIATE_UNMAP is a bit of a lie, since, if
> > > > > in_interrupt(), it's
> > > > > not immediate.)
> > > > >
> > > > > If we do keep both flags, maybe some restructuring would make
> > > > > sense,
> > > > > like this, perhaps.  Sorry about horrible whitespace damage.
> > > > >
> > > > > if (special) {
> > > > >  /* VM_HAS_SPECIAL_PERMS makes little sense without
> > > > > deallocate_pages. */
> > > > >  WARN_ON_ONCE(!deallocate_pages);
> > > > >
> > > > >  if (immediate) {
> > > > >    /* It's possible that the vmap alias is X and we're about to
> > > > > make
> > > > > the direct map RW.  To avoid a window where executable memory
> > > > > is
> > > > > writable, first mark the vmap alias NX.  This is silly, since
> > > > > we're
> > > > > about to *unmap* it, but this is the best we can do if all we
> > > > > have to
> > > > > work with is the set_memory_abc() APIs.  Architectures should
> > > > > override
> > > > > this whole function to get better behavior. */
> > > >
> > > > So can't we fix this first? Assuming that architectures that
> > > > bother to
> > > > implement them will not have executable mappings in the linear
> > > > region,
> > > > all we'd need is set_linear_range_ro/rw() routines that default
> > > > to
> > > > doing nothing, and encapsulate the existing code for x86 and
> > > > arm64.
> > > > That way, we can handle do things in the proper order, i.e.,
> > > > release
> > > > the vmalloc mapping (without caring about the permissions),
> > > > restore
> > > > the linear alias attributes, and finally release the pages.
> > >
> > > Seems reasonable, except that I think it should be
> > > set_linear_range_not_present() and set_linear_range_rw(), for three
> > > reasons:
> > >
> > > 1. It’s not at all clear to me that we need to keep the linear
> > > mapping
> > > around for modules.
> > >
> >
> > I'm pretty sure hibernate on arm64 will have to be fixed, since it
> > expects to be able to read all valid pages via the linear map. But we
> > can fix that.
> Hmm, now I wonder what else might be trying to access the entire direct
> map for some reason. Since the window of not present is so small,
> issues could lurk for some time. I guess that should show up with XPFO
> too though.
>

I don't think there is usually a need to scan the entire address space
like that, unless you are trying to preserve the contents and write
them to disk, like in the hibernate case.

However, IIUC, hibernate on arm64 can already deal with
debug_pagealloc, which relies on clearing the access flag as well, so
if we stick with that we should be ok, I guess.

> > > 2. At least on x86, the obvious algorithm to do the free operation
> > > with a single flush requires it.  Someone should probably confirm
> > > that
> > > arm’s TLB works the same way, i.e. that no flush is needed when
> > > changing from not-present (or whatever ARM calls it) to RW.
> > >
> >
> > Good point. ARM is similar in this regard, although we'll probably
> > clear the access flag rather than unmap the page entirely (which is
> > treated the same way in terms of required TLB management)
> How about set_alias_nv(not valid)/set_alias_default for the name? It
> can cover the general behavior of not cacheable in the TLB.
>

Works for me

> Also, FYI for anyone that is following this - Nadav and I have
> discussed merging this with the text poke patchset because of the
> overlap. With the US holidays, I may not get this done and tested until
> first week of January. I'll go back and make the efficient direct map
> permissions part arch generic now too.
>

Excellent!

> > > 3. Anyone playing with XPFO wants this facility anyway.  In fact,
> > > with
> > > this change, Rick’s series will more or less implement XPFO for
> > > vmalloc memory :)
> > >
> > > Does that seem reasonable to you?
> >
> > Absolutely.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-12-22 11:12 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-12  0:03 [PATCH v2 0/4] Don’t leave executable TLB entries to freed pages Rick Edgecombe
2018-12-12  0:03 ` [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms Rick Edgecombe
2018-12-12  2:20   ` Andy Lutomirski
2018-12-12 19:50     ` Edgecombe, Rick P
2018-12-12 19:57       ` Andy Lutomirski
2018-12-12 22:01         ` Edgecombe, Rick P
2018-12-15 18:52           ` Andy Lutomirski
2018-12-18  0:23             ` Edgecombe, Rick P
2018-12-18  1:02               ` Andy Lutomirski
2018-12-21 16:39     ` Ard Biesheuvel
2018-12-21 17:12       ` Andy Lutomirski
2018-12-21 17:25         ` Ard Biesheuvel
2018-12-21 19:57           ` Edgecombe, Rick P
2018-12-22 11:12             ` Ard Biesheuvel
2018-12-12  0:03 ` [PATCH v2 2/4] modules: Add new special vfree flags Rick Edgecombe
2018-12-12 23:40   ` Nadav Amit
2018-12-13 19:02     ` Edgecombe, Rick P
2018-12-13 19:27       ` Nadav Amit
2018-12-13 21:48         ` Edgecombe, Rick P
2018-12-12  0:03 ` [PATCH v2 3/4] bpf: switch to new vmalloc " Rick Edgecombe
2018-12-12  0:03 ` [PATCH v2 4/4] x86/vmalloc: Add TLB efficient x86 arch_vunmap Rick Edgecombe
2018-12-12  2:24   ` Andy Lutomirski
2018-12-12 19:51     ` Edgecombe, Rick P
2018-12-12  6:30   ` Nadav Amit
2018-12-12 21:05     ` Edgecombe, Rick P
2018-12-12 21:16       ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).