linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/4] KASLR feature to randomize each loadable module
@ 2018-11-02 19:25 Rick Edgecombe
  2018-11-02 19:25 ` [PATCH v8 1/4] vmalloc: Add __vmalloc_node_try_addr function Rick Edgecombe
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Rick Edgecombe @ 2018-11-02 19:25 UTC (permalink / raw)
  To: jeyu, akpm, willy, tglx, mingo, hpa, x86, linux-kernel, linux-mm,
	kernel-hardening, daniel, jannh, keescook
  Cc: kristen, dave.hansen, arjan, Rick Edgecombe

Hi,

This is V8 of the "KASLR feature to randomize each loadable module" patchset.
The purpose is to increase the randomization and also to make the modules
randomized in relation to each other instead of just the base, so that if one
module leaks the location of the others can't be inferred.

This version gets rid of the more complex, more LOC, new logic in vmalloc that
helped optimization around lazy the free area case, and hopefully makes this
patchset more straightforward. Earlier versions were concerned with efficiently
handling this case, but I have learned they are actually not common in real
world module loader usage. So instead there are some smaller tweaks to existing
vmalloc logic to allow an allocation to be tried without triggering a
purge_vmap_area_lazy() and retry, when it encounters a real (non lazy free)
area. The kselftest simulations have been updated with the logic of init
sections getting cleaned up as well.

There is a small allocation performance degradation versus v7 as a trade off, but
it is still faster on average than the existing algorithm until >7000 modules.

Changes for V8:
 - Simplify code by removing logic for optimum handling of lazy free areas

Changes for V7:
 - More 0-day build fixes, readability improvements (Kees Cook)

Changes for V6:
 - 0-day build fixes by removing un-needed functional testing, more error
   handling

Changes for V5:
 - Add module_alloc test module

Changes for V4:
 - Fix issue caused by KASAN, kmemleak being provided different allocation
   lengths (padding).
 - Avoid kmalloc until sure its needed in __vmalloc_node_try_addr.
 - Fixed issues reported by 0-day.

Changes for V3:
 - Code cleanup based on internal feedback. (thanks to Dave Hansen and Andriy
   Shevchenko)
 - Slight refactor of existing algorithm to more cleanly live along side new
   one.
 - BPF synthetic benchmark

Changes for V2:
 - New implementation of __vmalloc_node_try_addr based on the
   __vmalloc_node_range implementation, that only flushes TLB when needed.
 - Modified module loading algorithm to try to reduce the TLB flushes further.
 - Increase "random area" tries in order to increase the number of modules that
   can get high randomness.
 - Increase "random area" size to 2/3 of module area in order to increase the
   number of modules that can get high randomness.
 - Fix for 0day failures on other architectures.
 - Fix for wrong debugfs permissions. (thanks to Jann Horn)
 - Spelling fix. (thanks to Jann Horn)
 - Data on module_alloc performance and TLB flushes. (brought up by Kees Cook
   and Jann Horn)
 - Data on memory usage. (suggested by Jann)


Rick Edgecombe (4):
  vmalloc: Add __vmalloc_node_try_addr function
  x86/modules: Increase randomization for modules
  vmalloc: Add debugfs modfraginfo
  Kselftest for module text allocation benchmarking

 arch/x86/Kconfig                              |   3 +
 arch/x86/include/asm/kaslr_modules.h          |  38 ++
 arch/x86/include/asm/pgtable_64_types.h       |   7 +
 arch/x86/kernel/module.c                      | 111 ++++--
 include/linux/vmalloc.h                       |   3 +
 lib/Kconfig.debug                             |   9 +
 lib/Makefile                                  |   1 +
 lib/test_mod_alloc.c                          | 343 ++++++++++++++++++
 mm/vmalloc.c                                  | 228 ++++++++++--
 tools/testing/selftests/bpf/test_mod_alloc.sh |  29 ++
 10 files changed, 711 insertions(+), 61 deletions(-)
 create mode 100644 arch/x86/include/asm/kaslr_modules.h
 create mode 100644 lib/test_mod_alloc.c
 create mode 100755 tools/testing/selftests/bpf/test_mod_alloc.sh

-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v8 1/4] vmalloc: Add __vmalloc_node_try_addr function
  2018-11-02 19:25 [PATCH v8 0/4] KASLR feature to randomize each loadable module Rick Edgecombe
@ 2018-11-02 19:25 ` Rick Edgecombe
  2018-11-06 21:05   ` Andrew Morton
  2018-11-02 19:25 ` [PATCH v8 2/4] x86/modules: Increase randomization for modules Rick Edgecombe
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Rick Edgecombe @ 2018-11-02 19:25 UTC (permalink / raw)
  To: jeyu, akpm, willy, tglx, mingo, hpa, x86, linux-kernel, linux-mm,
	kernel-hardening, daniel, jannh, keescook
  Cc: kristen, dave.hansen, arjan, Rick Edgecombe

Create __vmalloc_node_try_addr function that tries to allocate at a specific
address without triggering any lazy purging. In order to support this behavior
a try_addr argument was plugged into several of the static helpers.

This also changes logic in __get_vm_area_node to be faster in cases where
allocations fail due to no space, which is a lot more common when trying
specific addresses.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/vmalloc.h |   3 +
 mm/vmalloc.c            | 128 +++++++++++++++++++++++++++++-----------
 2 files changed, 95 insertions(+), 36 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..6eaa89612372 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -82,6 +82,9 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 			unsigned long start, unsigned long end, gfp_t gfp_mask,
 			pgprot_t prot, unsigned long vm_flags, int node,
 			const void *caller);
+extern void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
+			gfp_t gfp_mask,	pgprot_t prot, unsigned long vm_flags,
+			int node, const void *caller);
 #ifndef CONFIG_MMU
 extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
 static inline void *__vmalloc_node_flags_caller(unsigned long size, int node,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a728fc492557..8d01f503e20d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -326,6 +326,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
 #define VM_LAZY_FREE	0x02
 #define VM_VM_AREA	0x04
 
+#define VMAP_MAY_PURGE	0x2
+#define VMAP_NO_PURGE	0x1
+
 static DEFINE_SPINLOCK(vmap_area_lock);
 /* Export for kexec only */
 LIST_HEAD(vmap_area_list);
@@ -402,12 +405,12 @@ static BLOCKING_NOTIFIER_HEAD(vmap_notify_list);
 static struct vmap_area *alloc_vmap_area(unsigned long size,
 				unsigned long align,
 				unsigned long vstart, unsigned long vend,
-				int node, gfp_t gfp_mask)
+				int node, gfp_t gfp_mask, int try_purge)
 {
 	struct vmap_area *va;
 	struct rb_node *n;
 	unsigned long addr;
-	int purged = 0;
+	int purged = try_purge & VMAP_NO_PURGE;
 	struct vmap_area *first;
 
 	BUG_ON(!size);
@@ -860,7 +863,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 
 	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
 					VMALLOC_START, VMALLOC_END,
-					node, gfp_mask);
+					node, gfp_mask, VMAP_MAY_PURGE);
 	if (IS_ERR(va)) {
 		kfree(vb);
 		return ERR_CAST(va);
@@ -1170,8 +1173,9 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro
 		addr = (unsigned long)mem;
 	} else {
 		struct vmap_area *va;
-		va = alloc_vmap_area(size, PAGE_SIZE,
-				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
+		va = alloc_vmap_area(size, PAGE_SIZE, VMALLOC_START,
+					VMALLOC_END, node, GFP_KERNEL,
+					VMAP_MAY_PURGE);
 		if (IS_ERR(va))
 			return NULL;
 
@@ -1372,7 +1376,8 @@ static void clear_vm_uninitialized_flag(struct vm_struct *vm)
 
 static struct vm_struct *__get_vm_area_node(unsigned long size,
 		unsigned long align, unsigned long flags, unsigned long start,
-		unsigned long end, int node, gfp_t gfp_mask, const void *caller)
+		unsigned long end, int node, gfp_t gfp_mask, int try_purge,
+		const void *caller)
 {
 	struct vmap_area *va;
 	struct vm_struct *area;
@@ -1386,16 +1391,17 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
 		align = 1ul << clamp_t(int, get_count_order_long(size),
 				       PAGE_SHIFT, IOREMAP_MAX_ORDER);
 
-	area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
-	if (unlikely(!area))
-		return NULL;
-
 	if (!(flags & VM_NO_GUARD))
 		size += PAGE_SIZE;
 
-	va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
-	if (IS_ERR(va)) {
-		kfree(area);
+	va = alloc_vmap_area(size, align, start, end, node, gfp_mask,
+				try_purge);
+	if (IS_ERR(va))
+		return NULL;
+
+	area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
+	if (unlikely(!area)) {
+		free_vmap_area(va);
 		return NULL;
 	}
 
@@ -1408,7 +1414,8 @@ struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
 				unsigned long start, unsigned long end)
 {
 	return __get_vm_area_node(size, 1, flags, start, end, NUMA_NO_NODE,
-				  GFP_KERNEL, __builtin_return_address(0));
+				  GFP_KERNEL, VMAP_MAY_PURGE,
+				  __builtin_return_address(0));
 }
 EXPORT_SYMBOL_GPL(__get_vm_area);
 
@@ -1417,7 +1424,7 @@ struct vm_struct *__get_vm_area_caller(unsigned long size, unsigned long flags,
 				       const void *caller)
 {
 	return __get_vm_area_node(size, 1, flags, start, end, NUMA_NO_NODE,
-				  GFP_KERNEL, caller);
+				  GFP_KERNEL, VMAP_MAY_PURGE, caller);
 }
 
 /**
@@ -1432,7 +1439,7 @@ struct vm_struct *__get_vm_area_caller(unsigned long size, unsigned long flags,
 struct vm_struct *get_vm_area(unsigned long size, unsigned long flags)
 {
 	return __get_vm_area_node(size, 1, flags, VMALLOC_START, VMALLOC_END,
-				  NUMA_NO_NODE, GFP_KERNEL,
+				  NUMA_NO_NODE, GFP_KERNEL, VMAP_MAY_PURGE,
 				  __builtin_return_address(0));
 }
 
@@ -1440,7 +1447,8 @@ struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags,
 				const void *caller)
 {
 	return __get_vm_area_node(size, 1, flags, VMALLOC_START, VMALLOC_END,
-				  NUMA_NO_NODE, GFP_KERNEL, caller);
+				  NUMA_NO_NODE, GFP_KERNEL, VMAP_MAY_PURGE,
+				  caller);
 }
 
 /**
@@ -1709,26 +1717,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	return NULL;
 }
 
-/**
- *	__vmalloc_node_range  -  allocate virtually contiguous memory
- *	@size:		allocation size
- *	@align:		desired alignment
- *	@start:		vm area range start
- *	@end:		vm area range end
- *	@gfp_mask:	flags for the page level allocator
- *	@prot:		protection mask for the allocated pages
- *	@vm_flags:	additional vm area flags (e.g. %VM_NO_GUARD)
- *	@node:		node to use for allocation or NUMA_NO_NODE
- *	@caller:	caller's return address
- *
- *	Allocate enough pages to cover @size from the page level
- *	allocator with @gfp_mask flags.  Map them into contiguous
- *	kernel virtual space, using a pagetable protection of @prot.
- */
-void *__vmalloc_node_range(unsigned long size, unsigned long align,
+static void *__vmalloc_node_range_opts(unsigned long size, unsigned long align,
 			unsigned long start, unsigned long end, gfp_t gfp_mask,
 			pgprot_t prot, unsigned long vm_flags, int node,
-			const void *caller)
+			int try_purge, const void *caller)
 {
 	struct vm_struct *area;
 	void *addr;
@@ -1739,7 +1731,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 		goto fail;
 
 	area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
-				vm_flags, start, end, node, gfp_mask, caller);
+				vm_flags, start, end, node, gfp_mask,
+				try_purge, caller);
 	if (!area)
 		goto fail;
 
@@ -1764,6 +1757,69 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 	return NULL;
 }
 
+/**
+ *	__vmalloc_node_range  -  allocate virtually contiguous memory
+ *	@size:		allocation size
+ *	@align:		desired alignment
+ *	@start:		vm area range start
+ *	@end:		vm area range end
+ *	@gfp_mask:	flags for the page level allocator
+ *	@prot:		protection mask for the allocated pages
+ *	@vm_flags:	additional vm area flags (e.g. %VM_NO_GUARD)
+ *	@node:		node to use for allocation or NUMA_NO_NODE
+ *	@caller:	caller's return address
+ *
+ *	Allocate enough pages to cover @size from the page level
+ *	allocator with @gfp_mask flags.  Map them into contiguous
+ *	kernel virtual space, using a pagetable protection of @prot.
+ */
+void *__vmalloc_node_range(unsigned long size, unsigned long align,
+			unsigned long start, unsigned long end, gfp_t gfp_mask,
+			pgprot_t prot, unsigned long vm_flags, int node,
+			const void *caller)
+{
+	return __vmalloc_node_range_opts(size, align, start, end, gfp_mask,
+					prot, vm_flags, node, VMAP_MAY_PURGE,
+					caller);
+}
+
+/**
+ *	__vmalloc_try_addr  -  try to alloc at a specific address
+ *	@addr:		address to try
+ *	@size:		size to try
+ *	@gfp_mask:	flags for the page level allocator
+ *	@prot:		protection mask for the allocated pages
+ *	@vm_flags:	additional vm area flags (e.g. %VM_NO_GUARD)
+ *	@node:		node to use for allocation or NUMA_NO_NODE
+ *	@caller:	caller's return address
+ *
+ *	Try to allocate at the specific address. If it succeeds the address is
+ *	returned. If it fails NULL is returned.  It will not try to purge lazy
+ *	free vmap areas in order to fit.
+ */
+void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
+			gfp_t gfp_mask,	pgprot_t prot, unsigned long vm_flags,
+			int node, const void *caller)
+{
+	unsigned long addr_end;
+	unsigned long vsize = PAGE_ALIGN(size);
+
+	if (!vsize || (vsize >> PAGE_SHIFT) > totalram_pages)
+		return NULL;
+
+	if (!(vm_flags & VM_NO_GUARD))
+		vsize += PAGE_SIZE;
+
+	addr_end = addr + vsize;
+
+	if (addr > addr_end)
+		return NULL;
+
+	return __vmalloc_node_range_opts(size, 1, addr, addr_end,
+				gfp_mask | __GFP_NOWARN, prot, vm_flags, node,
+				VMAP_NO_PURGE, caller);
+}
+
 /**
  *	__vmalloc_node  -  allocate virtually contiguous memory
  *	@size:		allocation size
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v8 2/4] x86/modules: Increase randomization for modules
  2018-11-02 19:25 [PATCH v8 0/4] KASLR feature to randomize each loadable module Rick Edgecombe
  2018-11-02 19:25 ` [PATCH v8 1/4] vmalloc: Add __vmalloc_node_try_addr function Rick Edgecombe
@ 2018-11-02 19:25 ` Rick Edgecombe
  2018-11-06 21:05   ` Andrew Morton
  2018-11-02 19:25 ` [PATCH v8 3/4] vmalloc: Add debugfs modfraginfo Rick Edgecombe
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Rick Edgecombe @ 2018-11-02 19:25 UTC (permalink / raw)
  To: jeyu, akpm, willy, tglx, mingo, hpa, x86, linux-kernel, linux-mm,
	kernel-hardening, daniel, jannh, keescook
  Cc: kristen, dave.hansen, arjan, Rick Edgecombe

This changes the behavior of the KASLR logic for allocating memory for the text
sections of loadable modules. It randomizes the location of each module text
section with about 17 bits of entropy in typical use. This is enabled on X86_64
only. For 32 bit, the behavior is unchanged.

It refactors existing code around module randomization somewhat. There are now
three different behaviors for x86 module_alloc depending on config.
RANDOMIZE_BASE=n, and RANDOMIZE_BASE=y ARCH=x86_64, and RANDOMIZE_BASE=y
ARCH=i386. The refactor of the existing code is to try to clearly show what
those behaviors are without having three separate versions or threading the
behaviors in a bunch of little spots. The reason it is not enabled on 32 bit
yet is because the module space is much smaller and simulations haven't been
run to see how it performs.

The new algorithm breaks the module space in two, a random area and a backup
area. It first tries to allocate at a number of randomly located starting pages
inside the random section without purging any lazy free vmap areas and
triggering the associated TLB flush. If this fails, then it will allocate in
the backup area. The backup area base will be offset in the same way as the
current algorithm does for the base area, 1024 possible locations.

Due to boot_params being defined with different types in different places,
placing the config helpers modules.h or kaslr.h caused conflicts elsewhere, and
so they are placed in a new file, kaslr_modules.h, instead.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/Kconfig                        |   3 +
 arch/x86/include/asm/kaslr_modules.h    |  38 ++++++++
 arch/x86/include/asm/pgtable_64_types.h |   7 ++
 arch/x86/kernel/module.c                | 111 +++++++++++++++++++-----
 4 files changed, 136 insertions(+), 23 deletions(-)
 create mode 100644 arch/x86/include/asm/kaslr_modules.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1a0be022f91d..32e1ac2e052d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2137,6 +2137,9 @@ config RANDOMIZE_BASE
 
 	  If unsure, say Y.
 
+config RANDOMIZE_FINE_MODULE
+	def_bool y if RANDOMIZE_BASE && X86_64 && !CONFIG_UML
+
 # Relocation on x86 needs some additional build support
 config X86_NEED_RELOCS
 	def_bool y
diff --git a/arch/x86/include/asm/kaslr_modules.h b/arch/x86/include/asm/kaslr_modules.h
new file mode 100644
index 000000000000..1da6eced4b47
--- /dev/null
+++ b/arch/x86/include/asm/kaslr_modules.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_KASLR_MODULES_H_
+#define _ASM_KASLR_MODULES_H_
+
+#ifdef CONFIG_RANDOMIZE_BASE
+/* kaslr_enabled is not always defined */
+static inline int kaslr_mod_randomize_base(void)
+{
+	return kaslr_enabled();
+}
+#else
+static inline int kaslr_mod_randomize_base(void)
+{
+	return 0;
+}
+#endif /* CONFIG_RANDOMIZE_BASE */
+
+#ifdef CONFIG_RANDOMIZE_FINE_MODULE
+/* kaslr_enabled is not always defined */
+static inline int kaslr_mod_randomize_each_module(void)
+{
+	return kaslr_enabled();
+}
+
+static inline unsigned long get_modules_rand_len(void)
+{
+	return MODULES_RAND_LEN;
+}
+#else
+static inline int kaslr_mod_randomize_each_module(void)
+{
+	return 0;
+}
+
+unsigned long get_modules_rand_len(void);
+#endif /* CONFIG_RANDOMIZE_FINE_MODULE */
+
+#endif
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 04edd2d58211..5e26369ab86c 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -143,6 +143,13 @@ extern unsigned int ptrs_per_p4d;
 #define MODULES_END		_AC(0xffffffffff000000, UL)
 #define MODULES_LEN		(MODULES_END - MODULES_VADDR)
 
+/*
+ * Dedicate the first part of the module space to a randomized area when KASLR
+ * is in use.  Leave the remaining part for a fallback if we are unable to
+ * allocate in the random area.
+ */
+#define MODULES_RAND_LEN	PAGE_ALIGN((MODULES_LEN/3)*2)
+
 #define ESPFIX_PGD_ENTRY	_AC(-2, UL)
 #define ESPFIX_BASE_ADDR	(ESPFIX_PGD_ENTRY << P4D_SHIFT)
 
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index f58336af095c..183f70730cda 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -36,6 +36,7 @@
 #include <asm/pgtable.h>
 #include <asm/setup.h>
 #include <asm/unwind.h>
+#include <asm/kaslr_modules.h>
 
 #if 0
 #define DEBUGP(fmt, ...)				\
@@ -48,34 +49,96 @@ do {							\
 } while (0)
 #endif
 
-#ifdef CONFIG_RANDOMIZE_BASE
 static unsigned long module_load_offset;
+static const unsigned long NO_TRY_RAND = 10000;
 
 /* Mutex protects the module_load_offset. */
 static DEFINE_MUTEX(module_kaslr_mutex);
 
 static unsigned long int get_module_load_offset(void)
 {
-	if (kaslr_enabled()) {
-		mutex_lock(&module_kaslr_mutex);
-		/*
-		 * Calculate the module_load_offset the first time this
-		 * code is called. Once calculated it stays the same until
-		 * reboot.
-		 */
-		if (module_load_offset == 0)
-			module_load_offset =
-				(get_random_int() % 1024 + 1) * PAGE_SIZE;
-		mutex_unlock(&module_kaslr_mutex);
-	}
+	mutex_lock(&module_kaslr_mutex);
+	/*
+	 * Calculate the module_load_offset the first time this
+	 * code is called. Once calculated it stays the same until
+	 * reboot.
+	 */
+	if (module_load_offset == 0)
+		module_load_offset = (get_random_int() % 1024 + 1) * PAGE_SIZE;
+	mutex_unlock(&module_kaslr_mutex);
+
 	return module_load_offset;
 }
-#else
-static unsigned long int get_module_load_offset(void)
+
+static unsigned long get_module_vmalloc_start(void)
 {
-	return 0;
+	unsigned long addr = MODULES_VADDR;
+
+	if (kaslr_mod_randomize_base())
+		addr += get_module_load_offset();
+
+	if (kaslr_mod_randomize_each_module())
+		addr += get_modules_rand_len();
+
+	return addr;
+}
+
+static void *try_module_alloc(unsigned long addr, unsigned long size)
+{
+	const unsigned long vm_flags = 0;
+
+	return __vmalloc_node_try_addr(addr, size, GFP_KERNEL, PAGE_KERNEL_EXEC,
+					vm_flags, NUMA_NO_NODE,
+					__builtin_return_address(0));
+}
+
+/*
+ * Find a random address to try that won't obviously not fit. Random areas are
+ * allowed to overflow into the backup area
+ */
+static unsigned long get_rand_module_addr(unsigned long size)
+{
+	unsigned long nr_max_pos = (MODULES_LEN - size) / MODULE_ALIGN + 1;
+	unsigned long nr_rnd_pos = get_modules_rand_len() / MODULE_ALIGN;
+	unsigned long nr_pos = min(nr_max_pos, nr_rnd_pos);
+
+	unsigned long module_position_nr = get_random_long() % nr_pos;
+	unsigned long offset = module_position_nr * MODULE_ALIGN;
+
+	return MODULES_VADDR + offset;
+}
+
+/*
+ * Try to allocate in the random area. First 5000 times without purging, then
+ * 5000 times with purging. If these fail, return NULL.
+ */
+static void *try_module_randomize_each(unsigned long size)
+{
+	void *p = NULL;
+	unsigned int i;
+
+	/* This will have a guard page */
+	unsigned long va_size = PAGE_ALIGN(size) + PAGE_SIZE;
+
+	if (!kaslr_mod_randomize_each_module())
+		return NULL;
+
+	/* Make sure there is at least one address that might fit. */
+	if (va_size < PAGE_ALIGN(size) || va_size > MODULES_LEN)
+		return NULL;
+
+	/* Try to find a spot that doesn't need a lazy purge */
+	for (i = 0; i < NO_TRY_RAND; i++) {
+		unsigned long addr = get_rand_module_addr(va_size);
+
+		p = try_module_alloc(addr, size);
+
+		if (p)
+			return p;
+	}
+
+	return NULL;
 }
-#endif
 
 void *module_alloc(unsigned long size)
 {
@@ -84,16 +147,18 @@ void *module_alloc(unsigned long size)
 	if (PAGE_ALIGN(size) > MODULES_LEN)
 		return NULL;
 
-	p = __vmalloc_node_range(size, MODULE_ALIGN,
-				    MODULES_VADDR + get_module_load_offset(),
-				    MODULES_END, GFP_KERNEL,
-				    PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
-				    __builtin_return_address(0));
+	p = try_module_randomize_each(size);
+
+	if (!p)
+		p = __vmalloc_node_range(size, MODULE_ALIGN,
+				get_module_vmalloc_start(), MODULES_END,
+				GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
+				NUMA_NO_NODE, __builtin_return_address(0));
+
 	if (p && (kasan_module_alloc(p, size) < 0)) {
 		vfree(p);
 		return NULL;
 	}
-
 	return p;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v8 3/4] vmalloc: Add debugfs modfraginfo
  2018-11-02 19:25 [PATCH v8 0/4] KASLR feature to randomize each loadable module Rick Edgecombe
  2018-11-02 19:25 ` [PATCH v8 1/4] vmalloc: Add __vmalloc_node_try_addr function Rick Edgecombe
  2018-11-02 19:25 ` [PATCH v8 2/4] x86/modules: Increase randomization for modules Rick Edgecombe
@ 2018-11-02 19:25 ` Rick Edgecombe
  2018-11-02 19:25 ` [PATCH v8 4/4] Kselftest for module text allocation benchmarking Rick Edgecombe
  2018-11-06 21:04 ` [PATCH v8 0/4] KASLR feature to randomize each loadable module Andrew Morton
  4 siblings, 0 replies; 13+ messages in thread
From: Rick Edgecombe @ 2018-11-02 19:25 UTC (permalink / raw)
  To: jeyu, akpm, willy, tglx, mingo, hpa, x86, linux-kernel, linux-mm,
	kernel-hardening, daniel, jannh, keescook
  Cc: kristen, dave.hansen, arjan, Rick Edgecombe

Add debugfs file "modfraginfo" for providing info on module space fragmentation.
This can be used for determining if loadable module randomization is causing any
problems for extreme module loading situations, like huge numbers of modules or
extremely large modules.

Sample output when KASLR is enabled and X86_64 is configured:
	Largest free space:	897912 kB
	  Total free space:	1025424 kB
Allocations in backup area:	0

Sample output when just X86_64:
	Largest free space:	897912 kB
	  Total free space:	1025424 kB

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 mm/vmalloc.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 98 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 8d01f503e20d..8d91901ba8f0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -18,6 +18,7 @@
 #include <linux/interrupt.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
+#include <linux/debugfs.h>
 #include <linux/debugobjects.h>
 #include <linux/kallsyms.h>
 #include <linux/list.h>
@@ -36,6 +37,12 @@
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>
 
+#ifdef CONFIG_X86
+#include <asm/page_types.h>
+#include <asm/setup.h>
+#include <asm/kaslr_modules.h>
+#endif
+
 #include "internal.h"
 
 struct vfree_deferred {
@@ -2411,7 +2418,6 @@ void free_vm_area(struct vm_struct *area)
 }
 EXPORT_SYMBOL_GPL(free_vm_area);
 
-#ifdef CONFIG_SMP
 static struct vmap_area *node_to_va(struct rb_node *n)
 {
 	return rb_entry_safe(n, struct vmap_area, rb_node);
@@ -2459,6 +2465,7 @@ static bool pvm_find_next_prev(unsigned long end,
 	return true;
 }
 
+#ifdef CONFIG_SMP
 /**
  * pvm_determine_end - find the highest aligned address between two vmap_areas
  * @pnext: in/out arg for the next vmap_area
@@ -2800,7 +2807,96 @@ static int __init proc_vmalloc_init(void)
 		proc_create_seq("vmallocinfo", 0400, NULL, &vmalloc_op);
 	return 0;
 }
-module_init(proc_vmalloc_init);
+#elif defined(CONFIG_DEBUG_FS)
+static int __init proc_vmalloc_init(void)
+{
+	return 0;
+}
+#endif
+
+#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_RANDOMIZE_FINE_MODULE)
+static inline unsigned long is_in_backup(unsigned long addr)
+{
+	return addr >= MODULES_VADDR + get_modules_rand_len();
+}
+
+static int modulefraginfo_debug_show(struct seq_file *m, void *v)
+{
+	unsigned long last_end = MODULES_VADDR;
+	unsigned long total_free = 0;
+	unsigned long largest_free = 0;
+	unsigned long backup_cnt = 0;
+	unsigned long gap;
+	struct vmap_area *prev, *cur = NULL;
+
+	spin_lock(&vmap_area_lock);
+
+	if (!pvm_find_next_prev(MODULES_VADDR, &cur, &prev) || !cur)
+		goto done;
+
+	for (; cur->va_end <= MODULES_END; cur = list_next_entry(cur, list)) {
+		/* Don't count areas that are marked to be lazily freed */
+		if (!(cur->flags & VM_LAZY_FREE)) {
+			if (kaslr_mod_randomize_each_module())
+				backup_cnt += is_in_backup(cur->va_start);
+			gap = cur->va_start - last_end;
+			if (gap > largest_free)
+				largest_free = gap;
+			total_free += gap;
+			last_end = cur->va_end;
+		}
+
+		if (list_is_last(&cur->list, &vmap_area_list))
+			break;
+	}
+
+done:
+	gap = (MODULES_END - last_end);
+	if (gap > largest_free)
+		largest_free = gap;
+	total_free += gap;
 
+	spin_unlock(&vmap_area_lock);
+
+	seq_printf(m, "\tLargest free space:\t%lu kB\n", largest_free / 1024);
+	seq_printf(m, "\t  Total free space:\t%lu kB\n", total_free / 1024);
+
+	if (kaslr_mod_randomize_each_module())
+		seq_printf(m, "Allocations in backup area:\t%lu\n", backup_cnt);
+
+	return 0;
+}
+
+static int proc_module_frag_debug_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, modulefraginfo_debug_show, NULL);
+}
+
+static const struct file_operations debug_module_frag_operations = {
+	.open       = proc_module_frag_debug_open,
+	.read       = seq_read,
+	.llseek     = seq_lseek,
+	.release    = single_release,
+};
+
+static void __init debug_modfrag_init(void)
+{
+	debugfs_create_file("modfraginfo", 0400, NULL, NULL,
+			&debug_module_frag_operations);
+}
+#elif defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS)
+static void __init debug_modfrag_init(void)
+{
+}
 #endif
 
+#if defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS)
+static int __init info_vmalloc_init(void)
+{
+	proc_vmalloc_init();
+	debug_modfrag_init();
+	return 0;
+}
+
+module_init(info_vmalloc_init);
+#endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v8 4/4] Kselftest for module text allocation benchmarking
  2018-11-02 19:25 [PATCH v8 0/4] KASLR feature to randomize each loadable module Rick Edgecombe
                   ` (2 preceding siblings ...)
  2018-11-02 19:25 ` [PATCH v8 3/4] vmalloc: Add debugfs modfraginfo Rick Edgecombe
@ 2018-11-02 19:25 ` Rick Edgecombe
  2018-11-06 21:05   ` Andrew Morton
  2018-11-06 21:04 ` [PATCH v8 0/4] KASLR feature to randomize each loadable module Andrew Morton
  4 siblings, 1 reply; 13+ messages in thread
From: Rick Edgecombe @ 2018-11-02 19:25 UTC (permalink / raw)
  To: jeyu, akpm, willy, tglx, mingo, hpa, x86, linux-kernel, linux-mm,
	kernel-hardening, daniel, jannh, keescook
  Cc: kristen, dave.hansen, arjan, Rick Edgecombe

This adds a test module in lib/, and a script in kselftest that does
benchmarking on the allocation of memory in the module space. Performance here
would have some small impact on kernel module insertions, BPF JIT insertions
and kprobes. In the case of KASLR features for the module space, this module
can be used to measure the allocation performance of different configurations.
This module needs to be compiled into the kernel because module_alloc is not
exported.

With some modification to the code, as explained in the comments, it can be
enabled to measure TLB flushes as well.

There are two tests in the module. One allocates until failure in order to
test module capacity and the other times allocating space in the module area.
They both use module sizes that roughly approximate the distribution of in-tree
X86_64 modules.

You can control the number of modules used in the tests like this:
echo m1000>/dev/mod_alloc_test

Run the test for module capacity like:
echo t1>/dev/mod_alloc_test

The other test will measure the allocation time, and for CONFG_X86_64 and
CONFIG_RANDOMIZE_BASE, also give data on how often the “backup area" is used.

Run the test for allocation time and backup area usage like:
echo t2>/dev/mod_alloc_test
The output will be something like this:
num		all(ns)		last(ns)
1000		1083		1099
Last module in backup count = 0
Total modules in backup     = 0
>1 module in backup count   = 0

To run a suite of allocation time tests for a collection of module numbers you can run:
tools/testing/selftests/bpf/test_mod_alloc.sh

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 lib/Kconfig.debug                             |   9 +
 lib/Makefile                                  |   1 +
 lib/test_mod_alloc.c                          | 343 ++++++++++++++++++
 tools/testing/selftests/bpf/test_mod_alloc.sh |  29 ++
 4 files changed, 382 insertions(+)
 create mode 100644 lib/test_mod_alloc.c
 create mode 100755 tools/testing/selftests/bpf/test_mod_alloc.sh

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 4966c4fbe7f7..09273ef32be4 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1883,6 +1883,15 @@ config TEST_BPF
 
 	  If unsure, say N.
 
+config TEST_MOD_ALLOC
+	bool "Tests for module allocator/vmalloc"
+	help
+	  This builds the "test_mod_alloc" module that performs performance
+	  tests on the module text section allocator. The module uses X86_64
+	  module text sizes for simulations.
+
+	  If unsure, say N.
+
 config FIND_BIT_BENCHMARK
 	tristate "Test find_bit functions"
 	help
diff --git a/lib/Makefile b/lib/Makefile
index 423876446810..a994240abf65 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -58,6 +58,7 @@ UBSAN_SANITIZE_test_ubsan.o := y
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
+obj-$(CONFIG_TEST_MOD_ALLOC) += test_mod_alloc.o
 obj-$(CONFIG_TEST_OVERFLOW) += test_overflow.o
 obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o
 obj-$(CONFIG_TEST_SORT) += test_sort.o
diff --git a/lib/test_mod_alloc.c b/lib/test_mod_alloc.c
new file mode 100644
index 000000000000..afa13c29746f
--- /dev/null
+++ b/lib/test_mod_alloc.c
@@ -0,0 +1,343 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/debugfs.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/moduleloader.h>
+#include <linux/random.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+
+struct mod { int filesize; int coresize; int initsize; };
+
+/* ==== Begin optional logging ==== */
+/*
+ * Note: In order to get an accurate count for the tlb flushes triggered in
+ * vmalloc, create a counter in vmalloc.c: with this method signature and export
+ * it. Then replace the below with: __purge_vmap_area_lazy
+ * extern unsigned long get_tlb_flushes_vmalloc(void);
+ */
+static unsigned long get_tlb_flushes_vmalloc(void)
+{
+	return 0;
+}
+
+/* ==== End optional logging ==== */
+
+
+#define MAX_ALLOC_CNT 20000
+#define ITERS 1000
+
+struct vm_alloc {
+	void *core;
+	unsigned long core_size;
+	void *init;
+};
+
+static struct vm_alloc *allocs_vm;
+static long mod_cnt;
+static DEFINE_MUTEX(test_mod_alloc_mutex);
+
+const static int core_hist[10] = {1, 5, 21, 46, 141, 245, 597, 2224, 1875, 0};
+const static int init_hist[10] = {0, 0, 0, 0, 10, 19, 70, 914, 3906, 236};
+const static int file_hist[10] = {6, 20, 55, 86, 286, 551, 918, 2024, 1028,
+					181};
+
+const static int bins[10] = {5000000, 2000000, 1000000, 500000, 200000, 100000,
+				50000, 20000, 10000, 5000};
+/*
+ * Rough approximation of the X86_64 module size distribution.
+ */
+static int get_mod_rand_size(const int *hist)
+{
+	int area_under = get_random_int() % 5155;
+	int i;
+	int last_bin = bins[0] + 1;
+	int sum = 0;
+
+	for (i = 0; i <= 9; i++) {
+		sum += hist[i];
+		if (area_under <= sum)
+			return bins[i]
+				+ (get_random_int() % (last_bin - bins[i]));
+		last_bin = bins[i];
+	}
+	return 4096;
+}
+
+static struct mod get_rand_module(void)
+{
+	struct mod ret;
+
+	ret.coresize = get_mod_rand_size(core_hist);
+	ret.initsize = get_mod_rand_size(init_hist);
+	ret.filesize = get_mod_rand_size(file_hist);
+	return ret;
+}
+
+static void do_test_alloc_fail(void)
+{
+	struct vm_alloc *cur_alloc;
+	struct mod cur_mod;
+	void *file;
+	int mod_n, free_mod_n;
+	unsigned long fail = 0;
+	int iter;
+
+	for (iter = 0; iter < ITERS; iter++) {
+		pr_info("Running iteration: %d\n", iter);
+		memset(allocs_vm, 0, mod_cnt * sizeof(struct vm_alloc));
+		vm_unmap_aliases();
+		for (mod_n = 0; mod_n < mod_cnt; mod_n++) {
+			cur_mod = get_rand_module();
+			cur_alloc = &allocs_vm[mod_n];
+
+			/* Allocate */
+			file = vmalloc(cur_mod.filesize);
+			cur_alloc->core = module_alloc(cur_mod.coresize);
+			cur_alloc->init = module_alloc(cur_mod.initsize);
+
+			/* Clean up everything except core */
+			if (!cur_alloc->core || !cur_alloc->init) {
+				fail++;
+				vfree(file);
+				if (cur_alloc->init) {
+					module_memfree(cur_alloc->init);
+					vm_unmap_aliases();
+				}
+				break;
+			}
+			module_memfree(cur_alloc->init);
+			vm_unmap_aliases();
+			vfree(file);
+		}
+
+		/* Clean up core sizes */
+		for (free_mod_n = 0; free_mod_n < mod_n; free_mod_n++) {
+			cur_alloc = &allocs_vm[free_mod_n];
+			if (cur_alloc->core)
+				module_memfree(cur_alloc->core);
+		}
+	}
+	pr_info("Failures(%ld modules): %lu\n", mod_cnt, fail);
+}
+
+#ifdef CONFIG_RANDOMIZE_FINE_MODULE
+static int is_in_backup(void *addr)
+{
+	return (unsigned long)addr >= MODULES_VADDR + MODULES_RAND_LEN;
+}
+#else
+static int is_in_backup(void *addr)
+{
+	return 0;
+}
+#endif
+
+static void do_test_last_perf(void)
+{
+	struct vm_alloc *cur_alloc;
+	struct mod cur_mod;
+	void *file;
+	int mod_n, mon_n_free;
+	unsigned long fail = 0;
+	int iter;
+	ktime_t start, diff;
+	ktime_t total_last = 0;
+	ktime_t total_all = 0;
+
+	/*
+	 * The number of last core allocations for each iteration that were
+	 * allocated in the backup area.
+	 */
+	int last_in_bk = 0;
+
+	/*
+	 * The total number of core allocations that were in the backup area for
+	 * all iterations.
+	 */
+	int total_in_bk = 0;
+
+	/* The number of iterations where the count was more than 1 */
+	int cnt_more_than_1 = 0;
+
+	/*
+	 * The number of core allocations that were in the backup area for the
+	 * current iteration.
+	 */
+	int cur_in_bk = 0;
+
+	unsigned long before_tlbs;
+	unsigned long tlb_cnt_total;
+	unsigned long tlb_cur;
+	unsigned long total_tlbs = 0;
+
+	pr_info("Starting %d iterations of %ld modules\n", ITERS, mod_cnt);
+
+	for (iter = 0; iter < ITERS; iter++) {
+		vm_unmap_aliases();
+		before_tlbs = get_tlb_flushes_vmalloc();
+		memset(allocs_vm, 0, mod_cnt * sizeof(struct vm_alloc));
+		tlb_cnt_total = 0;
+		cur_in_bk = 0;
+		for (mod_n = 0; mod_n < mod_cnt; mod_n++) {
+			/* allocate how the module allocator allocates */
+
+			cur_mod = get_rand_module();
+			cur_alloc = &allocs_vm[mod_n];
+			file = vmalloc(cur_mod.filesize);
+
+			tlb_cur = get_tlb_flushes_vmalloc();
+
+			start = ktime_get();
+			cur_alloc->core = module_alloc(cur_mod.coresize);
+			diff = ktime_get() - start;
+
+			cur_alloc->init = module_alloc(cur_mod.initsize);
+
+			/* Collect metrics */
+			if (is_in_backup(cur_alloc->core)) {
+				cur_in_bk++;
+				if (mod_n == mod_cnt - 1)
+					last_in_bk++;
+			}
+			total_all += diff;
+
+			if (mod_n == mod_cnt - 1)
+				total_last += diff;
+
+			tlb_cnt_total += get_tlb_flushes_vmalloc() - tlb_cur;
+
+			/* If there is a failure, quit. init/core freed later */
+			if (!cur_alloc->core || !cur_alloc->init) {
+				fail++;
+				vfree(file);
+				break;
+			}
+			/* Init sections do not last long so free here */
+			module_memfree(cur_alloc->init);
+			vm_unmap_aliases();
+			cur_alloc->init = NULL;
+			vfree(file);
+		}
+
+		/* Collect per iteration metrics */
+		total_in_bk += cur_in_bk;
+		if (cur_in_bk > 1)
+			cnt_more_than_1++;
+		total_tlbs += get_tlb_flushes_vmalloc() - before_tlbs;
+
+		/* Collect per iteration metrics */
+		for (mon_n_free = 0; mon_n_free < mod_cnt; mon_n_free++) {
+			cur_alloc = &allocs_vm[mon_n_free];
+			module_memfree(cur_alloc->init);
+			module_memfree(cur_alloc->core);
+		}
+	}
+
+	if (fail)
+		pr_info("There was an alloc failure, results invalid!\n");
+
+	pr_info("num\t\tall(ns)\t\tlast(ns)");
+	pr_info("%ld\t\t%llu\t\t%llu\n", mod_cnt,
+					div64_s64(total_all, ITERS * mod_cnt),
+					div64_s64(total_last, ITERS));
+
+	if (IS_ENABLED(CONFIG_RANDOMIZE_FINE_MODULE)) {
+		pr_info("Last module in backup count = %d\n", last_in_bk);
+		pr_info("Total modules in backup     = %d\n", total_in_bk);
+		pr_info(">1 module in backup count   = %d\n", cnt_more_than_1);
+	}
+	/*
+	 * This will usually hide info when the instrumentation is not in place.
+	 */
+	if (tlb_cnt_total)
+		pr_info("TLB Flushes: %lu\n", tlb_cnt_total);
+}
+
+static void do_test(int test)
+{
+	switch (test) {
+	case 1:
+		do_test_alloc_fail();
+		break;
+	case 2:
+		do_test_last_perf();
+		break;
+	default:
+		pr_info("Unknown test\n");
+	}
+}
+
+static ssize_t device_file_write(struct file *filp, const char __user *user_buf,
+				size_t count, loff_t *offp)
+{
+	char buf[100];
+	long input_num;
+
+	if (count >= sizeof(buf) - 1) {
+		pr_info("Command too long\n");
+		return count;
+	}
+
+	if (!mutex_trylock(&test_mod_alloc_mutex)) {
+		pr_info("test_mod_alloc busy\n");
+		return count;
+	}
+
+	if (copy_from_user(buf, user_buf, count))
+		goto error;
+
+	buf[count] = 0;
+
+	if (kstrtol(buf+1, 10, &input_num))
+		goto error;
+
+	switch (buf[0]) {
+	case 'm':
+		if (input_num > 0 && input_num <= MAX_ALLOC_CNT) {
+			pr_info("New module count: %ld\n", input_num);
+			mod_cnt = input_num;
+			if (allocs_vm)
+				vfree(allocs_vm);
+			allocs_vm = vmalloc(sizeof(struct vm_alloc) * mod_cnt);
+		} else
+			pr_info("more than %d not supported\n", MAX_ALLOC_CNT);
+		break;
+	case 't':
+		if (!mod_cnt) {
+			pr_info("Set module count first\n");
+			break;
+		}
+
+		do_test(input_num);
+		break;
+	default:
+		pr_info("Unknown command\n");
+	}
+	goto done;
+error:
+	pr_info("Could not process input\n");
+done:
+	mutex_unlock(&test_mod_alloc_mutex);
+	return count;
+}
+
+static const char *dv_name = "mod_alloc_test";
+const static struct file_operations test_mod_alloc_fops = {
+	.owner	= THIS_MODULE,
+	.write	= device_file_write,
+};
+
+static int __init mod_alloc_test_init(void)
+{
+	debugfs_create_file(dv_name, 0400, NULL, NULL, &test_mod_alloc_fops);
+
+	return 0;
+}
+
+MODULE_LICENSE("GPL");
+
+module_init(mod_alloc_test_init);
diff --git a/tools/testing/selftests/bpf/test_mod_alloc.sh b/tools/testing/selftests/bpf/test_mod_alloc.sh
new file mode 100755
index 000000000000..e9aea570de78
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_mod_alloc.sh
@@ -0,0 +1,29 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+UNMOUNT_DEBUG_FS=0
+if ! mount | grep -q debugfs; then
+	if mount -t debugfs none /sys/kernel/debug/; then
+		UNMOUNT_DEBUG_FS=1
+	else
+		echo "Could not mount debug fs."
+		exit 1
+	fi
+fi
+
+if [ ! -e /sys/kernel/debug/mod_alloc_test ]; then
+	echo "Test module not found, did you build kernel with TEST_MOD_ALLOC?"
+	exit 1
+fi
+
+echo "Beginning module_alloc performance tests."
+
+for i in `seq 1000 1000 8000`; do
+	echo m$i>/sys/kernel/debug/mod_alloc_test
+	echo t2>/sys/kernel/debug/mod_alloc_test
+done
+
+echo "Module_alloc performance tests ended."
+
+if [ $UNMOUNT_DEBUG_FS -eq 1 ]; then
+	umount /sys/kernel/debug/
+fi
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 0/4] KASLR feature to randomize each loadable module
  2018-11-02 19:25 [PATCH v8 0/4] KASLR feature to randomize each loadable module Rick Edgecombe
                   ` (3 preceding siblings ...)
  2018-11-02 19:25 ` [PATCH v8 4/4] Kselftest for module text allocation benchmarking Rick Edgecombe
@ 2018-11-06 21:04 ` Andrew Morton
  2018-11-07 20:03   ` Edgecombe, Rick P
  4 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2018-11-06 21:04 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: jeyu, willy, tglx, mingo, hpa, x86, linux-kernel, linux-mm,
	kernel-hardening, daniel, jannh, keescook, kristen, dave.hansen,
	arjan

On Fri,  2 Nov 2018 12:25:16 -0700 Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:

> This is V8 of the "KASLR feature to randomize each loadable module" patchset.
> The purpose is to increase the randomization and also to make the modules
> randomized in relation to each other instead of just the base, so that if one
> module leaks the location of the others can't be inferred.

I'm not seeing any info here which explains why we should add this to
Linux.

What is the end-user value?  What problems does it solve?  Are those
problems real or theoretical?  What are the exploit scenarios and how
realistic are they?  etcetera, etcetera.  How are we to decide to buy
this thing if we aren't given a glossy brochure?

> There is a small allocation performance degradation versus v7 as a
> trade off, but it is still faster on average than the existing
> algorithm until >7000 modules.

lol.  How did you test 7000 modules?  Using the selftest code?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 1/4] vmalloc: Add __vmalloc_node_try_addr function
  2018-11-02 19:25 ` [PATCH v8 1/4] vmalloc: Add __vmalloc_node_try_addr function Rick Edgecombe
@ 2018-11-06 21:05   ` Andrew Morton
  2018-11-07 20:03     ` Edgecombe, Rick P
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2018-11-06 21:05 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: jeyu, willy, tglx, mingo, hpa, x86, linux-kernel, linux-mm,
	kernel-hardening, daniel, jannh, keescook, kristen, dave.hansen,
	arjan

On Fri,  2 Nov 2018 12:25:17 -0700 Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:

> Create __vmalloc_node_try_addr function that tries to allocate at a specific
> address without triggering any lazy purging. In order to support this behavior
> a try_addr argument was plugged into several of the static helpers.

Please explain (in the changelog) why lazy purging is considered to be
a problem.  Preferably with some form of measurements, or at least a
hand-wavy guesstimate of the cost.

> This also changes logic in __get_vm_area_node to be faster in cases where
> allocations fail due to no space, which is a lot more common when trying
> specific addresses.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 2/4] x86/modules: Increase randomization for modules
  2018-11-02 19:25 ` [PATCH v8 2/4] x86/modules: Increase randomization for modules Rick Edgecombe
@ 2018-11-06 21:05   ` Andrew Morton
  2018-11-07 20:03     ` Edgecombe, Rick P
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2018-11-06 21:05 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: jeyu, willy, tglx, mingo, hpa, x86, linux-kernel, linux-mm,
	kernel-hardening, daniel, jannh, keescook, kristen, dave.hansen,
	arjan

On Fri,  2 Nov 2018 12:25:18 -0700 Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:

> This changes the behavior of the KASLR logic for allocating memory for the text
> sections of loadable modules. It randomizes the location of each module text
> section with about 17 bits of entropy in typical use. This is enabled on X86_64
> only. For 32 bit, the behavior is unchanged.
> 
> It refactors existing code around module randomization somewhat. There are now
> three different behaviors for x86 module_alloc depending on config.
> RANDOMIZE_BASE=n, and RANDOMIZE_BASE=y ARCH=x86_64, and RANDOMIZE_BASE=y
> ARCH=i386.
>
> The refactor of the existing code is to try to clearly show what
> those behaviors are without having three separate versions or threading the
> behaviors in a bunch of little spots. The reason it is not enabled on 32 bit
> yet is because the module space is much smaller and simulations haven't been
> run to see how it performs.
> 
> The new algorithm breaks the module space in two, a random area and a backup
> area. It first tries to allocate at a number of randomly located starting pages
> inside the random section without purging any lazy free vmap areas and
> triggering the associated TLB flush.

Surprised.  Is one TLB flush per module loading sufficiently expensive
to justify any effort to avoid it?  IOW, please provide some
justification and explanation in the changelog.

> If this fails, then it will allocate in
> the backup area. The backup area base will be offset in the same way as the
> current algorithm does for the base area, 1024 possible locations.

So presumably the allocation effort in the randomized section is
somewhat likely to fail.  That's unfortunate.  Some discussion about
why these failures occur would be helpful.

Because it would be nice to do away with the backup area altogether. 
But this reader doesn't really understand why the backup area is
needed.

> Due to boot_params being defined with different types in different places,
> placing the config helpers modules.h or kaslr.h caused conflicts elsewhere, and
> so they are placed in a new file, kaslr_modules.h, instead.
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 4/4] Kselftest for module text allocation benchmarking
  2018-11-02 19:25 ` [PATCH v8 4/4] Kselftest for module text allocation benchmarking Rick Edgecombe
@ 2018-11-06 21:05   ` Andrew Morton
  2018-11-07 20:03     ` Edgecombe, Rick P
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2018-11-06 21:05 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: jeyu, willy, tglx, mingo, hpa, x86, linux-kernel, linux-mm,
	kernel-hardening, daniel, jannh, keescook, kristen, dave.hansen,
	arjan

On Fri,  2 Nov 2018 12:25:20 -0700 Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:

> This adds a test module in lib/, and a script in kselftest that does
> benchmarking on the allocation of memory in the module space. Performance here
> would have some small impact on kernel module insertions, BPF JIT insertions
> and kprobes. In the case of KASLR features for the module space, this module
> can be used to measure the allocation performance of different configurations.
> This module needs to be compiled into the kernel because module_alloc is not
> exported.

Well, we could export module_alloc().  Would that be helpful at all?

> With some modification to the code, as explained in the comments, it can be
> enabled to measure TLB flushes as well.
> 
> There are two tests in the module. One allocates until failure in order to
> test module capacity and the other times allocating space in the module area.
> They both use module sizes that roughly approximate the distribution of in-tree
> X86_64 modules.
> 
> You can control the number of modules used in the tests like this:
> echo m1000>/dev/mod_alloc_test
> 
> Run the test for module capacity like:
> echo t1>/dev/mod_alloc_test
> 
> The other test will measure the allocation time, and for CONFG_X86_64 and
> CONFIG_RANDOMIZE_BASE, also give data on how often the “backup area" is used.
> 
> Run the test for allocation time and backup area usage like:
> echo t2>/dev/mod_alloc_test
> The output will be something like this:
> num		all(ns)		last(ns)
> 1000		1083		1099
> Last module in backup count = 0
> Total modules in backup     = 0
> >1 module in backup count   = 0

Are the above usage instructions captured in the kernel code somewhere?
I can't see it, and expecting people to trawl git changelogs isn't
very friendly.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 0/4] KASLR feature to randomize each loadable module
  2018-11-06 21:04 ` [PATCH v8 0/4] KASLR feature to randomize each loadable module Andrew Morton
@ 2018-11-07 20:03   ` Edgecombe, Rick P
  0 siblings, 0 replies; 13+ messages in thread
From: Edgecombe, Rick P @ 2018-11-07 20:03 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, daniel, jeyu, keescook, jannh, willy, tglx,
	linux-mm, arjan, x86, kristen, hpa, mingo, kernel-hardening,
	Hansen, Dave

On Tue, 2018-11-06 at 13:04 -0800, Andrew Morton wrote:
> On Fri,  2 Nov 2018 12:25:16 -0700 Rick Edgecombe <rick.p.edgecombe@intel.com>
> wrote:
> 
> > This is V8 of the "KASLR feature to randomize each loadable module"
> > patchset.
> > The purpose is to increase the randomization and also to make the modules
> > randomized in relation to each other instead of just the base, so that if
> > one
> > module leaks the location of the others can't be inferred.
> 
> I'm not seeing any info here which explains why we should add this to
> Linux.
> 
> What is the end-user value?  What problems does it solve?  Are those
> problems real or theoretical?  What are the exploit scenarios and how
> realistic are they?  etcetera, etcetera.  How are we to decide to buy
> this thing if we aren't given a glossy brochure?
Hi Andrew,

Thanks for taking a look! The first version had a proper write up, but now the
details are spread out over 8 versions. I'll send out another version with it
all in one place.

The short version is that today the RANDOMIZE_BASE feature randomizes the base
address where the module allocations begin with 10 bits of entropy. From here,
a highly deterministic algorithm allocates space for the modules as they are 
loaded and un-loaded. If an attacker can predict the order and identities for
modules that will be loaded, then a single text address leak can give the
attacker access to the locations of all the modules. 

So this is trying to prevent the same class of attacks as the existing KASLR,
like control flow manipulation and now also making it harder/longer to find
speculative execution gadgets. It increases the number of possible
positions 128X, and with that amount of randomness per module instead of for all
modules.

> > There is a small allocation performance degradation versus v7 as a
> > trade off, but it is still faster on average than the existing
> > algorithm until >7000 modules.
> 
> lol.  How did you test 7000 modules?  Using the selftest code?

Yes, this is with simulations using the included kselftest code with sizes
extracted from the x86_64 in-tree modules. Supporting 7000 kernel modules is not
the intention though, instead it's trying to accommodate 7000 allocations in the
module space. So also eBPF JIT, classic BPF socket filter JIT, kprobes, etc.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 1/4] vmalloc: Add __vmalloc_node_try_addr function
  2018-11-06 21:05   ` Andrew Morton
@ 2018-11-07 20:03     ` Edgecombe, Rick P
  0 siblings, 0 replies; 13+ messages in thread
From: Edgecombe, Rick P @ 2018-11-07 20:03 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, daniel, jeyu, keescook, jannh, willy, tglx,
	linux-mm, arjan, x86, kristen, hpa, mingo, kernel-hardening,
	Hansen, Dave

On Tue, 2018-11-06 at 13:05 -0800, Andrew Morton wrote:
> On Fri,  2 Nov 2018 12:25:17 -0700 Rick Edgecombe <rick.p.edgecombe@intel.com>
> wrote:
> 
> > Create __vmalloc_node_try_addr function that tries to allocate at a specific
> > address without triggering any lazy purging. In order to support this
> > behavior
> > a try_addr argument was plugged into several of the static helpers.
> 
> Please explain (in the changelog) why lazy purging is considered to be
> a problem.  Preferably with some form of measurements, or at least a
> hand-wavy guesstimate of the cost.
Sure, Ill update it to be more clear. The problem is that when
__vmalloc_node_range fails to allocate (in this case tries in a single random
spot that doesn't fit), it triggers a purge_vmap_area_lazy and then retries the
allocation in the same spot. It doesn't make as much sense in this case when we
are not trying over a large area. While it will usually not flush the TLB, it
does do extra work every time for an unlikely case in this situation of a lazy
free area blocking the allocation.

The average allocation time in ns for different versions measured by the
included kselftest:

Modules	Vmalloc optimization	No Vmalloc Optimization	Existing Module KASLR
1000	1433			1993			3821
2000	2295			3681			7830
3000	4424			7450			13012
4000	7746			13824			18106
5000	12721			21852			22572
6000	19724			33926			26443
7000	27638			47427			30473
8000	37745			64443			34200

The other optimization is not kmalloc-ing in __get_vm_area_node until after the
address was tried, which IIRC had a smaller but still noticeable performance
boost.

These allocations are not taking very long, but it may show up on systems with
very high usage of the module space (BPF JITs). If the trade-off of touching
vmalloc doesn't seem worth it to people, I'm happy to remove the optimizations.

> > This also changes logic in __get_vm_area_node to be faster in cases where
> > allocations fail due to no space, which is a lot more common when trying
> > specific addresses.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 2/4] x86/modules: Increase randomization for modules
  2018-11-06 21:05   ` Andrew Morton
@ 2018-11-07 20:03     ` Edgecombe, Rick P
  0 siblings, 0 replies; 13+ messages in thread
From: Edgecombe, Rick P @ 2018-11-07 20:03 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, daniel, jeyu, keescook, jannh, willy, tglx,
	linux-mm, arjan, x86, kristen, hpa, mingo, kernel-hardening,
	Hansen, Dave

On Tue, 2018-11-06 at 13:05 -0800, Andrew Morton wrote:
> On Fri,  2 Nov 2018 12:25:18 -0700 Rick Edgecombe <rick.p.edgecombe@intel.com>
> wrote:
> 
> > This changes the behavior of the KASLR logic for allocating memory for the
> > text
> > sections of loadable modules. It randomizes the location of each module text
> > section with about 17 bits of entropy in typical use. This is enabled on
> > X86_64
> > only. For 32 bit, the behavior is unchanged.
> > 
> > It refactors existing code around module randomization somewhat. There are
> > now
> > three different behaviors for x86 module_alloc depending on config.
> > RANDOMIZE_BASE=n, and RANDOMIZE_BASE=y ARCH=x86_64, and RANDOMIZE_BASE=y
> > ARCH=i386.
> > 
> > The refactor of the existing code is to try to clearly show what
> > those behaviors are without having three separate versions or threading the
> > behaviors in a bunch of little spots. The reason it is not enabled on 32 bit
> > yet is because the module space is much smaller and simulations haven't been
> > run to see how it performs.
> > 
> > The new algorithm breaks the module space in two, a random area and a backup
> > area. It first tries to allocate at a number of randomly located starting
> > pages
> > inside the random section without purging any lazy free vmap areas and
> > triggering the associated TLB flush.
> Surprised.  Is one TLB flush per module loading sufficiently expensive
> to justify any effort to avoid it?  IOW, please provide some
> justification and explanation in the changelog.
I'll fix this. The text is accurate, but misleading in communicating the
intentions because it was left over from an earlier version. Sorry about that.

I think the performance benefit might come more from avoiding the retry in
vmalloc, than the TLB flush. In the other mail I posted some benchmarks.

> > If this fails, then it will allocate in
> > the backup area. The backup area base will be offset in the same way as the
> > current algorithm does for the base area, 1024 possible locations.
> 
> So presumably the allocation effort in the randomized section is
> somewhat likely to fail.  That's unfortunate.  Some discussion about
> why these failures occur would be helpful.
> 
> Because it would be nice to do away with the backup area altogether. 
> But this reader doesn't really understand why the backup area is
> needed.
With the high 10000 number tries, it usually only fails for larger modules until
the module space starts to get fuller. A concrete use case for the backup area
is when the random area is fragmented with BPF filters, and then the user tries
to load a large module.

For example imagine if you had the full 1GB of module space as the random area.
If you have 1000 BPF filters inserted, and they ended up being evenly
distributed, 1GB/1000=1MB, and so you couldn't load any modules over 1MB. With
2000, 0.5 MB, etc.

The 2/3:1/3 breakdown was chosen as my best guess at the best tradeoff of
fragmentation safety and entropy.

> > Due to boot_params being defined with different types in different places,
> > placing the config helpers modules.h or kaslr.h caused conflicts elsewhere,
> > and
> > so they are placed in a new file, kaslr_modules.h, instead.
> > 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 4/4] Kselftest for module text allocation benchmarking
  2018-11-06 21:05   ` Andrew Morton
@ 2018-11-07 20:03     ` Edgecombe, Rick P
  0 siblings, 0 replies; 13+ messages in thread
From: Edgecombe, Rick P @ 2018-11-07 20:03 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, daniel, jeyu, keescook, jannh, willy, tglx,
	linux-mm, arjan, x86, kristen, hpa, mingo, kernel-hardening,
	Hansen, Dave

On Tue, 2018-11-06 at 13:05 -0800, Andrew Morton wrote:
> On Fri,  2 Nov 2018 12:25:20 -0700 Rick Edgecombe <rick.p.edgecombe@intel.com>
> wrote:
> 
> > This adds a test module in lib/, and a script in kselftest that does
> > benchmarking on the allocation of memory in the module space. Performance
> > here
> > would have some small impact on kernel module insertions, BPF JIT insertions
> > and kprobes. In the case of KASLR features for the module space, this module
> > can be used to measure the allocation performance of different
> > configurations.
> > This module needs to be compiled into the kernel because module_alloc is not
> > exported.
> 
> Well, we could export module_alloc().  Would that be helpful at all?
For me at least, it wasn't an issue to compile it into the kernel, since its
just for development testing. Since its controlled through debugfs, it doesn't
do anything until you write to it.

> > With some modification to the code, as explained in the comments, it can be
> > enabled to measure TLB flushes as well.
> > 
> > There are two tests in the module. One allocates until failure in order to
> > test module capacity and the other times allocating space in the module
> > area.
> > They both use module sizes that roughly approximate the distribution of in-
> > tree
> > X86_64 modules.
> > 
> > You can control the number of modules used in the tests like this:
> > echo m1000>/dev/mod_alloc_test
> > 
> > Run the test for module capacity like:
> > echo t1>/dev/mod_alloc_test
> > 
> > The other test will measure the allocation time, and for CONFG_X86_64 and
> > CONFIG_RANDOMIZE_BASE, also give data on how often the “backup area" is
> > used.
> > 
> > Run the test for allocation time and backup area usage like:
> > echo t2>/dev/mod_alloc_test
> > The output will be something like this:
> > num		all(ns)		last(ns)
> > 1000		1083		1099
> > Last module in backup count = 0
> > Total modules in backup     = 0
> > > 1 module in backup count   = 0
> 
> Are the above usage instructions captured in the kernel code somewhere?
> I can't see it, and expecting people to trawl git changelogs isn't
> very friendly.
> 
Thanks. I'll add the instructions to the file. For the performance test, a
script is included that does everything needed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-11-07 20:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-02 19:25 [PATCH v8 0/4] KASLR feature to randomize each loadable module Rick Edgecombe
2018-11-02 19:25 ` [PATCH v8 1/4] vmalloc: Add __vmalloc_node_try_addr function Rick Edgecombe
2018-11-06 21:05   ` Andrew Morton
2018-11-07 20:03     ` Edgecombe, Rick P
2018-11-02 19:25 ` [PATCH v8 2/4] x86/modules: Increase randomization for modules Rick Edgecombe
2018-11-06 21:05   ` Andrew Morton
2018-11-07 20:03     ` Edgecombe, Rick P
2018-11-02 19:25 ` [PATCH v8 3/4] vmalloc: Add debugfs modfraginfo Rick Edgecombe
2018-11-02 19:25 ` [PATCH v8 4/4] Kselftest for module text allocation benchmarking Rick Edgecombe
2018-11-06 21:05   ` Andrew Morton
2018-11-07 20:03     ` Edgecombe, Rick P
2018-11-06 21:04 ` [PATCH v8 0/4] KASLR feature to randomize each loadable module Andrew Morton
2018-11-07 20:03   ` Edgecombe, Rick P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).