kernel-hardening.lists.openwall.com archive mirror
 help / color / mirror / Atom feed
* [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately
@ 2016-05-05 22:13 Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 01/11] x86/boot: Clean up pointer casting Kees Cook
                   ` (10 more replies)
  0 siblings, 11 replies; 17+ messages in thread
From: Kees Cook @ 2016-05-05 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kees Cook, Ingo Molnar, Baoquan He, Yinghai Lu, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, lasse.collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

This is v6 of the x86 KASLR improvement series from Yinghai, Baoquan,
and myself. The current branch lives here:
http://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=kaslr/highmem

***Background:
Bugs have been reported around kdump, kexec, and some netboot situations
that didn't work when KASLR was enabled. While discussing the bugs, it
was found that the current KASLR implementation had various limitations,
but most importantly that it can only randomize in a 1GB region of
physical memory.

The current KASLR implementaion only randomizes the base physical
address of the kernel. If the delta from build-time load address and
KASLR run-time load address (i.e. the physical address of where the
kernel actually decompressed) is not equal to 0, relocation handling is
performed using the delta. Though in principle kernel can be randomized
to any physical address, the physical kernel text mapping address space
is limited to 1G and the virtual address is just offset by the same
amount. On x86_64 the result is the following range:
[0xffffffff80000000, 0xffffffffc0000000)

hpa and Vivek suggested we should change this by decoupling the physical
address and virtual address randomization of kernel text and let them work
separately. Then kernel text physical address can be randomized in region
[16M, 64T), and kernel text virtual address can be randomized in region
[0xffffffff80000000, 0xffffffffc0000000).

***Problems that needed solving:
  - When booting from the startup_32 case, only a 0~4G identity mapping is
    built. If kernel will be randomly put anywhere from 16M to 64T at
    most, the price to build all the identity mappings is too high. We
    need to build the identity mapping on demand, not covering all of the
    physical address space.

  - Decouple the physical address and virtual address randomization of kernel
    text and let them work separately.

  - Kernels loaded high will not randomize into a lower memory region.

***Parts:
   - The 1st part prepares the code for more invasive changes.
     (Patches 01-02)
   - The 2nd part is Yinghai's building of identity mappings on demand,
     fixing problem #1 above.
     (Patches 03-04)
   - The 3rd part is Baoquan's new randomization slot management code for
     handling the much larger possible memory space.
     (Patches 05-06)
   - The 4th part is Baoquan's decoupling the physical address and virtual
     address randomization of kernel text and letting them work separately,
     based on Yinghai's ident mapping patches, fixing problem #2 above.
     (Patches 07-09)
   - The 5th part lifts the upper and lower limits on physical addresses,
     fixing problem #3 above.
     (Patches 10-11)

I've boot tested this a bunch on 32-bit and 64-bit, and things appear to
be working as expected. I've cleaned up the changelogs, improved some
comments, refactored a few things, split out things and merged others,
etc. Changes are noted in the individual changelogs.

Thanks!

-Kees

v5->v6:
- sent other clean-ups as separate patches
- squashed code removal into the patches that made them removable
- refactoring slot calculation to avoid the confusing "if" statement
- protected slot_area_index in store_slot_info to be paranoid
- adjusted Kconfig language to be more complete but with hopefully less jargon
- fixed up as much of the unsigned long casts as possible
- fixed some coding styel on brace usage
- made find_random_phys_addr look like find_random_virt_addr
- clarified various variable names

v4->v5:
- rewrote all the changelogs, and several comments.
- refactored e820 parser to use a while loop instead of goto.
- rearranged removal of CONFIG_RANDOMIZE_BASE_MAX_OFFSET to earlier.
- additionally dropped KERNEL_IMAGE_SIZE_DEFAULT
- refactored minimum address calculation
- refactored slot offset calculation for readability
- fixed 32-bit boot failures
- fixed CONFIG_RANDOMIZE_BASE=n boot failure
- improved debug reporting

[Baoquan's histroy]
v3->v4:
- Made changes according to Kees's comments.
  Add one patch 20/20 as Kees suggested to use KERNEL_IMAGE_SIZE as offset
  max of virtual random, meanwhile clean up useless CONFIG_RANDOM_OFFSET_MAX

    x86, kaslr: Use KERNEL_IMAGE_SIZE as the offset max for kernel virtual randomization

v2->v3:
- It only takes care of the kaslr related patches.
  For reviewers it's better to discuss only one issue in one thread.
    * I take off one patch as follows from Yinghai's because I think it's unnecessay.
       - Patch 05/19 x86, kaslr: rename output_size to output_run_size
         output_size is enough to represen the value:
            output_len > run_size ? output_len : run_size

    * I add Patch 04/19, it's a comment update patch. For other patches, I just
      adjust patch log and do several places of change comparing with 2nd round.
      Please check the change log under patch log of each patch for details.

    * Adjust sequence of several patches to make review easier. It doesn't
      affect codes.

v1->v2:
- In 2nd round Yinghai made a big patchset including this kaslr fix and another
  setup_data related fix. The link is here:
   http://lists-archives.com/linux-kernel/28346903-x86-updated-patches-for-kaslr-and-setup_data-etc-for-v4-3.html
  You can get the code from Yinghai's git branch:
  git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-v4.3-next

v1:
- The first round can be found here:
    https://lwn.net/Articles/637115/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [kernel-hardening] [PATCH v6 01/11] x86/boot: Clean up pointer casting
  2016-05-05 22:13 [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately Kees Cook
@ 2016-05-05 22:13 ` Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 02/11] x86/KASLR: Consolidate mem_avoid entries Kees Cook
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Kees Cook @ 2016-05-05 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kees Cook, Ingo Molnar, Baoquan He, Yinghai Lu, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, lasse.collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

Currently extract_kernel() defines the input and output buffer pointers
as "unsigned char *" since that's effectively what they are. It passes
these to the decompressor routine and to the ELF parser, which both
logically deal with buffer pointers too. There is some casting ("unsigned
long") done to validate the numerical value of the pointers, but it is
relatively limited.

However, choose_random_location() operates almost exclusively on the
numerical representation of these pointers, so it ended up carrying
a lot of "unsigned long" casts. With the future physical/virtual split
these casts were going to multiply, so this attempts to solve the
problem by doing all the casting in choose_random_location()'s entry
and return instead of through-out the code. Adjusts argument names to
be more meaningful, and changes one us of "choice" to "output" to make
the future physical/virtual split more clear (i.e. "choice" should be
strictly a function return value and not used as an intermediate).

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/boot/compressed/kaslr.c | 20 ++++++++++++++------
 arch/x86/boot/compressed/misc.h  | 10 +++++-----
 2 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index f1818d95d726..2072d82c1911 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -305,12 +305,21 @@ static unsigned long find_random_addr(unsigned long minimum,
 	return slots_fetch_random();
 }
 
-unsigned char *choose_random_location(unsigned char *input,
+unsigned char *choose_random_location(unsigned char *input_ptr,
 				      unsigned long input_size,
-				      unsigned char *output,
+				      unsigned char *output_ptr,
 				      unsigned long output_size)
 {
-	unsigned long choice = (unsigned long)output;
+	/*
+	 * The caller of choose_random_location() uses unsigned char * for
+	 * buffer pointers since it performs decompression, elf parsing, etc.
+	 * Since this code examines addresses much more numerically,
+	 * unsigned long is used internally here. Instead of sprinkling
+	 * more casts into extract_kernel, do them here and at return.
+	 */
+	unsigned long input = (unsigned long)input_ptr;
+	unsigned long output = (unsigned long)output_ptr;
+	unsigned long choice = output;
 	unsigned long random_addr;
 
 #ifdef CONFIG_HIBERNATION
@@ -328,11 +337,10 @@ unsigned char *choose_random_location(unsigned char *input,
 	boot_params->hdr.loadflags |= KASLR_FLAG;
 
 	/* Record the various known unsafe memory ranges. */
-	mem_avoid_init((unsigned long)input, input_size,
-		       (unsigned long)output, output_size);
+	mem_avoid_init(input, input_size, output, output_size);
 
 	/* Walk e820 and find a random address. */
-	random_addr = find_random_addr(choice, output_size);
+	random_addr = find_random_addr(output, output_size);
 	if (!random_addr) {
 		warn("KASLR disabled: could not find suitable E820 region!");
 		goto out;
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 9887e0d4aaeb..1f23d022d241 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -67,20 +67,20 @@ int cmdline_find_option_bool(const char *option);
 
 #if CONFIG_RANDOMIZE_BASE
 /* kaslr.c */
-unsigned char *choose_random_location(unsigned char *input,
+unsigned char *choose_random_location(unsigned char *input_ptr,
 				      unsigned long input_size,
-				      unsigned char *output,
+				      unsigned char *output_ptr,
 				      unsigned long output_size);
 /* cpuflags.c */
 bool has_cpuflag(int flag);
 #else
 static inline
-unsigned char *choose_random_location(unsigned char *input,
+unsigned char *choose_random_location(unsigned char *input_ptr,
 				      unsigned long input_size,
-				      unsigned char *output,
+				      unsigned char *output_ptr,
 				      unsigned long output_size)
 {
-	return output;
+	return output_ptr;
 }
 #endif
 
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [kernel-hardening] [PATCH v6 02/11] x86/KASLR: Consolidate mem_avoid entries
  2016-05-05 22:13 [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 01/11] x86/boot: Clean up pointer casting Kees Cook
@ 2016-05-05 22:13 ` Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 03/11] x86/boot: Split out kernel_ident_mapping_init Kees Cook
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Kees Cook @ 2016-05-05 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kees Cook, Yinghai Lu, Baoquan He, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, lasse.collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

From: Yinghai Lu <yinghai@kernel.org>

The mem_avoid array is used to track positions that should be avoided (like
the compressed kernel, decompression code, etc) when selecting a memory
position for the randomly relocated kernel. Since ZO is now at the end of
the decompression buffer and the decompression code (and its heap and
stack) are at the front, we can safely consolidate the decompression entry,
the heap entry, and the stack entry. The boot_params memory, however, could
be elsewhere, so it should be explicitly included.

Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
[kees: rewrote changelog, cleaned up code comment]
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/boot/compressed/kaslr.c | 77 +++++++++++++++++++++++++++++++---------
 1 file changed, 61 insertions(+), 16 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 2072d82c1911..6392f0041b8a 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -121,7 +121,7 @@ struct mem_vector {
 	unsigned long size;
 };
 
-#define MEM_AVOID_MAX 5
+#define MEM_AVOID_MAX 4
 static struct mem_vector mem_avoid[MEM_AVOID_MAX];
 
 static bool mem_contains(struct mem_vector *region, struct mem_vector *item)
@@ -146,22 +146,71 @@ static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two)
 	return true;
 }
 
+/*
+ * In theroy, KASLR can put the kernel anywhere in area of [16M, 64T). The
+ * mem_avoid array is used to store the ranges that need to be avoided when
+ * KASLR searches for a an appropriate random address. We must avoid any
+ * regions that are unsafe to overlap with during decompression, and other
+ * things like the initrd, cmdline and boot_params.
+ *
+ * How to calculate the unsafe areas is detailed here, and is informed by
+ * the decompression calculations in header.S, and the diagram in misc.c.
+ *
+ * The compressed vmlinux (ZO) plus relocs and the run space of ZO can't be
+ * overwritten by decompression output.
+ *
+ * ZO sits against the end of the decompression buffer, so we can calculate
+ * where text, data, bss, etc of ZO are positioned.
+ *
+ * The follow are already enforced by the code:
+ *  - init_size >= kernel_total_size
+ *  - input + input_len >= output + output_len
+ *  - kernel_total_size could be >= or < output_len
+ *
+ * From this, we can make several observations, illustrated by a diagram:
+ *  - init_size >= kernel_total_size
+ *  - input + input_len > output + output_len
+ *  - kernel_total_size >= output_len
+ *
+ * 0   output            input            input+input_len    output+init_size
+ * |     |                 |                       |                       |
+ * |     |                 |                       |                       |
+ * |-----|--------|--------|------------------|----|------------|----------|
+ *                |                           |                 |
+ *                |                           |                 |
+ * output+init_size-ZO_INIT_SIZE   output+output_len  output+kernel_total_size
+ *
+ * [output, output+init_size) is for the buffer for decompressing the
+ * compressed kernel (ZO).
+ *
+ * [output, output+kernel_total_size) is for the uncompressed kernel (VO)
+ * and its bss, brk, etc.
+ * [output, output+output_len) is VO plus relocs
+ *
+ * [output+init_size-ZO_INIT_SIZE, output+init_size) is the copied ZO.
+ * [input, input+input_len) is the copied compressed (VO (vmlinux after
+ * objcopy) plus relocs), not the ZO.
+ *
+ * [input+input_len, output+init_size) is [_text, _end) for ZO. That was the
+ * first range in mem_avoid, which included ZO's heap and stack. Also
+ * [input, input+input_size) need be put in mem_avoid array, but since it
+ * is adjacent to the first entry, they can be merged. This is how we get
+ * the first entry in mem_avoid[].
+ */
 static void mem_avoid_init(unsigned long input, unsigned long input_size,
-			   unsigned long output, unsigned long output_size)
+			   unsigned long output)
 {
+	unsigned long init_size = boot_params->hdr.init_size;
 	u64 initrd_start, initrd_size;
 	u64 cmd_line, cmd_line_size;
-	unsigned long unsafe, unsafe_len;
 	char *ptr;
 
 	/*
 	 * Avoid the region that is unsafe to overlap during
-	 * decompression (see calculations in ../header.S).
+	 * decompression.
 	 */
-	unsafe_len = (output_size >> 12) + 32768 + 18;
-	unsafe = (unsigned long)input + input_size - unsafe_len;
-	mem_avoid[0].start = unsafe;
-	mem_avoid[0].size = unsafe_len;
+	mem_avoid[0].start = input;
+	mem_avoid[0].size = (output + init_size) - input;
 
 	/* Avoid initrd. */
 	initrd_start  = (u64)boot_params->ext_ramdisk_image << 32;
@@ -181,13 +230,9 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
 	mem_avoid[2].start = cmd_line;
 	mem_avoid[2].size = cmd_line_size;
 
-	/* Avoid heap memory. */
-	mem_avoid[3].start = (unsigned long)free_mem_ptr;
-	mem_avoid[3].size = BOOT_HEAP_SIZE;
-
-	/* Avoid stack memory. */
-	mem_avoid[4].start = (unsigned long)free_mem_end_ptr;
-	mem_avoid[4].size = BOOT_STACK_SIZE;
+	/* Avoid params */
+	mem_avoid[3].start = (unsigned long)boot_params;
+	mem_avoid[3].size = sizeof(*boot_params);
 }
 
 /* Does this memory vector overlap a known avoided area? */
@@ -337,7 +382,7 @@ unsigned char *choose_random_location(unsigned char *input_ptr,
 	boot_params->hdr.loadflags |= KASLR_FLAG;
 
 	/* Record the various known unsafe memory ranges. */
-	mem_avoid_init(input, input_size, output, output_size);
+	mem_avoid_init(input, input_size, output);
 
 	/* Walk e820 and find a random address. */
 	random_addr = find_random_addr(output, output_size);
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [kernel-hardening] [PATCH v6 03/11] x86/boot: Split out kernel_ident_mapping_init
  2016-05-05 22:13 [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 01/11] x86/boot: Clean up pointer casting Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 02/11] x86/KASLR: Consolidate mem_avoid entries Kees Cook
@ 2016-05-05 22:13 ` Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 04/11] x86/KASLR: Build identity mappings on demand Kees Cook
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Kees Cook @ 2016-05-05 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kees Cook, Yinghai Lu, Ingo Molnar, Baoquan He, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, lasse.collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

From: Yinghai Lu <yinghai@kernel.org>

In order to support on-demand page table creation when moving the
kernel for KASLR, we need to use kernel_ident_mapping_init() in the
decompression code. This splits it out into its own file for use outside
of init_64.c. Additionally, checking for __pa/__va defines is added
since they need to be overridden in the decompression code.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/page.h |  5 +++
 arch/x86/mm/ident_map.c     | 74 +++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/init_64.c       | 74 +--------------------------------------------
 3 files changed, 80 insertions(+), 73 deletions(-)
 create mode 100644 arch/x86/mm/ident_map.c

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 802dde30c928..cf8f619b305f 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -37,7 +37,10 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 	alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
+#ifndef __pa
 #define __pa(x)		__phys_addr((unsigned long)(x))
+#endif
+
 #define __pa_nodebug(x)	__phys_addr_nodebug((unsigned long)(x))
 /* __pa_symbol should be used for C visible symbols.
    This seems to be the official gcc blessed way to do such arithmetic. */
@@ -51,7 +54,9 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 #define __pa_symbol(x) \
 	__phys_addr_symbol(__phys_reloc_hide((unsigned long)(x)))
 
+#ifndef __va
 #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))
+#endif
 
 #define __boot_va(x)		__va(x)
 #define __boot_pa(x)		__pa(x)
diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
new file mode 100644
index 000000000000..751ca920773a
--- /dev/null
+++ b/arch/x86/mm/ident_map.c
@@ -0,0 +1,74 @@
+
+static void ident_pmd_init(unsigned long pmd_flag, pmd_t *pmd_page,
+			   unsigned long addr, unsigned long end)
+{
+	addr &= PMD_MASK;
+	for (; addr < end; addr += PMD_SIZE) {
+		pmd_t *pmd = pmd_page + pmd_index(addr);
+
+		if (!pmd_present(*pmd))
+			set_pmd(pmd, __pmd(addr | pmd_flag));
+	}
+}
+static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
+			  unsigned long addr, unsigned long end)
+{
+	unsigned long next;
+
+	for (; addr < end; addr = next) {
+		pud_t *pud = pud_page + pud_index(addr);
+		pmd_t *pmd;
+
+		next = (addr & PUD_MASK) + PUD_SIZE;
+		if (next > end)
+			next = end;
+
+		if (pud_present(*pud)) {
+			pmd = pmd_offset(pud, 0);
+			ident_pmd_init(info->pmd_flag, pmd, addr, next);
+			continue;
+		}
+		pmd = (pmd_t *)info->alloc_pgt_page(info->context);
+		if (!pmd)
+			return -ENOMEM;
+		ident_pmd_init(info->pmd_flag, pmd, addr, next);
+		set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
+	}
+
+	return 0;
+}
+
+int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
+			      unsigned long addr, unsigned long end)
+{
+	unsigned long next;
+	int result;
+	int off = info->kernel_mapping ? pgd_index(__PAGE_OFFSET) : 0;
+
+	for (; addr < end; addr = next) {
+		pgd_t *pgd = pgd_page + pgd_index(addr) + off;
+		pud_t *pud;
+
+		next = (addr & PGDIR_MASK) + PGDIR_SIZE;
+		if (next > end)
+			next = end;
+
+		if (pgd_present(*pgd)) {
+			pud = pud_offset(pgd, 0);
+			result = ident_pud_init(info, pud, addr, next);
+			if (result)
+				return result;
+			continue;
+		}
+
+		pud = (pud_t *)info->alloc_pgt_page(info->context);
+		if (!pud)
+			return -ENOMEM;
+		result = ident_pud_init(info, pud, addr, next);
+		if (result)
+			return result;
+		set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
+	}
+
+	return 0;
+}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 214afda97911..65cfbeefbec4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -58,79 +58,7 @@
 
 #include "mm_internal.h"
 
-static void ident_pmd_init(unsigned long pmd_flag, pmd_t *pmd_page,
-			   unsigned long addr, unsigned long end)
-{
-	addr &= PMD_MASK;
-	for (; addr < end; addr += PMD_SIZE) {
-		pmd_t *pmd = pmd_page + pmd_index(addr);
-
-		if (!pmd_present(*pmd))
-			set_pmd(pmd, __pmd(addr | pmd_flag));
-	}
-}
-static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
-			  unsigned long addr, unsigned long end)
-{
-	unsigned long next;
-
-	for (; addr < end; addr = next) {
-		pud_t *pud = pud_page + pud_index(addr);
-		pmd_t *pmd;
-
-		next = (addr & PUD_MASK) + PUD_SIZE;
-		if (next > end)
-			next = end;
-
-		if (pud_present(*pud)) {
-			pmd = pmd_offset(pud, 0);
-			ident_pmd_init(info->pmd_flag, pmd, addr, next);
-			continue;
-		}
-		pmd = (pmd_t *)info->alloc_pgt_page(info->context);
-		if (!pmd)
-			return -ENOMEM;
-		ident_pmd_init(info->pmd_flag, pmd, addr, next);
-		set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
-	}
-
-	return 0;
-}
-
-int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
-			      unsigned long addr, unsigned long end)
-{
-	unsigned long next;
-	int result;
-	int off = info->kernel_mapping ? pgd_index(__PAGE_OFFSET) : 0;
-
-	for (; addr < end; addr = next) {
-		pgd_t *pgd = pgd_page + pgd_index(addr) + off;
-		pud_t *pud;
-
-		next = (addr & PGDIR_MASK) + PGDIR_SIZE;
-		if (next > end)
-			next = end;
-
-		if (pgd_present(*pgd)) {
-			pud = pud_offset(pgd, 0);
-			result = ident_pud_init(info, pud, addr, next);
-			if (result)
-				return result;
-			continue;
-		}
-
-		pud = (pud_t *)info->alloc_pgt_page(info->context);
-		if (!pud)
-			return -ENOMEM;
-		result = ident_pud_init(info, pud, addr, next);
-		if (result)
-			return result;
-		set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
-	}
-
-	return 0;
-}
+#include "ident_map.c"
 
 /*
  * NOTE: pagetable_init alloc all the fixmap pagetables contiguous on the
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [kernel-hardening] [PATCH v6 04/11] x86/KASLR: Build identity mappings on demand
  2016-05-05 22:13 [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately Kees Cook
                   ` (2 preceding siblings ...)
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 03/11] x86/boot: Split out kernel_ident_mapping_init Kees Cook
@ 2016-05-05 22:13 ` Kees Cook
  2016-05-06  7:00   ` [kernel-hardening] " Ingo Molnar
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 05/11] x86/KASLR: Add slot_area to manage random_addr slots Kees Cook
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 17+ messages in thread
From: Kees Cook @ 2016-05-05 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kees Cook, Yinghai Lu, Jiri Kosina, Borislav Petkov, Baoquan He,
	Ingo Molnar, H. Peter Anvin, Borislav Petkov, Vivek Goyal,
	Andy Lutomirski, lasse.collin, Andrew Morton, Dave Young,
	kernel-hardening, LKML

From: Yinghai Lu <yinghai@kernel.org>

Currently KASLR only supports relocation in a small physical range (from
16M to 1G), due to using the initial kernel page table identity mapping.
To support ranges above this, we need to have an identity mapping for the
desired memory range before we can decompress (and later run) the kernel.

32-bit kernels already have the needed identity mapping. This patch adds
identity mappings for the needed memory ranges on 64-bit kernels. This
happens in two possible boot paths:

If loaded via startup_32, we need to set up the identity mapping we need.

If loaded from a 64-bit bootloader, the bootloader will have
already set up an identity mapping, and we'll start via ZO's
(arch/x86/boot/compressed/vmlinux) startup_64. In this case, the
bootloader's page tables need to be avoided when we select the new VO
(vmlinux) location. If not, the decompressor could overwrite them during
decompression.

To accomplish avoiding the bootloader's page tables, we could walk the
pagetable and find every page that is used, and add them to mem_avoid,
but this needs extra code and will require increasing the size of the
mem_avoid array.

Instead, we can create a new ident mapping instead, and pages for the
pagetable will come from the _pagetable section of ZO, which means they
are already in mem_avoid array. To do this, we reuse the code for the
kernel's identity mapping.

The _pgtable will be shared by 32-bit and 64-bit path to reduce init_size,
as now ZO _rodata to _end will contribute init_size.

To handle the possible mappings, we need to increase the pgt buffer size:

When booting via startup_64, as we need to cover the old VO, params,
cmdline and new VO. In an extreme case we could have them all beyond the
512G boundary, which needs (2+2)*4 pages with 2M mappings. And we'll
need 2 for first 2M for VGA RAM. One more is needed for level4. This
gets us to 19 pages total.

When booting via startup_32, KASLR could move the new VO above 4G, so we
need to set extra identity mappings for the VO, which should only need
(2+2) pages at most when it is beyond the 512G boundary. So 19 pages is
sufficient for this case as well.

Cc: Kees Cook <keescook@chromium.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/boot/compressed/Makefile   |  3 ++
 arch/x86/boot/compressed/head_64.S  |  4 +-
 arch/x86/boot/compressed/kaslr.c    | 14 ++++++
 arch/x86/boot/compressed/misc.h     | 11 +++++
 arch/x86/boot/compressed/misc_pgt.c | 93 +++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/boot.h         | 19 ++++++++
 6 files changed, 142 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/boot/compressed/misc_pgt.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 77ce3a04d46e..0c5c0b50fa83 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -75,6 +75,9 @@ vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
 
 vmlinux-objs-$(CONFIG_EARLY_PRINTK) += $(obj)/early_serial_console.o
 vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/kaslr.o
+ifdef CONFIG_X86_64
+	vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/misc_pgt.o
+endif
 
 $(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar -mno-red-zone
 
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 7c047002950c..0d80a7ad65cd 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -134,7 +134,7 @@ ENTRY(startup_32)
 	/* Initialize Page tables to 0 */
 	leal	pgtable(%ebx), %edi
 	xorl	%eax, %eax
-	movl	$((4096*6)/4), %ecx
+	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
 	rep	stosl
 
 	/* Build Level 4 */
@@ -486,4 +486,4 @@ boot_stack_end:
 	.section ".pgtable","a",@nobits
 	.balign 4096
 pgtable:
-	.fill 6*4096, 1, 0
+	.fill BOOT_PGT_SIZE, 1, 0
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 6392f0041b8a..36356fb6d4cf 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -211,6 +211,7 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
 	 */
 	mem_avoid[0].start = input;
 	mem_avoid[0].size = (output + init_size) - input;
+	fill_pagetable(mem_avoid[0].start, mem_avoid[0].size);
 
 	/* Avoid initrd. */
 	initrd_start  = (u64)boot_params->ext_ramdisk_image << 32;
@@ -219,6 +220,7 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
 	initrd_size |= boot_params->hdr.ramdisk_size;
 	mem_avoid[1].start = initrd_start;
 	mem_avoid[1].size = initrd_size;
+	/* No need to set mapping for initrd */
 
 	/* Avoid kernel command line. */
 	cmd_line  = (u64)boot_params->ext_cmd_line_ptr << 32;
@@ -229,10 +231,19 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
 		;
 	mem_avoid[2].start = cmd_line;
 	mem_avoid[2].size = cmd_line_size;
+	fill_pagetable(mem_avoid[2].start, mem_avoid[2].size);
 
 	/* Avoid params */
 	mem_avoid[3].start = (unsigned long)boot_params;
 	mem_avoid[3].size = sizeof(*boot_params);
+	fill_pagetable(mem_avoid[3].start, mem_avoid[3].size);
+
+	/* don't need to set mapping for setup_data */
+
+#ifdef CONFIG_X86_VERBOSE_BOOTUP
+	/* for video ram */
+	fill_pagetable(0, PMD_SIZE);
+#endif
 }
 
 /* Does this memory vector overlap a known avoided area? */
@@ -396,6 +407,9 @@ unsigned char *choose_random_location(unsigned char *input_ptr,
 		goto out;
 
 	choice = random_addr;
+
+	fill_pagetable(choice, output_size);
+	switch_pagetable();
 out:
 	return (unsigned char *)choice;
 }
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 1f23d022d241..ffacded0e66a 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -84,6 +84,17 @@ unsigned char *choose_random_location(unsigned char *input_ptr,
 }
 #endif
 
+#ifdef CONFIG_X86_64
+void fill_pagetable(unsigned long start, unsigned long size);
+void switch_pagetable(void);
+extern unsigned char _pgtable[];
+#else
+static inline void fill_pagetable(unsigned long start, unsigned long size)
+{ }
+static inline void switch_pagetable(void)
+{ }
+#endif
+
 #ifdef CONFIG_EARLY_PRINTK
 /* early_serial_console.c */
 extern int early_serial_base;
diff --git a/arch/x86/boot/compressed/misc_pgt.c b/arch/x86/boot/compressed/misc_pgt.c
new file mode 100644
index 000000000000..816551d0327c
--- /dev/null
+++ b/arch/x86/boot/compressed/misc_pgt.c
@@ -0,0 +1,93 @@
+#define __pa(x)  ((unsigned long)(x))
+#define __va(x)  ((void *)((unsigned long)(x)))
+
+#include "misc.h"
+
+#include <asm/init.h>
+#include <asm/pgtable.h>
+
+#include "../../mm/ident_map.c"
+#include "../string.h"
+
+struct alloc_pgt_data {
+	unsigned char *pgt_buf;
+	unsigned long pgt_buf_size;
+	unsigned long pgt_buf_offset;
+};
+
+static void *alloc_pgt_page(void *context)
+{
+	struct alloc_pgt_data *d = (struct alloc_pgt_data *)context;
+	unsigned char *p = (unsigned char *)d->pgt_buf;
+
+	if (d->pgt_buf_offset >= d->pgt_buf_size) {
+		debug_putstr("out of pgt_buf in misc.c\n");
+		debug_putaddr(d->pgt_buf_offset);
+		debug_putaddr(d->pgt_buf_size);
+		return NULL;
+	}
+
+	p += d->pgt_buf_offset;
+	d->pgt_buf_offset += PAGE_SIZE;
+
+	return p;
+}
+
+/*
+ * Use a normal definition of memset() from string.c. There are already
+ * included header files which expect a definition of memset() and by
+ * the time we define memset macro, it is too late.
+ */
+#undef memset
+
+unsigned long __force_order;
+static struct alloc_pgt_data pgt_data;
+static struct x86_mapping_info mapping_info;
+static pgd_t *level4p;
+
+void fill_pagetable(unsigned long start, unsigned long size)
+{
+	unsigned long end = start + size;
+
+	if (!level4p) {
+		pgt_data.pgt_buf_offset = 0;
+		mapping_info.alloc_pgt_page = alloc_pgt_page;
+		mapping_info.context = &pgt_data;
+		mapping_info.pmd_flag = __PAGE_KERNEL_LARGE_EXEC;
+
+		/*
+		 * come from startup_32 ?
+		 * then cr3 is _pgtable, we can reuse it.
+		 */
+		level4p = (pgd_t *)read_cr3();
+		if ((unsigned long)level4p == (unsigned long)_pgtable) {
+			pgt_data.pgt_buf = (unsigned char *)_pgtable +
+						 BOOT_INIT_PGT_SIZE;
+			pgt_data.pgt_buf_size = BOOT_PGT_SIZE -
+						 BOOT_INIT_PGT_SIZE;
+			memset((unsigned char *)pgt_data.pgt_buf, 0,
+				pgt_data.pgt_buf_size);
+			debug_putstr("boot via startup_32\n");
+		} else {
+			pgt_data.pgt_buf = (unsigned char *)_pgtable;
+			pgt_data.pgt_buf_size = BOOT_PGT_SIZE;
+			memset((unsigned char *)pgt_data.pgt_buf, 0,
+				pgt_data.pgt_buf_size);
+			debug_putstr("boot via startup_64\n");
+			level4p = (pgd_t *)alloc_pgt_page(&pgt_data);
+		}
+	}
+
+	/* align boundary to 2M */
+	start = round_down(start, PMD_SIZE);
+	end = round_up(end, PMD_SIZE);
+	if (start >= end)
+		return;
+
+	kernel_ident_mapping_init(&mapping_info, level4p, start, end);
+}
+
+void switch_pagetable(void)
+{
+	write_cr3((unsigned long)level4p);
+}
diff --git a/arch/x86/include/asm/boot.h b/arch/x86/include/asm/boot.h
index 6b8d6e8cd449..52a9cbc1419f 100644
--- a/arch/x86/include/asm/boot.h
+++ b/arch/x86/include/asm/boot.h
@@ -32,7 +32,26 @@
 #endif /* !CONFIG_KERNEL_BZIP2 */
 
 #ifdef CONFIG_X86_64
+
 #define BOOT_STACK_SIZE	0x4000
+
+#define BOOT_INIT_PGT_SIZE (6*4096)
+#ifdef CONFIG_RANDOMIZE_BASE
+/*
+ * 1 page for level4, 2 pages for first 2M.
+ * (2+2)*4 pages for kernel, param, cmd_line, random kernel
+ * if all cross 512G boundary.
+ * So total will be 19 pages.
+ */
+#ifdef CONFIG_X86_VERBOSE_BOOTUP
+#define BOOT_PGT_SIZE (19*4096)
+#else
+#define BOOT_PGT_SIZE (17*4096)
+#endif
+#else
+#define BOOT_PGT_SIZE BOOT_INIT_PGT_SIZE
+#endif
+
 #else
 #define BOOT_STACK_SIZE	0x1000
 #endif
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [kernel-hardening] [PATCH v6 05/11] x86/KASLR: Add slot_area to manage random_addr slots
  2016-05-05 22:13 [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately Kees Cook
                   ` (3 preceding siblings ...)
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 04/11] x86/KASLR: Build identity mappings on demand Kees Cook
@ 2016-05-05 22:13 ` Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 06/11] x86/KASLR: Return earliest overlap when avoiding regions Kees Cook
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Kees Cook @ 2016-05-05 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kees Cook, Baoquan He, Ingo Molnar, Yinghai Lu, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, lasse.collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

From: Baoquan He <bhe@redhat.com>

In order to support KASLR moving the kernel anywhere in physical memory
(which could be up to 64TB), we need to handle counting the potential
randomization locations in a more efficient manner.

In the worst case with 64TB, there could be roughly 32 * 1024 * 1024
randomization slots if CONFIG_PHYSICAL_ALIGN is 0x1000000. Currently
the starting address of candidate positions is stored into the slots[]
array, one at a time. This method would cost too much memory and it's
also very inefficient to get and save the slot information into the slot
array one by one.

This patch introduces struct slot_area to manage each contiguous region
of randomization slots. Each slot_area will contain the starting address
and how many available slots are in this area. As with the original code,
the slot_areas will avoid the mem_avoid regions.

Since setup_data is a linked list, it could contain an unknown number
of memory regions to be avoided, which could cause us to fragment
the contiguous memory that the slot_area array is tracking. In normal
operation this level of fragmentation will be extremely rare, but we
choose a suitably large value (100) for the array. If setup_data forces
the slot_area array to become highly fragmented and there are more
slots available beyond the first 100 found, the rest will be ignored
for KASLR selection.

The function store_slot_info() is used to calculate the number of slots
available in the passed-in memory region and stores it into slot_areas[]
after adjusting for alignment and size requirements.

Signed-off-by: Baoquan He <bhe@redhat.com>
[kees: rewrote changelog, squash with new functions]
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/boot/compressed/kaslr.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 36356fb6d4cf..dc9ffceb556b 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -275,8 +275,37 @@ static bool mem_avoid_overlap(struct mem_vector *img)
 }
 
 static unsigned long slots[KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN];
+
+struct slot_area {
+	unsigned long addr;
+	int num;
+};
+
+#define MAX_SLOT_AREA 100
+
+static struct slot_area slot_areas[MAX_SLOT_AREA];
+
 static unsigned long slot_max;
 
+static unsigned long slot_area_index;
+
+static void store_slot_info(struct mem_vector *region, unsigned long image_size)
+{
+	struct slot_area slot_area;
+
+	if (slot_area_index == MAX_SLOT_AREA)
+		return;
+
+	slot_area.addr = region->start;
+	slot_area.num = (region->size - image_size) /
+			CONFIG_PHYSICAL_ALIGN + 1;
+
+	if (slot_area.num > 0) {
+		slot_areas[slot_area_index++] = slot_area;
+		slot_max += slot_area.num;
+	}
+}
+
 static void slots_append(unsigned long addr)
 {
 	/* Overflowing the slots list should be impossible. */
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [kernel-hardening] [PATCH v6 06/11] x86/KASLR: Return earliest overlap when avoiding regions
  2016-05-05 22:13 [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately Kees Cook
                   ` (4 preceding siblings ...)
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 05/11] x86/KASLR: Add slot_area to manage random_addr slots Kees Cook
@ 2016-05-05 22:13 ` Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 07/11] x86/KASLR: Add virtual address choosing function Kees Cook
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Kees Cook @ 2016-05-05 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kees Cook, Ingo Molnar, Baoquan He, Yinghai Lu, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, lasse.collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

In preparation for being able to detect where to split up contiguous
memory regions that overlap with memory regions to avoid, we need to
pass back what the earliest overlapping region was. This modifies the
overlap checker to return that information.

Based on a separate mem_min_overlap() implementation by Baoquan He.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/boot/compressed/kaslr.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index dc9ffceb556b..391960e2cb08 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -246,15 +246,24 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
 #endif
 }
 
-/* Does this memory vector overlap a known avoided area? */
-static bool mem_avoid_overlap(struct mem_vector *img)
+/*
+ * Does this memory vector overlap a known avoided area? If so, record the
+ * overlap region with the lowest address.
+ */
+static bool mem_avoid_overlap(struct mem_vector *img,
+			      struct mem_vector *overlap)
 {
 	int i;
 	struct setup_data *ptr;
+	unsigned long earliest = img->start + img->size;
+	bool is_overlapping = false;
 
 	for (i = 0; i < MEM_AVOID_MAX; i++) {
-		if (mem_overlaps(img, &mem_avoid[i]))
-			return true;
+		if (mem_overlaps(img, &mem_avoid[i]) &&
+		    mem_avoid[i].start < earliest) {
+			*overlap = mem_avoid[i];
+			is_overlapping = true;
+		}
 	}
 
 	/* Avoid all entries in the setup_data linked list. */
@@ -265,13 +274,15 @@ static bool mem_avoid_overlap(struct mem_vector *img)
 		avoid.start = (unsigned long)ptr;
 		avoid.size = sizeof(*ptr) + ptr->len;
 
-		if (mem_overlaps(img, &avoid))
-			return true;
+		if (mem_overlaps(img, &avoid) && (avoid.start < earliest)) {
+			*overlap = avoid;
+			is_overlapping = true;
+		}
 
 		ptr = (struct setup_data *)(unsigned long)ptr->next;
 	}
 
-	return false;
+	return is_overlapping;
 }
 
 static unsigned long slots[KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN];
@@ -328,7 +339,7 @@ static void process_e820_entry(struct e820entry *entry,
 			       unsigned long minimum,
 			       unsigned long image_size)
 {
-	struct mem_vector region, img;
+	struct mem_vector region, img, overlap;
 
 	/* Skip non-RAM entries. */
 	if (entry->type != E820_RAM)
@@ -367,7 +378,7 @@ static void process_e820_entry(struct e820entry *entry,
 	for (img.start = region.start, img.size = image_size ;
 	     mem_contains(&region, &img) ;
 	     img.start += CONFIG_PHYSICAL_ALIGN) {
-		if (mem_avoid_overlap(&img))
+		if (mem_avoid_overlap(&img, &overlap))
 			continue;
 		slots_append(img.start);
 	}
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [kernel-hardening] [PATCH v6 07/11] x86/KASLR: Add virtual address choosing function
  2016-05-05 22:13 [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately Kees Cook
                   ` (5 preceding siblings ...)
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 06/11] x86/KASLR: Return earliest overlap when avoiding regions Kees Cook
@ 2016-05-05 22:13 ` Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 08/11] x86/KASLR: Clarify purpose of each get_random_long Kees Cook
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Kees Cook @ 2016-05-05 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kees Cook, Baoquan He, Ingo Molnar, Yinghai Lu, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, lasse.collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

From: Baoquan He <bhe@redhat.com>

To support randomizing the kernel virtual address separately from the
physical address, this patch adds find_random_virt_addr() to choose
a slot anywhere between LOAD_PHYSICAL_ADDR and KERNEL_IMAGE_SIZE.
Since this address is virtual, not physical, we can place the kernel
anywhere in this region, as long as it is aligned and (in the case of
kernel being larger than the slot size) placed with enough room to load
the entire kernel image.

For clarity and readability, find_random_addr() is renamed to
find_random_phys_addr() and has "size" renamed to "image_size" for
clarity and to match find_random_virt_addr().

Signed-off-by: Baoquan He <bhe@redhat.com>
[kees: rewrote changelog, refactor slot calculation for readability]
[kees: renamed find_random_phys_addr() and size argument]
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/boot/compressed/kaslr.c | 32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 391960e2cb08..05f564b6c70e 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -384,8 +384,8 @@ static void process_e820_entry(struct e820entry *entry,
 	}
 }
 
-static unsigned long find_random_addr(unsigned long minimum,
-				      unsigned long size)
+static unsigned long find_random_phys_addr(unsigned long minimum,
+					   unsigned long image_size)
 {
 	int i;
 	unsigned long addr;
@@ -395,12 +395,36 @@ static unsigned long find_random_addr(unsigned long minimum,
 
 	/* Verify potential e820 positions, appending to slots list. */
 	for (i = 0; i < boot_params->e820_entries; i++) {
-		process_e820_entry(&boot_params->e820_map[i], minimum, size);
+		process_e820_entry(&boot_params->e820_map[i], minimum,
+				   image_size);
 	}
 
 	return slots_fetch_random();
 }
 
+static unsigned long find_random_virt_addr(unsigned long minimum,
+					   unsigned long image_size)
+{
+	unsigned long slots, random_addr;
+
+	/* Make sure minimum is aligned. */
+	minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);
+	/* Align image_size for easy slot calculations. */
+	image_size = ALIGN(image_size, CONFIG_PHYSICAL_ALIGN);
+
+	/*
+	 * There are how many CONFIG_PHYSICAL_ALIGN-sized slots
+	 * that can hold image_size within the range of minimum to
+	 * KERNEL_IMAGE_SIZE?
+	 */
+	slots = (KERNEL_IMAGE_SIZE - minimum - image_size) /
+		 CONFIG_PHYSICAL_ALIGN + 1;
+
+	random_addr = get_random_long() % slots;
+
+	return random_addr * CONFIG_PHYSICAL_ALIGN + minimum;
+}
+
 unsigned char *choose_random_location(unsigned char *input_ptr,
 				      unsigned long input_size,
 				      unsigned char *output_ptr,
@@ -436,7 +460,7 @@ unsigned char *choose_random_location(unsigned char *input_ptr,
 	mem_avoid_init(input, input_size, output);
 
 	/* Walk e820 and find a random address. */
-	random_addr = find_random_addr(output, output_size);
+	random_addr = find_random_phys_addr(output, output_size);
 	if (!random_addr) {
 		warn("KASLR disabled: could not find suitable E820 region!");
 		goto out;
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [kernel-hardening] [PATCH v6 08/11] x86/KASLR: Clarify purpose of each get_random_long
  2016-05-05 22:13 [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately Kees Cook
                   ` (6 preceding siblings ...)
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 07/11] x86/KASLR: Add virtual address choosing function Kees Cook
@ 2016-05-05 22:13 ` Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 09/11] x86/KASLR: Randomize virtual address separately Kees Cook
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Kees Cook @ 2016-05-05 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kees Cook, Ingo Molnar, Baoquan He, Yinghai Lu, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, lasse.collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

KASLR will be calling get_random_long() twice, but the debug output
won't distinguishing between them. This patch adds a report on when it
is fetching the physical vs virtual address. With this, once the virtual
offset is separate, the report changes from:

KASLR using RDTSC...
KASLR using RDTSC...

into:

Physical KASLR using RDTSC...
Virtual KASLR using RDTSC...

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/boot/compressed/kaslr.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 05f564b6c70e..3672c7388e5a 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -72,7 +72,7 @@ static unsigned long get_random_boot(void)
 	return hash;
 }
 
-static unsigned long get_random_long(void)
+static unsigned long get_random_long(const char *purpose)
 {
 #ifdef CONFIG_X86_64
 	const unsigned long mix_const = 0x5d6008cbf3848dd3UL;
@@ -82,7 +82,8 @@ static unsigned long get_random_long(void)
 	unsigned long raw, random = get_random_boot();
 	bool use_i8254 = true;
 
-	debug_putstr("KASLR using");
+	debug_putstr(purpose);
+	debug_putstr(" KASLR using");
 
 	if (has_cpuflag(X86_FEATURE_RDRAND)) {
 		debug_putstr(" RDRAND");
@@ -332,7 +333,7 @@ static unsigned long slots_fetch_random(void)
 	if (slot_max == 0)
 		return 0;
 
-	return slots[get_random_long() % slot_max];
+	return slots[get_random_long("Physical") % slot_max];
 }
 
 static void process_e820_entry(struct e820entry *entry,
@@ -420,7 +421,7 @@ static unsigned long find_random_virt_addr(unsigned long minimum,
 	slots = (KERNEL_IMAGE_SIZE - minimum - image_size) /
 		 CONFIG_PHYSICAL_ALIGN + 1;
 
-	random_addr = get_random_long() % slots;
+	random_addr = get_random_long("Virtual") % slots;
 
 	return random_addr * CONFIG_PHYSICAL_ALIGN + minimum;
 }
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [kernel-hardening] [PATCH v6 09/11] x86/KASLR: Randomize virtual address separately
  2016-05-05 22:13 [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately Kees Cook
                   ` (7 preceding siblings ...)
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 08/11] x86/KASLR: Clarify purpose of each get_random_long Kees Cook
@ 2016-05-05 22:13 ` Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 10/11] x86/KASLR: Add physical address randomization >4G Kees Cook
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 11/11] x86/KASLR: Allow randomization below load address Kees Cook
  10 siblings, 0 replies; 17+ messages in thread
From: Kees Cook @ 2016-05-05 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kees Cook, Baoquan He, Ingo Molnar, Yinghai Lu, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, lasse.collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

From: Baoquan He <bhe@redhat.com>

The current KASLR implementation randomizes the physical and virtual
addresses of the kernel together (both are offset by the same amount). It
calculates the delta of the physical address where vmlinux was linked
to load and where it is finally loaded. If the delta is not equal to 0
(i.e. the kernel was relocated), relocation handling needs be done.

On 64-bit, this patch randomizes both the physical address where kernel
is decompressed and the virtual address where kernel text is mapped and
will execute from. We now have two values being chosen, so the function
arguments are reorganized to pass by pointer so they can be directly
updated. Since relocation handling only depends on the virtual address,
we must check the virtual delta, not the physical delta for processing
kernel relocations. This also populates the page table for the new
virtual address range. 32-bit does not support a separate virtual address,
so it continues to use the physical offset for its virtual offset.

Additionally updates the sanity checks done on the resulting kernel
addresses since they are potentially separate now.

Signed-off-by: Baoquan He <bhe@redhat.com>
[kees: rewrote changelog, limited virtual split to 64-bit only, update checks]
[kees: fix CONFIG_RANDOMIZE_BASE=n boot failure]
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/boot/compressed/kaslr.c | 42 +++++++++++++++++++----------------
 arch/x86/boot/compressed/misc.c  | 47 +++++++++++++++++++++++++---------------
 arch/x86/boot/compressed/misc.h  | 22 ++++++++++---------
 3 files changed, 64 insertions(+), 47 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 3672c7388e5a..58d0871be037 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -426,10 +426,11 @@ static unsigned long find_random_virt_addr(unsigned long minimum,
 	return random_addr * CONFIG_PHYSICAL_ALIGN + minimum;
 }
 
-unsigned char *choose_random_location(unsigned char *input_ptr,
-				      unsigned long input_size,
-				      unsigned char *output_ptr,
-				      unsigned long output_size)
+void choose_random_location(unsigned char *input_ptr,
+				unsigned long input_size,
+				unsigned char **output_ptr,
+				unsigned long output_size,
+				unsigned char **virt_addr)
 {
 	/*
 	 * The caller of choose_random_location() uses unsigned char * for
@@ -439,19 +440,21 @@ unsigned char *choose_random_location(unsigned char *input_ptr,
 	 * more casts into extract_kernel, do them here and at return.
 	 */
 	unsigned long input = (unsigned long)input_ptr;
-	unsigned long output = (unsigned long)output_ptr;
-	unsigned long choice = output;
+	unsigned long output = (unsigned long)*output_ptr;
 	unsigned long random_addr;
 
+	/* By default, keep output position unchanged. */
+	*virt_addr = *output_ptr;
+
 #ifdef CONFIG_HIBERNATION
 	if (!cmdline_find_option_bool("kaslr")) {
 		warn("KASLR disabled: 'kaslr' not on cmdline (hibernation selected).");
-		goto out;
+		return;
 	}
 #else
 	if (cmdline_find_option_bool("nokaslr")) {
 		warn("KASLR disabled: 'nokaslr' on cmdline.");
-		goto out;
+		return;
 	}
 #endif
 
@@ -464,17 +467,18 @@ unsigned char *choose_random_location(unsigned char *input_ptr,
 	random_addr = find_random_phys_addr(output, output_size);
 	if (!random_addr) {
 		warn("KASLR disabled: could not find suitable E820 region!");
-		goto out;
+	} else {
+		/* Update the new physical address location. */
+		if (output != random_addr) {
+			fill_pagetable(random_addr, output_size);
+			switch_pagetable();
+			*output_ptr = (unsigned char *)random_addr;
+		}
 	}
 
-	/* Always enforce the minimum. */
-	if (random_addr < choice)
-		goto out;
-
-	choice = random_addr;
-
-	fill_pagetable(choice, output_size);
-	switch_pagetable();
-out:
-	return (unsigned char *)choice;
+	/* Pick random virtual address starting from LOAD_PHYSICAL_ADDR. */
+	if (IS_ENABLED(CONFIG_X86_64))
+		random_addr = find_random_virt_addr(LOAD_PHYSICAL_ADDR,
+						 output_size);
+	*virt_addr = (unsigned char *)random_addr;
 }
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 9536d778149e..e2b2fbf08ed9 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -170,7 +170,8 @@ void __puthex(unsigned long value)
 }
 
 #if CONFIG_X86_NEED_RELOCS
-static void handle_relocations(void *output, unsigned long output_len)
+static void handle_relocations(void *output, unsigned long output_len,
+			       void *virt_addr)
 {
 	int *reloc;
 	unsigned long delta, map, ptr;
@@ -182,11 +183,6 @@ static void handle_relocations(void *output, unsigned long output_len)
 	 * and where it was actually loaded.
 	 */
 	delta = min_addr - LOAD_PHYSICAL_ADDR;
-	if (!delta) {
-		debug_putstr("No relocation needed... ");
-		return;
-	}
-	debug_putstr("Performing relocations... ");
 
 	/*
 	 * The kernel contains a table of relocation addresses. Those
@@ -198,6 +194,20 @@ static void handle_relocations(void *output, unsigned long output_len)
 	map = delta - __START_KERNEL_map;
 
 	/*
+	 * 32-bit always performs relocations. 64-bit relocations are only
+	 * needed if KASLR has chosen a different starting address offset
+	 * from __START_KERNEL_map.
+	 */
+	if (IS_ENABLED(CONFIG_X86_64))
+		delta = (unsigned long)virt_addr - LOAD_PHYSICAL_ADDR;
+
+	if (!delta) {
+		debug_putstr("No relocation needed... ");
+		return;
+	}
+	debug_putstr("Performing relocations... ");
+
+	/*
 	 * Process relocations: 32 bit relocations first then 64 bit after.
 	 * Three sets of binary relocations are added to the end of the kernel
 	 * before compression. Each relocation table entry is the kernel
@@ -250,7 +260,8 @@ static void handle_relocations(void *output, unsigned long output_len)
 #endif
 }
 #else
-static inline void handle_relocations(void *output, unsigned long output_len)
+static inline void handle_relocations(void *output, unsigned long output_len,
+				      void *virt_addr)
 { }
 #endif
 
@@ -327,7 +338,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 				  unsigned long output_len)
 {
 	const unsigned long kernel_total_size = VO__end - VO__text;
-	unsigned char *output_orig = output;
+	unsigned char *virt_addr = output;
 
 	/* Retain x86 boot parameters pointer passed from startup_32/64. */
 	boot_params = rmode;
@@ -366,12 +377,15 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 	 * the entire decompressed kernel plus relocation table, or the
 	 * entire decompressed kernel plus .bss and .brk sections.
 	 */
-	output = choose_random_location(input_data, input_len, output,
-					max(output_len, kernel_total_size));
+	choose_random_location(input_data, input_len, &output,
+				max(output_len, kernel_total_size),
+				&virt_addr);
 
 	/* Validate memory location choices. */
 	if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1))
-		error("Destination address inappropriately aligned");
+		error("Destination physical address inappropriately aligned");
+	if ((unsigned long)virt_addr & (MIN_KERNEL_ALIGN - 1))
+		error("Destination virtual address inappropriately aligned");
 #ifdef CONFIG_X86_64
 	if (heap > 0x3fffffffffffUL)
 		error("Destination address too large");
@@ -381,19 +395,16 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 #endif
 #ifndef CONFIG_RELOCATABLE
 	if ((unsigned long)output != LOAD_PHYSICAL_ADDR)
-		error("Wrong destination address");
+		error("Destination address does not match LOAD_PHYSICAL_ADDR");
+	if (output != virt_addr)
+		error("Destination virtual address changed when not relocatable");
 #endif
 
 	debug_putstr("\nDecompressing Linux... ");
 	__decompress(input_data, input_len, NULL, NULL, output, output_len,
 			NULL, error);
 	parse_elf(output);
-	/*
-	 * 32-bit always performs relocations. 64-bit relocations are only
-	 * needed if kASLR has chosen a different load address.
-	 */
-	if (!IS_ENABLED(CONFIG_X86_64) || output != output_orig)
-		handle_relocations(output, output_len);
+	handle_relocations(output, output_len, virt_addr);
 	debug_putstr("done.\nBooting the kernel.\n");
 	return output;
 }
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index ffacded0e66a..1a58fce66458 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -67,20 +67,22 @@ int cmdline_find_option_bool(const char *option);
 
 #if CONFIG_RANDOMIZE_BASE
 /* kaslr.c */
-unsigned char *choose_random_location(unsigned char *input_ptr,
-				      unsigned long input_size,
-				      unsigned char *output_ptr,
-				      unsigned long output_size);
+void choose_random_location(unsigned char *input_ptr,
+			    unsigned long input_size,
+			    unsigned char **output_ptr,
+			    unsigned long output_size,
+			    unsigned char **virt_addr);
 /* cpuflags.c */
 bool has_cpuflag(int flag);
 #else
-static inline
-unsigned char *choose_random_location(unsigned char *input_ptr,
-				      unsigned long input_size,
-				      unsigned char *output_ptr,
-				      unsigned long output_size)
+static inline void choose_random_location(unsigned char *input_ptr,
+					  unsigned long input_size,
+					  unsigned char **output_ptr,
+					  unsigned long output_size,
+					  unsigned char **virt_addr)
 {
-	return output_ptr;
+	/* No change from existing output location. */
+	*virt_addr = *output_ptr;
 }
 #endif
 
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [kernel-hardening] [PATCH v6 10/11] x86/KASLR: Add physical address randomization >4G
  2016-05-05 22:13 [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately Kees Cook
                   ` (8 preceding siblings ...)
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 09/11] x86/KASLR: Randomize virtual address separately Kees Cook
@ 2016-05-05 22:13 ` Kees Cook
  2016-05-06  8:27   ` [kernel-hardening] " Baoquan He
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 11/11] x86/KASLR: Allow randomization below load address Kees Cook
  10 siblings, 1 reply; 17+ messages in thread
From: Kees Cook @ 2016-05-05 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kees Cook, Baoquan He, Ingo Molnar, Yinghai Lu, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, lasse.collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

From: Baoquan He <bhe@redhat.com>

This patch exchanges the prior slots[] array for the new slot_areas[]
array, and lifts the limitation of KERNEL_IMAGE_SIZE on the physical
address offset for 64-bit. As before, process_e820_entry() walks
memory and populates slot_areas[], splitting on any detected mem_avoid
collisions.

Finally, since the slots[] array and its associated functions are not
needed any more, so they are removed.

Signed-off-by: Baoquan He <bhe@redhat.com>
[kees: rewrote changelog, refactored goto into while]
[kees: limit 32-bit to KERNEL_IMAGE_SIZE]
[kees: squash dead-code removal into this patch]
[kees: refactored to use new mem_overlap code, renamed variables]
[kees: updated Kconfig to reflect updated implementation]
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig                 |  27 +++++----
 arch/x86/boot/compressed/kaslr.c | 115 +++++++++++++++++++++++----------------
 2 files changed, 85 insertions(+), 57 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5892d549596d..39be0f7b49ef 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1943,21 +1943,26 @@ config RANDOMIZE_BASE
 	  attempts relying on knowledge of the location of kernel
 	  code internals.
 
-	  The kernel physical and virtual address can be randomized
-	  from 16MB up to 1GB on 64-bit and 512MB on 32-bit. (Note that
-	  using RANDOMIZE_BASE reduces the memory space available to
-	  kernel modules from 1.5GB to 1GB.)
+	  On 64-bit, the kernel physical and virtual addresses are
+	  randomized separately. The physical address will be anywhere
+	  between 16MB and the top of physical memory (up to 64TB). The
+	  virtual address will be randomized from 16MB up to 1GB (9 bits
+	  of entropy). Note that this also reduces the memory space
+	  available to kernel modules from 1.5GB to 1GB.
+
+	  On 32-bit, the kernel physical and virtual addresses are
+	  randomized together. They will be randomized from 16MB up to
+	  512MB (8 bits of entropy).
 
 	  Entropy is generated using the RDRAND instruction if it is
 	  supported. If RDTSC is supported, its value is mixed into
 	  the entropy pool as well. If neither RDRAND nor RDTSC are
-	  supported, then entropy is read from the i8254 timer.
-
-	  Since the kernel is built using 2GB addressing, and
-	  PHYSICAL_ALIGN must be at a minimum of 2MB, only 10 bits of
-	  entropy is theoretically possible. Currently, with the
-	  default value for PHYSICAL_ALIGN and due to page table
-	  layouts, 64-bit uses 9 bits of entropy and 32-bit uses 8 bits.
+	  supported, then entropy is read from the i8254 timer. The
+	  usable entropy is limited by the kernel being built using
+	  2GB addressing, and that PHYSICAL_ALIGN must be at a
+	  minimum of 2MB. As a result, only 10 bits of entropy is
+	  theoretically possible, but the implementations are further
+	  limited due to memory layouts.
 
 	  If CONFIG_HIBERNATE is also enabled, KASLR is disabled at boot
 	  time. To enable it, boot with "kaslr" on the kernel command
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 58d0871be037..8cf705826bdc 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -125,17 +125,6 @@ struct mem_vector {
 #define MEM_AVOID_MAX 4
 static struct mem_vector mem_avoid[MEM_AVOID_MAX];
 
-static bool mem_contains(struct mem_vector *region, struct mem_vector *item)
-{
-	/* Item at least partially before region. */
-	if (item->start < region->start)
-		return false;
-	/* Item at least partially after region. */
-	if (item->start + item->size > region->start + region->size)
-		return false;
-	return true;
-}
-
 static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two)
 {
 	/* Item one is entirely before item two. */
@@ -286,8 +275,6 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 	return is_overlapping;
 }
 
-static unsigned long slots[KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN];
-
 struct slot_area {
 	unsigned long addr;
 	int num;
@@ -318,36 +305,44 @@ static void store_slot_info(struct mem_vector *region, unsigned long image_size)
 	}
 }
 
-static void slots_append(unsigned long addr)
-{
-	/* Overflowing the slots list should be impossible. */
-	if (slot_max >= KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN)
-		return;
-
-	slots[slot_max++] = addr;
-}
-
 static unsigned long slots_fetch_random(void)
 {
+	unsigned long slot;
+	int i;
+
 	/* Handle case of no slots stored. */
 	if (slot_max == 0)
 		return 0;
 
-	return slots[get_random_long("Physical") % slot_max];
+	slot = get_random_long("Physical") % slot_max;
+
+	for (i = 0; i < slot_area_index; i++) {
+		if (slot >= slot_areas[i].num) {
+			slot -= slot_areas[i].num;
+			continue;
+		}
+		return slot_areas[i].addr + slot * CONFIG_PHYSICAL_ALIGN;
+	}
+
+	if (i == slot_area_index)
+		debug_putstr("slots_fetch_random() failed!?\n");
+	return 0;
 }
 
 static void process_e820_entry(struct e820entry *entry,
 			       unsigned long minimum,
 			       unsigned long image_size)
 {
-	struct mem_vector region, img, overlap;
+	struct mem_vector region, overlap;
+	struct slot_area slot_area;
+	unsigned long start_orig;
 
 	/* Skip non-RAM entries. */
 	if (entry->type != E820_RAM)
 		return;
 
-	/* Ignore entries entirely above our maximum. */
-	if (entry->addr >= KERNEL_IMAGE_SIZE)
+	/* On 32-bit, ignore entries entirely above our maximum. */
+	if (IS_ENABLED(CONFIG_X86_32) && entry->addr >= KERNEL_IMAGE_SIZE)
 		return;
 
 	/* Ignore entries entirely below our minimum. */
@@ -357,31 +352,55 @@ static void process_e820_entry(struct e820entry *entry,
 	region.start = entry->addr;
 	region.size = entry->size;
 
-	/* Potentially raise address to minimum location. */
-	if (region.start < minimum)
-		region.start = minimum;
+	/* Give up if slot area array is full. */
+	while (slot_area_index < MAX_SLOT_AREA) {
+		start_orig = region.start;
 
-	/* Potentially raise address to meet alignment requirements. */
-	region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);
+		/* Potentially raise address to minimum location. */
+		if (region.start < minimum)
+			region.start = minimum;
 
-	/* Did we raise the address above the bounds of this e820 region? */
-	if (region.start > entry->addr + entry->size)
-		return;
+		/* Potentially raise address to meet alignment needs. */
+		region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);
 
-	/* Reduce size by any delta from the original address. */
-	region.size -= region.start - entry->addr;
+		/* Did we raise the address above this e820 region? */
+		if (region.start > entry->addr + entry->size)
+			return;
 
-	/* Reduce maximum size to fit end of image within maximum limit. */
-	if (region.start + region.size > KERNEL_IMAGE_SIZE)
-		region.size = KERNEL_IMAGE_SIZE - region.start;
+		/* Reduce size by any delta from the original address. */
+		region.size -= region.start - start_orig;
 
-	/* Walk each aligned slot and check for avoided areas. */
-	for (img.start = region.start, img.size = image_size ;
-	     mem_contains(&region, &img) ;
-	     img.start += CONFIG_PHYSICAL_ALIGN) {
-		if (mem_avoid_overlap(&img, &overlap))
-			continue;
-		slots_append(img.start);
+		/* On 32-bit, reduce region size to fit within max size. */
+		if (IS_ENABLED(CONFIG_X86_32) &&
+		    region.start + region.size > KERNEL_IMAGE_SIZE)
+			region.size = KERNEL_IMAGE_SIZE - region.start;
+
+		/* Return if region can't contain decompressed kernel */
+		if (region.size < image_size)
+			return;
+
+		/* If nothing overlaps, store the region and return. */
+		if (!mem_avoid_overlap(&region, &overlap)) {
+			store_slot_info(&region, image_size);
+			return;
+		}
+
+		/* Store beginning of region if holds at least image_size. */
+		if (overlap.start > region.start + image_size) {
+			struct mem_vector beginning;
+
+			beginning.start = region.start;
+			beginning.size = overlap.start - region.start;
+			store_slot_info(&beginning, image_size);
+		}
+
+		/* Return if overlap extends to or past end of region. */
+		if (overlap.start + overlap.size >= region.start + region.size)
+			return;
+
+		/* Clip off the overlapping region and start over. */
+		region.size -= overlap.start - region.start + overlap.size;
+		region.start = overlap.start + overlap.size;
 	}
 }
 
@@ -398,6 +417,10 @@ static unsigned long find_random_phys_addr(unsigned long minimum,
 	for (i = 0; i < boot_params->e820_entries; i++) {
 		process_e820_entry(&boot_params->e820_map[i], minimum,
 				   image_size);
+		if (slot_area_index == MAX_SLOT_AREA) {
+			debug_putstr("Aborted e820 scan (slot_areas full)!\n");
+			break;
+		}
 	}
 
 	return slots_fetch_random();
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [kernel-hardening] [PATCH v6 11/11] x86/KASLR: Allow randomization below load address
  2016-05-05 22:13 [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately Kees Cook
                   ` (9 preceding siblings ...)
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 10/11] x86/KASLR: Add physical address randomization >4G Kees Cook
@ 2016-05-05 22:13 ` Kees Cook
  10 siblings, 0 replies; 17+ messages in thread
From: Kees Cook @ 2016-05-05 22:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kees Cook, Yinghai Lu, Ingo Molnar, Baoquan He, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, lasse.collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

From: Yinghai Lu <yinghai@kernel.org>

Currently the physical randomization's lower boundary is the load
address. For bootloaders that load kernels into very high memory
(e.g. kexec), this means randomization takes place in a very small window
at the top of memory, ignoring the large region of physical memory below
the load address.

Since mem_avoid is already correctly tracking the regions that must be
avoided, this patch changes the minimum address to which ever is less:
512M (to conservatively avoid unknown things in lower memory) or the
load address. Now, for example, if the kernel is loaded at 8G, [512M,
8G) will be added into possible physical memory positions.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
[kees: rewrote changelog, refactor to use min()]
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/boot/compressed/kaslr.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 8cf705826bdc..ace41822d16f 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -464,7 +464,7 @@ void choose_random_location(unsigned char *input_ptr,
 	 */
 	unsigned long input = (unsigned long)input_ptr;
 	unsigned long output = (unsigned long)*output_ptr;
-	unsigned long random_addr;
+	unsigned long random_addr, min_addr;
 
 	/* By default, keep output position unchanged. */
 	*virt_addr = *output_ptr;
@@ -486,8 +486,11 @@ void choose_random_location(unsigned char *input_ptr,
 	/* Record the various known unsafe memory ranges. */
 	mem_avoid_init(input, input_size, output);
 
+	/* Low end should be the smaller of 512M or initial location. */
+	min_addr = min(output, 512UL << 20);
+
 	/* Walk e820 and find a random address. */
-	random_addr = find_random_phys_addr(output, output_size);
+	random_addr = find_random_phys_addr(min_addr, output_size);
 	if (!random_addr) {
 		warn("KASLR disabled: could not find suitable E820 region!");
 	} else {
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [kernel-hardening] Re: [PATCH v6 04/11] x86/KASLR: Build identity mappings on demand
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 04/11] x86/KASLR: Build identity mappings on demand Kees Cook
@ 2016-05-06  7:00   ` Ingo Molnar
  2016-05-06 17:44     ` Kees Cook
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2016-05-06  7:00 UTC (permalink / raw)
  To: Kees Cook
  Cc: Yinghai Lu, Jiri Kosina, Borislav Petkov, Baoquan He,
	Ingo Molnar, H. Peter Anvin, Borislav Petkov, Vivek Goyal,
	Andy Lutomirski, lasse.collin, Andrew Morton, Dave Young,
	kernel-hardening, LKML


* Kees Cook <keescook@chromium.org> wrote:

> From: Yinghai Lu <yinghai@kernel.org>
> 
> Currently KASLR only supports relocation in a small physical range (from
> 16M to 1G), due to using the initial kernel page table identity mapping.
> To support ranges above this, we need to have an identity mapping for the
> desired memory range before we can decompress (and later run) the kernel.
> 
> 32-bit kernels already have the needed identity mapping. This patch adds
> identity mappings for the needed memory ranges on 64-bit kernels. This
> happens in two possible boot paths:
> 
> If loaded via startup_32, we need to set up the identity mapping we need.
> 
> If loaded from a 64-bit bootloader, the bootloader will have
> already set up an identity mapping, and we'll start via ZO's
> (arch/x86/boot/compressed/vmlinux) startup_64. In this case, the
> bootloader's page tables need to be avoided when we select the new VO
> (vmlinux) location. If not, the decompressor could overwrite them during
> decompression.
> 
> To accomplish avoiding the bootloader's page tables, we could walk the
> pagetable and find every page that is used, and add them to mem_avoid,
> but this needs extra code and will require increasing the size of the
> mem_avoid array.
> 
> Instead, we can create a new ident mapping instead, and pages for the
> pagetable will come from the _pagetable section of ZO, which means they
> are already in mem_avoid array. To do this, we reuse the code for the
> kernel's identity mapping.
> 
> The _pgtable will be shared by 32-bit and 64-bit path to reduce init_size,
> as now ZO _rodata to _end will contribute init_size.
> 
> To handle the possible mappings, we need to increase the pgt buffer size:
> 
> When booting via startup_64, as we need to cover the old VO, params,
> cmdline and new VO. In an extreme case we could have them all beyond the
> 512G boundary, which needs (2+2)*4 pages with 2M mappings. And we'll
> need 2 for first 2M for VGA RAM. One more is needed for level4. This
> gets us to 19 pages total.
> 
> When booting via startup_32, KASLR could move the new VO above 4G, so we
> need to set extra identity mappings for the VO, which should only need
> (2+2) pages at most when it is beyond the 512G boundary. So 19 pages is
> sufficient for this case as well.

In changelogs and comments please refer to C functions and symbols that point to 
executable code via '()', i.e. startup_64(), etc.

> +#ifdef CONFIG_X86_VERBOSE_BOOTUP
> +	/* for video ram */

Please capitalize RAM and generally free flowing comment sentences, i.e.:

> +	/* For video RAM: */

> +#ifdef CONFIG_X86_64
> +void fill_pagetable(unsigned long start, unsigned long size);
> +void switch_pagetable(void);

These are very ambiguous function names. Which pagetables are these? Kernel or 
user? Is it boot time or final page tables? etc.

Also what does the switching do?

> --- /dev/null
> +++ b/arch/x86/boot/compressed/misc_pgt.c
> @@ -0,0 +1,93 @@
> +#define __pa(x)  ((unsigned long)(x))
> +#define __va(x)  ((void *)((unsigned long)(x)))
> +
> +#include "misc.h"
> +
> +#include <asm/init.h>
> +#include <asm/pgtable.h>
> +
> +#include "../../mm/ident_map.c"
> +#include "../string.h"

Again a new .c file with no comments whatsoever :-(

Also, we just decided that 'misc.c' was a bad name. Is there really no better name 
than misc_pgt.c?

> +struct alloc_pgt_data {
> +	unsigned char *pgt_buf;
> +	unsigned long pgt_buf_size;
> +	unsigned long pgt_buf_offset;
> +};
> +
> +static void *alloc_pgt_page(void *context)

Non-obvious functions with no comments describing them.

> +unsigned long __force_order;
> +static struct alloc_pgt_data pgt_data;
> +static struct x86_mapping_info mapping_info;
> +static pgd_t *level4p;

What's this __force_order flag? Why does it have double underscores?

> +{
> +	unsigned long end = start + size;
> +
> +	if (!level4p) {
> +		pgt_data.pgt_buf_offset = 0;
> +		mapping_info.alloc_pgt_page = alloc_pgt_page;
> +		mapping_info.context = &pgt_data;
> +		mapping_info.pmd_flag = __PAGE_KERNEL_LARGE_EXEC;
> +
> +		/*
> +		 * come from startup_32 ?
> +		 * then cr3 is _pgtable, we can reuse it.

come what? What does this comment mean??

> +		 */
> +		level4p = (pgd_t *)read_cr3();

Argh, another type cast. A quick check shows:

+static pgd_t *level4p;
+       if (!level4p) {
+               level4p = (pgd_t *)read_cr3();
+               if ((unsigned long)level4p == (unsigned long)_pgtable) {
+                       level4p = (pgd_t *)alloc_pgt_page(&pgt_data);
+       kernel_ident_mapping_init(&mapping_info, level4p, start, end);
+       write_cr3((unsigned long)level4p);

... that the natural type for level4p would be unsigned long, not pgt_t ...


> +		if ((unsigned long)level4p == (unsigned long)_pgtable) {
> +			pgt_data.pgt_buf = (unsigned char *)_pgtable +
> +						 BOOT_INIT_PGT_SIZE;
> +			pgt_data.pgt_buf_size = BOOT_PGT_SIZE -
> +						 BOOT_INIT_PGT_SIZE;
> +			memset((unsigned char *)pgt_data.pgt_buf, 0,
> +				pgt_data.pgt_buf_size);

Is that (unsigned char *) type cast really necessary??

Type casts are the absolute exception, we avoid them whenever possible - and this 
code is using them like candy :-(

Also, for heaven's sake, either ignore checkpatch.pl whining, or avoid the 
excessive indentation by using helper functions (or some other technique).

> +			debug_putstr("boot via startup_32\n");
> +		} else {
> +			pgt_data.pgt_buf = (unsigned char *)_pgtable;
> +			pgt_data.pgt_buf_size = BOOT_PGT_SIZE;
> +			memset((unsigned char *)pgt_data.pgt_buf, 0,
> +				pgt_data.pgt_buf_size);
> +			debug_putstr("boot via startup_64\n");
> +			level4p = (pgd_t *)alloc_pgt_page(&pgt_data);
> +		}
> +	}
> +
> +	/* align boundary to 2M */
> +	start = round_down(start, PMD_SIZE);
> +	end = round_up(end, PMD_SIZE);
> +	if (start >= end)
> +		return;
> +
> +	kernel_ident_mapping_init(&mapping_info, level4p, start, end);
> +}
> +
> +void switch_pagetable(void)
> +{
> +	write_cr3((unsigned long)level4p);
> +}
> diff --git a/arch/x86/include/asm/boot.h b/arch/x86/include/asm/boot.h
> index 6b8d6e8cd449..52a9cbc1419f 100644
> --- a/arch/x86/include/asm/boot.h
> +++ b/arch/x86/include/asm/boot.h
> @@ -32,7 +32,26 @@
>  #endif /* !CONFIG_KERNEL_BZIP2 */
>  
>  #ifdef CONFIG_X86_64
> +
>  #define BOOT_STACK_SIZE	0x4000
> +
> +#define BOOT_INIT_PGT_SIZE (6*4096)
> +#ifdef CONFIG_RANDOMIZE_BASE
> +/*
> + * 1 page for level4, 2 pages for first 2M.
> + * (2+2)*4 pages for kernel, param, cmd_line, random kernel
> + * if all cross 512G boundary.
> + * So total will be 19 pages.
> + */
> +#ifdef CONFIG_X86_VERBOSE_BOOTUP
> +#define BOOT_PGT_SIZE (19*4096)
> +#else
> +#define BOOT_PGT_SIZE (17*4096)
> +#endif
> +#else
> +#define BOOT_PGT_SIZE BOOT_INIT_PGT_SIZE
> +#endif

Please use proper nesting of defines to make it more readable, i.e. something 
like:

#define BOOT_INIT_PGT_SIZE (6*4096)

#ifdef CONFIG_RANDOMIZE_BASE
/*
 * 1 page for level4, 2 pages for first 2M.
 * (2+2)*4 pages for kernel, param, cmd_line, random kernel
 * if all cross 512G boundary.
 * So total will be 19 pages.
 */
# ifdef CONFIG_X86_VERBOSE_BOOTUP
#  define BOOT_PGT_SIZE (19*4096)
# else
#  define BOOT_PGT_SIZE (17*4096)
# endif
#else
# define BOOT_PGT_SIZE BOOT_INIT_PGT_SIZE
#endif

but more importantly, BOOT_PGT_SIZE is really a bad name for this. Since it's used 
by an allocator it's clear that this is a _maximum_. Why not put that fact into 
the name??

Basically you took a butt-ugly patch from Yinghai that shat all over the kernel 
and rewrote its changelog - but that's not enough to make it acceptable! As a 
general rule, most Yinghai patches I've seen in the past couple of years were 
butt-ugly. If you take a patch from Yinghai you have to completely rewrite it in 
most cases.

So I'm really annoyed, I see myself repeating the same kind of review feedback I 
gave to Yinghai again and again, and now to you?

If it's less work for you then please rewrite Yinghai's patches from scratch - and 
possibly split them up as well wherever possible, as they are risky as hell.

I've applied the first two patches because they look OK, but _please_ do a proper 
job with the rest of the series as well!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [kernel-hardening] Re: [PATCH v6 10/11] x86/KASLR: Add physical address randomization >4G
  2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 10/11] x86/KASLR: Add physical address randomization >4G Kees Cook
@ 2016-05-06  8:27   ` Baoquan He
  2016-05-06 15:31     ` Kees Cook
  0 siblings, 1 reply; 17+ messages in thread
From: Baoquan He @ 2016-05-06  8:27 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ingo Molnar, Ingo Molnar, Yinghai Lu, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, lasse.collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

Hi Kees,

On 05/05/16 at 03:13pm, Kees Cook wrote:
> From: Baoquan He <bhe@redhat.com>
> 
> This patch exchanges the prior slots[] array for the new slot_areas[]
> array, and lifts the limitation of KERNEL_IMAGE_SIZE on the physical
> address offset for 64-bit. As before, process_e820_entry() walks

Sorry, I didn't get what you said about lifting the limitation for
64-bit. Do you mean 32-bit?

Thanks
Baoquan


> memory and populates slot_areas[], splitting on any detected mem_avoid
> collisions.
> 
> Finally, since the slots[] array and its associated functions are not
> needed any more, so they are removed.
> 
> Signed-off-by: Baoquan He <bhe@redhat.com>
> [kees: rewrote changelog, refactored goto into while]
> [kees: limit 32-bit to KERNEL_IMAGE_SIZE]
> [kees: squash dead-code removal into this patch]
> [kees: refactored to use new mem_overlap code, renamed variables]
> [kees: updated Kconfig to reflect updated implementation]
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  arch/x86/Kconfig                 |  27 +++++----
>  arch/x86/boot/compressed/kaslr.c | 115 +++++++++++++++++++++++----------------
>  2 files changed, 85 insertions(+), 57 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5892d549596d..39be0f7b49ef 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1943,21 +1943,26 @@ config RANDOMIZE_BASE
>  	  attempts relying on knowledge of the location of kernel
>  	  code internals.
>  
> -	  The kernel physical and virtual address can be randomized
> -	  from 16MB up to 1GB on 64-bit and 512MB on 32-bit. (Note that
> -	  using RANDOMIZE_BASE reduces the memory space available to
> -	  kernel modules from 1.5GB to 1GB.)
> +	  On 64-bit, the kernel physical and virtual addresses are
> +	  randomized separately. The physical address will be anywhere
> +	  between 16MB and the top of physical memory (up to 64TB). The
> +	  virtual address will be randomized from 16MB up to 1GB (9 bits
> +	  of entropy). Note that this also reduces the memory space
> +	  available to kernel modules from 1.5GB to 1GB.
> +
> +	  On 32-bit, the kernel physical and virtual addresses are
> +	  randomized together. They will be randomized from 16MB up to
> +	  512MB (8 bits of entropy).
>  
>  	  Entropy is generated using the RDRAND instruction if it is
>  	  supported. If RDTSC is supported, its value is mixed into
>  	  the entropy pool as well. If neither RDRAND nor RDTSC are
> -	  supported, then entropy is read from the i8254 timer.
> -
> -	  Since the kernel is built using 2GB addressing, and
> -	  PHYSICAL_ALIGN must be at a minimum of 2MB, only 10 bits of
> -	  entropy is theoretically possible. Currently, with the
> -	  default value for PHYSICAL_ALIGN and due to page table
> -	  layouts, 64-bit uses 9 bits of entropy and 32-bit uses 8 bits.
> +	  supported, then entropy is read from the i8254 timer. The
> +	  usable entropy is limited by the kernel being built using
> +	  2GB addressing, and that PHYSICAL_ALIGN must be at a
> +	  minimum of 2MB. As a result, only 10 bits of entropy is
> +	  theoretically possible, but the implementations are further
> +	  limited due to memory layouts.
>  
>  	  If CONFIG_HIBERNATE is also enabled, KASLR is disabled at boot
>  	  time. To enable it, boot with "kaslr" on the kernel command
> diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
> index 58d0871be037..8cf705826bdc 100644
> --- a/arch/x86/boot/compressed/kaslr.c
> +++ b/arch/x86/boot/compressed/kaslr.c
> @@ -125,17 +125,6 @@ struct mem_vector {
>  #define MEM_AVOID_MAX 4
>  static struct mem_vector mem_avoid[MEM_AVOID_MAX];
>  
> -static bool mem_contains(struct mem_vector *region, struct mem_vector *item)
> -{
> -	/* Item at least partially before region. */
> -	if (item->start < region->start)
> -		return false;
> -	/* Item at least partially after region. */
> -	if (item->start + item->size > region->start + region->size)
> -		return false;
> -	return true;
> -}
> -
>  static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two)
>  {
>  	/* Item one is entirely before item two. */
> @@ -286,8 +275,6 @@ static bool mem_avoid_overlap(struct mem_vector *img,
>  	return is_overlapping;
>  }
>  
> -static unsigned long slots[KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN];
> -
>  struct slot_area {
>  	unsigned long addr;
>  	int num;
> @@ -318,36 +305,44 @@ static void store_slot_info(struct mem_vector *region, unsigned long image_size)
>  	}
>  }
>  
> -static void slots_append(unsigned long addr)
> -{
> -	/* Overflowing the slots list should be impossible. */
> -	if (slot_max >= KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN)
> -		return;
> -
> -	slots[slot_max++] = addr;
> -}
> -
>  static unsigned long slots_fetch_random(void)
>  {
> +	unsigned long slot;
> +	int i;
> +
>  	/* Handle case of no slots stored. */
>  	if (slot_max == 0)
>  		return 0;
>  
> -	return slots[get_random_long("Physical") % slot_max];
> +	slot = get_random_long("Physical") % slot_max;
> +
> +	for (i = 0; i < slot_area_index; i++) {
> +		if (slot >= slot_areas[i].num) {
> +			slot -= slot_areas[i].num;
> +			continue;
> +		}
> +		return slot_areas[i].addr + slot * CONFIG_PHYSICAL_ALIGN;
> +	}
> +
> +	if (i == slot_area_index)
> +		debug_putstr("slots_fetch_random() failed!?\n");
> +	return 0;
>  }
>  
>  static void process_e820_entry(struct e820entry *entry,
>  			       unsigned long minimum,
>  			       unsigned long image_size)
>  {
> -	struct mem_vector region, img, overlap;
> +	struct mem_vector region, overlap;
> +	struct slot_area slot_area;
> +	unsigned long start_orig;
>  
>  	/* Skip non-RAM entries. */
>  	if (entry->type != E820_RAM)
>  		return;
>  
> -	/* Ignore entries entirely above our maximum. */
> -	if (entry->addr >= KERNEL_IMAGE_SIZE)
> +	/* On 32-bit, ignore entries entirely above our maximum. */
> +	if (IS_ENABLED(CONFIG_X86_32) && entry->addr >= KERNEL_IMAGE_SIZE)
>  		return;
>  
>  	/* Ignore entries entirely below our minimum. */
> @@ -357,31 +352,55 @@ static void process_e820_entry(struct e820entry *entry,
>  	region.start = entry->addr;
>  	region.size = entry->size;
>  
> -	/* Potentially raise address to minimum location. */
> -	if (region.start < minimum)
> -		region.start = minimum;
> +	/* Give up if slot area array is full. */
> +	while (slot_area_index < MAX_SLOT_AREA) {
> +		start_orig = region.start;
>  
> -	/* Potentially raise address to meet alignment requirements. */
> -	region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);
> +		/* Potentially raise address to minimum location. */
> +		if (region.start < minimum)
> +			region.start = minimum;
>  
> -	/* Did we raise the address above the bounds of this e820 region? */
> -	if (region.start > entry->addr + entry->size)
> -		return;
> +		/* Potentially raise address to meet alignment needs. */
> +		region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);
>  
> -	/* Reduce size by any delta from the original address. */
> -	region.size -= region.start - entry->addr;
> +		/* Did we raise the address above this e820 region? */
> +		if (region.start > entry->addr + entry->size)
> +			return;
>  
> -	/* Reduce maximum size to fit end of image within maximum limit. */
> -	if (region.start + region.size > KERNEL_IMAGE_SIZE)
> -		region.size = KERNEL_IMAGE_SIZE - region.start;
> +		/* Reduce size by any delta from the original address. */
> +		region.size -= region.start - start_orig;
>  
> -	/* Walk each aligned slot and check for avoided areas. */
> -	for (img.start = region.start, img.size = image_size ;
> -	     mem_contains(&region, &img) ;
> -	     img.start += CONFIG_PHYSICAL_ALIGN) {
> -		if (mem_avoid_overlap(&img, &overlap))
> -			continue;
> -		slots_append(img.start);
> +		/* On 32-bit, reduce region size to fit within max size. */
> +		if (IS_ENABLED(CONFIG_X86_32) &&
> +		    region.start + region.size > KERNEL_IMAGE_SIZE)
> +			region.size = KERNEL_IMAGE_SIZE - region.start;
> +
> +		/* Return if region can't contain decompressed kernel */
> +		if (region.size < image_size)
> +			return;
> +
> +		/* If nothing overlaps, store the region and return. */
> +		if (!mem_avoid_overlap(&region, &overlap)) {
> +			store_slot_info(&region, image_size);
> +			return;
> +		}
> +
> +		/* Store beginning of region if holds at least image_size. */
> +		if (overlap.start > region.start + image_size) {
> +			struct mem_vector beginning;
> +
> +			beginning.start = region.start;
> +			beginning.size = overlap.start - region.start;
> +			store_slot_info(&beginning, image_size);
> +		}
> +
> +		/* Return if overlap extends to or past end of region. */
> +		if (overlap.start + overlap.size >= region.start + region.size)
> +			return;
> +
> +		/* Clip off the overlapping region and start over. */
> +		region.size -= overlap.start - region.start + overlap.size;
> +		region.start = overlap.start + overlap.size;
>  	}
>  }
>  
> @@ -398,6 +417,10 @@ static unsigned long find_random_phys_addr(unsigned long minimum,
>  	for (i = 0; i < boot_params->e820_entries; i++) {
>  		process_e820_entry(&boot_params->e820_map[i], minimum,
>  				   image_size);
> +		if (slot_area_index == MAX_SLOT_AREA) {
> +			debug_putstr("Aborted e820 scan (slot_areas full)!\n");
> +			break;
> +		}
>  	}
>  
>  	return slots_fetch_random();
> -- 
> 2.6.3
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [kernel-hardening] Re: [PATCH v6 10/11] x86/KASLR: Add physical address randomization >4G
  2016-05-06  8:27   ` [kernel-hardening] " Baoquan He
@ 2016-05-06 15:31     ` Kees Cook
  2016-05-08  9:17       ` Baoquan He
  0 siblings, 1 reply; 17+ messages in thread
From: Kees Cook @ 2016-05-06 15:31 UTC (permalink / raw)
  To: Baoquan He
  Cc: Ingo Molnar, Ingo Molnar, Yinghai Lu, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, Lasse Collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

On Fri, May 6, 2016 at 1:27 AM, Baoquan He <bhe@redhat.com> wrote:
> Hi Kees,
>
> On 05/05/16 at 03:13pm, Kees Cook wrote:
>> From: Baoquan He <bhe@redhat.com>
>>
>> This patch exchanges the prior slots[] array for the new slot_areas[]
>> array, and lifts the limitation of KERNEL_IMAGE_SIZE on the physical
>> address offset for 64-bit. As before, process_e820_entry() walks
>
> Sorry, I didn't get what you said about lifting the limitation for
> 64-bit. Do you mean 32-bit?

Prior to this patch, physical address randomization was limited to
KERNEL_IMAGE_SIZE on x86_64. After this patch, that limit is lifted
for x86_64 but 32-bit remains limited to KERNEL_IMAGE_SIZE being the
upper physical memory position.

>
> Thanks
> Baoquan
>
>
>> memory and populates slot_areas[], splitting on any detected mem_avoid
>> collisions.
>>
>> Finally, since the slots[] array and its associated functions are not
>> needed any more, so they are removed.
>>
>> Signed-off-by: Baoquan He <bhe@redhat.com>
>> [kees: rewrote changelog, refactored goto into while]
>> [kees: limit 32-bit to KERNEL_IMAGE_SIZE]
>> [kees: squash dead-code removal into this patch]
>> [kees: refactored to use new mem_overlap code, renamed variables]
>> [kees: updated Kconfig to reflect updated implementation]
>> Signed-off-by: Kees Cook <keescook@chromium.org>
>> ---
>>  arch/x86/Kconfig                 |  27 +++++----
>>  arch/x86/boot/compressed/kaslr.c | 115 +++++++++++++++++++++++----------------
>>  2 files changed, 85 insertions(+), 57 deletions(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 5892d549596d..39be0f7b49ef 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1943,21 +1943,26 @@ config RANDOMIZE_BASE
>>         attempts relying on knowledge of the location of kernel
>>         code internals.
>>
>> -       The kernel physical and virtual address can be randomized
>> -       from 16MB up to 1GB on 64-bit and 512MB on 32-bit. (Note that
>> -       using RANDOMIZE_BASE reduces the memory space available to
>> -       kernel modules from 1.5GB to 1GB.)
>> +       On 64-bit, the kernel physical and virtual addresses are
>> +       randomized separately. The physical address will be anywhere
>> +       between 16MB and the top of physical memory (up to 64TB). The
>> +       virtual address will be randomized from 16MB up to 1GB (9 bits
>> +       of entropy). Note that this also reduces the memory space
>> +       available to kernel modules from 1.5GB to 1GB.
>> +
>> +       On 32-bit, the kernel physical and virtual addresses are
>> +       randomized together. They will be randomized from 16MB up to
>> +       512MB (8 bits of entropy).
>>
>>         Entropy is generated using the RDRAND instruction if it is
>>         supported. If RDTSC is supported, its value is mixed into
>>         the entropy pool as well. If neither RDRAND nor RDTSC are
>> -       supported, then entropy is read from the i8254 timer.
>> -
>> -       Since the kernel is built using 2GB addressing, and
>> -       PHYSICAL_ALIGN must be at a minimum of 2MB, only 10 bits of
>> -       entropy is theoretically possible. Currently, with the
>> -       default value for PHYSICAL_ALIGN and due to page table
>> -       layouts, 64-bit uses 9 bits of entropy and 32-bit uses 8 bits.
>> +       supported, then entropy is read from the i8254 timer. The
>> +       usable entropy is limited by the kernel being built using
>> +       2GB addressing, and that PHYSICAL_ALIGN must be at a
>> +       minimum of 2MB. As a result, only 10 bits of entropy is
>> +       theoretically possible, but the implementations are further
>> +       limited due to memory layouts.
>>
>>         If CONFIG_HIBERNATE is also enabled, KASLR is disabled at boot
>>         time. To enable it, boot with "kaslr" on the kernel command
>> diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
>> index 58d0871be037..8cf705826bdc 100644
>> --- a/arch/x86/boot/compressed/kaslr.c
>> +++ b/arch/x86/boot/compressed/kaslr.c
>> @@ -125,17 +125,6 @@ struct mem_vector {
>>  #define MEM_AVOID_MAX 4
>>  static struct mem_vector mem_avoid[MEM_AVOID_MAX];
>>
>> -static bool mem_contains(struct mem_vector *region, struct mem_vector *item)
>> -{
>> -     /* Item at least partially before region. */
>> -     if (item->start < region->start)
>> -             return false;
>> -     /* Item at least partially after region. */
>> -     if (item->start + item->size > region->start + region->size)
>> -             return false;
>> -     return true;
>> -}
>> -
>>  static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two)
>>  {
>>       /* Item one is entirely before item two. */
>> @@ -286,8 +275,6 @@ static bool mem_avoid_overlap(struct mem_vector *img,
>>       return is_overlapping;
>>  }
>>
>> -static unsigned long slots[KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN];
>> -
>>  struct slot_area {
>>       unsigned long addr;
>>       int num;
>> @@ -318,36 +305,44 @@ static void store_slot_info(struct mem_vector *region, unsigned long image_size)
>>       }
>>  }
>>
>> -static void slots_append(unsigned long addr)
>> -{
>> -     /* Overflowing the slots list should be impossible. */
>> -     if (slot_max >= KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN)
>> -             return;
>> -
>> -     slots[slot_max++] = addr;
>> -}
>> -
>>  static unsigned long slots_fetch_random(void)
>>  {
>> +     unsigned long slot;
>> +     int i;
>> +
>>       /* Handle case of no slots stored. */
>>       if (slot_max == 0)
>>               return 0;
>>
>> -     return slots[get_random_long("Physical") % slot_max];
>> +     slot = get_random_long("Physical") % slot_max;
>> +
>> +     for (i = 0; i < slot_area_index; i++) {
>> +             if (slot >= slot_areas[i].num) {
>> +                     slot -= slot_areas[i].num;
>> +                     continue;
>> +             }
>> +             return slot_areas[i].addr + slot * CONFIG_PHYSICAL_ALIGN;
>> +     }
>> +
>> +     if (i == slot_area_index)
>> +             debug_putstr("slots_fetch_random() failed!?\n");
>> +     return 0;
>>  }
>>
>>  static void process_e820_entry(struct e820entry *entry,
>>                              unsigned long minimum,
>>                              unsigned long image_size)
>>  {
>> -     struct mem_vector region, img, overlap;
>> +     struct mem_vector region, overlap;
>> +     struct slot_area slot_area;
>> +     unsigned long start_orig;
>>
>>       /* Skip non-RAM entries. */
>>       if (entry->type != E820_RAM)
>>               return;
>>
>> -     /* Ignore entries entirely above our maximum. */
>> -     if (entry->addr >= KERNEL_IMAGE_SIZE)
>> +     /* On 32-bit, ignore entries entirely above our maximum. */
>> +     if (IS_ENABLED(CONFIG_X86_32) && entry->addr >= KERNEL_IMAGE_SIZE)
>>               return;

Since physical and virtual are not separated on 32-bit, it means the
32-bit physical address selection becomes the virtual address
selection, and as such, it cannot be above KERNEL_IMAGE_SIZE on
32-bit.

-Kees

>>
>>       /* Ignore entries entirely below our minimum. */
>> @@ -357,31 +352,55 @@ static void process_e820_entry(struct e820entry *entry,
>>       region.start = entry->addr;
>>       region.size = entry->size;
>>
>> -     /* Potentially raise address to minimum location. */
>> -     if (region.start < minimum)
>> -             region.start = minimum;
>> +     /* Give up if slot area array is full. */
>> +     while (slot_area_index < MAX_SLOT_AREA) {
>> +             start_orig = region.start;
>>
>> -     /* Potentially raise address to meet alignment requirements. */
>> -     region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);
>> +             /* Potentially raise address to minimum location. */
>> +             if (region.start < minimum)
>> +                     region.start = minimum;
>>
>> -     /* Did we raise the address above the bounds of this e820 region? */
>> -     if (region.start > entry->addr + entry->size)
>> -             return;
>> +             /* Potentially raise address to meet alignment needs. */
>> +             region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);
>>
>> -     /* Reduce size by any delta from the original address. */
>> -     region.size -= region.start - entry->addr;
>> +             /* Did we raise the address above this e820 region? */
>> +             if (region.start > entry->addr + entry->size)
>> +                     return;
>>
>> -     /* Reduce maximum size to fit end of image within maximum limit. */
>> -     if (region.start + region.size > KERNEL_IMAGE_SIZE)
>> -             region.size = KERNEL_IMAGE_SIZE - region.start;
>> +             /* Reduce size by any delta from the original address. */
>> +             region.size -= region.start - start_orig;
>>
>> -     /* Walk each aligned slot and check for avoided areas. */
>> -     for (img.start = region.start, img.size = image_size ;
>> -          mem_contains(&region, &img) ;
>> -          img.start += CONFIG_PHYSICAL_ALIGN) {
>> -             if (mem_avoid_overlap(&img, &overlap))
>> -                     continue;
>> -             slots_append(img.start);
>> +             /* On 32-bit, reduce region size to fit within max size. */
>> +             if (IS_ENABLED(CONFIG_X86_32) &&
>> +                 region.start + region.size > KERNEL_IMAGE_SIZE)
>> +                     region.size = KERNEL_IMAGE_SIZE - region.start;
>> +
>> +             /* Return if region can't contain decompressed kernel */
>> +             if (region.size < image_size)
>> +                     return;
>> +
>> +             /* If nothing overlaps, store the region and return. */
>> +             if (!mem_avoid_overlap(&region, &overlap)) {
>> +                     store_slot_info(&region, image_size);
>> +                     return;
>> +             }
>> +
>> +             /* Store beginning of region if holds at least image_size. */
>> +             if (overlap.start > region.start + image_size) {
>> +                     struct mem_vector beginning;
>> +
>> +                     beginning.start = region.start;
>> +                     beginning.size = overlap.start - region.start;
>> +                     store_slot_info(&beginning, image_size);
>> +             }
>> +
>> +             /* Return if overlap extends to or past end of region. */
>> +             if (overlap.start + overlap.size >= region.start + region.size)
>> +                     return;
>> +
>> +             /* Clip off the overlapping region and start over. */
>> +             region.size -= overlap.start - region.start + overlap.size;
>> +             region.start = overlap.start + overlap.size;
>>       }
>>  }
>>
>> @@ -398,6 +417,10 @@ static unsigned long find_random_phys_addr(unsigned long minimum,
>>       for (i = 0; i < boot_params->e820_entries; i++) {
>>               process_e820_entry(&boot_params->e820_map[i], minimum,
>>                                  image_size);
>> +             if (slot_area_index == MAX_SLOT_AREA) {
>> +                     debug_putstr("Aborted e820 scan (slot_areas full)!\n");
>> +                     break;
>> +             }
>>       }
>>
>>       return slots_fetch_random();
>> --
>> 2.6.3
>>



-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [kernel-hardening] Re: [PATCH v6 04/11] x86/KASLR: Build identity mappings on demand
  2016-05-06  7:00   ` [kernel-hardening] " Ingo Molnar
@ 2016-05-06 17:44     ` Kees Cook
  0 siblings, 0 replies; 17+ messages in thread
From: Kees Cook @ 2016-05-06 17:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Yinghai Lu, Jiri Kosina, Borislav Petkov, Baoquan He,
	Ingo Molnar, H. Peter Anvin, Borislav Petkov, Vivek Goyal,
	Andy Lutomirski, Lasse Collin, Andrew Morton, Dave Young,
	kernel-hardening, LKML

On Fri, May 6, 2016 at 12:00 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Kees Cook <keescook@chromium.org> wrote:
>
>> From: Yinghai Lu <yinghai@kernel.org>
>>
>> Currently KASLR only supports relocation in a small physical range (from
>> 16M to 1G), due to using the initial kernel page table identity mapping.
>> To support ranges above this, we need to have an identity mapping for the
>> desired memory range before we can decompress (and later run) the kernel.
>>
>> 32-bit kernels already have the needed identity mapping. This patch adds
>> identity mappings for the needed memory ranges on 64-bit kernels. This
>> happens in two possible boot paths:
>>
>> If loaded via startup_32, we need to set up the identity mapping we need.
>>
>> If loaded from a 64-bit bootloader, the bootloader will have
>> already set up an identity mapping, and we'll start via ZO's
>> (arch/x86/boot/compressed/vmlinux) startup_64. In this case, the
>> bootloader's page tables need to be avoided when we select the new VO
>> (vmlinux) location. If not, the decompressor could overwrite them during
>> decompression.
>>
>> To accomplish avoiding the bootloader's page tables, we could walk the
>> pagetable and find every page that is used, and add them to mem_avoid,
>> but this needs extra code and will require increasing the size of the
>> mem_avoid array.
>>
>> Instead, we can create a new ident mapping instead, and pages for the
>> pagetable will come from the _pagetable section of ZO, which means they
>> are already in mem_avoid array. To do this, we reuse the code for the
>> kernel's identity mapping.
>>
>> The _pgtable will be shared by 32-bit and 64-bit path to reduce init_size,
>> as now ZO _rodata to _end will contribute init_size.
>>
>> To handle the possible mappings, we need to increase the pgt buffer size:
>>
>> When booting via startup_64, as we need to cover the old VO, params,
>> cmdline and new VO. In an extreme case we could have them all beyond the
>> 512G boundary, which needs (2+2)*4 pages with 2M mappings. And we'll
>> need 2 for first 2M for VGA RAM. One more is needed for level4. This
>> gets us to 19 pages total.
>>
>> When booting via startup_32, KASLR could move the new VO above 4G, so we
>> need to set extra identity mappings for the VO, which should only need
>> (2+2) pages at most when it is beyond the 512G boundary. So 19 pages is
>> sufficient for this case as well.
>
> In changelogs and comments please refer to C functions and symbols that point to
> executable code via '()', i.e. startup_64(), etc.
>
>> +#ifdef CONFIG_X86_VERBOSE_BOOTUP
>> +     /* for video ram */
>
> Please capitalize RAM and generally free flowing comment sentences, i.e.:
>
>> +     /* For video RAM: */
>
>> +#ifdef CONFIG_X86_64
>> +void fill_pagetable(unsigned long start, unsigned long size);
>> +void switch_pagetable(void);
>
> These are very ambiguous function names. Which pagetables are these? Kernel or
> user? Is it boot time or final page tables? etc.
>
> Also what does the switching do?
>
>> --- /dev/null
>> +++ b/arch/x86/boot/compressed/misc_pgt.c
>> @@ -0,0 +1,93 @@
>> +#define __pa(x)  ((unsigned long)(x))
>> +#define __va(x)  ((void *)((unsigned long)(x)))
>> +
>> +#include "misc.h"
>> +
>> +#include <asm/init.h>
>> +#include <asm/pgtable.h>
>> +
>> +#include "../../mm/ident_map.c"
>> +#include "../string.h"
>
> Again a new .c file with no comments whatsoever :-(
>
> Also, we just decided that 'misc.c' was a bad name. Is there really no better name
> than misc_pgt.c?
>
>> +struct alloc_pgt_data {
>> +     unsigned char *pgt_buf;
>> +     unsigned long pgt_buf_size;
>> +     unsigned long pgt_buf_offset;
>> +};
>> +
>> +static void *alloc_pgt_page(void *context)
>
> Non-obvious functions with no comments describing them.
>
>> +unsigned long __force_order;
>> +static struct alloc_pgt_data pgt_data;
>> +static struct x86_mapping_info mapping_info;
>> +static pgd_t *level4p;
>
> What's this __force_order flag? Why does it have double underscores?
>
>> +{
>> +     unsigned long end = start + size;
>> +
>> +     if (!level4p) {
>> +             pgt_data.pgt_buf_offset = 0;
>> +             mapping_info.alloc_pgt_page = alloc_pgt_page;
>> +             mapping_info.context = &pgt_data;
>> +             mapping_info.pmd_flag = __PAGE_KERNEL_LARGE_EXEC;
>> +
>> +             /*
>> +              * come from startup_32 ?
>> +              * then cr3 is _pgtable, we can reuse it.
>
> come what? What does this comment mean??
>
>> +              */
>> +             level4p = (pgd_t *)read_cr3();
>
> Argh, another type cast. A quick check shows:
>
> +static pgd_t *level4p;
> +       if (!level4p) {
> +               level4p = (pgd_t *)read_cr3();
> +               if ((unsigned long)level4p == (unsigned long)_pgtable) {
> +                       level4p = (pgd_t *)alloc_pgt_page(&pgt_data);
> +       kernel_ident_mapping_init(&mapping_info, level4p, start, end);
> +       write_cr3((unsigned long)level4p);
>
> ... that the natural type for level4p would be unsigned long, not pgt_t ...
>
>
>> +             if ((unsigned long)level4p == (unsigned long)_pgtable) {
>> +                     pgt_data.pgt_buf = (unsigned char *)_pgtable +
>> +                                              BOOT_INIT_PGT_SIZE;
>> +                     pgt_data.pgt_buf_size = BOOT_PGT_SIZE -
>> +                                              BOOT_INIT_PGT_SIZE;
>> +                     memset((unsigned char *)pgt_data.pgt_buf, 0,
>> +                             pgt_data.pgt_buf_size);
>
> Is that (unsigned char *) type cast really necessary??
>
> Type casts are the absolute exception, we avoid them whenever possible - and this
> code is using them like candy :-(
>
> Also, for heaven's sake, either ignore checkpatch.pl whining, or avoid the
> excessive indentation by using helper functions (or some other technique).
>
>> +                     debug_putstr("boot via startup_32\n");
>> +             } else {
>> +                     pgt_data.pgt_buf = (unsigned char *)_pgtable;
>> +                     pgt_data.pgt_buf_size = BOOT_PGT_SIZE;
>> +                     memset((unsigned char *)pgt_data.pgt_buf, 0,
>> +                             pgt_data.pgt_buf_size);
>> +                     debug_putstr("boot via startup_64\n");
>> +                     level4p = (pgd_t *)alloc_pgt_page(&pgt_data);
>> +             }
>> +     }
>> +
>> +     /* align boundary to 2M */
>> +     start = round_down(start, PMD_SIZE);
>> +     end = round_up(end, PMD_SIZE);
>> +     if (start >= end)
>> +             return;
>> +
>> +     kernel_ident_mapping_init(&mapping_info, level4p, start, end);
>> +}
>> +
>> +void switch_pagetable(void)
>> +{
>> +     write_cr3((unsigned long)level4p);
>> +}
>> diff --git a/arch/x86/include/asm/boot.h b/arch/x86/include/asm/boot.h
>> index 6b8d6e8cd449..52a9cbc1419f 100644
>> --- a/arch/x86/include/asm/boot.h
>> +++ b/arch/x86/include/asm/boot.h
>> @@ -32,7 +32,26 @@
>>  #endif /* !CONFIG_KERNEL_BZIP2 */
>>
>>  #ifdef CONFIG_X86_64
>> +
>>  #define BOOT_STACK_SIZE      0x4000
>> +
>> +#define BOOT_INIT_PGT_SIZE (6*4096)
>> +#ifdef CONFIG_RANDOMIZE_BASE
>> +/*
>> + * 1 page for level4, 2 pages for first 2M.
>> + * (2+2)*4 pages for kernel, param, cmd_line, random kernel
>> + * if all cross 512G boundary.
>> + * So total will be 19 pages.
>> + */
>> +#ifdef CONFIG_X86_VERBOSE_BOOTUP
>> +#define BOOT_PGT_SIZE (19*4096)
>> +#else
>> +#define BOOT_PGT_SIZE (17*4096)
>> +#endif
>> +#else
>> +#define BOOT_PGT_SIZE BOOT_INIT_PGT_SIZE
>> +#endif
>
> Please use proper nesting of defines to make it more readable, i.e. something
> like:
>
> #define BOOT_INIT_PGT_SIZE (6*4096)
>
> #ifdef CONFIG_RANDOMIZE_BASE
> /*
>  * 1 page for level4, 2 pages for first 2M.
>  * (2+2)*4 pages for kernel, param, cmd_line, random kernel
>  * if all cross 512G boundary.
>  * So total will be 19 pages.
>  */
> # ifdef CONFIG_X86_VERBOSE_BOOTUP
> #  define BOOT_PGT_SIZE (19*4096)
> # else
> #  define BOOT_PGT_SIZE (17*4096)
> # endif
> #else
> # define BOOT_PGT_SIZE BOOT_INIT_PGT_SIZE
> #endif
>
> but more importantly, BOOT_PGT_SIZE is really a bad name for this. Since it's used
> by an allocator it's clear that this is a _maximum_. Why not put that fact into
> the name??
>
> Basically you took a butt-ugly patch from Yinghai that shat all over the kernel
> and rewrote its changelog - but that's not enough to make it acceptable! As a
> general rule, most Yinghai patches I've seen in the past couple of years were
> butt-ugly. If you take a patch from Yinghai you have to completely rewrite it in
> most cases.
>
> So I'm really annoyed, I see myself repeating the same kind of review feedback I
> gave to Yinghai again and again, and now to you?
>
> If it's less work for you then please rewrite Yinghai's patches from scratch - and
> possibly split them up as well wherever possible, as they are risky as hell.

Okay, noted. I'll rewrite them. And really, at this point, I've mostly
rewritten half of them as it is, so easier to just keep going with it.

> I've applied the first two patches because they look OK, but _please_ do a proper
> job with the rest of the series as well!

Okay, I'll give it a shot!

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [kernel-hardening] Re: [PATCH v6 10/11] x86/KASLR: Add physical address randomization >4G
  2016-05-06 15:31     ` Kees Cook
@ 2016-05-08  9:17       ` Baoquan He
  0 siblings, 0 replies; 17+ messages in thread
From: Baoquan He @ 2016-05-08  9:17 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ingo Molnar, Ingo Molnar, Yinghai Lu, H. Peter Anvin,
	Borislav Petkov, Vivek Goyal, Andy Lutomirski, Lasse Collin,
	Andrew Morton, Dave Young, kernel-hardening, LKML

On 05/06/16 at 08:31am, Kees Cook wrote:
> On Fri, May 6, 2016 at 1:27 AM, Baoquan He <bhe@redhat.com> wrote:
> > Hi Kees,
> >
> > On 05/05/16 at 03:13pm, Kees Cook wrote:
> >> From: Baoquan He <bhe@redhat.com>
> >>
> >> This patch exchanges the prior slots[] array for the new slot_areas[]
> >> array, and lifts the limitation of KERNEL_IMAGE_SIZE on the physical
> >> address offset for 64-bit. As before, process_e820_entry() walks
> >
> > Sorry, I didn't get what you said about lifting the limitation for
> > 64-bit. Do you mean 32-bit?
> 
> Prior to this patch, physical address randomization was limited to
> KERNEL_IMAGE_SIZE on x86_64. After this patch, that limit is lifted
> for x86_64 but 32-bit remains limited to KERNEL_IMAGE_SIZE being the
> upper physical memory position.

Ah, yes. This is lifting the limitation of physical addr from
KERNEL_IMAGE_SIZE to max available physical addr. I must have been
dizzy, please forgive my noise.

> 
> >
> > Thanks
> > Baoquan
> >
> >
> >> memory and populates slot_areas[], splitting on any detected mem_avoid
> >> collisions.
> >>
> >> Finally, since the slots[] array and its associated functions are not
> >> needed any more, so they are removed.
> >>
> >> Signed-off-by: Baoquan He <bhe@redhat.com>
> >> [kees: rewrote changelog, refactored goto into while]
> >> [kees: limit 32-bit to KERNEL_IMAGE_SIZE]
> >> [kees: squash dead-code removal into this patch]
> >> [kees: refactored to use new mem_overlap code, renamed variables]
> >> [kees: updated Kconfig to reflect updated implementation]
> >> Signed-off-by: Kees Cook <keescook@chromium.org>
> >> ---
> >>  arch/x86/Kconfig                 |  27 +++++----
> >>  arch/x86/boot/compressed/kaslr.c | 115 +++++++++++++++++++++++----------------
> >>  2 files changed, 85 insertions(+), 57 deletions(-)
> >>
> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >> index 5892d549596d..39be0f7b49ef 100644
> >> --- a/arch/x86/Kconfig
> >> +++ b/arch/x86/Kconfig
> >> @@ -1943,21 +1943,26 @@ config RANDOMIZE_BASE
> >>         attempts relying on knowledge of the location of kernel
> >>         code internals.
> >>
> >> -       The kernel physical and virtual address can be randomized
> >> -       from 16MB up to 1GB on 64-bit and 512MB on 32-bit. (Note that
> >> -       using RANDOMIZE_BASE reduces the memory space available to
> >> -       kernel modules from 1.5GB to 1GB.)
> >> +       On 64-bit, the kernel physical and virtual addresses are
> >> +       randomized separately. The physical address will be anywhere
> >> +       between 16MB and the top of physical memory (up to 64TB). The
> >> +       virtual address will be randomized from 16MB up to 1GB (9 bits
> >> +       of entropy). Note that this also reduces the memory space
> >> +       available to kernel modules from 1.5GB to 1GB.
> >> +
> >> +       On 32-bit, the kernel physical and virtual addresses are
> >> +       randomized together. They will be randomized from 16MB up to
> >> +       512MB (8 bits of entropy).
> >>
> >>         Entropy is generated using the RDRAND instruction if it is
> >>         supported. If RDTSC is supported, its value is mixed into
> >>         the entropy pool as well. If neither RDRAND nor RDTSC are
> >> -       supported, then entropy is read from the i8254 timer.
> >> -
> >> -       Since the kernel is built using 2GB addressing, and
> >> -       PHYSICAL_ALIGN must be at a minimum of 2MB, only 10 bits of
> >> -       entropy is theoretically possible. Currently, with the
> >> -       default value for PHYSICAL_ALIGN and due to page table
> >> -       layouts, 64-bit uses 9 bits of entropy and 32-bit uses 8 bits.
> >> +       supported, then entropy is read from the i8254 timer. The
> >> +       usable entropy is limited by the kernel being built using
> >> +       2GB addressing, and that PHYSICAL_ALIGN must be at a
> >> +       minimum of 2MB. As a result, only 10 bits of entropy is
> >> +       theoretically possible, but the implementations are further
> >> +       limited due to memory layouts.
> >>
> >>         If CONFIG_HIBERNATE is also enabled, KASLR is disabled at boot
> >>         time. To enable it, boot with "kaslr" on the kernel command
> >> diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
> >> index 58d0871be037..8cf705826bdc 100644
> >> --- a/arch/x86/boot/compressed/kaslr.c
> >> +++ b/arch/x86/boot/compressed/kaslr.c
> >> @@ -125,17 +125,6 @@ struct mem_vector {
> >>  #define MEM_AVOID_MAX 4
> >>  static struct mem_vector mem_avoid[MEM_AVOID_MAX];
> >>
> >> -static bool mem_contains(struct mem_vector *region, struct mem_vector *item)
> >> -{
> >> -     /* Item at least partially before region. */
> >> -     if (item->start < region->start)
> >> -             return false;
> >> -     /* Item at least partially after region. */
> >> -     if (item->start + item->size > region->start + region->size)
> >> -             return false;
> >> -     return true;
> >> -}
> >> -
> >>  static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two)
> >>  {
> >>       /* Item one is entirely before item two. */
> >> @@ -286,8 +275,6 @@ static bool mem_avoid_overlap(struct mem_vector *img,
> >>       return is_overlapping;
> >>  }
> >>
> >> -static unsigned long slots[KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN];
> >> -
> >>  struct slot_area {
> >>       unsigned long addr;
> >>       int num;
> >> @@ -318,36 +305,44 @@ static void store_slot_info(struct mem_vector *region, unsigned long image_size)
> >>       }
> >>  }
> >>
> >> -static void slots_append(unsigned long addr)
> >> -{
> >> -     /* Overflowing the slots list should be impossible. */
> >> -     if (slot_max >= KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN)
> >> -             return;
> >> -
> >> -     slots[slot_max++] = addr;
> >> -}
> >> -
> >>  static unsigned long slots_fetch_random(void)
> >>  {
> >> +     unsigned long slot;
> >> +     int i;
> >> +
> >>       /* Handle case of no slots stored. */
> >>       if (slot_max == 0)
> >>               return 0;
> >>
> >> -     return slots[get_random_long("Physical") % slot_max];
> >> +     slot = get_random_long("Physical") % slot_max;
> >> +
> >> +     for (i = 0; i < slot_area_index; i++) {
> >> +             if (slot >= slot_areas[i].num) {
> >> +                     slot -= slot_areas[i].num;
> >> +                     continue;
> >> +             }
> >> +             return slot_areas[i].addr + slot * CONFIG_PHYSICAL_ALIGN;
> >> +     }
> >> +
> >> +     if (i == slot_area_index)
> >> +             debug_putstr("slots_fetch_random() failed!?\n");
> >> +     return 0;
> >>  }
> >>
> >>  static void process_e820_entry(struct e820entry *entry,
> >>                              unsigned long minimum,
> >>                              unsigned long image_size)
> >>  {
> >> -     struct mem_vector region, img, overlap;
> >> +     struct mem_vector region, overlap;
> >> +     struct slot_area slot_area;
> >> +     unsigned long start_orig;
> >>
> >>       /* Skip non-RAM entries. */
> >>       if (entry->type != E820_RAM)
> >>               return;
> >>
> >> -     /* Ignore entries entirely above our maximum. */
> >> -     if (entry->addr >= KERNEL_IMAGE_SIZE)
> >> +     /* On 32-bit, ignore entries entirely above our maximum. */
> >> +     if (IS_ENABLED(CONFIG_X86_32) && entry->addr >= KERNEL_IMAGE_SIZE)
> >>               return;
> 
> Since physical and virtual are not separated on 32-bit, it means the
> 32-bit physical address selection becomes the virtual address
> selection, and as such, it cannot be above KERNEL_IMAGE_SIZE on
> 32-bit.
> 
> -Kees
> 
> >>
> >>       /* Ignore entries entirely below our minimum. */
> >> @@ -357,31 +352,55 @@ static void process_e820_entry(struct e820entry *entry,
> >>       region.start = entry->addr;
> >>       region.size = entry->size;
> >>
> >> -     /* Potentially raise address to minimum location. */
> >> -     if (region.start < minimum)
> >> -             region.start = minimum;
> >> +     /* Give up if slot area array is full. */
> >> +     while (slot_area_index < MAX_SLOT_AREA) {
> >> +             start_orig = region.start;
> >>
> >> -     /* Potentially raise address to meet alignment requirements. */
> >> -     region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);
> >> +             /* Potentially raise address to minimum location. */
> >> +             if (region.start < minimum)
> >> +                     region.start = minimum;
> >>
> >> -     /* Did we raise the address above the bounds of this e820 region? */
> >> -     if (region.start > entry->addr + entry->size)
> >> -             return;
> >> +             /* Potentially raise address to meet alignment needs. */
> >> +             region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);
> >>
> >> -     /* Reduce size by any delta from the original address. */
> >> -     region.size -= region.start - entry->addr;
> >> +             /* Did we raise the address above this e820 region? */
> >> +             if (region.start > entry->addr + entry->size)
> >> +                     return;
> >>
> >> -     /* Reduce maximum size to fit end of image within maximum limit. */
> >> -     if (region.start + region.size > KERNEL_IMAGE_SIZE)
> >> -             region.size = KERNEL_IMAGE_SIZE - region.start;
> >> +             /* Reduce size by any delta from the original address. */
> >> +             region.size -= region.start - start_orig;
> >>
> >> -     /* Walk each aligned slot and check for avoided areas. */
> >> -     for (img.start = region.start, img.size = image_size ;
> >> -          mem_contains(&region, &img) ;
> >> -          img.start += CONFIG_PHYSICAL_ALIGN) {
> >> -             if (mem_avoid_overlap(&img, &overlap))
> >> -                     continue;
> >> -             slots_append(img.start);
> >> +             /* On 32-bit, reduce region size to fit within max size. */
> >> +             if (IS_ENABLED(CONFIG_X86_32) &&
> >> +                 region.start + region.size > KERNEL_IMAGE_SIZE)
> >> +                     region.size = KERNEL_IMAGE_SIZE - region.start;
> >> +
> >> +             /* Return if region can't contain decompressed kernel */
> >> +             if (region.size < image_size)
> >> +                     return;
> >> +
> >> +             /* If nothing overlaps, store the region and return. */
> >> +             if (!mem_avoid_overlap(&region, &overlap)) {
> >> +                     store_slot_info(&region, image_size);
> >> +                     return;
> >> +             }
> >> +
> >> +             /* Store beginning of region if holds at least image_size. */
> >> +             if (overlap.start > region.start + image_size) {
> >> +                     struct mem_vector beginning;
> >> +
> >> +                     beginning.start = region.start;
> >> +                     beginning.size = overlap.start - region.start;
> >> +                     store_slot_info(&beginning, image_size);
> >> +             }
> >> +
> >> +             /* Return if overlap extends to or past end of region. */
> >> +             if (overlap.start + overlap.size >= region.start + region.size)
> >> +                     return;
> >> +
> >> +             /* Clip off the overlapping region and start over. */
> >> +             region.size -= overlap.start - region.start + overlap.size;
> >> +             region.start = overlap.start + overlap.size;
> >>       }
> >>  }
> >>
> >> @@ -398,6 +417,10 @@ static unsigned long find_random_phys_addr(unsigned long minimum,
> >>       for (i = 0; i < boot_params->e820_entries; i++) {
> >>               process_e820_entry(&boot_params->e820_map[i], minimum,
> >>                                  image_size);
> >> +             if (slot_area_index == MAX_SLOT_AREA) {
> >> +                     debug_putstr("Aborted e820 scan (slot_areas full)!\n");
> >> +                     break;
> >> +             }
> >>       }
> >>
> >>       return slots_fetch_random();
> >> --
> >> 2.6.3
> >>
> 
> 
> 
> -- 
> Kees Cook
> Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-05-08  9:17 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-05 22:13 [kernel-hardening] [PATCH v6 0/11] x86/KASLR: Randomize virtual address separately Kees Cook
2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 01/11] x86/boot: Clean up pointer casting Kees Cook
2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 02/11] x86/KASLR: Consolidate mem_avoid entries Kees Cook
2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 03/11] x86/boot: Split out kernel_ident_mapping_init Kees Cook
2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 04/11] x86/KASLR: Build identity mappings on demand Kees Cook
2016-05-06  7:00   ` [kernel-hardening] " Ingo Molnar
2016-05-06 17:44     ` Kees Cook
2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 05/11] x86/KASLR: Add slot_area to manage random_addr slots Kees Cook
2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 06/11] x86/KASLR: Return earliest overlap when avoiding regions Kees Cook
2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 07/11] x86/KASLR: Add virtual address choosing function Kees Cook
2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 08/11] x86/KASLR: Clarify purpose of each get_random_long Kees Cook
2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 09/11] x86/KASLR: Randomize virtual address separately Kees Cook
2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 10/11] x86/KASLR: Add physical address randomization >4G Kees Cook
2016-05-06  8:27   ` [kernel-hardening] " Baoquan He
2016-05-06 15:31     ` Kees Cook
2016-05-08  9:17       ` Baoquan He
2016-05-05 22:13 ` [kernel-hardening] [PATCH v6 11/11] x86/KASLR: Allow randomization below load address Kees Cook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).