All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/3] virtually mapped stacks
@ 2016-08-11  9:35 Andy Lutomirski
  2016-08-11  9:35 ` [PATCH v6 1/3] fork: Add generic vmalloced stack support Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Andy Lutomirski @ 2016-08-11  9:35 UTC (permalink / raw)
  To: x86; +Cc: Borislav Petkov, linux-kernel, Brian Gerst, Andy Lutomirski

Hi all-

Since the dawn of time, a kernel stack overflow has been a real PITA
to debug, has caused nondeterministic crashes some time after the
actual overflow, and has generally been easy to exploit for root.

With this series, arches can enable HAVE_ARCH_VMAP_STACK.  Arches
that enable it (just x86 for now) get virtually mapped stacks with
guard pages.  This causes reliable faults when the stack overflows.

If the arch implements it well, we get a nice OOPS on stack overflow
(as opposed to panicing directly or otherwise exploding badly).  On
x86, the OOPS is nice, has a usable call trace, and the overflowing
task is killed cleanly.

This does not address interrupt stacks.

It's worth noting that s390 has an arch-specific gcc feature that
detects stack overflows by adjusting function prologues.  Arches
with features like that may wish to avoid using vmapped stacks to
avoid any possible the performance hit.

Ingo, I think it may make sense to apply this in its own branch in
-tip.  By itself, it hurts performance a bit, but the next series
(moving thread_info into task_struct and caching stacks) appears to
mostly eliminate the slowdown and to actually speed up my benchmark
compared to the status quo.  I'm sending this separately because it's
logically separate and this may ease the cognitive load a bit.

The 0day bot is chewing on this as we speak.

Known issues:
- virtio_console and wusb will have issues.  Michael
  Tsirkin says he'll fix virtio_console.  Herbert Xu or I will fix wusb,
  although I'm not convinced that wusb hardware exists.

Changes from v5:
 - Rebase to 4.8-rc1
 - Separate out the vmapped stack bit, which can be applied by itself
   now that most of the preparatory work has landed in 4.8-rc1.
 - Default to Y (Ingo)

Changes from v4:
 - Fix kthread (Oleg)
 - Tidy up some changelongs and fold some patches (Borislav, Josh)
 - Add "x86/mm/64: In vmalloc_fault(), use CR3 instead of current->active_mm"
 - Make VMAP_STACK depend on !KASAN (not worth waiting for the fix, I think)

Changes from v3:
 - Minor cleanups
 - Rebased onto Linus' tree
 - All the thread_info stuff is new

Changes from v2:
 - Delete kerne_unmap_pages_in_pgd rather than hardening it (Borislav)
 - Fix sub-page stack accounting better (Josh)

Changes from v1:
 - Fix rewind_stack_and_do_exit (Josh)
 - Fix deadlock under load
 - Clean up generic stack vmalloc code
 - Many other minor fixes

Andy Lutomirski (3):
  fork: Add generic vmalloced stack support
  dma-api: Teach the "DMA-from-stack" check about vmapped stacks
  x86/mm/64: Enable vmapped stacks

 arch/Kconfig                        | 34 +++++++++++++
 arch/ia64/include/asm/thread_info.h |  2 +-
 arch/x86/Kconfig                    |  1 +
 arch/x86/include/asm/switch_to.h    | 28 ++++++++++-
 arch/x86/kernel/traps.c             | 62 ++++++++++++++++++++++++
 arch/x86/mm/tlb.c                   | 15 ++++++
 include/linux/sched.h               | 15 ++++++
 kernel/fork.c                       | 96 +++++++++++++++++++++++++++++--------
 lib/dma-debug.c                     | 39 ++++++++++++---
 9 files changed, 264 insertions(+), 28 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v6 1/3] fork: Add generic vmalloced stack support
  2016-08-11  9:35 [PATCH v6 0/3] virtually mapped stacks Andy Lutomirski
@ 2016-08-11  9:35 ` Andy Lutomirski
  2016-08-15 11:55   ` Michal Hocko
                     ` (3 more replies)
  2016-08-11  9:35 ` [PATCH v6 2/3] dma-api: Teach the "DMA-from-stack" check about vmapped stacks Andy Lutomirski
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 27+ messages in thread
From: Andy Lutomirski @ 2016-08-11  9:35 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Andy Lutomirski,
	Oleg Nesterov

If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
vmalloc_node.

grsecurity has had a similar feature (called
GRKERNSEC_KSTACKOVERFLOW) for a long time.

Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/Kconfig                        | 34 +++++++++++++
 arch/ia64/include/asm/thread_info.h |  2 +-
 include/linux/sched.h               | 15 ++++++
 kernel/fork.c                       | 96 +++++++++++++++++++++++++++++--------
 4 files changed, 126 insertions(+), 21 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index bd8056b5b246..2a7f5b62e716 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -698,4 +698,38 @@ config ARCH_NO_COHERENT_DMA_MMAP
 config CPU_NO_EFFICIENT_FFS
 	def_bool n
 
+config HAVE_ARCH_VMAP_STACK
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stacks
+	  in vmalloc space.  This means:
+
+	  - vmalloc space must be large enough to hold many kernel stacks.
+	    This may rule out many 32-bit architectures.
+
+	  - Stacks in vmalloc space need to work reliably.  For example, if
+	    vmap page tables are created on demand, either this mechanism
+	    needs to work while the stack points to a virtual address with
+	    unpopulated page tables or arch code (switch_to and switch_mm,
+	    most likely) needs to ensure that the stack's page table entries
+	    are populated before running on a possibly unpopulated stack.
+
+	  - If the stack overflows into a guard page, something reasonable
+	    should happen.  The definition of "reasonable" is flexible, but
+	    instantly rebooting without logging anything would be unfriendly.
+
+config VMAP_STACK
+	default y
+	bool "Use a virtually-mapped stack"
+	depends on HAVE_ARCH_VMAP_STACK && !KASAN
+	---help---
+	  Enable this if you want the use virtually-mapped kernel stacks
+	  with guard pages.  This causes kernel stack overflows to be
+	  caught immediately rather than causing difficult-to-diagnose
+	  corruption.
+
+	  This is presently incompatible with KASAN because KASAN expects
+	  the stack to map directly to the KASAN shadow map using a formula
+	  that is incorrect if the stack is in vmalloc space.
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index 29bd59790d6c..c7026429816b 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -56,7 +56,7 @@ struct thread_info {
 #define alloc_thread_stack_node(tsk, node)	((unsigned long *) 0)
 #define task_thread_info(tsk)	((struct thread_info *) 0)
 #endif
-#define free_thread_stack(ti)	/* nothing */
+#define free_thread_stack(tsk)	/* nothing */
 #define task_stack_page(tsk)	((void *)(tsk))
 
 #define __HAVE_THREAD_FUNCTIONS
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e513e39..20f9f47bcfd0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1923,6 +1923,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_VMAP_STACK
+	struct vm_struct *stack_vm_area;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -1939,6 +1942,18 @@ extern int arch_task_struct_size __read_mostly;
 # define arch_task_struct_size (sizeof(struct task_struct))
 #endif
 
+#ifdef CONFIG_VMAP_STACK
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return t->stack_vm_area;
+}
+#else
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return NULL;
+}
+#endif
+
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 52e725d4a866..05f7ef796fb4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -158,19 +158,39 @@ void __weak arch_release_thread_stack(unsigned long *stack)
  * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
  * kmemcache based allocator.
  */
-# if THREAD_SIZE >= PAGE_SIZE
-static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
-						  int node)
+# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
+static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
+#ifdef CONFIG_VMAP_STACK
+	void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
+					   VMALLOC_START, VMALLOC_END,
+					   THREADINFO_GFP | __GFP_HIGHMEM,
+					   PAGE_KERNEL,
+					   0, node,
+					   __builtin_return_address(0));
+
+	/*
+	 * We can't call find_vm_area() in interrupt context, and
+	 * free_thread_stack can be called in interrupt context, so cache
+	 * the vm_struct.
+	 */
+	if (stack)
+		tsk->stack_vm_area = find_vm_area(stack);
+	return stack;
+#else
 	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
 					     THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
+#endif
 }
 
-static inline void free_thread_stack(unsigned long *stack)
+static inline void free_thread_stack(struct task_struct *tsk)
 {
-	__free_pages(virt_to_page(stack), THREAD_SIZE_ORDER);
+	if (task_stack_vm_area(tsk))
+		vfree(tsk->stack);
+	else
+		__free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_stack_cache;
@@ -181,9 +201,9 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
 	return kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node);
 }
 
-static void free_thread_stack(unsigned long *stack)
+static void free_thread_stack(struct task_struct *tsk)
 {
-	kmem_cache_free(thread_stack_cache, stack);
+	kmem_cache_free(thread_stack_cache, tsk->stack);
 }
 
 void thread_stack_cache_init(void)
@@ -213,24 +233,47 @@ struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-static void account_kernel_stack(unsigned long *stack, int account)
+static void account_kernel_stack(struct task_struct *tsk, int account)
 {
-	/* All stack pages are in the same zone and belong to the same memcg. */
-	struct page *first_page = virt_to_page(stack);
+	void *stack = task_stack_page(tsk);
+	struct vm_struct *vm = task_stack_vm_area(tsk);
+
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
+
+	if (vm) {
+		int i;
+
+		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
+
+		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+			mod_zone_page_state(page_zone(vm->pages[i]),
+					    NR_KERNEL_STACK_KB,
+					    PAGE_SIZE / 1024 * account);
+		}
 
-	mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
-			    THREAD_SIZE / 1024 * account);
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(vm->pages[0], MEMCG_KERNEL_STACK_KB,
+					    account * (THREAD_SIZE / 1024));
+	} else {
+		/*
+		 * All stack pages are in the same zone and belong to the
+		 * same memcg.
+		 */
+		struct page *first_page = virt_to_page(stack);
+
+		mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
+				    THREAD_SIZE / 1024 * account);
 
-	memcg_kmem_update_page_stat(
-		first_page, MEMCG_KERNEL_STACK_KB,
-		account * (THREAD_SIZE / 1024));
+		memcg_kmem_update_page_stat(first_page, MEMCG_KERNEL_STACK_KB,
+					    account * (THREAD_SIZE / 1024));
+	}
 }
 
 void free_task(struct task_struct *tsk)
 {
-	account_kernel_stack(tsk->stack, -1);
+	account_kernel_stack(tsk, -1);
 	arch_release_thread_stack(tsk->stack);
-	free_thread_stack(tsk->stack);
+	free_thread_stack(tsk);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
 	put_seccomp_filter(tsk);
@@ -342,6 +385,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
 	struct task_struct *tsk;
 	unsigned long *stack;
+	struct vm_struct *stack_vm_area;
 	int err;
 
 	if (node == NUMA_NO_NODE)
@@ -354,11 +398,23 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (!stack)
 		goto free_tsk;
 
+	stack_vm_area = task_stack_vm_area(tsk);
+
 	err = arch_dup_task_struct(tsk, orig);
+
+	/*
+	 * arch_dup_task_struct clobbers the stack-related fields.  Make
+	 * sure they're properly initialized before using any stack-related
+	 * functions again.
+	 */
+	tsk->stack = stack;
+#ifdef CONFIG_VMAP_STACK
+	tsk->stack_vm_area = stack_vm_area;
+#endif
+
 	if (err)
 		goto free_stack;
 
-	tsk->stack = stack;
 #ifdef CONFIG_SECCOMP
 	/*
 	 * We must handle setting up seccomp filters once we're under
@@ -390,14 +446,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->task_frag.page = NULL;
 	tsk->wake_q.next = NULL;
 
-	account_kernel_stack(stack, 1);
+	account_kernel_stack(tsk, 1);
 
 	kcov_task_init(tsk);
 
 	return tsk;
 
 free_stack:
-	free_thread_stack(stack);
+	free_thread_stack(tsk);
 free_tsk:
 	free_task_struct(tsk);
 	return NULL;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v6 2/3] dma-api: Teach the "DMA-from-stack" check about vmapped stacks
  2016-08-11  9:35 [PATCH v6 0/3] virtually mapped stacks Andy Lutomirski
  2016-08-11  9:35 ` [PATCH v6 1/3] fork: Add generic vmalloced stack support Andy Lutomirski
@ 2016-08-11  9:35 ` Andy Lutomirski
  2016-08-24 13:02   ` [tip:x86/asm] " tip-bot for Andy Lutomirski
  2016-08-11  9:35 ` [PATCH v6 3/3] x86/mm/64: Enable " Andy Lutomirski
  2016-08-20 18:06 ` [PATCH v6 0/3] virtually mapped stacks Andy Lutomirski
  3 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2016-08-11  9:35 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Andy Lutomirski,
	Andrew Morton, Arnd Bergmann

If we're using CONFIG_VMAP_STACK and we manage to point an sg entry
at the stack, then either the sg page will be in highmem or sg_virt
will return the direct-map alias.  In neither case will the existing
check_for_stack() implementation realize that it's a stack page.

Fix it by explicitly checking for stack pages.

This has no effect by itself.  It's broken out for ease of review.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 lib/dma-debug.c | 39 +++++++++++++++++++++++++++++++++------
 1 file changed, 33 insertions(+), 6 deletions(-)

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index fcfa1939ac41..ca56d7212e8b 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -22,6 +22,7 @@
 #include <linux/stacktrace.h>
 #include <linux/dma-debug.h>
 #include <linux/spinlock.h>
+#include <linux/vmalloc.h>
 #include <linux/debugfs.h>
 #include <linux/uaccess.h>
 #include <linux/export.h>
@@ -1164,11 +1165,35 @@ static void check_unmap(struct dma_debug_entry *ref)
 	put_hash_bucket(bucket, &flags);
 }
 
-static void check_for_stack(struct device *dev, void *addr)
+static void check_for_stack(struct device *dev,
+			    struct page *page, size_t offset)
 {
-	if (object_is_on_stack(addr))
-		err_printk(dev, NULL, "DMA-API: device driver maps memory from "
-				"stack [addr=%p]\n", addr);
+	void *addr;
+	struct vm_struct *stack_vm_area = task_stack_vm_area(current);
+
+	if (!stack_vm_area) {
+		/* Stack is direct-mapped. */
+		if (PageHighMem(page))
+			return;
+		addr = page_address(page) + offset;
+		if (object_is_on_stack(addr))
+			err_printk(dev, NULL, "DMA-API: device driver maps memory from stack [addr=%p]\n",
+				   addr);
+	} else {
+		/* Stack is vmalloced. */
+		int i;
+
+		for (i = 0; i < stack_vm_area->nr_pages; i++) {
+			if (page != stack_vm_area->pages[i])
+				continue;
+
+			addr = (u8 *)current->stack + i * PAGE_SIZE +
+				offset;
+			err_printk(dev, NULL, "DMA-API: device driver maps memory from stack [probable addr=%p]\n",
+				   addr);
+			break;
+		}
+	}
 }
 
 static inline bool overlap(void *addr, unsigned long len, void *start, void *end)
@@ -1291,10 +1316,11 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
 	if (map_single)
 		entry->type = dma_debug_single;
 
+	check_for_stack(dev, page, offset);
+
 	if (!PageHighMem(page)) {
 		void *addr = page_address(page) + offset;
 
-		check_for_stack(dev, addr);
 		check_for_illegal_area(dev, addr, size);
 	}
 
@@ -1386,8 +1412,9 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		entry->sg_call_ents   = nents;
 		entry->sg_mapped_ents = mapped_ents;
 
+		check_for_stack(dev, sg_page(s), s->offset);
+
 		if (!PageHighMem(sg_page(s))) {
-			check_for_stack(dev, sg_virt(s));
 			check_for_illegal_area(dev, sg_virt(s), sg_dma_len(s));
 		}
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v6 3/3] x86/mm/64: Enable vmapped stacks
  2016-08-11  9:35 [PATCH v6 0/3] virtually mapped stacks Andy Lutomirski
  2016-08-11  9:35 ` [PATCH v6 1/3] fork: Add generic vmalloced stack support Andy Lutomirski
  2016-08-11  9:35 ` [PATCH v6 2/3] dma-api: Teach the "DMA-from-stack" check about vmapped stacks Andy Lutomirski
@ 2016-08-11  9:35 ` Andy Lutomirski
  2016-08-24 13:03   ` [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y) tip-bot for Andy Lutomirski
  2016-08-20 18:06 ` [PATCH v6 0/3] virtually mapped stacks Andy Lutomirski
  3 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2016-08-11  9:35 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Andy Lutomirski, Nadav Amit

This allows x86_64 kernels to enable vmapped stacks.  There are a
couple of interesting bits.

First, x86 lazily faults in top-level paging entries for the vmalloc
area.  This won't work if we get a page fault while trying to access
the stack: the CPU will promote it to a double-fault and we'll die.
To avoid this problem, probe the new stack when switching stacks and
forcibly populate the pgd entry for the stack when switching mms.

Second, once we have guard pages around the stack, we'll want to
detect and handle stack overflow.

I didn't enable it on x86_32.  We'd need to rework the double-fault
code a bit and I'm concerned about running out of vmalloc virtual
addresses under some workloads.

This patch, by itself, will behave somewhat erratically when the
stack overflows while RSP is still more than a few tens of bytes
above the bottom of the stack.  Specifically, we'll get #PF and make
it to no_context and them oops without reliably triggering a
double-fault, and no_context doesn't know about stack overflows.
The next patch will improve that case.

Thank you to Nadav and Brian for helping me pay enough attention to
the SDM to hopefully get this right.

Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++-
 arch/x86/kernel/traps.c          | 62 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                | 15 ++++++++++
 4 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5c6e7471b732..1636ab3c5100 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -92,6 +92,7 @@ config X86
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_EBPF_JIT			if X86_64
+	select HAVE_ARCH_VMAP_STACK		if X86_64
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1b03a1..14e4b20f0aaf 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,28 @@ struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 
+/* This runs runs on the previous thread's stack. */
+static inline void prepare_switch_to(struct task_struct *prev,
+				     struct task_struct *next)
+{
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we switch to a stack that has a top-level paging entry
+	 * that is not present in the current mm, the resulting #PF will
+	 * will be promoted to a double-fault and we'll panic.  Probe
+	 * the new stack now so that vmalloc_fault can fix up the page
+	 * tables if needed.  This can only happen if we use a stack
+	 * in vmap space.
+	 *
+	 * We assume that the stack is aligned so that it never spans
+	 * more than one top-level paging entry.
+	 *
+	 * To minimize cache pollution, just follow the stack pointer.
+	 */
+	READ_ONCE(*(unsigned char *)next->thread.sp);
+#endif
+}
+
 #ifdef CONFIG_X86_32
 
 #ifdef CONFIG_CC_STACKPROTECTOR
@@ -39,6 +61,8 @@ do {									\
 	 */								\
 	unsigned long ebx, ecx, edx, esi, edi;				\
 									\
+	prepare_switch_to(prev, next);					\
+									\
 	asm volatile("pushl %%ebp\n\t"		/* save    EBP   */	\
 		     "movl %%esp,%[prev_sp]\n\t"	/* save    ESP   */ \
 		     "movl %[next_sp],%%esp\n\t"	/* restore ESP   */ \
@@ -103,7 +127,9 @@ do {									\
  * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
  * has no effect.
  */
-#define switch_to(prev, next, last) \
+#define switch_to(prev, next, last)					  \
+	prepare_switch_to(prev, next);					  \
+									  \
 	asm volatile(SAVE_CONTEXT					  \
 	     "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */	  \
 	     "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */	  \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b70ca12dd389..636d94681e02 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present",	segment_not_present)
 DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
+#ifdef CONFIG_VMAP_STACK
+static void __noreturn handle_stack_overflow(const char *message,
+					     struct pt_regs *regs,
+					     unsigned long fault_address)
+{
+	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
+		 (void *)fault_address, current->stack,
+		 (char *)current->stack + THREAD_SIZE - 1);
+	die(message, regs, 0);
+
+	/* Be absolutely certain we don't return. */
+	panic(message);
+}
+#endif
+
 #ifdef CONFIG_X86_64
 /* Runs on IST stack */
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 {
 	static const char str[] = "double fault";
 	struct task_struct *tsk = current;
+#ifdef CONFIG_VMAP_STACK
+	unsigned long cr2;
+#endif
 
 #ifdef CONFIG_X86_ESPFIX64
 	extern unsigned char native_irq_return_iret[];
@@ -332,6 +350,50 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_DF;
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we overflow the stack into a guard page, the CPU will fail
+	 * to deliver #PF and will send #DF instead.  Similarly, if we
+	 * take any non-IST exception while too close to the bottom of
+	 * the stack, the processor will get a page fault while
+	 * delivering the exception and will generate a double fault.
+	 *
+	 * According to the SDM (footnote in 6.15 under "Interrupt 14 -
+	 * Page-Fault Exception (#PF):
+	 *
+	 *   Processors update CR2 whenever a page fault is detected. If a
+	 *   second page fault occurs while an earlier page fault is being
+	 *   deliv- ered, the faulting linear address of the second fault will
+	 *   overwrite the contents of CR2 (replacing the previous
+	 *   address). These updates to CR2 occur even if the page fault
+	 *   results in a double fault or occurs during the delivery of a
+	 *   double fault.
+	 *
+	 * The logic below has a small possibility of incorrectly diagnosing
+	 * some errors as stack overflows.  For example, if the IDT or GDT
+	 * gets corrupted such that #GP delivery fails due to a bad descriptor
+	 * causing #GP and we hit this condition while CR2 coincidentally
+	 * points to the stack guard page, we'll think we overflowed the
+	 * stack.  Given that we're going to panic one way or another
+	 * if this happens, this isn't necessarily worth fixing.
+	 *
+	 * If necessary, we could improve the test by only diagnosing
+	 * a stack overflow if the saved RSP points within 47 bytes of
+	 * the bottom of the stack: if RSP == tsk_stack + 48 and we
+	 * take an exception, the stack is already aligned and there
+	 * will be enough room SS, RSP, RFLAGS, CS, RIP, and a
+	 * possible error code, so a stack overflow would *not* double
+	 * fault.  With any less space left, exception delivery could
+	 * fail, and, as a practical matter, we've overflowed the
+	 * stack even if the actual trigger for the double fault was
+	 * something else.
+	 */
+	cr2 = read_cr2();
+	if ((unsigned long)task_stack_page(tsk) - 1 - cr2 < PAGE_SIZE)
+		handle_stack_overflow(
+			"kernel stack overflow (double-fault)", regs, cr2);
+#endif
+
 #ifdef CONFIG_DOUBLEFAULT
 	df_debug(regs, error_code);
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4dbe65622810..43f89e085557 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	unsigned cpu = smp_processor_id();
 
 	if (likely(prev != next)) {
+		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+			/*
+			 * If our current stack is in vmalloc space and isn't
+			 * mapped in the new pgd, we'll double-fault.  Forcibly
+			 * map it.
+			 */
+			unsigned int stack_pgd_index =
+				pgd_index(current_stack_pointer());
+			pgd_t *pgd = next->pgd + stack_pgd_index;
+
+			if (unlikely(pgd_none(*pgd)))
+				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+		}
+
 #ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
 #endif
+
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
 		/*
-- 
2.7.4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 1/3] fork: Add generic vmalloced stack support
  2016-08-11  9:35 ` [PATCH v6 1/3] fork: Add generic vmalloced stack support Andy Lutomirski
@ 2016-08-15 11:55   ` Michal Hocko
  2016-08-24 10:03   ` Ingo Molnar
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2016-08-15 11:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Borislav Petkov, linux-kernel, Brian Gerst, Oleg Nesterov

On Thu 11-08-16 02:35:21, Andy Lutomirski wrote:
> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
> vmalloc_node.
> 
> grsecurity has had a similar feature (called
> GRKERNSEC_KSTACKOVERFLOW) for a long time.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Looks good to me.
FWIW
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  arch/Kconfig                        | 34 +++++++++++++
>  arch/ia64/include/asm/thread_info.h |  2 +-
>  include/linux/sched.h               | 15 ++++++
>  kernel/fork.c                       | 96 +++++++++++++++++++++++++++++--------
>  4 files changed, 126 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index bd8056b5b246..2a7f5b62e716 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -698,4 +698,38 @@ config ARCH_NO_COHERENT_DMA_MMAP
>  config CPU_NO_EFFICIENT_FFS
>  	def_bool n
>  
> +config HAVE_ARCH_VMAP_STACK
> +	def_bool n
> +	help
> +	  An arch should select this symbol if it can support kernel stacks
> +	  in vmalloc space.  This means:
> +
> +	  - vmalloc space must be large enough to hold many kernel stacks.
> +	    This may rule out many 32-bit architectures.
> +
> +	  - Stacks in vmalloc space need to work reliably.  For example, if
> +	    vmap page tables are created on demand, either this mechanism
> +	    needs to work while the stack points to a virtual address with
> +	    unpopulated page tables or arch code (switch_to and switch_mm,
> +	    most likely) needs to ensure that the stack's page table entries
> +	    are populated before running on a possibly unpopulated stack.
> +
> +	  - If the stack overflows into a guard page, something reasonable
> +	    should happen.  The definition of "reasonable" is flexible, but
> +	    instantly rebooting without logging anything would be unfriendly.
> +
> +config VMAP_STACK
> +	default y
> +	bool "Use a virtually-mapped stack"
> +	depends on HAVE_ARCH_VMAP_STACK && !KASAN
> +	---help---
> +	  Enable this if you want the use virtually-mapped kernel stacks
> +	  with guard pages.  This causes kernel stack overflows to be
> +	  caught immediately rather than causing difficult-to-diagnose
> +	  corruption.
> +
> +	  This is presently incompatible with KASAN because KASAN expects
> +	  the stack to map directly to the KASAN shadow map using a formula
> +	  that is incorrect if the stack is in vmalloc space.
> +
>  source "kernel/gcov/Kconfig"
> diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
> index 29bd59790d6c..c7026429816b 100644
> --- a/arch/ia64/include/asm/thread_info.h
> +++ b/arch/ia64/include/asm/thread_info.h
> @@ -56,7 +56,7 @@ struct thread_info {
>  #define alloc_thread_stack_node(tsk, node)	((unsigned long *) 0)
>  #define task_thread_info(tsk)	((struct thread_info *) 0)
>  #endif
> -#define free_thread_stack(ti)	/* nothing */
> +#define free_thread_stack(tsk)	/* nothing */
>  #define task_stack_page(tsk)	((void *)(tsk))
>  
>  #define __HAVE_THREAD_FUNCTIONS
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 62c68e513e39..20f9f47bcfd0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1923,6 +1923,9 @@ struct task_struct {
>  #ifdef CONFIG_MMU
>  	struct task_struct *oom_reaper_list;
>  #endif
> +#ifdef CONFIG_VMAP_STACK
> +	struct vm_struct *stack_vm_area;
> +#endif
>  /* CPU-specific state of this task */
>  	struct thread_struct thread;
>  /*
> @@ -1939,6 +1942,18 @@ extern int arch_task_struct_size __read_mostly;
>  # define arch_task_struct_size (sizeof(struct task_struct))
>  #endif
>  
> +#ifdef CONFIG_VMAP_STACK
> +static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
> +{
> +	return t->stack_vm_area;
> +}
> +#else
> +static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
> +{
> +	return NULL;
> +}
> +#endif
> +
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
>  #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 52e725d4a866..05f7ef796fb4 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -158,19 +158,39 @@ void __weak arch_release_thread_stack(unsigned long *stack)
>   * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
>   * kmemcache based allocator.
>   */
> -# if THREAD_SIZE >= PAGE_SIZE
> -static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
> -						  int node)
> +# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
> +static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
>  {
> +#ifdef CONFIG_VMAP_STACK
> +	void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
> +					   VMALLOC_START, VMALLOC_END,
> +					   THREADINFO_GFP | __GFP_HIGHMEM,
> +					   PAGE_KERNEL,
> +					   0, node,
> +					   __builtin_return_address(0));
> +
> +	/*
> +	 * We can't call find_vm_area() in interrupt context, and
> +	 * free_thread_stack can be called in interrupt context, so cache
> +	 * the vm_struct.
> +	 */
> +	if (stack)
> +		tsk->stack_vm_area = find_vm_area(stack);
> +	return stack;
> +#else
>  	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
>  					     THREAD_SIZE_ORDER);
>  
>  	return page ? page_address(page) : NULL;
> +#endif
>  }
>  
> -static inline void free_thread_stack(unsigned long *stack)
> +static inline void free_thread_stack(struct task_struct *tsk)
>  {
> -	__free_pages(virt_to_page(stack), THREAD_SIZE_ORDER);
> +	if (task_stack_vm_area(tsk))
> +		vfree(tsk->stack);
> +	else
> +		__free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_stack_cache;
> @@ -181,9 +201,9 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
>  	return kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node);
>  }
>  
> -static void free_thread_stack(unsigned long *stack)
> +static void free_thread_stack(struct task_struct *tsk)
>  {
> -	kmem_cache_free(thread_stack_cache, stack);
> +	kmem_cache_free(thread_stack_cache, tsk->stack);
>  }
>  
>  void thread_stack_cache_init(void)
> @@ -213,24 +233,47 @@ struct kmem_cache *vm_area_cachep;
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
>  
> -static void account_kernel_stack(unsigned long *stack, int account)
> +static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
> -	/* All stack pages are in the same zone and belong to the same memcg. */
> -	struct page *first_page = virt_to_page(stack);
> +	void *stack = task_stack_page(tsk);
> +	struct vm_struct *vm = task_stack_vm_area(tsk);
> +
> +	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
> +
> +	if (vm) {
> +		int i;
> +
> +		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
> +
> +		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
> +			mod_zone_page_state(page_zone(vm->pages[i]),
> +					    NR_KERNEL_STACK_KB,
> +					    PAGE_SIZE / 1024 * account);
> +		}
>  
> -	mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
> -			    THREAD_SIZE / 1024 * account);
> +		/* All stack pages belong to the same memcg. */
> +		memcg_kmem_update_page_stat(vm->pages[0], MEMCG_KERNEL_STACK_KB,
> +					    account * (THREAD_SIZE / 1024));
> +	} else {
> +		/*
> +		 * All stack pages are in the same zone and belong to the
> +		 * same memcg.
> +		 */
> +		struct page *first_page = virt_to_page(stack);
> +
> +		mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
> +				    THREAD_SIZE / 1024 * account);
>  
> -	memcg_kmem_update_page_stat(
> -		first_page, MEMCG_KERNEL_STACK_KB,
> -		account * (THREAD_SIZE / 1024));
> +		memcg_kmem_update_page_stat(first_page, MEMCG_KERNEL_STACK_KB,
> +					    account * (THREAD_SIZE / 1024));
> +	}
>  }
>  
>  void free_task(struct task_struct *tsk)
>  {
> -	account_kernel_stack(tsk->stack, -1);
> +	account_kernel_stack(tsk, -1);
>  	arch_release_thread_stack(tsk->stack);
> -	free_thread_stack(tsk->stack);
> +	free_thread_stack(tsk);
>  	rt_mutex_debug_task_free(tsk);
>  	ftrace_graph_exit_task(tsk);
>  	put_seccomp_filter(tsk);
> @@ -342,6 +385,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  {
>  	struct task_struct *tsk;
>  	unsigned long *stack;
> +	struct vm_struct *stack_vm_area;
>  	int err;
>  
>  	if (node == NUMA_NO_NODE)
> @@ -354,11 +398,23 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  	if (!stack)
>  		goto free_tsk;
>  
> +	stack_vm_area = task_stack_vm_area(tsk);
> +
>  	err = arch_dup_task_struct(tsk, orig);
> +
> +	/*
> +	 * arch_dup_task_struct clobbers the stack-related fields.  Make
> +	 * sure they're properly initialized before using any stack-related
> +	 * functions again.
> +	 */
> +	tsk->stack = stack;
> +#ifdef CONFIG_VMAP_STACK
> +	tsk->stack_vm_area = stack_vm_area;
> +#endif
> +
>  	if (err)
>  		goto free_stack;
>  
> -	tsk->stack = stack;
>  #ifdef CONFIG_SECCOMP
>  	/*
>  	 * We must handle setting up seccomp filters once we're under
> @@ -390,14 +446,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  	tsk->task_frag.page = NULL;
>  	tsk->wake_q.next = NULL;
>  
> -	account_kernel_stack(stack, 1);
> +	account_kernel_stack(tsk, 1);
>  
>  	kcov_task_init(tsk);
>  
>  	return tsk;
>  
>  free_stack:
> -	free_thread_stack(stack);
> +	free_thread_stack(tsk);
>  free_tsk:
>  	free_task_struct(tsk);
>  	return NULL;
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 0/3] virtually mapped stacks
  2016-08-11  9:35 [PATCH v6 0/3] virtually mapped stacks Andy Lutomirski
                   ` (2 preceding siblings ...)
  2016-08-11  9:35 ` [PATCH v6 3/3] x86/mm/64: Enable " Andy Lutomirski
@ 2016-08-20 18:06 ` Andy Lutomirski
  2016-08-24 10:12   ` Ingo Molnar
  3 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2016-08-20 18:06 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst

On Thu, Aug 11, 2016 at 2:35 AM, Andy Lutomirski <luto@kernel.org> wrote:
> Andy Lutomirski (3):
>   fork: Add generic vmalloced stack support
>   dma-api: Teach the "DMA-from-stack" check about vmapped stacks
>   x86/mm/64: Enable vmapped stacks


Hi Ingo-

Is this a good format for this series?  Is there anything I need to do?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 1/3] fork: Add generic vmalloced stack support
  2016-08-11  9:35 ` [PATCH v6 1/3] fork: Add generic vmalloced stack support Andy Lutomirski
  2016-08-15 11:55   ` Michal Hocko
@ 2016-08-24 10:03   ` Ingo Molnar
  2016-08-24 16:11     ` Dmitry Vyukov
  2016-08-24 13:02   ` [tip:x86/asm] " tip-bot for Andy Lutomirski
  2016-08-24 16:51   ` [PATCH v6 1/3] " Josh Cartwright
  3 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2016-08-24 10:03 UTC (permalink / raw)
  To: Andy Lutomirski, Andrey Ryabinin, Alexander Potapenko, Dmitry Vyukov
  Cc: x86, Borislav Petkov, linux-kernel, Brian Gerst, Oleg Nesterov,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra


* Andy Lutomirski <luto@kernel.org> wrote:

> +config VMAP_STACK
> +	default y
> +	bool "Use a virtually-mapped stack"
> +	depends on HAVE_ARCH_VMAP_STACK && !KASAN
> +	---help---
> +	  Enable this if you want the use virtually-mapped kernel stacks
> +	  with guard pages.  This causes kernel stack overflows to be
> +	  caught immediately rather than causing difficult-to-diagnose
> +	  corruption.
> +
> +	  This is presently incompatible with KASAN because KASAN expects
> +	  the stack to map directly to the KASAN shadow map using a formula
> +	  that is incorrect if the stack is in vmalloc space.

Btw., is this KASAN limitation fundamental?

As x86 is going to enable this feature by default, this probably limits KASAN 
utility rather significantly.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 0/3] virtually mapped stacks
  2016-08-20 18:06 ` [PATCH v6 0/3] virtually mapped stacks Andy Lutomirski
@ 2016-08-24 10:12   ` Ingo Molnar
  2016-08-24 15:27     ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2016-08-24 10:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Linus Torvalds


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Thu, Aug 11, 2016 at 2:35 AM, Andy Lutomirski <luto@kernel.org> wrote:
> > Andy Lutomirski (3):
> >   fork: Add generic vmalloced stack support
> >   dma-api: Teach the "DMA-from-stack" check about vmapped stacks
> >   x86/mm/64: Enable vmapped stacks
> 
> 
> Hi Ingo-
> 
> Is this a good format for this series?  Is there anything I need to do?

It's all looking good to me - just needed a quiet moment to line it all up, as 
it's a bit risky all around.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [tip:x86/asm] fork: Add generic vmalloced stack support
  2016-08-11  9:35 ` [PATCH v6 1/3] fork: Add generic vmalloced stack support Andy Lutomirski
  2016-08-15 11:55   ` Michal Hocko
  2016-08-24 10:03   ` Ingo Molnar
@ 2016-08-24 13:02   ` tip-bot for Andy Lutomirski
  2016-08-24 16:51   ` [PATCH v6 1/3] " Josh Cartwright
  3 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-08-24 13:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mhocko, mingo, torvalds, bp, jpoimboe, peterz, oleg, luto,
	dvlasenk, linux-kernel, hpa, brgerst, tglx, glider, aryabinin,
	dvyukov

Commit-ID:  ba14a194a434ccc8f733e263ad2ce941e35e5787
Gitweb:     http://git.kernel.org/tip/ba14a194a434ccc8f733e263ad2ce941e35e5787
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Thu, 11 Aug 2016 02:35:21 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Aug 2016 12:11:41 +0200

fork: Add generic vmalloced stack support

If CONFIG_VMAP_STACK=y is selected, kernel stacks are allocated with
__vmalloc_node_range().

Grsecurity has had a similar feature (called GRKERNSEC_KSTACKOVERFLOW=y)
for a long time.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/14c07d4fd173a5b117f51e8b939f9f4323e39899.1470907718.git.luto@kernel.org
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/Kconfig                        | 34 +++++++++++++
 arch/ia64/include/asm/thread_info.h |  2 +-
 include/linux/sched.h               | 15 ++++++
 kernel/fork.c                       | 96 +++++++++++++++++++++++++++++--------
 4 files changed, 126 insertions(+), 21 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index e9c9334..9ecf9f6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -707,4 +707,38 @@ config ARCH_NO_COHERENT_DMA_MMAP
 config CPU_NO_EFFICIENT_FFS
 	def_bool n
 
+config HAVE_ARCH_VMAP_STACK
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stacks
+	  in vmalloc space.  This means:
+
+	  - vmalloc space must be large enough to hold many kernel stacks.
+	    This may rule out many 32-bit architectures.
+
+	  - Stacks in vmalloc space need to work reliably.  For example, if
+	    vmap page tables are created on demand, either this mechanism
+	    needs to work while the stack points to a virtual address with
+	    unpopulated page tables or arch code (switch_to() and switch_mm(),
+	    most likely) needs to ensure that the stack's page table entries
+	    are populated before running on a possibly unpopulated stack.
+
+	  - If the stack overflows into a guard page, something reasonable
+	    should happen.  The definition of "reasonable" is flexible, but
+	    instantly rebooting without logging anything would be unfriendly.
+
+config VMAP_STACK
+	default y
+	bool "Use a virtually-mapped stack"
+	depends on HAVE_ARCH_VMAP_STACK && !KASAN
+	---help---
+	  Enable this if you want the use virtually-mapped kernel stacks
+	  with guard pages.  This causes kernel stack overflows to be
+	  caught immediately rather than causing difficult-to-diagnose
+	  corruption.
+
+	  This is presently incompatible with KASAN because KASAN expects
+	  the stack to map directly to the KASAN shadow map using a formula
+	  that is incorrect if the stack is in vmalloc space.
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index 29bd597..c702642 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -56,7 +56,7 @@ struct thread_info {
 #define alloc_thread_stack_node(tsk, node)	((unsigned long *) 0)
 #define task_thread_info(tsk)	((struct thread_info *) 0)
 #endif
-#define free_thread_stack(ti)	/* nothing */
+#define free_thread_stack(tsk)	/* nothing */
 #define task_stack_page(tsk)	((void *)(tsk))
 
 #define __HAVE_THREAD_FUNCTIONS
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..20f9f47 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1923,6 +1923,9 @@ struct task_struct {
 #ifdef CONFIG_MMU
 	struct task_struct *oom_reaper_list;
 #endif
+#ifdef CONFIG_VMAP_STACK
+	struct vm_struct *stack_vm_area;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -1939,6 +1942,18 @@ extern int arch_task_struct_size __read_mostly;
 # define arch_task_struct_size (sizeof(struct task_struct))
 #endif
 
+#ifdef CONFIG_VMAP_STACK
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return t->stack_vm_area;
+}
+#else
+static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
+{
+	return NULL;
+}
+#endif
+
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 52e725d..9b85f6b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -158,19 +158,39 @@ void __weak arch_release_thread_stack(unsigned long *stack)
  * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
  * kmemcache based allocator.
  */
-# if THREAD_SIZE >= PAGE_SIZE
-static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
-						  int node)
+# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
+static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
+#ifdef CONFIG_VMAP_STACK
+	void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
+					   VMALLOC_START, VMALLOC_END,
+					   THREADINFO_GFP | __GFP_HIGHMEM,
+					   PAGE_KERNEL,
+					   0, node,
+					   __builtin_return_address(0));
+
+	/*
+	 * We can't call find_vm_area() in interrupt context, and
+	 * free_thread_stack() can be called in interrupt context,
+	 * so cache the vm_struct.
+	 */
+	if (stack)
+		tsk->stack_vm_area = find_vm_area(stack);
+	return stack;
+#else
 	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
 					     THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
+#endif
 }
 
-static inline void free_thread_stack(unsigned long *stack)
+static inline void free_thread_stack(struct task_struct *tsk)
 {
-	__free_pages(virt_to_page(stack), THREAD_SIZE_ORDER);
+	if (task_stack_vm_area(tsk))
+		vfree(tsk->stack);
+	else
+		__free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_stack_cache;
@@ -181,9 +201,9 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
 	return kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node);
 }
 
-static void free_thread_stack(unsigned long *stack)
+static void free_thread_stack(struct task_struct *tsk)
 {
-	kmem_cache_free(thread_stack_cache, stack);
+	kmem_cache_free(thread_stack_cache, tsk->stack);
 }
 
 void thread_stack_cache_init(void)
@@ -213,24 +233,47 @@ struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
-static void account_kernel_stack(unsigned long *stack, int account)
+static void account_kernel_stack(struct task_struct *tsk, int account)
 {
-	/* All stack pages are in the same zone and belong to the same memcg. */
-	struct page *first_page = virt_to_page(stack);
+	void *stack = task_stack_page(tsk);
+	struct vm_struct *vm = task_stack_vm_area(tsk);
+
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
+
+	if (vm) {
+		int i;
+
+		BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
+
+		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
+			mod_zone_page_state(page_zone(vm->pages[i]),
+					    NR_KERNEL_STACK_KB,
+					    PAGE_SIZE / 1024 * account);
+		}
 
-	mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
-			    THREAD_SIZE / 1024 * account);
+		/* All stack pages belong to the same memcg. */
+		memcg_kmem_update_page_stat(vm->pages[0], MEMCG_KERNEL_STACK_KB,
+					    account * (THREAD_SIZE / 1024));
+	} else {
+		/*
+		 * All stack pages are in the same zone and belong to the
+		 * same memcg.
+		 */
+		struct page *first_page = virt_to_page(stack);
+
+		mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
+				    THREAD_SIZE / 1024 * account);
 
-	memcg_kmem_update_page_stat(
-		first_page, MEMCG_KERNEL_STACK_KB,
-		account * (THREAD_SIZE / 1024));
+		memcg_kmem_update_page_stat(first_page, MEMCG_KERNEL_STACK_KB,
+					    account * (THREAD_SIZE / 1024));
+	}
 }
 
 void free_task(struct task_struct *tsk)
 {
-	account_kernel_stack(tsk->stack, -1);
+	account_kernel_stack(tsk, -1);
 	arch_release_thread_stack(tsk->stack);
-	free_thread_stack(tsk->stack);
+	free_thread_stack(tsk);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
 	put_seccomp_filter(tsk);
@@ -342,6 +385,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 {
 	struct task_struct *tsk;
 	unsigned long *stack;
+	struct vm_struct *stack_vm_area;
 	int err;
 
 	if (node == NUMA_NO_NODE)
@@ -354,11 +398,23 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (!stack)
 		goto free_tsk;
 
+	stack_vm_area = task_stack_vm_area(tsk);
+
 	err = arch_dup_task_struct(tsk, orig);
+
+	/*
+	 * arch_dup_task_struct() clobbers the stack-related fields.  Make
+	 * sure they're properly initialized before using any stack-related
+	 * functions again.
+	 */
+	tsk->stack = stack;
+#ifdef CONFIG_VMAP_STACK
+	tsk->stack_vm_area = stack_vm_area;
+#endif
+
 	if (err)
 		goto free_stack;
 
-	tsk->stack = stack;
 #ifdef CONFIG_SECCOMP
 	/*
 	 * We must handle setting up seccomp filters once we're under
@@ -390,14 +446,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->task_frag.page = NULL;
 	tsk->wake_q.next = NULL;
 
-	account_kernel_stack(stack, 1);
+	account_kernel_stack(tsk, 1);
 
 	kcov_task_init(tsk);
 
 	return tsk;
 
 free_stack:
-	free_thread_stack(stack);
+	free_thread_stack(tsk);
 free_tsk:
 	free_task_struct(tsk);
 	return NULL;

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [tip:x86/asm] dma-api: Teach the "DMA-from-stack" check about vmapped stacks
  2016-08-11  9:35 ` [PATCH v6 2/3] dma-api: Teach the "DMA-from-stack" check about vmapped stacks Andy Lutomirski
@ 2016-08-24 13:02   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-08-24 13:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, bp, torvalds, mingo, tglx, dvlasenk, brgerst, luto, hpa,
	arnd, jpoimboe, linux-kernel, akpm

Commit-ID:  b4a0f533e5976cb1a79f31d6152e1d322d79b7f1
Gitweb:     http://git.kernel.org/tip/b4a0f533e5976cb1a79f31d6152e1d322d79b7f1
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Thu, 11 Aug 2016 02:35:22 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Aug 2016 12:11:41 +0200

dma-api: Teach the "DMA-from-stack" check about vmapped stacks

If we're using CONFIG_VMAP_STACK=y and we manage to point an sg entry
at the stack, then either the sg page will be in highmem or sg_virt()
will return the direct-map alias.  In neither case will the existing
check_for_stack() implementation realize that it's a stack page.

Fix it by explicitly checking for stack pages.

This has no effect by itself.  It's broken out for ease of review.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/448460622731312298bf19dcbacb1606e75de7a9.1470907718.git.luto@kernel.org
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 lib/dma-debug.c | 36 ++++++++++++++++++++++++++++++------
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index fcfa193..06f02f6 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -22,6 +22,7 @@
 #include <linux/stacktrace.h>
 #include <linux/dma-debug.h>
 #include <linux/spinlock.h>
+#include <linux/vmalloc.h>
 #include <linux/debugfs.h>
 #include <linux/uaccess.h>
 #include <linux/export.h>
@@ -1164,11 +1165,32 @@ static void check_unmap(struct dma_debug_entry *ref)
 	put_hash_bucket(bucket, &flags);
 }
 
-static void check_for_stack(struct device *dev, void *addr)
+static void check_for_stack(struct device *dev,
+			    struct page *page, size_t offset)
 {
-	if (object_is_on_stack(addr))
-		err_printk(dev, NULL, "DMA-API: device driver maps memory from "
-				"stack [addr=%p]\n", addr);
+	void *addr;
+	struct vm_struct *stack_vm_area = task_stack_vm_area(current);
+
+	if (!stack_vm_area) {
+		/* Stack is direct-mapped. */
+		if (PageHighMem(page))
+			return;
+		addr = page_address(page) + offset;
+		if (object_is_on_stack(addr))
+			err_printk(dev, NULL, "DMA-API: device driver maps memory from stack [addr=%p]\n", addr);
+	} else {
+		/* Stack is vmalloced. */
+		int i;
+
+		for (i = 0; i < stack_vm_area->nr_pages; i++) {
+			if (page != stack_vm_area->pages[i])
+				continue;
+
+			addr = (u8 *)current->stack + i * PAGE_SIZE + offset;
+			err_printk(dev, NULL, "DMA-API: device driver maps memory from stack [probable addr=%p]\n", addr);
+			break;
+		}
+	}
 }
 
 static inline bool overlap(void *addr, unsigned long len, void *start, void *end)
@@ -1291,10 +1313,11 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
 	if (map_single)
 		entry->type = dma_debug_single;
 
+	check_for_stack(dev, page, offset);
+
 	if (!PageHighMem(page)) {
 		void *addr = page_address(page) + offset;
 
-		check_for_stack(dev, addr);
 		check_for_illegal_area(dev, addr, size);
 	}
 
@@ -1386,8 +1409,9 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		entry->sg_call_ents   = nents;
 		entry->sg_mapped_ents = mapped_ents;
 
+		check_for_stack(dev, sg_page(s), s->offset);
+
 		if (!PageHighMem(sg_page(s))) {
-			check_for_stack(dev, sg_virt(s));
 			check_for_illegal_area(dev, sg_virt(s), sg_dma_len(s));
 		}
 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
  2016-08-11  9:35 ` [PATCH v6 3/3] x86/mm/64: Enable " Andy Lutomirski
@ 2016-08-24 13:03   ` tip-bot for Andy Lutomirski
  2016-10-21 12:32       ` Matt Fleming
  0 siblings, 1 reply; 27+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-08-24 13:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, torvalds, mingo, bp, luto, peterz, dvlasenk,
	brgerst, jpoimboe, nadav.amit, tglx

Commit-ID:  e37e43a497d5a8b7c0cc1736d56986f432c394c9
Gitweb:     http://git.kernel.org/tip/e37e43a497d5a8b7c0cc1736d56986f432c394c9
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Thu, 11 Aug 2016 02:35:23 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Aug 2016 12:11:42 +0200

x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)

This allows x86_64 kernels to enable vmapped stacks by setting
HAVE_ARCH_VMAP_STACK=y - which enables the CONFIG_VMAP_STACK=y
high level Kconfig option.

There are a couple of interesting bits:

First, x86 lazily faults in top-level paging entries for the vmalloc
area.  This won't work if we get a page fault while trying to access
the stack: the CPU will promote it to a double-fault and we'll die.
To avoid this problem, probe the new stack when switching stacks and
forcibly populate the pgd entry for the stack when switching mms.

Second, once we have guard pages around the stack, we'll want to
detect and handle stack overflow.

I didn't enable it on x86_32.  We'd need to rework the double-fault
code a bit and I'm concerned about running out of vmalloc virtual
addresses under some workloads.

This patch, by itself, will behave somewhat erratically when the
stack overflows while RSP is still more than a few tens of bytes
above the bottom of the stack.  Specifically, we'll get #PF and make
it to no_context and them oops without reliably triggering a
double-fault, and no_context doesn't know about stack overflows.
The next patch will improve that case.

Thank you to Nadav and Brian for helping me pay enough attention to
the SDM to hopefully get this right.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c88f3e2920b18e6cc621d772a04a62c06869037e.1470907718.git.luto@kernel.org
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig                 |  1 +
 arch/x86/include/asm/switch_to.h | 28 +++++++++++++++++-
 arch/x86/kernel/traps.c          | 61 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                | 15 ++++++++++
 4 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c580d8c..21a6d0e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -94,6 +94,7 @@ config X86
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_WITHIN_STACK_FRAMES
 	select HAVE_EBPF_JIT			if X86_64
+	select HAVE_ARCH_VMAP_STACK		if X86_64
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1..14e4b20 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,28 @@ struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 
+/* This runs runs on the previous thread's stack. */
+static inline void prepare_switch_to(struct task_struct *prev,
+				     struct task_struct *next)
+{
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we switch to a stack that has a top-level paging entry
+	 * that is not present in the current mm, the resulting #PF will
+	 * will be promoted to a double-fault and we'll panic.  Probe
+	 * the new stack now so that vmalloc_fault can fix up the page
+	 * tables if needed.  This can only happen if we use a stack
+	 * in vmap space.
+	 *
+	 * We assume that the stack is aligned so that it never spans
+	 * more than one top-level paging entry.
+	 *
+	 * To minimize cache pollution, just follow the stack pointer.
+	 */
+	READ_ONCE(*(unsigned char *)next->thread.sp);
+#endif
+}
+
 #ifdef CONFIG_X86_32
 
 #ifdef CONFIG_CC_STACKPROTECTOR
@@ -39,6 +61,8 @@ do {									\
 	 */								\
 	unsigned long ebx, ecx, edx, esi, edi;				\
 									\
+	prepare_switch_to(prev, next);					\
+									\
 	asm volatile("pushl %%ebp\n\t"		/* save    EBP   */	\
 		     "movl %%esp,%[prev_sp]\n\t"	/* save    ESP   */ \
 		     "movl %[next_sp],%%esp\n\t"	/* restore ESP   */ \
@@ -103,7 +127,9 @@ do {									\
  * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
  * has no effect.
  */
-#define switch_to(prev, next, last) \
+#define switch_to(prev, next, last)					  \
+	prepare_switch_to(prev, next);					  \
+									  \
 	asm volatile(SAVE_CONTEXT					  \
 	     "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */	  \
 	     "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */	  \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b70ca12..907b4e4 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP,     SIGBUS,  "segment not present",	segment_not_present)
 DO_ERROR(X86_TRAP_SS,     SIGBUS,  "stack segment",		stack_segment)
 DO_ERROR(X86_TRAP_AC,     SIGBUS,  "alignment check",		alignment_check)
 
+#ifdef CONFIG_VMAP_STACK
+static void __noreturn handle_stack_overflow(const char *message,
+					     struct pt_regs *regs,
+					     unsigned long fault_address)
+{
+	printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
+		 (void *)fault_address, current->stack,
+		 (char *)current->stack + THREAD_SIZE - 1);
+	die(message, regs, 0);
+
+	/* Be absolutely certain we don't return. */
+	panic(message);
+}
+#endif
+
 #ifdef CONFIG_X86_64
 /* Runs on IST stack */
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 {
 	static const char str[] = "double fault";
 	struct task_struct *tsk = current;
+#ifdef CONFIG_VMAP_STACK
+	unsigned long cr2;
+#endif
 
 #ifdef CONFIG_X86_ESPFIX64
 	extern unsigned char native_irq_return_iret[];
@@ -332,6 +350,49 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_DF;
 
+#ifdef CONFIG_VMAP_STACK
+	/*
+	 * If we overflow the stack into a guard page, the CPU will fail
+	 * to deliver #PF and will send #DF instead.  Similarly, if we
+	 * take any non-IST exception while too close to the bottom of
+	 * the stack, the processor will get a page fault while
+	 * delivering the exception and will generate a double fault.
+	 *
+	 * According to the SDM (footnote in 6.15 under "Interrupt 14 -
+	 * Page-Fault Exception (#PF):
+	 *
+	 *   Processors update CR2 whenever a page fault is detected. If a
+	 *   second page fault occurs while an earlier page fault is being
+	 *   deliv- ered, the faulting linear address of the second fault will
+	 *   overwrite the contents of CR2 (replacing the previous
+	 *   address). These updates to CR2 occur even if the page fault
+	 *   results in a double fault or occurs during the delivery of a
+	 *   double fault.
+	 *
+	 * The logic below has a small possibility of incorrectly diagnosing
+	 * some errors as stack overflows.  For example, if the IDT or GDT
+	 * gets corrupted such that #GP delivery fails due to a bad descriptor
+	 * causing #GP and we hit this condition while CR2 coincidentally
+	 * points to the stack guard page, we'll think we overflowed the
+	 * stack.  Given that we're going to panic one way or another
+	 * if this happens, this isn't necessarily worth fixing.
+	 *
+	 * If necessary, we could improve the test by only diagnosing
+	 * a stack overflow if the saved RSP points within 47 bytes of
+	 * the bottom of the stack: if RSP == tsk_stack + 48 and we
+	 * take an exception, the stack is already aligned and there
+	 * will be enough room SS, RSP, RFLAGS, CS, RIP, and a
+	 * possible error code, so a stack overflow would *not* double
+	 * fault.  With any less space left, exception delivery could
+	 * fail, and, as a practical matter, we've overflowed the
+	 * stack even if the actual trigger for the double fault was
+	 * something else.
+	 */
+	cr2 = read_cr2();
+	if ((unsigned long)task_stack_page(tsk) - 1 - cr2 < PAGE_SIZE)
+		handle_stack_overflow("kernel stack overflow (double-fault)", regs, cr2);
+#endif
+
 #ifdef CONFIG_DOUBLEFAULT
 	df_debug(regs, error_code);
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4dbe656..a7655f6 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -77,10 +77,25 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	unsigned cpu = smp_processor_id();
 
 	if (likely(prev != next)) {
+		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
+			/*
+			 * If our current stack is in vmalloc space and isn't
+			 * mapped in the new pgd, we'll double-fault.  Forcibly
+			 * map it.
+			 */
+			unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
+
+			pgd_t *pgd = next->pgd + stack_pgd_index;
+
+			if (unlikely(pgd_none(*pgd)))
+				set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+		}
+
 #ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
 #endif
+
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
 		/*

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 0/3] virtually mapped stacks
  2016-08-24 10:12   ` Ingo Molnar
@ 2016-08-24 15:27     ` Andy Lutomirski
  2016-08-24 17:16       ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2016-08-24 15:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Linus Torvalds

On Wed, Aug 24, 2016 at 3:12 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Thu, Aug 11, 2016 at 2:35 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> > Andy Lutomirski (3):
>> >   fork: Add generic vmalloced stack support
>> >   dma-api: Teach the "DMA-from-stack" check about vmapped stacks
>> >   x86/mm/64: Enable vmapped stacks
>>
>>
>> Hi Ingo-
>>
>> Is this a good format for this series?  Is there anything I need to do?
>
> It's all looking good to me - just needed a quiet moment to line it all up, as
> it's a bit risky all around.
>

Thanks!

I've rebased the rest of the pile onto tip:x86/asm (Brian's switch_to
changes conflicted a bit).  I'll test it a bit more, let the 0-day bot
chew on it a bit, and I'll email it out in a couple of days.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 1/3] fork: Add generic vmalloced stack support
  2016-08-24 10:03   ` Ingo Molnar
@ 2016-08-24 16:11     ` Dmitry Vyukov
  0 siblings, 0 replies; 27+ messages in thread
From: Dmitry Vyukov @ 2016-08-24 16:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Andrey Ryabinin, Alexander Potapenko, x86,
	Borislav Petkov, LKML, Brian Gerst, Oleg Nesterov,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra

On Wed, Aug 24, 2016 at 3:03 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
>> +config VMAP_STACK
>> +     default y
>> +     bool "Use a virtually-mapped stack"
>> +     depends on HAVE_ARCH_VMAP_STACK && !KASAN
>> +     ---help---
>> +       Enable this if you want the use virtually-mapped kernel stacks
>> +       with guard pages.  This causes kernel stack overflows to be
>> +       caught immediately rather than causing difficult-to-diagnose
>> +       corruption.
>> +
>> +       This is presently incompatible with KASAN because KASAN expects
>> +       the stack to map directly to the KASAN shadow map using a formula
>> +       that is incorrect if the stack is in vmalloc space.
>
> Btw., is this KASAN limitation fundamental?
>
> As x86 is going to enable this feature by default, this probably limits KASAN
> utility rather significantly.


No, it's not fundamental.

KASAN has shadow for vmalloc range, but currently we map a single
read-only zero page there (which means "this memory is good"). Stack
instrumentation tries to write to that read-only page, which causes
crash.

Andrey proposed that we can map some real writable pages in the shadow
range when we allocate a vmalloc-ed stack:
https://groups.google.com/d/msg/kasan-dev/0YxqFs9r0V8/OKoGHQL8BAAJ

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 1/3] fork: Add generic vmalloced stack support
  2016-08-11  9:35 ` [PATCH v6 1/3] fork: Add generic vmalloced stack support Andy Lutomirski
                     ` (2 preceding siblings ...)
  2016-08-24 13:02   ` [tip:x86/asm] " tip-bot for Andy Lutomirski
@ 2016-08-24 16:51   ` Josh Cartwright
  2016-08-30 22:01     ` Andy Lutomirski
  3 siblings, 1 reply; 27+ messages in thread
From: Josh Cartwright @ 2016-08-24 16:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Borislav Petkov, linux-kernel, Brian Gerst, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 1816 bytes --]

Hey Andy-

Small non-critical/potential future optimization comment below:

On Thu, Aug 11, 2016 at 02:35:21AM -0700, Andy Lutomirski wrote:
> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
> vmalloc_node.
> 
> grsecurity has had a similar feature (called
> GRKERNSEC_KSTACKOVERFLOW) for a long time.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
[..]
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 52e725d4a866..05f7ef796fb4 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -158,19 +158,39 @@ void __weak arch_release_thread_stack(unsigned long *stack)
>   * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
>   * kmemcache based allocator.
>   */
> -# if THREAD_SIZE >= PAGE_SIZE
> -static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
> -						  int node)
> +# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
> +static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
>  {
> +#ifdef CONFIG_VMAP_STACK
> +	void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
> +					   VMALLOC_START, VMALLOC_END,
> +					   THREADINFO_GFP | __GFP_HIGHMEM,
> +					   PAGE_KERNEL,
> +					   0, node,
> +					   __builtin_return_address(0));
> +
> +	/*
> +	 * We can't call find_vm_area() in interrupt context, and
> +	 * free_thread_stack can be called in interrupt context, so cache
> +	 * the vm_struct.
> +	 */
> +	if (stack)
> +		tsk->stack_vm_area = find_vm_area(stack);

This is annoying, we end up having to walk the vm_area tree twice (once
for the allocation, then here to get a handle on area).

Perhaps it's time the vmalloc code learned an allocation API that
returned the vm_area handle as well.

  Josh

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 0/3] virtually mapped stacks
  2016-08-24 15:27     ` Andy Lutomirski
@ 2016-08-24 17:16       ` Ingo Molnar
  0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2016-08-24 17:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Linus Torvalds


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Wed, Aug 24, 2016 at 3:12 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >> On Thu, Aug 11, 2016 at 2:35 AM, Andy Lutomirski <luto@kernel.org> wrote:
> >> > Andy Lutomirski (3):
> >> >   fork: Add generic vmalloced stack support
> >> >   dma-api: Teach the "DMA-from-stack" check about vmapped stacks
> >> >   x86/mm/64: Enable vmapped stacks
> >>
> >>
> >> Hi Ingo-
> >>
> >> Is this a good format for this series?  Is there anything I need to do?
> >
> > It's all looking good to me - just needed a quiet moment to line it all up, as
> > it's a bit risky all around.
> >
> 
> Thanks!
> 
> I've rebased the rest of the pile onto tip:x86/asm (Brian's switch_to
> changes conflicted a bit).  I'll test it a bit more, let the 0-day bot
> chew on it a bit, and I'll email it out in a couple of days.

Sounds good to me!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 1/3] fork: Add generic vmalloced stack support
  2016-08-24 16:51   ` [PATCH v6 1/3] " Josh Cartwright
@ 2016-08-30 22:01     ` Andy Lutomirski
  2016-08-30 22:26       ` Josh Cartwright
  0 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2016-08-30 22:01 UTC (permalink / raw)
  To: Josh Cartwright
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Oleg Nesterov

On Wed, Aug 24, 2016 at 9:51 AM, Josh Cartwright <joshc@ni.com> wrote:
> Hey Andy-
>
> Small non-critical/potential future optimization comment below:
>
> On Thu, Aug 11, 2016 at 02:35:21AM -0700, Andy Lutomirski wrote:
>> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
>> vmalloc_node.
>>
>> grsecurity has had a similar feature (called
>> GRKERNSEC_KSTACKOVERFLOW) for a long time.
>>
>> Cc: Oleg Nesterov <oleg@redhat.com>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
> [..]
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 52e725d4a866..05f7ef796fb4 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -158,19 +158,39 @@ void __weak arch_release_thread_stack(unsigned long *stack)
>>   * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
>>   * kmemcache based allocator.
>>   */
>> -# if THREAD_SIZE >= PAGE_SIZE
>> -static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
>> -                                               int node)
>> +# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
>> +static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
>>  {
>> +#ifdef CONFIG_VMAP_STACK
>> +     void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
>> +                                        VMALLOC_START, VMALLOC_END,
>> +                                        THREADINFO_GFP | __GFP_HIGHMEM,
>> +                                        PAGE_KERNEL,
>> +                                        0, node,
>> +                                        __builtin_return_address(0));
>> +
>> +     /*
>> +      * We can't call find_vm_area() in interrupt context, and
>> +      * free_thread_stack can be called in interrupt context, so cache
>> +      * the vm_struct.
>> +      */
>> +     if (stack)
>> +             tsk->stack_vm_area = find_vm_area(stack);
>
> This is annoying, we end up having to walk the vm_area tree twice (once
> for the allocation, then here to get a handle on area).
>
> Perhaps it's time the vmalloc code learned an allocation API that
> returned the vm_area handle as well.
>

Agreed.  I may do this once everything else lands.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 1/3] fork: Add generic vmalloced stack support
  2016-08-30 22:01     ` Andy Lutomirski
@ 2016-08-30 22:26       ` Josh Cartwright
  0 siblings, 0 replies; 27+ messages in thread
From: Josh Cartwright @ 2016-08-30 22:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 2095 bytes --]

On Tue, Aug 30, 2016 at 03:01:51PM -0700, Andy Lutomirski wrote:
> On Wed, Aug 24, 2016 at 9:51 AM, Josh Cartwright <joshc@ni.com> wrote:
[..]
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index 52e725d4a866..05f7ef796fb4 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -158,19 +158,39 @@ void __weak arch_release_thread_stack(unsigned long *stack)
> >>   * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
> >>   * kmemcache based allocator.
> >>   */
> >> -# if THREAD_SIZE >= PAGE_SIZE
> >> -static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
> >> -                                               int node)
> >> +# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
> >> +static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
> >>  {
> >> +#ifdef CONFIG_VMAP_STACK
> >> +     void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
> >> +                                        VMALLOC_START, VMALLOC_END,
> >> +                                        THREADINFO_GFP | __GFP_HIGHMEM,
> >> +                                        PAGE_KERNEL,
> >> +                                        0, node,
> >> +                                        __builtin_return_address(0));
> >> +
> >> +     /*
> >> +      * We can't call find_vm_area() in interrupt context, and
> >> +      * free_thread_stack can be called in interrupt context, so cache
> >> +      * the vm_struct.
> >> +      */
> >> +     if (stack)
> >> +             tsk->stack_vm_area = find_vm_area(stack);
> >
> > This is annoying, we end up having to walk the vm_area tree twice (once
> > for the allocation, then here to get a handle on area).
> >
> > Perhaps it's time the vmalloc code learned an allocation API that
> > returned the vm_area handle as well.
> 
> Agreed.  I may do this once everything else lands.

There are at least a few other places that could benefit from this,
doing a quick scan of find_vm_area() callers: vmalloc_{32_,}_user() and
kasan_module_alloc().

  Josh

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
@ 2016-10-21 12:32       ` Matt Fleming
  0 siblings, 0 replies; 27+ messages in thread
From: Matt Fleming @ 2016-10-21 12:32 UTC (permalink / raw)
  To: jpoimboe, nadav.amit, brgerst, tglx, torvalds, hpa, mingo,
	linux-kernel, dvlasenk, luto, peterz, bp
  Cc: linux-tip-commits, Ard Biesheuvel, linux-efi

On Wed, 24 Aug, at 06:03:04AM, tip-bot for Andy Lutomirski wrote:
> Commit-ID:  e37e43a497d5a8b7c0cc1736d56986f432c394c9
> Gitweb:     http://git.kernel.org/tip/e37e43a497d5a8b7c0cc1736d56986f432c394c9
> Author:     Andy Lutomirski <luto@kernel.org>
> AuthorDate: Thu, 11 Aug 2016 02:35:23 -0700
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Wed, 24 Aug 2016 12:11:42 +0200
> 
> x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
> 
> This allows x86_64 kernels to enable vmapped stacks by setting
> HAVE_ARCH_VMAP_STACK=y - which enables the CONFIG_VMAP_STACK=y
> high level Kconfig option.
> 
> There are a couple of interesting bits:

This commit broke booting EFI mixed mode kernels. Here's what I've got
queued up to fix it.

---
>From acf11e55bfcef7a1dca7d1735f4a780e0cdb1c89 Mon Sep 17 00:00:00 2001
From: Matt Fleming <matt@codeblueprint.co.uk>
Date: Thu, 20 Oct 2016 22:17:21 +0100
Subject: [PATCH] x86/efi: Prevent mixed mode boot corruption with
 CONFIG_VMAP_STACK

Booting an EFI mixed mode kernel has been crashing since commit:

  e37e43a497d5 ("x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)")

The user-visible effect in my test setup was the kernel being unable
to find the root file system ramdisk. This was likely caused by silent
memory or page table corruption.

Enabling CONFIG_DEBUG_VIRTUAL immediately flagged the thunking code as
abusing virt_to_phys() because it was passing addresses that were not
part of the kernel direct mapping.

Use the slow version instead, which correctly handles all memory
regions by performing a page table walk.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
---
 arch/x86/platform/efi/efi_64.c | 59 +++++++++++++++++++++++++-----------------
 1 file changed, 35 insertions(+), 24 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 58b0f801f66f..e3569a00885b 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -211,6 +211,17 @@ void efi_sync_low_kernel_mappings(void)
 	memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
 }
 
+/*
+ * Wrapper for slow_virt_to_phys() that handles NULL addresses.
+ */
+static inline phys_addr_t virt_to_phys_or_null(void *va)
+{
+	if (!va)
+		return 0;
+
+	return slow_virt_to_phys(va);
+}
+
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
 	unsigned long pfn, text;
@@ -251,7 +262,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 	if (!page)
 		panic("Unable to allocate EFI runtime stack < 4GB\n");
 
-	efi_scratch.phys_stack = virt_to_phys(page_address(page));
+	efi_scratch.phys_stack = virt_to_phys_or_null(page_address(page));
 	efi_scratch.phys_stack += PAGE_SIZE; /* stack grows down */
 
 	npages = (_etext - _text) >> PAGE_SHIFT;
@@ -494,8 +505,8 @@ static efi_status_t efi_thunk_get_time(efi_time_t *tm, efi_time_cap_t *tc)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
-	phys_tc = virt_to_phys(tc);
+	phys_tm = virt_to_phys_or_null(tm);
+	phys_tc = virt_to_phys_or_null(tc);
 
 	status = efi_thunk(get_time, phys_tm, phys_tc);
 
@@ -511,7 +522,7 @@ static efi_status_t efi_thunk_set_time(efi_time_t *tm)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(set_time, phys_tm);
 
@@ -529,9 +540,9 @@ efi_thunk_get_wakeup_time(efi_bool_t *enabled, efi_bool_t *pending,
 
 	spin_lock(&rtc_lock);
 
-	phys_enabled = virt_to_phys(enabled);
-	phys_pending = virt_to_phys(pending);
-	phys_tm = virt_to_phys(tm);
+	phys_enabled = virt_to_phys_or_null(enabled);
+	phys_pending = virt_to_phys_or_null(pending);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(get_wakeup_time, phys_enabled,
 			     phys_pending, phys_tm);
@@ -549,7 +560,7 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(set_wakeup_time, enabled, phys_tm);
 
@@ -567,11 +578,11 @@ efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor,
 	u32 phys_name, phys_vendor, phys_attr;
 	u32 phys_data_size, phys_data;
 
-	phys_data_size = virt_to_phys(data_size);
-	phys_vendor = virt_to_phys(vendor);
-	phys_name = virt_to_phys(name);
-	phys_attr = virt_to_phys(attr);
-	phys_data = virt_to_phys(data);
+	phys_data_size = virt_to_phys_or_null(data_size);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_name = virt_to_phys_or_null(name);
+	phys_attr = virt_to_phys_or_null(attr);
+	phys_data = virt_to_phys_or_null(data);
 
 	status = efi_thunk(get_variable, phys_name, phys_vendor,
 			   phys_attr, phys_data_size, phys_data);
@@ -586,9 +597,9 @@ efi_thunk_set_variable(efi_char16_t *name, efi_guid_t *vendor,
 	u32 phys_name, phys_vendor, phys_data;
 	efi_status_t status;
 
-	phys_name = virt_to_phys(name);
-	phys_vendor = virt_to_phys(vendor);
-	phys_data = virt_to_phys(data);
+	phys_name = virt_to_phys_or_null(name);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_data = virt_to_phys_or_null(data);
 
 	/* If data_size is > sizeof(u32) we've got problems */
 	status = efi_thunk(set_variable, phys_name, phys_vendor,
@@ -605,9 +616,9 @@ efi_thunk_get_next_variable(unsigned long *name_size,
 	efi_status_t status;
 	u32 phys_name_size, phys_name, phys_vendor;
 
-	phys_name_size = virt_to_phys(name_size);
-	phys_vendor = virt_to_phys(vendor);
-	phys_name = virt_to_phys(name);
+	phys_name_size = virt_to_phys_or_null(name_size);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_name = virt_to_phys_or_null(name);
 
 	status = efi_thunk(get_next_variable, phys_name_size,
 			   phys_name, phys_vendor);
@@ -621,7 +632,7 @@ efi_thunk_get_next_high_mono_count(u32 *count)
 	efi_status_t status;
 	u32 phys_count;
 
-	phys_count = virt_to_phys(count);
+	phys_count = virt_to_phys_or_null(count);
 	status = efi_thunk(get_next_high_mono_count, phys_count);
 
 	return status;
@@ -633,7 +644,7 @@ efi_thunk_reset_system(int reset_type, efi_status_t status,
 {
 	u32 phys_data;
 
-	phys_data = virt_to_phys(data);
+	phys_data = virt_to_phys_or_null(data);
 
 	efi_thunk(reset_system, reset_type, status, data_size, phys_data);
 }
@@ -661,9 +672,9 @@ efi_thunk_query_variable_info(u32 attr, u64 *storage_space,
 	if (efi.runtime_version < EFI_2_00_SYSTEM_TABLE_REVISION)
 		return EFI_UNSUPPORTED;
 
-	phys_storage = virt_to_phys(storage_space);
-	phys_remaining = virt_to_phys(remaining_space);
-	phys_max = virt_to_phys(max_variable_size);
+	phys_storage = virt_to_phys_or_null(storage_space);
+	phys_remaining = virt_to_phys_or_null(remaining_space);
+	phys_max = virt_to_phys_or_null(max_variable_size);
 
 	status = efi_thunk(query_variable_info, attr, phys_storage,
 			   phys_remaining, phys_max);
-- 
2.10.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
@ 2016-10-21 12:32       ` Matt Fleming
  0 siblings, 0 replies; 27+ messages in thread
From: Matt Fleming @ 2016-10-21 12:32 UTC (permalink / raw)
  To: jpoimboe-H+wXaHxf7aLQT0dZR+AlfA,
	nadav.amit-Re5JQEeQqe8AvxtiuMwx3w,
	brgerst-Re5JQEeQqe8AvxtiuMwx3w, tglx-hfZtesqFncYOwBW4kG4KsQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, mingo-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dvlasenk-H+wXaHxf7aLQT0dZR+AlfA, luto-DgEjT+Ai2ygdnm+yROfE0A,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, bp-Gina5bIWoIWzQB+pC5nmwQ
  Cc: linux-tip-commits-u79uwXL29TY76Z2rM5mHXA, Ard Biesheuvel,
	linux-efi-u79uwXL29TY76Z2rM5mHXA

On Wed, 24 Aug, at 06:03:04AM, tip-bot for Andy Lutomirski wrote:
> Commit-ID:  e37e43a497d5a8b7c0cc1736d56986f432c394c9
> Gitweb:     http://git.kernel.org/tip/e37e43a497d5a8b7c0cc1736d56986f432c394c9
> Author:     Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> AuthorDate: Thu, 11 Aug 2016 02:35:23 -0700
> Committer:  Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> CommitDate: Wed, 24 Aug 2016 12:11:42 +0200
> 
> x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
> 
> This allows x86_64 kernels to enable vmapped stacks by setting
> HAVE_ARCH_VMAP_STACK=y - which enables the CONFIG_VMAP_STACK=y
> high level Kconfig option.
> 
> There are a couple of interesting bits:

This commit broke booting EFI mixed mode kernels. Here's what I've got
queued up to fix it.

---
>From acf11e55bfcef7a1dca7d1735f4a780e0cdb1c89 Mon Sep 17 00:00:00 2001
From: Matt Fleming <matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
Date: Thu, 20 Oct 2016 22:17:21 +0100
Subject: [PATCH] x86/efi: Prevent mixed mode boot corruption with
 CONFIG_VMAP_STACK

Booting an EFI mixed mode kernel has been crashing since commit:

  e37e43a497d5 ("x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)")

The user-visible effect in my test setup was the kernel being unable
to find the root file system ramdisk. This was likely caused by silent
memory or page table corruption.

Enabling CONFIG_DEBUG_VIRTUAL immediately flagged the thunking code as
abusing virt_to_phys() because it was passing addresses that were not
part of the kernel direct mapping.

Use the slow version instead, which correctly handles all memory
regions by performing a page table walk.

Suggested-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
Cc: Ard Biesheuvel <ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Cc: Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Cc: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Matt Fleming <matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
---
 arch/x86/platform/efi/efi_64.c | 59 +++++++++++++++++++++++++-----------------
 1 file changed, 35 insertions(+), 24 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 58b0f801f66f..e3569a00885b 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -211,6 +211,17 @@ void efi_sync_low_kernel_mappings(void)
 	memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
 }
 
+/*
+ * Wrapper for slow_virt_to_phys() that handles NULL addresses.
+ */
+static inline phys_addr_t virt_to_phys_or_null(void *va)
+{
+	if (!va)
+		return 0;
+
+	return slow_virt_to_phys(va);
+}
+
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
 	unsigned long pfn, text;
@@ -251,7 +262,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 	if (!page)
 		panic("Unable to allocate EFI runtime stack < 4GB\n");
 
-	efi_scratch.phys_stack = virt_to_phys(page_address(page));
+	efi_scratch.phys_stack = virt_to_phys_or_null(page_address(page));
 	efi_scratch.phys_stack += PAGE_SIZE; /* stack grows down */
 
 	npages = (_etext - _text) >> PAGE_SHIFT;
@@ -494,8 +505,8 @@ static efi_status_t efi_thunk_get_time(efi_time_t *tm, efi_time_cap_t *tc)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
-	phys_tc = virt_to_phys(tc);
+	phys_tm = virt_to_phys_or_null(tm);
+	phys_tc = virt_to_phys_or_null(tc);
 
 	status = efi_thunk(get_time, phys_tm, phys_tc);
 
@@ -511,7 +522,7 @@ static efi_status_t efi_thunk_set_time(efi_time_t *tm)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(set_time, phys_tm);
 
@@ -529,9 +540,9 @@ efi_thunk_get_wakeup_time(efi_bool_t *enabled, efi_bool_t *pending,
 
 	spin_lock(&rtc_lock);
 
-	phys_enabled = virt_to_phys(enabled);
-	phys_pending = virt_to_phys(pending);
-	phys_tm = virt_to_phys(tm);
+	phys_enabled = virt_to_phys_or_null(enabled);
+	phys_pending = virt_to_phys_or_null(pending);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(get_wakeup_time, phys_enabled,
 			     phys_pending, phys_tm);
@@ -549,7 +560,7 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(set_wakeup_time, enabled, phys_tm);
 
@@ -567,11 +578,11 @@ efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor,
 	u32 phys_name, phys_vendor, phys_attr;
 	u32 phys_data_size, phys_data;
 
-	phys_data_size = virt_to_phys(data_size);
-	phys_vendor = virt_to_phys(vendor);
-	phys_name = virt_to_phys(name);
-	phys_attr = virt_to_phys(attr);
-	phys_data = virt_to_phys(data);
+	phys_data_size = virt_to_phys_or_null(data_size);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_name = virt_to_phys_or_null(name);
+	phys_attr = virt_to_phys_or_null(attr);
+	phys_data = virt_to_phys_or_null(data);
 
 	status = efi_thunk(get_variable, phys_name, phys_vendor,
 			   phys_attr, phys_data_size, phys_data);
@@ -586,9 +597,9 @@ efi_thunk_set_variable(efi_char16_t *name, efi_guid_t *vendor,
 	u32 phys_name, phys_vendor, phys_data;
 	efi_status_t status;
 
-	phys_name = virt_to_phys(name);
-	phys_vendor = virt_to_phys(vendor);
-	phys_data = virt_to_phys(data);
+	phys_name = virt_to_phys_or_null(name);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_data = virt_to_phys_or_null(data);
 
 	/* If data_size is > sizeof(u32) we've got problems */
 	status = efi_thunk(set_variable, phys_name, phys_vendor,
@@ -605,9 +616,9 @@ efi_thunk_get_next_variable(unsigned long *name_size,
 	efi_status_t status;
 	u32 phys_name_size, phys_name, phys_vendor;
 
-	phys_name_size = virt_to_phys(name_size);
-	phys_vendor = virt_to_phys(vendor);
-	phys_name = virt_to_phys(name);
+	phys_name_size = virt_to_phys_or_null(name_size);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_name = virt_to_phys_or_null(name);
 
 	status = efi_thunk(get_next_variable, phys_name_size,
 			   phys_name, phys_vendor);
@@ -621,7 +632,7 @@ efi_thunk_get_next_high_mono_count(u32 *count)
 	efi_status_t status;
 	u32 phys_count;
 
-	phys_count = virt_to_phys(count);
+	phys_count = virt_to_phys_or_null(count);
 	status = efi_thunk(get_next_high_mono_count, phys_count);
 
 	return status;
@@ -633,7 +644,7 @@ efi_thunk_reset_system(int reset_type, efi_status_t status,
 {
 	u32 phys_data;
 
-	phys_data = virt_to_phys(data);
+	phys_data = virt_to_phys_or_null(data);
 
 	efi_thunk(reset_system, reset_type, status, data_size, phys_data);
 }
@@ -661,9 +672,9 @@ efi_thunk_query_variable_info(u32 attr, u64 *storage_space,
 	if (efi.runtime_version < EFI_2_00_SYSTEM_TABLE_REVISION)
 		return EFI_UNSUPPORTED;
 
-	phys_storage = virt_to_phys(storage_space);
-	phys_remaining = virt_to_phys(remaining_space);
-	phys_max = virt_to_phys(max_variable_size);
+	phys_storage = virt_to_phys_or_null(storage_space);
+	phys_remaining = virt_to_phys_or_null(remaining_space);
+	phys_max = virt_to_phys_or_null(max_variable_size);
 
 	status = efi_thunk(query_variable_info, attr, phys_storage,
 			   phys_remaining, phys_max);
-- 
2.10.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
@ 2016-10-22  0:18         ` Andy Lutomirski
  0 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2016-10-22  0:18 UTC (permalink / raw)
  To: Matt Fleming
  Cc: linux-efi, Brian Gerst, linux-tip-commits, Thomas Gleixner,
	Ingo Molnar, Linus Torvalds, Ard Biesheuvel, Josh Poimboeuf,
	Denys Vlasenko, H. Peter Anvin, linux-kernel, Nadav Amit,
	Borislav Petkov, Peter Zijlstra

On Oct 21, 2016 5:32 AM, "Matt Fleming" <matt@codeblueprint.co.uk> wrote:
>
> On Wed, 24 Aug, at 06:03:04AM, tip-bot for Andy Lutomirski wrote:
> > Commit-ID:  e37e43a497d5a8b7c0cc1736d56986f432c394c9
> > Gitweb:     http://git.kernel.org/tip/e37e43a497d5a8b7c0cc1736d56986f432c394c9
> > Author:     Andy Lutomirski <luto@kernel.org>
> > AuthorDate: Thu, 11 Aug 2016 02:35:23 -0700
> > Committer:  Ingo Molnar <mingo@kernel.org>
> > CommitDate: Wed, 24 Aug 2016 12:11:42 +0200
> >
> > x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
> >
> > This allows x86_64 kernels to enable vmapped stacks by setting
> > HAVE_ARCH_VMAP_STACK=y - which enables the CONFIG_VMAP_STACK=y
> > high level Kconfig option.
> >
> > There are a couple of interesting bits:
>
> This commit broke booting EFI mixed mode kernels. Here's what I've got
> queued up to fix it.
>
> ---
> From acf11e55bfcef7a1dca7d1735f4a780e0cdb1c89 Mon Sep 17 00:00:00 2001
> From: Matt Fleming <matt@codeblueprint.co.uk>
> Date: Thu, 20 Oct 2016 22:17:21 +0100
> Subject: [PATCH] x86/efi: Prevent mixed mode boot corruption with
>  CONFIG_VMAP_STACK
>
> Booting an EFI mixed mode kernel has been crashing since commit:
>
>   e37e43a497d5 ("x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)")
>
> The user-visible effect in my test setup was the kernel being unable
> to find the root file system ramdisk. This was likely caused by silent
> memory or page table corruption.
>
> Enabling CONFIG_DEBUG_VIRTUAL immediately flagged the thunking code as
> abusing virt_to_phys() because it was passing addresses that were not
> part of the kernel direct mapping.
>
> Use the slow version instead, which correctly handles all memory
> regions by performing a page table walk.
>
> Suggested-by: Andy Lutomirski <luto@amacapital.net>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
> ---
>  arch/x86/platform/efi/efi_64.c | 59 +++++++++++++++++++++++++-----------------
>  1 file changed, 35 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index 58b0f801f66f..e3569a00885b 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -211,6 +211,17 @@ void efi_sync_low_kernel_mappings(void)
>         memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
>  }
>
> +/*
> + * Wrapper for slow_virt_to_phys() that handles NULL addresses.
> + */
> +static inline phys_addr_t virt_to_phys_or_null(void *va)
> +{
> +       if (!va)
> +               return 0;
> +
> +       return slow_virt_to_phys(va);
> +}
> +

This is asking for trouble if any of the variable length parameters
are on the stack.  How about adding a "size_t size" parameter and
doing:

if (!va) {
  return 0;
} else if (virt_addr_valid(va)) {
  return virt_to_phys(va);
} else {
  /* A fully aligned variable on the stack is guaranteed not to cross
a page boundary. */
  WARN_ON(!IS_ALIGNED((uintptr_t)va, size) || size > PAGE_SIZE);
  return slow_virt_to_phys(va);
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
@ 2016-10-22  0:18         ` Andy Lutomirski
  0 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2016-10-22  0:18 UTC (permalink / raw)
  To: Matt Fleming
  Cc: linux-efi-u79uwXL29TY76Z2rM5mHXA, Brian Gerst,
	linux-tip-commits-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, Linus Torvalds, Ard Biesheuvel, Josh Poimboeuf,
	Denys Vlasenko, H. Peter Anvin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Nadav Amit, Borislav Petkov,
	Peter Zijlstra

On Oct 21, 2016 5:32 AM, "Matt Fleming" <matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org> wrote:
>
> On Wed, 24 Aug, at 06:03:04AM, tip-bot for Andy Lutomirski wrote:
> > Commit-ID:  e37e43a497d5a8b7c0cc1736d56986f432c394c9
> > Gitweb:     http://git.kernel.org/tip/e37e43a497d5a8b7c0cc1736d56986f432c394c9
> > Author:     Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > AuthorDate: Thu, 11 Aug 2016 02:35:23 -0700
> > Committer:  Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > CommitDate: Wed, 24 Aug 2016 12:11:42 +0200
> >
> > x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
> >
> > This allows x86_64 kernels to enable vmapped stacks by setting
> > HAVE_ARCH_VMAP_STACK=y - which enables the CONFIG_VMAP_STACK=y
> > high level Kconfig option.
> >
> > There are a couple of interesting bits:
>
> This commit broke booting EFI mixed mode kernels. Here's what I've got
> queued up to fix it.
>
> ---
> From acf11e55bfcef7a1dca7d1735f4a780e0cdb1c89 Mon Sep 17 00:00:00 2001
> From: Matt Fleming <matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
> Date: Thu, 20 Oct 2016 22:17:21 +0100
> Subject: [PATCH] x86/efi: Prevent mixed mode boot corruption with
>  CONFIG_VMAP_STACK
>
> Booting an EFI mixed mode kernel has been crashing since commit:
>
>   e37e43a497d5 ("x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)")
>
> The user-visible effect in my test setup was the kernel being unable
> to find the root file system ramdisk. This was likely caused by silent
> memory or page table corruption.
>
> Enabling CONFIG_DEBUG_VIRTUAL immediately flagged the thunking code as
> abusing virt_to_phys() because it was passing addresses that were not
> part of the kernel direct mapping.
>
> Use the slow version instead, which correctly handles all memory
> regions by performing a page table walk.
>
> Suggested-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
> Cc: Ard Biesheuvel <ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
> Cc: Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> Cc: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Matt Fleming <matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
> ---
>  arch/x86/platform/efi/efi_64.c | 59 +++++++++++++++++++++++++-----------------
>  1 file changed, 35 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index 58b0f801f66f..e3569a00885b 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -211,6 +211,17 @@ void efi_sync_low_kernel_mappings(void)
>         memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
>  }
>
> +/*
> + * Wrapper for slow_virt_to_phys() that handles NULL addresses.
> + */
> +static inline phys_addr_t virt_to_phys_or_null(void *va)
> +{
> +       if (!va)
> +               return 0;
> +
> +       return slow_virt_to_phys(va);
> +}
> +

This is asking for trouble if any of the variable length parameters
are on the stack.  How about adding a "size_t size" parameter and
doing:

if (!va) {
  return 0;
} else if (virt_addr_valid(va)) {
  return virt_to_phys(va);
} else {
  /* A fully aligned variable on the stack is guaranteed not to cross
a page boundary. */
  WARN_ON(!IS_ALIGNED((uintptr_t)va, size) || size > PAGE_SIZE);
  return slow_virt_to_phys(va);
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
@ 2016-10-24 13:09           ` Matt Fleming
  0 siblings, 0 replies; 27+ messages in thread
From: Matt Fleming @ 2016-10-24 13:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-efi, Brian Gerst, linux-tip-commits, Thomas Gleixner,
	Ingo Molnar, Linus Torvalds, Ard Biesheuvel, Josh Poimboeuf,
	Denys Vlasenko, H. Peter Anvin, linux-kernel, Nadav Amit,
	Borislav Petkov, Peter Zijlstra

On Fri, 21 Oct, at 05:18:30PM, Andy Lutomirski wrote:
> 
> This is asking for trouble if any of the variable length parameters
> are on the stack.  How about adding a "size_t size" parameter and
> doing:
> 
> if (!va) {
>   return 0;
> } else if (virt_addr_valid(va)) {
>   return virt_to_phys(va);
> } else {
>   /* A fully aligned variable on the stack is guaranteed not to cross
> a page boundary. */
>   WARN_ON(!IS_ALIGNED((uintptr_t)va, size) || size > PAGE_SIZE);
>   return slow_virt_to_phys(va);
> }

Ah, good catch. Something like this?

---

>From d2c17f46686076677da3bf04caa2f69d654f8d8a Mon Sep 17 00:00:00 2001
From: Matt Fleming <matt@codeblueprint.co.uk>
Date: Thu, 20 Oct 2016 22:17:21 +0100
Subject: [PATCH] x86/efi: Prevent mixed mode boot corruption with
 CONFIG_VMAP_STACK

Booting an EFI mixed mode kernel has been crashing since commit:

  e37e43a497d5 ("x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)")

The user-visible effect in my test setup was the kernel being unable
to find the root file system ramdisk. This was likely caused by silent
memory or page table corruption.

Enabling CONFIG_DEBUG_VIRTUAL immediately flagged the thunking code as
abusing virt_to_phys() because it was passing addresses that were not
part of the kernel direct mapping.

Use the slow version instead, which correctly handles all memory
regions by performing a page table walk.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
---
 arch/x86/platform/efi/efi_64.c | 79 +++++++++++++++++++++++++++++-------------
 1 file changed, 55 insertions(+), 24 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 58b0f801f66f..010544293dda 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -31,6 +31,7 @@
 #include <linux/io.h>
 #include <linux/reboot.h>
 #include <linux/slab.h>
+#include <linux/ucs2_string.h>
 
 #include <asm/setup.h>
 #include <asm/page.h>
@@ -211,11 +212,36 @@ void efi_sync_low_kernel_mappings(void)
 	memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
 }
 
+/*
+ * Wrapper for slow_virt_to_phys() that handles NULL addresses.
+ */
+static inline phys_addr_t
+virt_to_phys_or_null_size(void *va, unsigned long size)
+{
+	if (!va)
+		return 0;
+
+	if (virt_addr_valid(va))
+		return virt_to_phys(va);
+
+	/*
+	 * A fully aligned variable on the stack is guaranteed not to
+	 * cross a page bounary.
+	 */
+	WARN_ON(!IS_ALIGNED((unsigned long)va, size) || size > PAGE_SIZE);
+
+	return slow_virt_to_phys(va);
+}
+
+#define virt_to_phys_or_null(addr)				\
+	virt_to_phys_or_null_size((addr), sizeof(*(addr)))
+
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
 	unsigned long pfn, text;
 	struct page *page;
 	unsigned npages;
+	void *addr;
 	pgd_t *pgd;
 
 	if (efi_enabled(EFI_OLD_MEMMAP))
@@ -251,7 +277,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 	if (!page)
 		panic("Unable to allocate EFI runtime stack < 4GB\n");
 
-	efi_scratch.phys_stack = virt_to_phys(page_address(page));
+	addr = page_address(page);
+	efi_scratch.phys_stack = virt_to_phys_or_null_size(addr, PAGE_SIZE);
 	efi_scratch.phys_stack += PAGE_SIZE; /* stack grows down */
 
 	npages = (_etext - _text) >> PAGE_SHIFT;
@@ -494,8 +521,8 @@ static efi_status_t efi_thunk_get_time(efi_time_t *tm, efi_time_cap_t *tc)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
-	phys_tc = virt_to_phys(tc);
+	phys_tm = virt_to_phys_or_null(tm);
+	phys_tc = virt_to_phys_or_null(tc);
 
 	status = efi_thunk(get_time, phys_tm, phys_tc);
 
@@ -511,7 +538,7 @@ static efi_status_t efi_thunk_set_time(efi_time_t *tm)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(set_time, phys_tm);
 
@@ -529,9 +556,9 @@ efi_thunk_get_wakeup_time(efi_bool_t *enabled, efi_bool_t *pending,
 
 	spin_lock(&rtc_lock);
 
-	phys_enabled = virt_to_phys(enabled);
-	phys_pending = virt_to_phys(pending);
-	phys_tm = virt_to_phys(tm);
+	phys_enabled = virt_to_phys_or_null(enabled);
+	phys_pending = virt_to_phys_or_null(pending);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(get_wakeup_time, phys_enabled,
 			     phys_pending, phys_tm);
@@ -549,7 +576,7 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(set_wakeup_time, enabled, phys_tm);
 
@@ -558,6 +585,10 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
 	return status;
 }
 
+static unsigned long efi_name_size(efi_char16_t *name)
+{
+	return ucs2_strsize(name, EFI_VAR_NAME_LEN) + 1;
+}
 
 static efi_status_t
 efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor,
@@ -567,11 +598,11 @@ efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor,
 	u32 phys_name, phys_vendor, phys_attr;
 	u32 phys_data_size, phys_data;
 
-	phys_data_size = virt_to_phys(data_size);
-	phys_vendor = virt_to_phys(vendor);
-	phys_name = virt_to_phys(name);
-	phys_attr = virt_to_phys(attr);
-	phys_data = virt_to_phys(data);
+	phys_data_size = virt_to_phys_or_null(data_size);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_name = virt_to_phys_or_null_size(name, efi_name_size(name));
+	phys_attr = virt_to_phys_or_null(attr);
+	phys_data = virt_to_phys_or_null_size(data, *data_size);
 
 	status = efi_thunk(get_variable, phys_name, phys_vendor,
 			   phys_attr, phys_data_size, phys_data);
@@ -586,9 +617,9 @@ efi_thunk_set_variable(efi_char16_t *name, efi_guid_t *vendor,
 	u32 phys_name, phys_vendor, phys_data;
 	efi_status_t status;
 
-	phys_name = virt_to_phys(name);
-	phys_vendor = virt_to_phys(vendor);
-	phys_data = virt_to_phys(data);
+	phys_name = virt_to_phys_or_null_size(name, efi_name_size(name));
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_data = virt_to_phys_or_null_size(data, data_size);
 
 	/* If data_size is > sizeof(u32) we've got problems */
 	status = efi_thunk(set_variable, phys_name, phys_vendor,
@@ -605,9 +636,9 @@ efi_thunk_get_next_variable(unsigned long *name_size,
 	efi_status_t status;
 	u32 phys_name_size, phys_name, phys_vendor;
 
-	phys_name_size = virt_to_phys(name_size);
-	phys_vendor = virt_to_phys(vendor);
-	phys_name = virt_to_phys(name);
+	phys_name_size = virt_to_phys_or_null(name_size);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_name = virt_to_phys_or_null_size(name, *name_size);
 
 	status = efi_thunk(get_next_variable, phys_name_size,
 			   phys_name, phys_vendor);
@@ -621,7 +652,7 @@ efi_thunk_get_next_high_mono_count(u32 *count)
 	efi_status_t status;
 	u32 phys_count;
 
-	phys_count = virt_to_phys(count);
+	phys_count = virt_to_phys_or_null(count);
 	status = efi_thunk(get_next_high_mono_count, phys_count);
 
 	return status;
@@ -633,7 +664,7 @@ efi_thunk_reset_system(int reset_type, efi_status_t status,
 {
 	u32 phys_data;
 
-	phys_data = virt_to_phys(data);
+	phys_data = virt_to_phys_or_null_size(data, data_size);
 
 	efi_thunk(reset_system, reset_type, status, data_size, phys_data);
 }
@@ -661,9 +692,9 @@ efi_thunk_query_variable_info(u32 attr, u64 *storage_space,
 	if (efi.runtime_version < EFI_2_00_SYSTEM_TABLE_REVISION)
 		return EFI_UNSUPPORTED;
 
-	phys_storage = virt_to_phys(storage_space);
-	phys_remaining = virt_to_phys(remaining_space);
-	phys_max = virt_to_phys(max_variable_size);
+	phys_storage = virt_to_phys_or_null(storage_space);
+	phys_remaining = virt_to_phys_or_null(remaining_space);
+	phys_max = virt_to_phys_or_null(max_variable_size);
 
 	status = efi_thunk(query_variable_info, attr, phys_storage,
 			   phys_remaining, phys_max);
-- 
2.10.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
@ 2016-10-24 13:09           ` Matt Fleming
  0 siblings, 0 replies; 27+ messages in thread
From: Matt Fleming @ 2016-10-24 13:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-efi-u79uwXL29TY76Z2rM5mHXA, Brian Gerst,
	linux-tip-commits-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, Linus Torvalds, Ard Biesheuvel, Josh Poimboeuf,
	Denys Vlasenko, H. Peter Anvin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Nadav Amit, Borislav Petkov,
	Peter Zijlstra

On Fri, 21 Oct, at 05:18:30PM, Andy Lutomirski wrote:
> 
> This is asking for trouble if any of the variable length parameters
> are on the stack.  How about adding a "size_t size" parameter and
> doing:
> 
> if (!va) {
>   return 0;
> } else if (virt_addr_valid(va)) {
>   return virt_to_phys(va);
> } else {
>   /* A fully aligned variable on the stack is guaranteed not to cross
> a page boundary. */
>   WARN_ON(!IS_ALIGNED((uintptr_t)va, size) || size > PAGE_SIZE);
>   return slow_virt_to_phys(va);
> }

Ah, good catch. Something like this?

---

>From d2c17f46686076677da3bf04caa2f69d654f8d8a Mon Sep 17 00:00:00 2001
From: Matt Fleming <matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
Date: Thu, 20 Oct 2016 22:17:21 +0100
Subject: [PATCH] x86/efi: Prevent mixed mode boot corruption with
 CONFIG_VMAP_STACK

Booting an EFI mixed mode kernel has been crashing since commit:

  e37e43a497d5 ("x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)")

The user-visible effect in my test setup was the kernel being unable
to find the root file system ramdisk. This was likely caused by silent
memory or page table corruption.

Enabling CONFIG_DEBUG_VIRTUAL immediately flagged the thunking code as
abusing virt_to_phys() because it was passing addresses that were not
part of the kernel direct mapping.

Use the slow version instead, which correctly handles all memory
regions by performing a page table walk.

Suggested-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
Cc: Ard Biesheuvel <ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Cc: Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Cc: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Matt Fleming <matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
---
 arch/x86/platform/efi/efi_64.c | 79 +++++++++++++++++++++++++++++-------------
 1 file changed, 55 insertions(+), 24 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 58b0f801f66f..010544293dda 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -31,6 +31,7 @@
 #include <linux/io.h>
 #include <linux/reboot.h>
 #include <linux/slab.h>
+#include <linux/ucs2_string.h>
 
 #include <asm/setup.h>
 #include <asm/page.h>
@@ -211,11 +212,36 @@ void efi_sync_low_kernel_mappings(void)
 	memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
 }
 
+/*
+ * Wrapper for slow_virt_to_phys() that handles NULL addresses.
+ */
+static inline phys_addr_t
+virt_to_phys_or_null_size(void *va, unsigned long size)
+{
+	if (!va)
+		return 0;
+
+	if (virt_addr_valid(va))
+		return virt_to_phys(va);
+
+	/*
+	 * A fully aligned variable on the stack is guaranteed not to
+	 * cross a page bounary.
+	 */
+	WARN_ON(!IS_ALIGNED((unsigned long)va, size) || size > PAGE_SIZE);
+
+	return slow_virt_to_phys(va);
+}
+
+#define virt_to_phys_or_null(addr)				\
+	virt_to_phys_or_null_size((addr), sizeof(*(addr)))
+
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
 	unsigned long pfn, text;
 	struct page *page;
 	unsigned npages;
+	void *addr;
 	pgd_t *pgd;
 
 	if (efi_enabled(EFI_OLD_MEMMAP))
@@ -251,7 +277,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 	if (!page)
 		panic("Unable to allocate EFI runtime stack < 4GB\n");
 
-	efi_scratch.phys_stack = virt_to_phys(page_address(page));
+	addr = page_address(page);
+	efi_scratch.phys_stack = virt_to_phys_or_null_size(addr, PAGE_SIZE);
 	efi_scratch.phys_stack += PAGE_SIZE; /* stack grows down */
 
 	npages = (_etext - _text) >> PAGE_SHIFT;
@@ -494,8 +521,8 @@ static efi_status_t efi_thunk_get_time(efi_time_t *tm, efi_time_cap_t *tc)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
-	phys_tc = virt_to_phys(tc);
+	phys_tm = virt_to_phys_or_null(tm);
+	phys_tc = virt_to_phys_or_null(tc);
 
 	status = efi_thunk(get_time, phys_tm, phys_tc);
 
@@ -511,7 +538,7 @@ static efi_status_t efi_thunk_set_time(efi_time_t *tm)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(set_time, phys_tm);
 
@@ -529,9 +556,9 @@ efi_thunk_get_wakeup_time(efi_bool_t *enabled, efi_bool_t *pending,
 
 	spin_lock(&rtc_lock);
 
-	phys_enabled = virt_to_phys(enabled);
-	phys_pending = virt_to_phys(pending);
-	phys_tm = virt_to_phys(tm);
+	phys_enabled = virt_to_phys_or_null(enabled);
+	phys_pending = virt_to_phys_or_null(pending);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(get_wakeup_time, phys_enabled,
 			     phys_pending, phys_tm);
@@ -549,7 +576,7 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(set_wakeup_time, enabled, phys_tm);
 
@@ -558,6 +585,10 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
 	return status;
 }
 
+static unsigned long efi_name_size(efi_char16_t *name)
+{
+	return ucs2_strsize(name, EFI_VAR_NAME_LEN) + 1;
+}
 
 static efi_status_t
 efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor,
@@ -567,11 +598,11 @@ efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor,
 	u32 phys_name, phys_vendor, phys_attr;
 	u32 phys_data_size, phys_data;
 
-	phys_data_size = virt_to_phys(data_size);
-	phys_vendor = virt_to_phys(vendor);
-	phys_name = virt_to_phys(name);
-	phys_attr = virt_to_phys(attr);
-	phys_data = virt_to_phys(data);
+	phys_data_size = virt_to_phys_or_null(data_size);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_name = virt_to_phys_or_null_size(name, efi_name_size(name));
+	phys_attr = virt_to_phys_or_null(attr);
+	phys_data = virt_to_phys_or_null_size(data, *data_size);
 
 	status = efi_thunk(get_variable, phys_name, phys_vendor,
 			   phys_attr, phys_data_size, phys_data);
@@ -586,9 +617,9 @@ efi_thunk_set_variable(efi_char16_t *name, efi_guid_t *vendor,
 	u32 phys_name, phys_vendor, phys_data;
 	efi_status_t status;
 
-	phys_name = virt_to_phys(name);
-	phys_vendor = virt_to_phys(vendor);
-	phys_data = virt_to_phys(data);
+	phys_name = virt_to_phys_or_null_size(name, efi_name_size(name));
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_data = virt_to_phys_or_null_size(data, data_size);
 
 	/* If data_size is > sizeof(u32) we've got problems */
 	status = efi_thunk(set_variable, phys_name, phys_vendor,
@@ -605,9 +636,9 @@ efi_thunk_get_next_variable(unsigned long *name_size,
 	efi_status_t status;
 	u32 phys_name_size, phys_name, phys_vendor;
 
-	phys_name_size = virt_to_phys(name_size);
-	phys_vendor = virt_to_phys(vendor);
-	phys_name = virt_to_phys(name);
+	phys_name_size = virt_to_phys_or_null(name_size);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_name = virt_to_phys_or_null_size(name, *name_size);
 
 	status = efi_thunk(get_next_variable, phys_name_size,
 			   phys_name, phys_vendor);
@@ -621,7 +652,7 @@ efi_thunk_get_next_high_mono_count(u32 *count)
 	efi_status_t status;
 	u32 phys_count;
 
-	phys_count = virt_to_phys(count);
+	phys_count = virt_to_phys_or_null(count);
 	status = efi_thunk(get_next_high_mono_count, phys_count);
 
 	return status;
@@ -633,7 +664,7 @@ efi_thunk_reset_system(int reset_type, efi_status_t status,
 {
 	u32 phys_data;
 
-	phys_data = virt_to_phys(data);
+	phys_data = virt_to_phys_or_null_size(data, data_size);
 
 	efi_thunk(reset_system, reset_type, status, data_size, phys_data);
 }
@@ -661,9 +692,9 @@ efi_thunk_query_variable_info(u32 attr, u64 *storage_space,
 	if (efi.runtime_version < EFI_2_00_SYSTEM_TABLE_REVISION)
 		return EFI_UNSUPPORTED;
 
-	phys_storage = virt_to_phys(storage_space);
-	phys_remaining = virt_to_phys(remaining_space);
-	phys_max = virt_to_phys(max_variable_size);
+	phys_storage = virt_to_phys_or_null(storage_space);
+	phys_remaining = virt_to_phys_or_null(remaining_space);
+	phys_max = virt_to_phys_or_null(max_variable_size);
 
 	status = efi_thunk(query_variable_info, attr, phys_storage,
 			   phys_remaining, phys_max);
-- 
2.10.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
  2016-10-24 13:09           ` Matt Fleming
@ 2016-10-30 16:21             ` Andy Lutomirski
  -1 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2016-10-30 16:21 UTC (permalink / raw)
  To: Matt Fleming
  Cc: linux-efi, Brian Gerst, linux-tip-commits, Thomas Gleixner,
	Ingo Molnar, Linus Torvalds, Ard Biesheuvel, Josh Poimboeuf,
	Denys Vlasenko, H. Peter Anvin, linux-kernel, Nadav Amit,
	Borislav Petkov, Peter Zijlstra

On Mon, Oct 24, 2016 at 6:09 AM, Matt Fleming <matt@codeblueprint.co.uk> wrote:
>
> From d2c17f46686076677da3bf04caa2f69d654f8d8a Mon Sep 17 00:00:00 2001
> From: Matt Fleming <matt@codeblueprint.co.uk>
> Date: Thu, 20 Oct 2016 22:17:21 +0100
> Subject: [PATCH] x86/efi: Prevent mixed mode boot corruption with
>  CONFIG_VMAP_STACK
>
> Booting an EFI mixed mode kernel has been crashing since commit:
>

Looks reasonable.  It's certainly a considerable improvement over the
status quo.  Minor comments below, mainly of the form "do you need all
these changes":

>  int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
>  {
>         unsigned long pfn, text;
>         struct page *page;
>         unsigned npages;
> +       void *addr;
>         pgd_t *pgd;
>
>         if (efi_enabled(EFI_OLD_MEMMAP))
> @@ -251,7 +277,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
>         if (!page)
>                 panic("Unable to allocate EFI runtime stack < 4GB\n");
>
> -       efi_scratch.phys_stack = virt_to_phys(page_address(page));
> +       addr = page_address(page);
> +       efi_scratch.phys_stack = virt_to_phys_or_null_size(addr, PAGE_SIZE);

This can't be on the stack -- you just allocated it with alloc_page().

>         efi_scratch.phys_stack += PAGE_SIZE; /* stack grows down */
>
>         npages = (_etext - _text) >> PAGE_SHIFT;
> @@ -494,8 +521,8 @@ static efi_status_t efi_thunk_get_time(efi_time_t *tm, efi_time_cap_t *tc)
>
>         spin_lock(&rtc_lock);
>
> -       phys_tm = virt_to_phys(tm);
> -       phys_tc = virt_to_phys(tc);
> +       phys_tm = virt_to_phys_or_null(tm);
> +       phys_tc = virt_to_phys_or_null(tc);

Seems okay.

> @@ -511,7 +538,7 @@ static efi_status_t efi_thunk_set_time(efi_time_t *tm)
>
>         spin_lock(&rtc_lock);
>
> -       phys_tm = virt_to_phys(tm);
> +       phys_tm = virt_to_phys_or_null(tm);

Seems okay.

> @@ -529,9 +556,9 @@ efi_thunk_get_wakeup_time(efi_bool_t *enabled, efi_bool_t *pending,
>
>         spin_lock(&rtc_lock);
>
> -       phys_enabled = virt_to_phys(enabled);
> -       phys_pending = virt_to_phys(pending);
> -       phys_tm = virt_to_phys(tm);
> +       phys_enabled = virt_to_phys_or_null(enabled);
> +       phys_pending = virt_to_phys_or_null(pending);
> +       phys_tm = virt_to_phys_or_null(tm);

All seem okay.

>
>         status = efi_thunk(get_wakeup_time, phys_enabled,
>                              phys_pending, phys_tm);
> @@ -549,7 +576,7 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
>
>         spin_lock(&rtc_lock);
>
> -       phys_tm = virt_to_phys(tm);
> +       phys_tm = virt_to_phys_or_null(tm);

Seems reasonable.

> +static unsigned long efi_name_size(efi_char16_t *name)
> +{
> +       return ucs2_strsize(name, EFI_VAR_NAME_LEN) + 1;
> +}

If this is really dynamic, I'm surprised that anything ends up
aligned.  Can't this be a number like 6?  It might pay to extend that
warning to also check that, if the variable is on the stack, its size
is a power of two.  But maybe none of the users of this are on the
stack, in which case it might pay to try to prevent a new user on the
stack from showing up.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
@ 2016-10-30 16:21             ` Andy Lutomirski
  0 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2016-10-30 16:21 UTC (permalink / raw)
  To: Matt Fleming
  Cc: linux-efi, Brian Gerst, linux-tip-commits, Thomas Gleixner,
	Ingo Molnar, Linus Torvalds, Ard Biesheuvel, Josh Poimboeuf,
	Denys Vlasenko, H. Peter Anvin, linux-kernel, Nadav Amit,
	Borislav Petkov, Peter Zijlstra

On Mon, Oct 24, 2016 at 6:09 AM, Matt Fleming <matt@codeblueprint.co.uk> wrote:
>
> From d2c17f46686076677da3bf04caa2f69d654f8d8a Mon Sep 17 00:00:00 2001
> From: Matt Fleming <matt@codeblueprint.co.uk>
> Date: Thu, 20 Oct 2016 22:17:21 +0100
> Subject: [PATCH] x86/efi: Prevent mixed mode boot corruption with
>  CONFIG_VMAP_STACK
>
> Booting an EFI mixed mode kernel has been crashing since commit:
>

Looks reasonable.  It's certainly a considerable improvement over the
status quo.  Minor comments below, mainly of the form "do you need all
these changes":

>  int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
>  {
>         unsigned long pfn, text;
>         struct page *page;
>         unsigned npages;
> +       void *addr;
>         pgd_t *pgd;
>
>         if (efi_enabled(EFI_OLD_MEMMAP))
> @@ -251,7 +277,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
>         if (!page)
>                 panic("Unable to allocate EFI runtime stack < 4GB\n");
>
> -       efi_scratch.phys_stack = virt_to_phys(page_address(page));
> +       addr = page_address(page);
> +       efi_scratch.phys_stack = virt_to_phys_or_null_size(addr, PAGE_SIZE);

This can't be on the stack -- you just allocated it with alloc_page().

>         efi_scratch.phys_stack += PAGE_SIZE; /* stack grows down */
>
>         npages = (_etext - _text) >> PAGE_SHIFT;
> @@ -494,8 +521,8 @@ static efi_status_t efi_thunk_get_time(efi_time_t *tm, efi_time_cap_t *tc)
>
>         spin_lock(&rtc_lock);
>
> -       phys_tm = virt_to_phys(tm);
> -       phys_tc = virt_to_phys(tc);
> +       phys_tm = virt_to_phys_or_null(tm);
> +       phys_tc = virt_to_phys_or_null(tc);

Seems okay.

> @@ -511,7 +538,7 @@ static efi_status_t efi_thunk_set_time(efi_time_t *tm)
>
>         spin_lock(&rtc_lock);
>
> -       phys_tm = virt_to_phys(tm);
> +       phys_tm = virt_to_phys_or_null(tm);

Seems okay.

> @@ -529,9 +556,9 @@ efi_thunk_get_wakeup_time(efi_bool_t *enabled, efi_bool_t *pending,
>
>         spin_lock(&rtc_lock);
>
> -       phys_enabled = virt_to_phys(enabled);
> -       phys_pending = virt_to_phys(pending);
> -       phys_tm = virt_to_phys(tm);
> +       phys_enabled = virt_to_phys_or_null(enabled);
> +       phys_pending = virt_to_phys_or_null(pending);
> +       phys_tm = virt_to_phys_or_null(tm);

All seem okay.

>
>         status = efi_thunk(get_wakeup_time, phys_enabled,
>                              phys_pending, phys_tm);
> @@ -549,7 +576,7 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
>
>         spin_lock(&rtc_lock);
>
> -       phys_tm = virt_to_phys(tm);
> +       phys_tm = virt_to_phys_or_null(tm);

Seems reasonable.

> +static unsigned long efi_name_size(efi_char16_t *name)
> +{
> +       return ucs2_strsize(name, EFI_VAR_NAME_LEN) + 1;
> +}

If this is really dynamic, I'm surprised that anything ends up
aligned.  Can't this be a number like 6?  It might pay to extend that
warning to also check that, if the variable is on the stack, its size
is a power of two.  But maybe none of the users of this are on the
stack, in which case it might pay to try to prevent a new user on the
stack from showing up.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
  2016-10-30 16:21             ` Andy Lutomirski
@ 2016-11-07 12:32               ` Matt Fleming
  -1 siblings, 0 replies; 27+ messages in thread
From: Matt Fleming @ 2016-11-07 12:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-efi, Brian Gerst, linux-tip-commits, Thomas Gleixner,
	Ingo Molnar, Linus Torvalds, Ard Biesheuvel, Josh Poimboeuf,
	Denys Vlasenko, H. Peter Anvin, linux-kernel, Nadav Amit,
	Borislav Petkov, Peter Zijlstra

On Sun, 30 Oct, at 09:21:29AM, Andy Lutomirski wrote:
> > @@ -251,7 +277,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
> >         if (!page)
> >                 panic("Unable to allocate EFI runtime stack < 4GB\n");
> >
> > -       efi_scratch.phys_stack = virt_to_phys(page_address(page));
> > +       addr = page_address(page);
> > +       efi_scratch.phys_stack = virt_to_phys_or_null_size(addr, PAGE_SIZE);
> 
> This can't be on the stack -- you just allocated it with alloc_page().
 
Oh good point. I was too eager with search and replace.

> > +static unsigned long efi_name_size(efi_char16_t *name)
> > +{
> > +       return ucs2_strsize(name, EFI_VAR_NAME_LEN) + 1;
> > +}
> 
> If this is really dynamic, I'm surprised that anything ends up
> aligned.  Can't this be a number like 6?  It might pay to extend that
> warning to also check that, if the variable is on the stack, its size
> is a power of two.  But maybe none of the users of this are on the
> stack, in which case it might pay to try to prevent a new user on the
> stack from showing up.

I can't find any existing users that place strings on the stack, no.

OK, let's try this one.

---

>From 3f8a1ec209a458a9e4b82313186fbde86082696b Mon Sep 17 00:00:00 2001
From: Matt Fleming <matt@codeblueprint.co.uk>
Date: Thu, 20 Oct 2016 22:17:21 +0100
Subject: [PATCH] x86/efi: Prevent mixed mode boot corruption with
 CONFIG_VMAP_STACK

Booting an EFI mixed mode kernel has been crashing since commit:

  e37e43a497d5 ("x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)")

The user-visible effect in my test setup was the kernel being unable
to find the root file system ramdisk. This was likely caused by silent
memory or page table corruption.

Enabling CONFIG_DEBUG_VIRTUAL immediately flagged the thunking code as
abusing virt_to_phys() because it was passing addresses that were not
part of the kernel direct mapping.

Use the slow version instead, which correctly handles all memory
regions by performing a page table walk.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
---
 arch/x86/platform/efi/efi_64.c | 80 ++++++++++++++++++++++++++++++------------
 1 file changed, 57 insertions(+), 23 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 58b0f801f66f..319148bd4b05 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -31,6 +31,7 @@
 #include <linux/io.h>
 #include <linux/reboot.h>
 #include <linux/slab.h>
+#include <linux/ucs2_string.h>
 
 #include <asm/setup.h>
 #include <asm/page.h>
@@ -211,6 +212,35 @@ void efi_sync_low_kernel_mappings(void)
 	memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
 }
 
+/*
+ * Wrapper for slow_virt_to_phys() that handles NULL addresses.
+ */
+static inline phys_addr_t
+virt_to_phys_or_null_size(void *va, unsigned long size)
+{
+	bool bad_size;
+
+	if (!va)
+		return 0;
+
+	if (virt_addr_valid(va))
+		return virt_to_phys(va);
+
+	/*
+	 * A fully aligned variable on the stack is guaranteed not to
+	 * cross a page bounary. Try to catch strings on the stack by
+	 * checking that 'size' is a power of two.
+	 */
+	bad_size = size > PAGE_SIZE || !is_power_of_2(size);
+
+	WARN_ON(!IS_ALIGNED((unsigned long)va, size) || bad_size);
+
+	return slow_virt_to_phys(va);
+}
+
+#define virt_to_phys_or_null(addr)				\
+	virt_to_phys_or_null_size((addr), sizeof(*(addr)))
+
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
 	unsigned long pfn, text;
@@ -494,8 +524,8 @@ static efi_status_t efi_thunk_get_time(efi_time_t *tm, efi_time_cap_t *tc)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
-	phys_tc = virt_to_phys(tc);
+	phys_tm = virt_to_phys_or_null(tm);
+	phys_tc = virt_to_phys_or_null(tc);
 
 	status = efi_thunk(get_time, phys_tm, phys_tc);
 
@@ -511,7 +541,7 @@ static efi_status_t efi_thunk_set_time(efi_time_t *tm)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(set_time, phys_tm);
 
@@ -529,9 +559,9 @@ efi_thunk_get_wakeup_time(efi_bool_t *enabled, efi_bool_t *pending,
 
 	spin_lock(&rtc_lock);
 
-	phys_enabled = virt_to_phys(enabled);
-	phys_pending = virt_to_phys(pending);
-	phys_tm = virt_to_phys(tm);
+	phys_enabled = virt_to_phys_or_null(enabled);
+	phys_pending = virt_to_phys_or_null(pending);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(get_wakeup_time, phys_enabled,
 			     phys_pending, phys_tm);
@@ -549,7 +579,7 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(set_wakeup_time, enabled, phys_tm);
 
@@ -558,6 +588,10 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
 	return status;
 }
 
+static unsigned long efi_name_size(efi_char16_t *name)
+{
+	return ucs2_strsize(name, EFI_VAR_NAME_LEN) + 1;
+}
 
 static efi_status_t
 efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor,
@@ -567,11 +601,11 @@ efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor,
 	u32 phys_name, phys_vendor, phys_attr;
 	u32 phys_data_size, phys_data;
 
-	phys_data_size = virt_to_phys(data_size);
-	phys_vendor = virt_to_phys(vendor);
-	phys_name = virt_to_phys(name);
-	phys_attr = virt_to_phys(attr);
-	phys_data = virt_to_phys(data);
+	phys_data_size = virt_to_phys_or_null(data_size);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_name = virt_to_phys_or_null_size(name, efi_name_size(name));
+	phys_attr = virt_to_phys_or_null(attr);
+	phys_data = virt_to_phys_or_null_size(data, *data_size);
 
 	status = efi_thunk(get_variable, phys_name, phys_vendor,
 			   phys_attr, phys_data_size, phys_data);
@@ -586,9 +620,9 @@ efi_thunk_set_variable(efi_char16_t *name, efi_guid_t *vendor,
 	u32 phys_name, phys_vendor, phys_data;
 	efi_status_t status;
 
-	phys_name = virt_to_phys(name);
-	phys_vendor = virt_to_phys(vendor);
-	phys_data = virt_to_phys(data);
+	phys_name = virt_to_phys_or_null_size(name, efi_name_size(name));
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_data = virt_to_phys_or_null_size(data, data_size);
 
 	/* If data_size is > sizeof(u32) we've got problems */
 	status = efi_thunk(set_variable, phys_name, phys_vendor,
@@ -605,9 +639,9 @@ efi_thunk_get_next_variable(unsigned long *name_size,
 	efi_status_t status;
 	u32 phys_name_size, phys_name, phys_vendor;
 
-	phys_name_size = virt_to_phys(name_size);
-	phys_vendor = virt_to_phys(vendor);
-	phys_name = virt_to_phys(name);
+	phys_name_size = virt_to_phys_or_null(name_size);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_name = virt_to_phys_or_null_size(name, *name_size);
 
 	status = efi_thunk(get_next_variable, phys_name_size,
 			   phys_name, phys_vendor);
@@ -621,7 +655,7 @@ efi_thunk_get_next_high_mono_count(u32 *count)
 	efi_status_t status;
 	u32 phys_count;
 
-	phys_count = virt_to_phys(count);
+	phys_count = virt_to_phys_or_null(count);
 	status = efi_thunk(get_next_high_mono_count, phys_count);
 
 	return status;
@@ -633,7 +667,7 @@ efi_thunk_reset_system(int reset_type, efi_status_t status,
 {
 	u32 phys_data;
 
-	phys_data = virt_to_phys(data);
+	phys_data = virt_to_phys_or_null_size(data, data_size);
 
 	efi_thunk(reset_system, reset_type, status, data_size, phys_data);
 }
@@ -661,9 +695,9 @@ efi_thunk_query_variable_info(u32 attr, u64 *storage_space,
 	if (efi.runtime_version < EFI_2_00_SYSTEM_TABLE_REVISION)
 		return EFI_UNSUPPORTED;
 
-	phys_storage = virt_to_phys(storage_space);
-	phys_remaining = virt_to_phys(remaining_space);
-	phys_max = virt_to_phys(max_variable_size);
+	phys_storage = virt_to_phys_or_null(storage_space);
+	phys_remaining = virt_to_phys_or_null(remaining_space);
+	phys_max = virt_to_phys_or_null(max_variable_size);
 
 	status = efi_thunk(query_variable_info, attr, phys_storage,
 			   phys_remaining, phys_max);
-- 
2.10.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)
@ 2016-11-07 12:32               ` Matt Fleming
  0 siblings, 0 replies; 27+ messages in thread
From: Matt Fleming @ 2016-11-07 12:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-efi, Brian Gerst, linux-tip-commits, Thomas Gleixner,
	Ingo Molnar, Linus Torvalds, Ard Biesheuvel, Josh Poimboeuf,
	Denys Vlasenko, H. Peter Anvin, linux-kernel, Nadav Amit,
	Borislav Petkov, Peter Zijlstra

On Sun, 30 Oct, at 09:21:29AM, Andy Lutomirski wrote:
> > @@ -251,7 +277,8 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
> >         if (!page)
> >                 panic("Unable to allocate EFI runtime stack < 4GB\n");
> >
> > -       efi_scratch.phys_stack = virt_to_phys(page_address(page));
> > +       addr = page_address(page);
> > +       efi_scratch.phys_stack = virt_to_phys_or_null_size(addr, PAGE_SIZE);
> 
> This can't be on the stack -- you just allocated it with alloc_page().
 
Oh good point. I was too eager with search and replace.

> > +static unsigned long efi_name_size(efi_char16_t *name)
> > +{
> > +       return ucs2_strsize(name, EFI_VAR_NAME_LEN) + 1;
> > +}
> 
> If this is really dynamic, I'm surprised that anything ends up
> aligned.  Can't this be a number like 6?  It might pay to extend that
> warning to also check that, if the variable is on the stack, its size
> is a power of two.  But maybe none of the users of this are on the
> stack, in which case it might pay to try to prevent a new user on the
> stack from showing up.

I can't find any existing users that place strings on the stack, no.

OK, let's try this one.

---

>From 3f8a1ec209a458a9e4b82313186fbde86082696b Mon Sep 17 00:00:00 2001
From: Matt Fleming <matt@codeblueprint.co.uk>
Date: Thu, 20 Oct 2016 22:17:21 +0100
Subject: [PATCH] x86/efi: Prevent mixed mode boot corruption with
 CONFIG_VMAP_STACK

Booting an EFI mixed mode kernel has been crashing since commit:

  e37e43a497d5 ("x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y)")

The user-visible effect in my test setup was the kernel being unable
to find the root file system ramdisk. This was likely caused by silent
memory or page table corruption.

Enabling CONFIG_DEBUG_VIRTUAL immediately flagged the thunking code as
abusing virt_to_phys() because it was passing addresses that were not
part of the kernel direct mapping.

Use the slow version instead, which correctly handles all memory
regions by performing a page table walk.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
---
 arch/x86/platform/efi/efi_64.c | 80 ++++++++++++++++++++++++++++++------------
 1 file changed, 57 insertions(+), 23 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 58b0f801f66f..319148bd4b05 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -31,6 +31,7 @@
 #include <linux/io.h>
 #include <linux/reboot.h>
 #include <linux/slab.h>
+#include <linux/ucs2_string.h>
 
 #include <asm/setup.h>
 #include <asm/page.h>
@@ -211,6 +212,35 @@ void efi_sync_low_kernel_mappings(void)
 	memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
 }
 
+/*
+ * Wrapper for slow_virt_to_phys() that handles NULL addresses.
+ */
+static inline phys_addr_t
+virt_to_phys_or_null_size(void *va, unsigned long size)
+{
+	bool bad_size;
+
+	if (!va)
+		return 0;
+
+	if (virt_addr_valid(va))
+		return virt_to_phys(va);
+
+	/*
+	 * A fully aligned variable on the stack is guaranteed not to
+	 * cross a page bounary. Try to catch strings on the stack by
+	 * checking that 'size' is a power of two.
+	 */
+	bad_size = size > PAGE_SIZE || !is_power_of_2(size);
+
+	WARN_ON(!IS_ALIGNED((unsigned long)va, size) || bad_size);
+
+	return slow_virt_to_phys(va);
+}
+
+#define virt_to_phys_or_null(addr)				\
+	virt_to_phys_or_null_size((addr), sizeof(*(addr)))
+
 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
 	unsigned long pfn, text;
@@ -494,8 +524,8 @@ static efi_status_t efi_thunk_get_time(efi_time_t *tm, efi_time_cap_t *tc)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
-	phys_tc = virt_to_phys(tc);
+	phys_tm = virt_to_phys_or_null(tm);
+	phys_tc = virt_to_phys_or_null(tc);
 
 	status = efi_thunk(get_time, phys_tm, phys_tc);
 
@@ -511,7 +541,7 @@ static efi_status_t efi_thunk_set_time(efi_time_t *tm)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(set_time, phys_tm);
 
@@ -529,9 +559,9 @@ efi_thunk_get_wakeup_time(efi_bool_t *enabled, efi_bool_t *pending,
 
 	spin_lock(&rtc_lock);
 
-	phys_enabled = virt_to_phys(enabled);
-	phys_pending = virt_to_phys(pending);
-	phys_tm = virt_to_phys(tm);
+	phys_enabled = virt_to_phys_or_null(enabled);
+	phys_pending = virt_to_phys_or_null(pending);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(get_wakeup_time, phys_enabled,
 			     phys_pending, phys_tm);
@@ -549,7 +579,7 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
 
 	spin_lock(&rtc_lock);
 
-	phys_tm = virt_to_phys(tm);
+	phys_tm = virt_to_phys_or_null(tm);
 
 	status = efi_thunk(set_wakeup_time, enabled, phys_tm);
 
@@ -558,6 +588,10 @@ efi_thunk_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm)
 	return status;
 }
 
+static unsigned long efi_name_size(efi_char16_t *name)
+{
+	return ucs2_strsize(name, EFI_VAR_NAME_LEN) + 1;
+}
 
 static efi_status_t
 efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor,
@@ -567,11 +601,11 @@ efi_thunk_get_variable(efi_char16_t *name, efi_guid_t *vendor,
 	u32 phys_name, phys_vendor, phys_attr;
 	u32 phys_data_size, phys_data;
 
-	phys_data_size = virt_to_phys(data_size);
-	phys_vendor = virt_to_phys(vendor);
-	phys_name = virt_to_phys(name);
-	phys_attr = virt_to_phys(attr);
-	phys_data = virt_to_phys(data);
+	phys_data_size = virt_to_phys_or_null(data_size);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_name = virt_to_phys_or_null_size(name, efi_name_size(name));
+	phys_attr = virt_to_phys_or_null(attr);
+	phys_data = virt_to_phys_or_null_size(data, *data_size);
 
 	status = efi_thunk(get_variable, phys_name, phys_vendor,
 			   phys_attr, phys_data_size, phys_data);
@@ -586,9 +620,9 @@ efi_thunk_set_variable(efi_char16_t *name, efi_guid_t *vendor,
 	u32 phys_name, phys_vendor, phys_data;
 	efi_status_t status;
 
-	phys_name = virt_to_phys(name);
-	phys_vendor = virt_to_phys(vendor);
-	phys_data = virt_to_phys(data);
+	phys_name = virt_to_phys_or_null_size(name, efi_name_size(name));
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_data = virt_to_phys_or_null_size(data, data_size);
 
 	/* If data_size is > sizeof(u32) we've got problems */
 	status = efi_thunk(set_variable, phys_name, phys_vendor,
@@ -605,9 +639,9 @@ efi_thunk_get_next_variable(unsigned long *name_size,
 	efi_status_t status;
 	u32 phys_name_size, phys_name, phys_vendor;
 
-	phys_name_size = virt_to_phys(name_size);
-	phys_vendor = virt_to_phys(vendor);
-	phys_name = virt_to_phys(name);
+	phys_name_size = virt_to_phys_or_null(name_size);
+	phys_vendor = virt_to_phys_or_null(vendor);
+	phys_name = virt_to_phys_or_null_size(name, *name_size);
 
 	status = efi_thunk(get_next_variable, phys_name_size,
 			   phys_name, phys_vendor);
@@ -621,7 +655,7 @@ efi_thunk_get_next_high_mono_count(u32 *count)
 	efi_status_t status;
 	u32 phys_count;
 
-	phys_count = virt_to_phys(count);
+	phys_count = virt_to_phys_or_null(count);
 	status = efi_thunk(get_next_high_mono_count, phys_count);
 
 	return status;
@@ -633,7 +667,7 @@ efi_thunk_reset_system(int reset_type, efi_status_t status,
 {
 	u32 phys_data;
 
-	phys_data = virt_to_phys(data);
+	phys_data = virt_to_phys_or_null_size(data, data_size);
 
 	efi_thunk(reset_system, reset_type, status, data_size, phys_data);
 }
@@ -661,9 +695,9 @@ efi_thunk_query_variable_info(u32 attr, u64 *storage_space,
 	if (efi.runtime_version < EFI_2_00_SYSTEM_TABLE_REVISION)
 		return EFI_UNSUPPORTED;
 
-	phys_storage = virt_to_phys(storage_space);
-	phys_remaining = virt_to_phys(remaining_space);
-	phys_max = virt_to_phys(max_variable_size);
+	phys_storage = virt_to_phys_or_null(storage_space);
+	phys_remaining = virt_to_phys_or_null(remaining_space);
+	phys_max = virt_to_phys_or_null(max_variable_size);
 
 	status = efi_thunk(query_variable_info, attr, phys_storage,
 			   phys_remaining, phys_max);
-- 
2.10.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-11-07 12:58 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-11  9:35 [PATCH v6 0/3] virtually mapped stacks Andy Lutomirski
2016-08-11  9:35 ` [PATCH v6 1/3] fork: Add generic vmalloced stack support Andy Lutomirski
2016-08-15 11:55   ` Michal Hocko
2016-08-24 10:03   ` Ingo Molnar
2016-08-24 16:11     ` Dmitry Vyukov
2016-08-24 13:02   ` [tip:x86/asm] " tip-bot for Andy Lutomirski
2016-08-24 16:51   ` [PATCH v6 1/3] " Josh Cartwright
2016-08-30 22:01     ` Andy Lutomirski
2016-08-30 22:26       ` Josh Cartwright
2016-08-11  9:35 ` [PATCH v6 2/3] dma-api: Teach the "DMA-from-stack" check about vmapped stacks Andy Lutomirski
2016-08-24 13:02   ` [tip:x86/asm] " tip-bot for Andy Lutomirski
2016-08-11  9:35 ` [PATCH v6 3/3] x86/mm/64: Enable " Andy Lutomirski
2016-08-24 13:03   ` [tip:x86/asm] x86/mm/64: Enable vmapped stacks (CONFIG_HAVE_ARCH_VMAP_STACK=y) tip-bot for Andy Lutomirski
2016-10-21 12:32     ` Matt Fleming
2016-10-21 12:32       ` Matt Fleming
2016-10-22  0:18       ` Andy Lutomirski
2016-10-22  0:18         ` Andy Lutomirski
2016-10-24 13:09         ` Matt Fleming
2016-10-24 13:09           ` Matt Fleming
2016-10-30 16:21           ` Andy Lutomirski
2016-10-30 16:21             ` Andy Lutomirski
2016-11-07 12:32             ` Matt Fleming
2016-11-07 12:32               ` Matt Fleming
2016-08-20 18:06 ` [PATCH v6 0/3] virtually mapped stacks Andy Lutomirski
2016-08-24 10:12   ` Ingo Molnar
2016-08-24 15:27     ` Andy Lutomirski
2016-08-24 17:16       ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.